GitBench

A benchmark suite for evaluating language models' Git competency. GitBench runs synthetic Git scenarios against models and scores their responses against known-good solutions.

Requirements

Python 3.10+
Git CLI installed and available in PATH

Installation

pip install -e .

Or with the included virtual environment:

source .venv/bin/activate

Quick Start

# List available benchmarks
gitbench list

# Run all benchmarks with the mock model (no API key needed)
gitbench run --all --model mock

# Run with a real model
gitbench run --all --model gpt-4o

# Run with a reasoning level
gitbench run --all --model o3-mini#high

# Run with Ollama locally
gitbench run --all --model llama3.1:8b --provider ollama

Configuration

Model profiles can be defined in gitbench.json (searched in ./gitbench.json, ./.gitbench.json, ~/.gitbench.json):

{
  "models": {
    "openai-gpt4o": {
      "models": ["gpt-4o", "gpt-4o-mini"],
      "provider": "openai",
      "api_key_env": "OPENAI_API_KEY"
    },
    "ollama-local": {
      "models": ["llama3.1:8b", "qwen2.5-coder:7b"],
      "provider": "ollama",
      "base_url": "http://localhost:11434"
    }
  },
  "outputs": {
    "json": "runs/latest.json"
  }
}

Then run with profiles:

gitbench run --all --profile openai-gpt4o
gitbench run --all --all-profiles              # all profiles sequentially
gitbench run --all --all-models                # flattened, can run concurrently

List configured profiles:

gitbench profiles

Running Benchmarks

Full CLI reference

gitbench run [OPTIONS]

Key options:

Option	Description
`--all`, `-a`	Run all benchmarks against all models (shortcut)
`--all-benchmarks`	Run all available benchmarks
`--benchmark`, `-b`	Run a specific benchmark by name
`--model`, `-m`	Model to use. `mock` for testing, Ollama model names, or OpenAI-compatible model IDs
`--profile`, `-p`	Model profile from gitbench.json
`--all-profiles`	Run against all profiles defined in config
`--all-models`	Run against all models across all profiles (flattened)
`--provider`	Explicit provider type: `ollama` or `openai` (auto-detected from base_url if omitted)
`--base-url`	API base URL. Defaults to `http://localhost:11434` for Ollama
`--verbose`, `-v`	Print detailed per-fixture results
`--timeout`, `-t`	Timeout in seconds per model attempt
`--retry-count`, `-r`	Retry attempts on failure (default: 3)
`--model-workers`	Number of models to run concurrently (default: 1)
`--fixture-workers`	Number of fixtures to run concurrently within a benchmark (default: 1)

Output options

Option	Description
`--json-output`	Single JSON file for results (default: `gitbench-results/{timestamp}/results-v{version}.json`)
`--output-dir`, `-d`	Write per-run auto-named JSON files to a directory
`--jsonl`, `-j`	Append run results as a JSON line to a file (useful for accumulating runs)
`--export`, `-e`	Export format(s): `csv`, `artificialanalysis` (repeatable)
`--export-path`	Path for export file (auto-named from model + timestamp if omitted)

Examples

# Run a specific benchmark with verbose output
gitbench run --benchmark commit_messages --model mock --verbose

# Run all benchmarks with Ollama, 4 concurrent fixtures
gitbench run --all --model llama3.1:8b --provider ollama --fixture-workers 4

# Run with profile, export results
gitbench run --all --profile openai-gpt4o --export csv --export artificialanalysis

# Accumulate runs in a JSONL file
gitbench run --all --model gpt-4o --jsonl runs/history.jsonl

# Run two models concurrently
gitbench run --all --all-models --model-workers 2

Model reasoning levels

Models can be suffixed with #level to control reasoning effort:

gitbench run --all --model o3-mini#high
gitbench run --all --model gpt-4o#minimal

Valid levels: minimal, low, medium, high, xhigh (model-dependent). The harness validates the combination before any calls are made, failing fast on invalid pairings.

Result doctoring

Transient failures (rate limits, timeouts, server errors) can be repaired without re-running the entire suite:

# Preview what would be repaired
gitbench doctor results.json --dry-run

# Repair a single result file
gitbench doctor results.json

# Repair result files in every timestamped directory under gitbench-results/
gitbench doctor --latest

# Use a longer timeout for slow repair reruns
gitbench doctor --latest --timeout 180

The doctor command identifies failures caused by known transient error patterns (HTTP 429, 500/502/503/504, timeouts, rate limits) and re-runs only those fixtures against the original models.

Report generation

After running benchmarks, generate a static report site:

gitbench report

This aggregates results from gitbench-results/, generates web/public/results.json, builds the Astro site to web/dist/, and starts a preview server.

gitbench report --open       # build and open in browser
gitbench report --dev        # start dev server with hot reload
gitbench report --no-build   # only generate results.json (skip build)
gitbench report -d my-results/ --open   # use custom input dir

Export formats

Results can be exported to structured formats for external analysis:

Format	Description
`csv`	One row per fixture result (benchmark, fixture_id, model, passed, similarity, error, etc.)
`artificialanalysis`	One row per benchmark (model, benchmark, score, total, passed) — compatible with artificialanalysis.com

gitbench run --all --model gpt-4o --export csv --export artificialanalysis

Result versioning

Saved run envelopes include two version fields:

Field	Meaning
`schema_version` / `version`	Integer output schema version for result readers
`benchmark_suite_version`	`0.x` suite version tied to benchmark and fixture coverage

During the pre-1.0 period, bump benchmark_suite_version whenever fixtures, benchmark definitions, prompts, expected answers, or scoring rules change in a way that affects comparability. Use 0.x minor bumps for added or behavior-changing benchmark coverage, and patch bumps for corrections that keep the same intended coverage.

Generated default artifact filenames include the benchmark suite version, and report rendering sorts accumulated runs by benchmark_suite_version first, then timestamp.

Benchmarks

GitBench includes 17 benchmark categories:

Benchmark	Description	Fixtures	Scoring
`blame_forensics`	Trace which commit introduced a bug using git blame/log	12	exact_match
`branch_cleanup`	Identify branches to delete (fully merged into main)	12	exact_match
`cherry_pick`	Apply specific commits from one branch to another, resolving conflicts	12	similarity
`commit_messages`	Given a diff, generate a meaningful commit message	12	similarity
`commit_squash`	Squash multiple commits into a single coherent commit	12	commit_selection
`git_bisect`	Identify the commit that introduced a bug via automated bisect	12	dynamic hash
`git_clean`	Safely remove untracked files and directories	12	state_assertions
`git_grep`	Search repository content with git grep	12	exact_match/similarity
`git_log_format`	Format git log output for targeted history inspection	12	exact_match
`git_show`	Inspect commits, tags, and file state with git show	12	exact_match/dynamic hash
`merge_conflicts`	Resolve merge conflicts producing the correct final tree	12	similarity
`rebase`	Clean up commit history before PR (squash, reorder, amend)	12	similarity
`reflog`	Restore lost commits or fix detached HEAD state	12	similarity
`stash_recovery`	Recover stashed changes or resolve stash pop conflicts	12	similarity
`submodule_usage`	Manage git submodules for external dependencies	12	state_assertions
`tag_management`	Create, inspect, move, and delete tags	12	state_assertions
`worktree_usage`	Use git worktrees for parallel development	12	state_assertions

Each benchmark has 12 fixtures — 204 total — for meaningful pass@1 scoring.

Scoring Types

Type	Description
`similarity`	Text similarity via `difflib.SequenceMatcher`. Default threshold: 0.5
`exact_match`	Exact string comparison after stripping whitespace
`command_equivalence`	Tokenized command comparison against fixture-declared accepted alternatives
`state_assertions`	Execute model output as git commands, then verify repo state via assertions (file_exists, dir_exists, file_content, branch_exists, git_config, git_output)
`structured`	Parse model output as key-value fields, score each independently (exact_match or similarity per field)
`commit_selection`	Verify that the model selects specific expected commits (used by commit_squash)
`dynamic_hash`	Match against a git hash that varies per run (used by git_bisect, git_show)

Use command_equivalence for read-only fixtures that ask for a Git command and where multiple command spellings are semantically equivalent:

scoring:
  type: command_equivalence
  accepted:
    - git submodule
    - git submodule status

Multi-command alternatives are supported as ordered command sequences:

scoring:
  type: command_equivalence
  accepted:
    - - git submodule init
      - git submodule update
    - - git submodule update --init

Output Format

Envelope

Every run produces an envelope wrapping benchmark results with metadata:

{
  "version": 1,
  "schema_version": 1,
  "benchmark_suite_version": "0.1.0",
  "timestamp": "2025-01-15T10:30:00+00:00",
  "git_sha": "abc1234",
  "model": "gpt-4o",
  "profile": "openai-gpt4o",
  "summary": {
    "total_benchmarks": 17,
    "total_fixtures": 204,
    "total_passed": 120,
    "overall_pass_at_k": 0.5882
  },
  "results": [ ... ]
}

Single benchmark

When running --benchmark <name>, output is a single benchmark result:

{
  "benchmark": "commit_messages",
  "total": 12,
  "passed": 3,
  "pass_at_k": 0.25,
  "scores": [
    {
      "fixture_id": "f001",
      "passed": true,
      "similarity": 0.72,
      "model_output": "Add greeting file",
      "error": null
    }
  ],
  "errors": 0
}

Multi-model / multi-profile output

When running --all-models or --all-profiles, results are nested:

{
  "benchmark_suite_version": "0.1.0",
  "summary": {
    "total_models": 2,
    "total_fixtures": 408,
    "total_passed": 220,
    "overall_pass_at_k": 0.5392
  },
  "models": [
    { "model": "gpt-4o", "summary": {...}, "results": [...] },
    { "model": "gpt-4o-mini", "summary": {...}, "results": [...] }
  ]
}

All benchmarks combined

When running --all, output is a combined JSON with a summary and per-benchmark results:

{
  "summary": {
    "total_benchmarks": 17,
    "total_fixtures": 204,
    "total_passed": 15,
    "overall_pass_at_k": 0.25
  },
  "results": [
    {
      "benchmark": "commit_messages",
      "total": 12,
      "passed": 3,
      "pass_at_k": 0.25,
      "scores": [...],
      "errors": 0
    },
    ...
  ]
}

Field	Type	Description
`benchmark`	`string`	Benchmark name
`total`	`integer`	Total number of fixtures
`passed`	`integer`	Number of fixtures that passed
`pass_at_k`	`float`	Fraction of fixtures with at least one passing attempt
`scores`	`list`	Per-fixture score objects
`errors`	`integer`	Number of fixtures that produced an error

Each score object contains:

Field	Type	Description
`fixture_id`	`string`	Fixture identifier
`passed`	`boolean`	Whether the model output passed the threshold
`similarity`	`float`	Text similarity score (0.0 – 1.0)
`model_output`	`string`	The model's generated output
`error`	`string \| null`	Error message if processing failed
`reasoning_level`	`string \| null`	Reasoning level used (e.g. `"high"`)
`model`	`string`	Model name (present in multi-model output)

Running Tests

pytest tests/

Run with verbose output:

pytest tests/ -v

Run only integration tests:

pytest tests/test_integration.py -v

Run specific test files:

pytest tests/test_cli.py -v
pytest tests/test_export.py -v
pytest tests/test_parallel_fixtures.py -v
pytest tests/test_result_doctoring.py -v

Adding Fixtures and Benchmarks

See CONTRIBUTING.md for the full fixture authoring guide and step-by-step instructions for adding new benchmark categories.

Quick fixture template:

id: "f013"
description: "My scenario"
purpose: "Tests ability to generate a commit message for a new file."
difficulty: easy
tags: [commit-message, add, basic]
setup:
  - "git init"
  - "git config user.email 'test@test.com'"
  - "git config user.name 'Test User'"
  - "echo 'content' > file.txt"
  - "git add file.txt"
prompt: "Your instruction to the model"
expected: "The correct output"
scoring:
  type: "similarity"
  threshold: 0.5

Important gotchas:

YAML colon values: strings containing colons (e.g. "Fix: login") must be single-quoted to prevent PyYAML from parsing them as mapping keys.
Rebase vs merge conflict polarity: In merge conflicts, HEAD's changes appear above =======. In rebase conflicts, the polarity is reversed — upstream's changes appear below =======.
GitExecutor exit codes: git merge and git rebase exit with code 1 on conflicts. GitExecutor.setup_repo() handles this automatically — do not call _run_command() directly for merge/rebase commands in fixtures.
New benchmarks: Drop a Python module inheriting from Benchmark in gitbench/benchmarks/. No registration or harness changes needed — auto-discovered via importlib.

Project Structure

gitbench/
├── __init__.py
├── cli.py                  # Click-based CLI (run, list, doctor, profiles, report)
├── config.py               # Configuration loader (gitbench.json profiles)
├── export.py               # CSV and artificialanalysis format exporters
├── render.py               # Result aggregation and JSON generation for reports
├── result_doctoring.py     # Selective repair of transient failures
├── version.py              # Version constants (schema, suite, package)
├── harness/
│   ├── benchmark.py       # Benchmark abstract base class
│   ├── types.py           # Dataclasses: ModelMessage, Fixture, Score, BenchmarkResult
│   ├── model.py           # ModelInterface, OpenAIAdapter, OllamaAdapter, MockModelClient
│   ├── loader.py          # FixtureLoader: YAML loading and validation
│   ├── runner.py          # BenchmarkRunner: executes benchmarks with progress reporting
│   ├── scorer.py          # Scorer: similarity, command equivalence, state assertions, pass@k
│   └── reasoning.py       # Model reasoning level validation matrix
├── benchmarks/
│   ├── __init__.py        # Auto-discovery import (no registration needed)
│   ├── blame_forensics.py # Blame/forensics benchmark
│   ├── branch_cleanup.py  # Branch cleanup benchmark
│   ├── cherry_pick.py     # Cherry-pick benchmark
│   ├── commit_messages.py # Commit message generation benchmark
│   ├── commit_squash.py   # Commit squash benchmark
│   ├── git_bisect.py      # Git bisect benchmark
│   ├── git_clean.py       # Git clean benchmark
│   ├── git_grep.py        # Git grep benchmark
│   ├── git_log_format.py  # Git log formatting benchmark
│   ├── git_show.py        # Git show benchmark
│   ├── merge_conflicts.py # Merge conflict resolution benchmark
│   ├── rebase.py          # Interactive rebase benchmark
│   ├── reflog.py          # Reflog/detached HEAD recovery benchmark
│   ├── stash_recovery.py  # Stash recovery benchmark
│   ├── submodule_usage.py # Submodule usage benchmark
│   ├── tag_management.py  # Tag management benchmark
│   └── worktree_usage.py  # Worktree usage benchmark
├── ui/
│   ├── __init__.py
│   ├── display.py         # RichProgressDisplay: live TUI with progress bars
│   └── format.py          # Duration, cost, and human-readable formatting
└── utils/
    └── git.py             # GitExecutor: sandboxed git repo management
fixtures/
├── blame_forensics/       # 12 YAML fixtures
├── branch_cleanup/        # 12 YAML fixtures
├── cherry_pick/           # 12 YAML fixtures
├── commit_messages/       # 12 YAML fixtures
├── commit_squash/         # 12 YAML fixtures
├── git_bisect/            # 12 YAML fixtures
├── git_clean/             # 12 YAML fixtures
├── git_grep/              # 12 YAML fixtures
├── git_log_format/        # 12 YAML fixtures
├── git_show/              # 12 YAML fixtures
├── merge_conflicts/       # 12 YAML fixtures
├── rebase/                # 12 YAML fixtures
├── reflog/                # 12 YAML fixtures
├── stash_recovery/        # 12 YAML fixtures
├── submodule_usage/       # 12 YAML fixtures
├── tag_management/        # 12 YAML fixtures
└── worktree_usage/        # 12 YAML fixtures
tests/
├── conftest.py            # Shared test fixtures
├── test_benchmarks.py     # Benchmark correctness tests
├── test_cli.py            # CLI integration tests
├── test_config.py         # Config loader tests
├── test_export.py         # Export format tests
├── test_format.py         # UI formatting tests
├── test_git.py            # GitExecutor tests
├── test_integration.py    # End-to-end integration tests
├── test_loader.py         # Fixture loader tests
├── test_model.py          # Model adapter tests
├── test_parallel_fixtures.py  # Parallel fixture execution tests
├── test_reasoning.py      # Reasoning level validation tests
├── test_render.py         # Render/aggregation tests
├── test_result_doctoring.py   # Doctor command tests
├── test_scorer.py         # Scoring logic tests
└── test_types.py          # Data type tests

Architecture

CLI (gitbench/cli.py): Click-based command group with run, list, doctor, profiles, and report commands.
Configuration (gitbench/config.py): Loads model profiles from gitbench.json. Profiles define model lists, provider type, base URL, API key resolution, timeout, and retry settings.
Benchmark ABC (gitbench/harness/benchmark.py): Abstract base class enforcing run_setup(), load_fixtures(), and score() interface. Drop a new Python module in benchmarks/ to add a new benchmark category — auto-discovered via importlib.
Benchmark Runner (gitbench/harness/runner.py): Orchestrates fixture execution against a model, supports concurrent fixtures, and reports progress via the RunProgress protocol.
Model adapters (gitbench/harness/model.py): ModelInterface ABC with OpenAIAdapter, OllamaAdapter, and MockModelClient implementations. Models can specify a reasoning level with model#level syntax.
Reasoning validation (gitbench/harness/reasoning.py): Validates model + reasoning level combinations before any API calls. Fails fast on invalid pairings.
Fixture loader (gitbench/harness/loader.py): Parses and validates YAML fixtures. Supports single fixture per file (dict) or multiple (list).
Scorer (gitbench/harness/scorer.py): Computes text similarity via difflib.SequenceMatcher, command equivalence, state assertions, structured field scoring, and pass_at_k across fixtures.
Git executor (gitbench/utils/git.py): Manages sandboxed temporary git repositories for fixture isolation. Uses _run_command_permissive internally for git merge and git rebase which intentionally exit code 1 on conflicts.
Progress display (gitbench/ui/display.py): Rich-based live TUI with progress bars, model panels, throughput stats, cost estimation, and final summary tables.
Export (gitbench/export.py): Pluggable format registry. Ships with CSV (per-fixture) and artificialanalysis (per-benchmark) exporters.
Result doctoring (gitbench/result_doctoring.py): Identifies transient failures (rate limits, timeouts, server errors) and re-runs only affected fixtures against the original models.
Report rendering (gitbench/render.py): Aggregates run envelopes from directories or JSONL files, resolves model metadata, and produces results.json for the Astro report site.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.opencode		.opencode
.pi		.pi
.playwright-mcp		.playwright-mcp
docs		docs
fixtures		fixtures
gitbench-results		gitbench-results
gitbench.egg-info		gitbench.egg-info
gitbench		gitbench
openspec		openspec
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
gitbench_export_mock_20260503T005241.csv		gitbench_export_mock_20260503T005241.csv
pyproject.toml		pyproject.toml
temp.json		temp.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitBench

Requirements

Installation

Quick Start

Configuration

Running Benchmarks

Full CLI reference

Output options

Examples

Model reasoning levels

Result doctoring

Report generation

Export formats

Result versioning

Benchmarks

Scoring Types

Output Format

Envelope

Single benchmark

Multi-model / multi-profile output

All benchmarks combined

Running Tests

Adding Fixtures and Benchmarks

Project Structure

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GitBench

Requirements

Installation

Quick Start

Configuration

Running Benchmarks

Full CLI reference

Output options

Examples

Model reasoning levels

Result doctoring

Report generation

Export formats

Result versioning

Benchmarks

Scoring Types

Output Format

Envelope

Single benchmark

Multi-model / multi-profile output

All benchmarks combined

Running Tests

Adding Fixtures and Benchmarks

Project Structure

Architecture

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages