A benchmark suite for evaluating language models' Git competency. GitBench runs synthetic Git scenarios against models and scores their responses against known-good solutions.
- Python 3.10+
- Git CLI installed and available in
PATH
pip install -e .Or with the included virtual environment:
source .venv/bin/activate# List available benchmarks
gitbench list
# Run all benchmarks with the mock model (no API key needed)
gitbench run --all --model mock
# Run with a real model
gitbench run --all --model gpt-4o
# Run with a reasoning level
gitbench run --all --model o3-mini#high
# Run with Ollama locally
gitbench run --all --model llama3.1:8b --provider ollamaModel profiles can be defined in gitbench.json (searched in ./gitbench.json, ./.gitbench.json, ~/.gitbench.json):
{
"models": {
"openai-gpt4o": {
"models": ["gpt-4o", "gpt-4o-mini"],
"provider": "openai",
"api_key_env": "OPENAI_API_KEY"
},
"ollama-local": {
"models": ["llama3.1:8b", "qwen2.5-coder:7b"],
"provider": "ollama",
"base_url": "http://localhost:11434"
}
},
"outputs": {
"json": "runs/latest.json"
}
}Then run with profiles:
gitbench run --all --profile openai-gpt4o
gitbench run --all --all-profiles # all profiles sequentially
gitbench run --all --all-models # flattened, can run concurrentlyList configured profiles:
gitbench profilesgitbench run [OPTIONS]
Key options:
| Option | Description |
|---|---|
--all, -a |
Run all benchmarks against all models (shortcut) |
--all-benchmarks |
Run all available benchmarks |
--benchmark, -b |
Run a specific benchmark by name |
--model, -m |
Model to use. mock for testing, Ollama model names, or OpenAI-compatible model IDs |
--profile, -p |
Model profile from gitbench.json |
--all-profiles |
Run against all profiles defined in config |
--all-models |
Run against all models across all profiles (flattened) |
--provider |
Explicit provider type: ollama or openai (auto-detected from base_url if omitted) |
--base-url |
API base URL. Defaults to http://localhost:11434 for Ollama |
--verbose, -v |
Print detailed per-fixture results |
--timeout, -t |
Timeout in seconds per model attempt |
--retry-count, -r |
Retry attempts on failure (default: 3) |
--model-workers |
Number of models to run concurrently (default: 1) |
--fixture-workers |
Number of fixtures to run concurrently within a benchmark (default: 1) |
| Option | Description |
|---|---|
--json-output |
Single JSON file for results (default: gitbench-results/{timestamp}/results-v{version}.json) |
--output-dir, -d |
Write per-run auto-named JSON files to a directory |
--jsonl, -j |
Append run results as a JSON line to a file (useful for accumulating runs) |
--export, -e |
Export format(s): csv, artificialanalysis (repeatable) |
--export-path |
Path for export file (auto-named from model + timestamp if omitted) |
# Run a specific benchmark with verbose output
gitbench run --benchmark commit_messages --model mock --verbose
# Run all benchmarks with Ollama, 4 concurrent fixtures
gitbench run --all --model llama3.1:8b --provider ollama --fixture-workers 4
# Run with profile, export results
gitbench run --all --profile openai-gpt4o --export csv --export artificialanalysis
# Accumulate runs in a JSONL file
gitbench run --all --model gpt-4o --jsonl runs/history.jsonl
# Run two models concurrently
gitbench run --all --all-models --model-workers 2Models can be suffixed with #level to control reasoning effort:
gitbench run --all --model o3-mini#high
gitbench run --all --model gpt-4o#minimalValid levels: minimal, low, medium, high, xhigh (model-dependent).
The harness validates the combination before any calls are made, failing fast on invalid pairings.
Transient failures (rate limits, timeouts, server errors) can be repaired without re-running the entire suite:
# Preview what would be repaired
gitbench doctor results.json --dry-run
# Repair a single result file
gitbench doctor results.json
# Repair result files in every timestamped directory under gitbench-results/
gitbench doctor --latest
# Use a longer timeout for slow repair reruns
gitbench doctor --latest --timeout 180The doctor command identifies failures caused by known transient error patterns (HTTP 429, 500/502/503/504, timeouts, rate limits) and re-runs only those fixtures against the original models.
After running benchmarks, generate a static report site:
gitbench reportThis aggregates results from gitbench-results/, generates web/public/results.json, builds the Astro site to web/dist/, and starts a preview server.
gitbench report --open # build and open in browser
gitbench report --dev # start dev server with hot reload
gitbench report --no-build # only generate results.json (skip build)
gitbench report -d my-results/ --open # use custom input dirResults can be exported to structured formats for external analysis:
| Format | Description |
|---|---|
csv |
One row per fixture result (benchmark, fixture_id, model, passed, similarity, error, etc.) |
artificialanalysis |
One row per benchmark (model, benchmark, score, total, passed) — compatible with artificialanalysis.com |
gitbench run --all --model gpt-4o --export csv --export artificialanalysisSaved run envelopes include two version fields:
| Field | Meaning |
|---|---|
schema_version / version |
Integer output schema version for result readers |
benchmark_suite_version |
0.x suite version tied to benchmark and fixture coverage |
During the pre-1.0 period, bump benchmark_suite_version whenever fixtures, benchmark definitions, prompts, expected answers, or scoring rules change in a way that affects comparability. Use 0.x minor bumps for added or behavior-changing benchmark coverage, and patch bumps for corrections that keep the same intended coverage.
Generated default artifact filenames include the benchmark suite version, and report rendering sorts accumulated runs by benchmark_suite_version first, then timestamp.
GitBench includes 17 benchmark categories:
| Benchmark | Description | Fixtures | Scoring |
|---|---|---|---|
blame_forensics |
Trace which commit introduced a bug using git blame/log | 12 | exact_match |
branch_cleanup |
Identify branches to delete (fully merged into main) | 12 | exact_match |
cherry_pick |
Apply specific commits from one branch to another, resolving conflicts | 12 | similarity |
commit_messages |
Given a diff, generate a meaningful commit message | 12 | similarity |
commit_squash |
Squash multiple commits into a single coherent commit | 12 | commit_selection |
git_bisect |
Identify the commit that introduced a bug via automated bisect | 12 | dynamic hash |
git_clean |
Safely remove untracked files and directories | 12 | state_assertions |
git_grep |
Search repository content with git grep | 12 | exact_match/similarity |
git_log_format |
Format git log output for targeted history inspection | 12 | exact_match |
git_show |
Inspect commits, tags, and file state with git show | 12 | exact_match/dynamic hash |
merge_conflicts |
Resolve merge conflicts producing the correct final tree | 12 | similarity |
rebase |
Clean up commit history before PR (squash, reorder, amend) | 12 | similarity |
reflog |
Restore lost commits or fix detached HEAD state | 12 | similarity |
stash_recovery |
Recover stashed changes or resolve stash pop conflicts | 12 | similarity |
submodule_usage |
Manage git submodules for external dependencies | 12 | state_assertions |
tag_management |
Create, inspect, move, and delete tags | 12 | state_assertions |
worktree_usage |
Use git worktrees for parallel development | 12 | state_assertions |
Each benchmark has 12 fixtures — 204 total — for meaningful pass@1 scoring.
| Type | Description |
|---|---|
similarity |
Text similarity via difflib.SequenceMatcher. Default threshold: 0.5 |
exact_match |
Exact string comparison after stripping whitespace |
command_equivalence |
Tokenized command comparison against fixture-declared accepted alternatives |
state_assertions |
Execute model output as git commands, then verify repo state via assertions (file_exists, dir_exists, file_content, branch_exists, git_config, git_output) |
structured |
Parse model output as key-value fields, score each independently (exact_match or similarity per field) |
commit_selection |
Verify that the model selects specific expected commits (used by commit_squash) |
dynamic_hash |
Match against a git hash that varies per run (used by git_bisect, git_show) |
Use command_equivalence for read-only fixtures that ask for a Git command and
where multiple command spellings are semantically equivalent:
scoring:
type: command_equivalence
accepted:
- git submodule
- git submodule statusMulti-command alternatives are supported as ordered command sequences:
scoring:
type: command_equivalence
accepted:
- - git submodule init
- git submodule update
- - git submodule update --initEvery run produces an envelope wrapping benchmark results with metadata:
{
"version": 1,
"schema_version": 1,
"benchmark_suite_version": "0.1.0",
"timestamp": "2025-01-15T10:30:00+00:00",
"git_sha": "abc1234",
"model": "gpt-4o",
"profile": "openai-gpt4o",
"summary": {
"total_benchmarks": 17,
"total_fixtures": 204,
"total_passed": 120,
"overall_pass_at_k": 0.5882
},
"results": [ ... ]
}When running --benchmark <name>, output is a single benchmark result:
{
"benchmark": "commit_messages",
"total": 12,
"passed": 3,
"pass_at_k": 0.25,
"scores": [
{
"fixture_id": "f001",
"passed": true,
"similarity": 0.72,
"model_output": "Add greeting file",
"error": null
}
],
"errors": 0
}When running --all-models or --all-profiles, results are nested:
{
"benchmark_suite_version": "0.1.0",
"summary": {
"total_models": 2,
"total_fixtures": 408,
"total_passed": 220,
"overall_pass_at_k": 0.5392
},
"models": [
{ "model": "gpt-4o", "summary": {...}, "results": [...] },
{ "model": "gpt-4o-mini", "summary": {...}, "results": [...] }
]
}When running --all, output is a combined JSON with a summary and per-benchmark results:
{
"summary": {
"total_benchmarks": 17,
"total_fixtures": 204,
"total_passed": 15,
"overall_pass_at_k": 0.25
},
"results": [
{
"benchmark": "commit_messages",
"total": 12,
"passed": 3,
"pass_at_k": 0.25,
"scores": [...],
"errors": 0
},
...
]
}| Field | Type | Description |
|---|---|---|
benchmark |
string |
Benchmark name |
total |
integer |
Total number of fixtures |
passed |
integer |
Number of fixtures that passed |
pass_at_k |
float |
Fraction of fixtures with at least one passing attempt |
scores |
list |
Per-fixture score objects |
errors |
integer |
Number of fixtures that produced an error |
Each score object contains:
| Field | Type | Description |
|---|---|---|
fixture_id |
string |
Fixture identifier |
passed |
boolean |
Whether the model output passed the threshold |
similarity |
float |
Text similarity score (0.0 – 1.0) |
model_output |
string |
The model's generated output |
error |
string | null |
Error message if processing failed |
reasoning_level |
string | null |
Reasoning level used (e.g. "high") |
model |
string |
Model name (present in multi-model output) |
pytest tests/Run with verbose output:
pytest tests/ -vRun only integration tests:
pytest tests/test_integration.py -vRun specific test files:
pytest tests/test_cli.py -v
pytest tests/test_export.py -v
pytest tests/test_parallel_fixtures.py -v
pytest tests/test_result_doctoring.py -vSee CONTRIBUTING.md for the full fixture authoring guide and step-by-step instructions for adding new benchmark categories.
Quick fixture template:
id: "f013"
description: "My scenario"
purpose: "Tests ability to generate a commit message for a new file."
difficulty: easy
tags: [commit-message, add, basic]
setup:
- "git init"
- "git config user.email 'test@test.com'"
- "git config user.name 'Test User'"
- "echo 'content' > file.txt"
- "git add file.txt"
prompt: "Your instruction to the model"
expected: "The correct output"
scoring:
type: "similarity"
threshold: 0.5Important gotchas:
- YAML colon values: strings containing colons (e.g.
"Fix: login") must be single-quoted to prevent PyYAML from parsing them as mapping keys. - Rebase vs merge conflict polarity: In merge conflicts, HEAD's changes appear above
=======. In rebase conflicts, the polarity is reversed — upstream's changes appear below=======. - GitExecutor exit codes:
git mergeandgit rebaseexit with code 1 on conflicts.GitExecutor.setup_repo()handles this automatically — do not call_run_command()directly for merge/rebase commands in fixtures. - New benchmarks: Drop a Python module inheriting from
Benchmarkingitbench/benchmarks/. No registration or harness changes needed — auto-discovered viaimportlib.
gitbench/
├── __init__.py
├── cli.py # Click-based CLI (run, list, doctor, profiles, report)
├── config.py # Configuration loader (gitbench.json profiles)
├── export.py # CSV and artificialanalysis format exporters
├── render.py # Result aggregation and JSON generation for reports
├── result_doctoring.py # Selective repair of transient failures
├── version.py # Version constants (schema, suite, package)
├── harness/
│ ├── benchmark.py # Benchmark abstract base class
│ ├── types.py # Dataclasses: ModelMessage, Fixture, Score, BenchmarkResult
│ ├── model.py # ModelInterface, OpenAIAdapter, OllamaAdapter, MockModelClient
│ ├── loader.py # FixtureLoader: YAML loading and validation
│ ├── runner.py # BenchmarkRunner: executes benchmarks with progress reporting
│ ├── scorer.py # Scorer: similarity, command equivalence, state assertions, pass@k
│ └── reasoning.py # Model reasoning level validation matrix
├── benchmarks/
│ ├── __init__.py # Auto-discovery import (no registration needed)
│ ├── blame_forensics.py # Blame/forensics benchmark
│ ├── branch_cleanup.py # Branch cleanup benchmark
│ ├── cherry_pick.py # Cherry-pick benchmark
│ ├── commit_messages.py # Commit message generation benchmark
│ ├── commit_squash.py # Commit squash benchmark
│ ├── git_bisect.py # Git bisect benchmark
│ ├── git_clean.py # Git clean benchmark
│ ├── git_grep.py # Git grep benchmark
│ ├── git_log_format.py # Git log formatting benchmark
│ ├── git_show.py # Git show benchmark
│ ├── merge_conflicts.py # Merge conflict resolution benchmark
│ ├── rebase.py # Interactive rebase benchmark
│ ├── reflog.py # Reflog/detached HEAD recovery benchmark
│ ├── stash_recovery.py # Stash recovery benchmark
│ ├── submodule_usage.py # Submodule usage benchmark
│ ├── tag_management.py # Tag management benchmark
│ └── worktree_usage.py # Worktree usage benchmark
├── ui/
│ ├── __init__.py
│ ├── display.py # RichProgressDisplay: live TUI with progress bars
│ └── format.py # Duration, cost, and human-readable formatting
└── utils/
└── git.py # GitExecutor: sandboxed git repo management
fixtures/
├── blame_forensics/ # 12 YAML fixtures
├── branch_cleanup/ # 12 YAML fixtures
├── cherry_pick/ # 12 YAML fixtures
├── commit_messages/ # 12 YAML fixtures
├── commit_squash/ # 12 YAML fixtures
├── git_bisect/ # 12 YAML fixtures
├── git_clean/ # 12 YAML fixtures
├── git_grep/ # 12 YAML fixtures
├── git_log_format/ # 12 YAML fixtures
├── git_show/ # 12 YAML fixtures
├── merge_conflicts/ # 12 YAML fixtures
├── rebase/ # 12 YAML fixtures
├── reflog/ # 12 YAML fixtures
├── stash_recovery/ # 12 YAML fixtures
├── submodule_usage/ # 12 YAML fixtures
├── tag_management/ # 12 YAML fixtures
└── worktree_usage/ # 12 YAML fixtures
tests/
├── conftest.py # Shared test fixtures
├── test_benchmarks.py # Benchmark correctness tests
├── test_cli.py # CLI integration tests
├── test_config.py # Config loader tests
├── test_export.py # Export format tests
├── test_format.py # UI formatting tests
├── test_git.py # GitExecutor tests
├── test_integration.py # End-to-end integration tests
├── test_loader.py # Fixture loader tests
├── test_model.py # Model adapter tests
├── test_parallel_fixtures.py # Parallel fixture execution tests
├── test_reasoning.py # Reasoning level validation tests
├── test_render.py # Render/aggregation tests
├── test_result_doctoring.py # Doctor command tests
├── test_scorer.py # Scoring logic tests
└── test_types.py # Data type tests
- CLI (
gitbench/cli.py): Click-based command group withrun,list,doctor,profiles, andreportcommands. - Configuration (
gitbench/config.py): Loads model profiles fromgitbench.json. Profiles define model lists, provider type, base URL, API key resolution, timeout, and retry settings. - Benchmark ABC (
gitbench/harness/benchmark.py): Abstract base class enforcingrun_setup(),load_fixtures(), andscore()interface. Drop a new Python module inbenchmarks/to add a new benchmark category — auto-discovered viaimportlib. - Benchmark Runner (
gitbench/harness/runner.py): Orchestrates fixture execution against a model, supports concurrent fixtures, and reports progress via theRunProgressprotocol. - Model adapters (
gitbench/harness/model.py):ModelInterfaceABC withOpenAIAdapter,OllamaAdapter, andMockModelClientimplementations. Models can specify a reasoning level withmodel#levelsyntax. - Reasoning validation (
gitbench/harness/reasoning.py): Validates model + reasoning level combinations before any API calls. Fails fast on invalid pairings. - Fixture loader (
gitbench/harness/loader.py): Parses and validates YAML fixtures. Supports single fixture per file (dict) or multiple (list). - Scorer (
gitbench/harness/scorer.py): Computes text similarity viadifflib.SequenceMatcher, command equivalence, state assertions, structured field scoring, andpass_at_kacross fixtures. - Git executor (
gitbench/utils/git.py): Manages sandboxed temporary git repositories for fixture isolation. Uses_run_command_permissiveinternally forgit mergeandgit rebasewhich intentionally exit code 1 on conflicts. - Progress display (
gitbench/ui/display.py): Rich-based live TUI with progress bars, model panels, throughput stats, cost estimation, and final summary tables. - Export (
gitbench/export.py): Pluggable format registry. Ships with CSV (per-fixture) and artificialanalysis (per-benchmark) exporters. - Result doctoring (
gitbench/result_doctoring.py): Identifies transient failures (rate limits, timeouts, server errors) and re-runs only affected fixtures against the original models. - Report rendering (
gitbench/render.py): Aggregates run envelopes from directories or JSONL files, resolves model metadata, and producesresults.jsonfor the Astro report site.