Skip to content

gitkraken/gitbench

Repository files navigation

GitBench

A benchmark suite for evaluating language models' Git competency. GitBench runs synthetic Git scenarios against models and scores their responses against known-good solutions.

Requirements

  • Python 3.10+
  • Git CLI installed and available in PATH

Installation

pip install -e .

Or with the included virtual environment:

source .venv/bin/activate

Quick Start

# List available benchmarks
gitbench list

# Run all benchmarks with the mock model (no API key needed)
gitbench run --all --model mock

# Run with a real model
gitbench run --all --model gpt-4o

# Run with a reasoning level
gitbench run --all --model o3-mini#high

# Run with Ollama locally
gitbench run --all --model llama3.1:8b --provider ollama

Configuration

Model profiles can be defined in gitbench.json (searched in ./gitbench.json, ./.gitbench.json, ~/.gitbench.json):

{
  "models": {
    "openai-gpt4o": {
      "models": ["gpt-4o", "gpt-4o-mini"],
      "provider": "openai",
      "api_key_env": "OPENAI_API_KEY"
    },
    "ollama-local": {
      "models": ["llama3.1:8b", "qwen2.5-coder:7b"],
      "provider": "ollama",
      "base_url": "http://localhost:11434"
    }
  },
  "outputs": {
    "json": "runs/latest.json"
  }
}

Then run with profiles:

gitbench run --all --profile openai-gpt4o
gitbench run --all --all-profiles              # all profiles sequentially
gitbench run --all --all-models                # flattened, can run concurrently

List configured profiles:

gitbench profiles

Running Benchmarks

Full CLI reference

gitbench run [OPTIONS]

Key options:

Option Description
--all, -a Run all benchmarks against all models (shortcut)
--all-benchmarks Run all available benchmarks
--benchmark, -b Run a specific benchmark by name
--model, -m Model to use. mock for testing, Ollama model names, or OpenAI-compatible model IDs
--profile, -p Model profile from gitbench.json
--all-profiles Run against all profiles defined in config
--all-models Run against all models across all profiles (flattened)
--provider Explicit provider type: ollama or openai (auto-detected from base_url if omitted)
--base-url API base URL. Defaults to http://localhost:11434 for Ollama
--verbose, -v Print detailed per-fixture results
--timeout, -t Timeout in seconds per model attempt
--retry-count, -r Retry attempts on failure (default: 3)
--model-workers Number of models to run concurrently (default: 1)
--fixture-workers Number of fixtures to run concurrently within a benchmark (default: 1)

Output options

Option Description
--json-output Single JSON file for results (default: gitbench-results/{timestamp}/results-v{version}.json)
--output-dir, -d Write per-run auto-named JSON files to a directory
--jsonl, -j Append run results as a JSON line to a file (useful for accumulating runs)
--export, -e Export format(s): csv, artificialanalysis (repeatable)
--export-path Path for export file (auto-named from model + timestamp if omitted)

Examples

# Run a specific benchmark with verbose output
gitbench run --benchmark commit_messages --model mock --verbose

# Run all benchmarks with Ollama, 4 concurrent fixtures
gitbench run --all --model llama3.1:8b --provider ollama --fixture-workers 4

# Run with profile, export results
gitbench run --all --profile openai-gpt4o --export csv --export artificialanalysis

# Accumulate runs in a JSONL file
gitbench run --all --model gpt-4o --jsonl runs/history.jsonl

# Run two models concurrently
gitbench run --all --all-models --model-workers 2

Model reasoning levels

Models can be suffixed with #level to control reasoning effort:

gitbench run --all --model o3-mini#high
gitbench run --all --model gpt-4o#minimal

Valid levels: minimal, low, medium, high, xhigh (model-dependent). The harness validates the combination before any calls are made, failing fast on invalid pairings.

Result doctoring

Transient failures (rate limits, timeouts, server errors) can be repaired without re-running the entire suite:

# Preview what would be repaired
gitbench doctor results.json --dry-run

# Repair a single result file
gitbench doctor results.json

# Repair result files in every timestamped directory under gitbench-results/
gitbench doctor --latest

# Use a longer timeout for slow repair reruns
gitbench doctor --latest --timeout 180

The doctor command identifies failures caused by known transient error patterns (HTTP 429, 500/502/503/504, timeouts, rate limits) and re-runs only those fixtures against the original models.

Report generation

After running benchmarks, generate a static report site:

gitbench report

This aggregates results from gitbench-results/, generates web/public/results.json, builds the Astro site to web/dist/, and starts a preview server.

gitbench report --open       # build and open in browser
gitbench report --dev        # start dev server with hot reload
gitbench report --no-build   # only generate results.json (skip build)
gitbench report -d my-results/ --open   # use custom input dir

Export formats

Results can be exported to structured formats for external analysis:

Format Description
csv One row per fixture result (benchmark, fixture_id, model, passed, similarity, error, etc.)
artificialanalysis One row per benchmark (model, benchmark, score, total, passed) — compatible with artificialanalysis.com
gitbench run --all --model gpt-4o --export csv --export artificialanalysis

Result versioning

Saved run envelopes include two version fields:

Field Meaning
schema_version / version Integer output schema version for result readers
benchmark_suite_version 0.x suite version tied to benchmark and fixture coverage

During the pre-1.0 period, bump benchmark_suite_version whenever fixtures, benchmark definitions, prompts, expected answers, or scoring rules change in a way that affects comparability. Use 0.x minor bumps for added or behavior-changing benchmark coverage, and patch bumps for corrections that keep the same intended coverage.

Generated default artifact filenames include the benchmark suite version, and report rendering sorts accumulated runs by benchmark_suite_version first, then timestamp.

Benchmarks

GitBench includes 17 benchmark categories:

Benchmark Description Fixtures Scoring
blame_forensics Trace which commit introduced a bug using git blame/log 12 exact_match
branch_cleanup Identify branches to delete (fully merged into main) 12 exact_match
cherry_pick Apply specific commits from one branch to another, resolving conflicts 12 similarity
commit_messages Given a diff, generate a meaningful commit message 12 similarity
commit_squash Squash multiple commits into a single coherent commit 12 commit_selection
git_bisect Identify the commit that introduced a bug via automated bisect 12 dynamic hash
git_clean Safely remove untracked files and directories 12 state_assertions
git_grep Search repository content with git grep 12 exact_match/similarity
git_log_format Format git log output for targeted history inspection 12 exact_match
git_show Inspect commits, tags, and file state with git show 12 exact_match/dynamic hash
merge_conflicts Resolve merge conflicts producing the correct final tree 12 similarity
rebase Clean up commit history before PR (squash, reorder, amend) 12 similarity
reflog Restore lost commits or fix detached HEAD state 12 similarity
stash_recovery Recover stashed changes or resolve stash pop conflicts 12 similarity
submodule_usage Manage git submodules for external dependencies 12 state_assertions
tag_management Create, inspect, move, and delete tags 12 state_assertions
worktree_usage Use git worktrees for parallel development 12 state_assertions

Each benchmark has 12 fixtures — 204 total — for meaningful pass@1 scoring.

Scoring Types

Type Description
similarity Text similarity via difflib.SequenceMatcher. Default threshold: 0.5
exact_match Exact string comparison after stripping whitespace
command_equivalence Tokenized command comparison against fixture-declared accepted alternatives
state_assertions Execute model output as git commands, then verify repo state via assertions (file_exists, dir_exists, file_content, branch_exists, git_config, git_output)
structured Parse model output as key-value fields, score each independently (exact_match or similarity per field)
commit_selection Verify that the model selects specific expected commits (used by commit_squash)
dynamic_hash Match against a git hash that varies per run (used by git_bisect, git_show)

Use command_equivalence for read-only fixtures that ask for a Git command and where multiple command spellings are semantically equivalent:

scoring:
  type: command_equivalence
  accepted:
    - git submodule
    - git submodule status

Multi-command alternatives are supported as ordered command sequences:

scoring:
  type: command_equivalence
  accepted:
    - - git submodule init
      - git submodule update
    - - git submodule update --init

Output Format

Envelope

Every run produces an envelope wrapping benchmark results with metadata:

{
  "version": 1,
  "schema_version": 1,
  "benchmark_suite_version": "0.1.0",
  "timestamp": "2025-01-15T10:30:00+00:00",
  "git_sha": "abc1234",
  "model": "gpt-4o",
  "profile": "openai-gpt4o",
  "summary": {
    "total_benchmarks": 17,
    "total_fixtures": 204,
    "total_passed": 120,
    "overall_pass_at_k": 0.5882
  },
  "results": [ ... ]
}

Single benchmark

When running --benchmark <name>, output is a single benchmark result:

{
  "benchmark": "commit_messages",
  "total": 12,
  "passed": 3,
  "pass_at_k": 0.25,
  "scores": [
    {
      "fixture_id": "f001",
      "passed": true,
      "similarity": 0.72,
      "model_output": "Add greeting file",
      "error": null
    }
  ],
  "errors": 0
}

Multi-model / multi-profile output

When running --all-models or --all-profiles, results are nested:

{
  "benchmark_suite_version": "0.1.0",
  "summary": {
    "total_models": 2,
    "total_fixtures": 408,
    "total_passed": 220,
    "overall_pass_at_k": 0.5392
  },
  "models": [
    { "model": "gpt-4o", "summary": {...}, "results": [...] },
    { "model": "gpt-4o-mini", "summary": {...}, "results": [...] }
  ]
}

All benchmarks combined

When running --all, output is a combined JSON with a summary and per-benchmark results:

{
  "summary": {
    "total_benchmarks": 17,
    "total_fixtures": 204,
    "total_passed": 15,
    "overall_pass_at_k": 0.25
  },
  "results": [
    {
      "benchmark": "commit_messages",
      "total": 12,
      "passed": 3,
      "pass_at_k": 0.25,
      "scores": [...],
      "errors": 0
    },
    ...
  ]
}
Field Type Description
benchmark string Benchmark name
total integer Total number of fixtures
passed integer Number of fixtures that passed
pass_at_k float Fraction of fixtures with at least one passing attempt
scores list Per-fixture score objects
errors integer Number of fixtures that produced an error

Each score object contains:

Field Type Description
fixture_id string Fixture identifier
passed boolean Whether the model output passed the threshold
similarity float Text similarity score (0.0 – 1.0)
model_output string The model's generated output
error string | null Error message if processing failed
reasoning_level string | null Reasoning level used (e.g. "high")
model string Model name (present in multi-model output)

Running Tests

pytest tests/

Run with verbose output:

pytest tests/ -v

Run only integration tests:

pytest tests/test_integration.py -v

Run specific test files:

pytest tests/test_cli.py -v
pytest tests/test_export.py -v
pytest tests/test_parallel_fixtures.py -v
pytest tests/test_result_doctoring.py -v

Adding Fixtures and Benchmarks

See CONTRIBUTING.md for the full fixture authoring guide and step-by-step instructions for adding new benchmark categories.

Quick fixture template:

id: "f013"
description: "My scenario"
purpose: "Tests ability to generate a commit message for a new file."
difficulty: easy
tags: [commit-message, add, basic]
setup:
  - "git init"
  - "git config user.email 'test@test.com'"
  - "git config user.name 'Test User'"
  - "echo 'content' > file.txt"
  - "git add file.txt"
prompt: "Your instruction to the model"
expected: "The correct output"
scoring:
  type: "similarity"
  threshold: 0.5

Important gotchas:

  • YAML colon values: strings containing colons (e.g. "Fix: login") must be single-quoted to prevent PyYAML from parsing them as mapping keys.
  • Rebase vs merge conflict polarity: In merge conflicts, HEAD's changes appear above =======. In rebase conflicts, the polarity is reversed — upstream's changes appear below =======.
  • GitExecutor exit codes: git merge and git rebase exit with code 1 on conflicts. GitExecutor.setup_repo() handles this automatically — do not call _run_command() directly for merge/rebase commands in fixtures.
  • New benchmarks: Drop a Python module inheriting from Benchmark in gitbench/benchmarks/. No registration or harness changes needed — auto-discovered via importlib.

Project Structure

gitbench/
├── __init__.py
├── cli.py                  # Click-based CLI (run, list, doctor, profiles, report)
├── config.py               # Configuration loader (gitbench.json profiles)
├── export.py               # CSV and artificialanalysis format exporters
├── render.py               # Result aggregation and JSON generation for reports
├── result_doctoring.py     # Selective repair of transient failures
├── version.py              # Version constants (schema, suite, package)
├── harness/
│   ├── benchmark.py       # Benchmark abstract base class
│   ├── types.py           # Dataclasses: ModelMessage, Fixture, Score, BenchmarkResult
│   ├── model.py           # ModelInterface, OpenAIAdapter, OllamaAdapter, MockModelClient
│   ├── loader.py          # FixtureLoader: YAML loading and validation
│   ├── runner.py          # BenchmarkRunner: executes benchmarks with progress reporting
│   ├── scorer.py          # Scorer: similarity, command equivalence, state assertions, pass@k
│   └── reasoning.py       # Model reasoning level validation matrix
├── benchmarks/
│   ├── __init__.py        # Auto-discovery import (no registration needed)
│   ├── blame_forensics.py # Blame/forensics benchmark
│   ├── branch_cleanup.py  # Branch cleanup benchmark
│   ├── cherry_pick.py     # Cherry-pick benchmark
│   ├── commit_messages.py # Commit message generation benchmark
│   ├── commit_squash.py   # Commit squash benchmark
│   ├── git_bisect.py      # Git bisect benchmark
│   ├── git_clean.py       # Git clean benchmark
│   ├── git_grep.py        # Git grep benchmark
│   ├── git_log_format.py  # Git log formatting benchmark
│   ├── git_show.py        # Git show benchmark
│   ├── merge_conflicts.py # Merge conflict resolution benchmark
│   ├── rebase.py          # Interactive rebase benchmark
│   ├── reflog.py          # Reflog/detached HEAD recovery benchmark
│   ├── stash_recovery.py  # Stash recovery benchmark
│   ├── submodule_usage.py # Submodule usage benchmark
│   ├── tag_management.py  # Tag management benchmark
│   └── worktree_usage.py  # Worktree usage benchmark
├── ui/
│   ├── __init__.py
│   ├── display.py         # RichProgressDisplay: live TUI with progress bars
│   └── format.py          # Duration, cost, and human-readable formatting
└── utils/
    └── git.py             # GitExecutor: sandboxed git repo management
fixtures/
├── blame_forensics/       # 12 YAML fixtures
├── branch_cleanup/        # 12 YAML fixtures
├── cherry_pick/           # 12 YAML fixtures
├── commit_messages/       # 12 YAML fixtures
├── commit_squash/         # 12 YAML fixtures
├── git_bisect/            # 12 YAML fixtures
├── git_clean/             # 12 YAML fixtures
├── git_grep/              # 12 YAML fixtures
├── git_log_format/        # 12 YAML fixtures
├── git_show/              # 12 YAML fixtures
├── merge_conflicts/       # 12 YAML fixtures
├── rebase/                # 12 YAML fixtures
├── reflog/                # 12 YAML fixtures
├── stash_recovery/        # 12 YAML fixtures
├── submodule_usage/       # 12 YAML fixtures
├── tag_management/        # 12 YAML fixtures
└── worktree_usage/        # 12 YAML fixtures
tests/
├── conftest.py            # Shared test fixtures
├── test_benchmarks.py     # Benchmark correctness tests
├── test_cli.py            # CLI integration tests
├── test_config.py         # Config loader tests
├── test_export.py         # Export format tests
├── test_format.py         # UI formatting tests
├── test_git.py            # GitExecutor tests
├── test_integration.py    # End-to-end integration tests
├── test_loader.py         # Fixture loader tests
├── test_model.py          # Model adapter tests
├── test_parallel_fixtures.py  # Parallel fixture execution tests
├── test_reasoning.py      # Reasoning level validation tests
├── test_render.py         # Render/aggregation tests
├── test_result_doctoring.py   # Doctor command tests
├── test_scorer.py         # Scoring logic tests
└── test_types.py          # Data type tests

Architecture

  • CLI (gitbench/cli.py): Click-based command group with run, list, doctor, profiles, and report commands.
  • Configuration (gitbench/config.py): Loads model profiles from gitbench.json. Profiles define model lists, provider type, base URL, API key resolution, timeout, and retry settings.
  • Benchmark ABC (gitbench/harness/benchmark.py): Abstract base class enforcing run_setup(), load_fixtures(), and score() interface. Drop a new Python module in benchmarks/ to add a new benchmark category — auto-discovered via importlib.
  • Benchmark Runner (gitbench/harness/runner.py): Orchestrates fixture execution against a model, supports concurrent fixtures, and reports progress via the RunProgress protocol.
  • Model adapters (gitbench/harness/model.py): ModelInterface ABC with OpenAIAdapter, OllamaAdapter, and MockModelClient implementations. Models can specify a reasoning level with model#level syntax.
  • Reasoning validation (gitbench/harness/reasoning.py): Validates model + reasoning level combinations before any API calls. Fails fast on invalid pairings.
  • Fixture loader (gitbench/harness/loader.py): Parses and validates YAML fixtures. Supports single fixture per file (dict) or multiple (list).
  • Scorer (gitbench/harness/scorer.py): Computes text similarity via difflib.SequenceMatcher, command equivalence, state assertions, structured field scoring, and pass_at_k across fixtures.
  • Git executor (gitbench/utils/git.py): Manages sandboxed temporary git repositories for fixture isolation. Uses _run_command_permissive internally for git merge and git rebase which intentionally exit code 1 on conflicts.
  • Progress display (gitbench/ui/display.py): Rich-based live TUI with progress bars, model panels, throughput stats, cost estimation, and final summary tables.
  • Export (gitbench/export.py): Pluggable format registry. Ships with CSV (per-fixture) and artificialanalysis (per-benchmark) exporters.
  • Result doctoring (gitbench/result_doctoring.py): Identifies transient failures (rate limits, timeouts, server errors) and re-runs only affected fixtures against the original models.
  • Report rendering (gitbench/render.py): Aggregates run envelopes from directories or JSONL files, resolves model metadata, and produces results.json for the Astro report site.

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors