This repository implements the refactored, academic-grade benchmark framework designed to support a PhD-grade IEEE Transactions on Knowledge and Data Engineering (T-DKE) paper on Resilient Semantic Reconciliation under API Schema Drift.
To ensure scientific rigor, reproducibility, and a clean experimental setup, the framework is strictly partitioned into two clean, independent pathways:
graph TD
subgraph CHAOS ["2. Adversarial Chaos Generation (Secondary Path)"]
CG[generate_chaos_dataset.py] -->|Procedural Mutation Engine| CD[(chaos_dataset.json / CSV)]
end
subgraph SEMANTIC ["1. Semantic Translation Benchmark (Primary Path)"]
CD -->|Static Input Dataset| SB[run_semantic_benchmark.py]
SB -->|Local-only BERT| LB[StrictBERTModel]
SB -->|Local-only Gemma| LG[StrictGemmaModel]
SB -->|Resilience Metrics Package| RM[ResilienceEvaluator]
SB -->|IEEE T-DKE Ready Outputs| RO[per_run_benchmark.json]
SB -->|IEEE T-DKE Ready Outputs| RC[accuracy_vs_drift.csv]
end
- Directory:
semantic_benchmark/ - Core Responsibilities:
- Off-line evaluation of semantic drift detection and reconciliation algorithms.
- Supports four reconcilers as first-class citizens: Regex, Levenshtein, BERT (sentence-transformers), and Gemma (generative LLM).
- Implements detailed method attribution (metrics captured per run:
match_score,confidence,latency_ms,fallback_used,fallback_reason). - Utilizes
resilience-metricsfor mathematical resilience profiling. - Offline Enforced: Zero cloud handshakes or API calls. Asserts
HF_HUB_OFFLINE=1at runtime.
- Directory:
chaos_generator/ - Core Responsibilities:
- Procedural mutation synthesis (JSON corruption, schema mutation, paraphrase drift, and LLM-driven adversarial renames).
- Produces static, replayable chaos datasets (JSON/CSV) that the scientific benchmark consumes.
- Separation Guarantee: The Semantic Benchmark never invokes chaos generation or LLM mutation at runtime; it relies strictly on these static datasets to ensure reproducible experiments.
Get the framework running in a few simple steps. The system automatically detects your Python environment (Python 3.10–3.13) and optimizes the dependency wheels accordingly.
Run the bootstrap utility to install optimized PyTorch, compile the native C++ Levenshtein accelerator, and pre-cache model weights locally:
# 1. Initialize environment and pre-cache local BERT/Gemma weights
python bootstrap.py --bootstrap
# 2. Compile native C++ Levenshtein accelerator
python setup.py build_ext --inplace
# 3. Install the resilience-metrics package
pip install -e /Users/tarekclarke/.gemini/antigravity/scratch/resilience-metricsQuery the baseline APIs and inject adversarial chaos to compile your evaluation dataset:
python chaos_generator/generate_chaos_dataset.py \
--output-dir chaos_generator/datasets \
--runs-per-config 5 \
--strategies json schema gemmaExecute the primary scientific benchmark under strict local-only validation:
python semantic_benchmark/run_semantic_benchmark.py \
--dataset-path chaos_generator/datasets/chaos_dataset.json \
--require-local-models True \
--strict-mode \
--output-dir results(Use --verbose to view fine-grained matching scores, attributions, and latencies in real-time).
Algorithm robustness is mathematically assessed by integrating the official resilience-metrics package. System resilience is assessed under two distinct scientific formulas (
-
Throughput Score (
$T$ ): Normalized as$\min(1.0, \frac{\text{throughput_pps}}{\text{target_hz}})$ , assessing system capability to handle baseline processing frequencies (default:$100\text{ Hz}$ ). -
Detection Rate (
$D$ ): Clamped in$[0, 1]$ , measuring accuracy in identifying active schema drift events. -
Recovery Score (
$R$ ): Clamped in$[0, 1]$ , scoring schema mapping accuracy. -
Latency Score (
$L$ ): Normalized as$\min(1.0, \frac{\text{baseline_p95_ms}}{\max(10^{-6}, \text{p95_latency_ms})})$ , evaluating execution delays relative to a baseline threshold ($10\text{ ms}$ ).
Resilience scores are aggregated globally, by drift type, and by reconciler method, and included in the final T-DKE output directory.
The framework supports 8 baseline schema drift types categorized to rigorously stress semantic matching bounds:
| Drift Type | Category | Original Schema |
|---|---|---|
missing_keys |
Structural / Lexical |
{"price": 100.0, "currency": "USD"} {"currency": "USD"}
|
extra_keys |
Structural / Lexical |
{"price": 100.0} {"price": 100.0, "price_extra": "dummy"}
|
renamed_keys |
Lexical / Semantic |
{"temperature": 22.5} {"tempC": 22.5} (or extreme domain renames) |
split_fields |
Structural / Syntactic |
{"location": "37.7 -122.4"} {"location_lat": 37.7, "location_lng": -122.4}
|
merged_fields |
Structural / Syntactic |
{"first_name": "Max", "last_name": "Verstappen"} {"full_name": "Max Verstappen"}
|
nested_corruption |
Structural |
{"address": "123 Main St"} {"address": {"raw": "123 Main St"}}
|
type_mismatch |
Syntactic |
{"active": true} {"active": "true"}
|
value_contradiction |
Semantic / Lexical |
{"price": 100.0} {"price": 103.45} (content/value paraphrases) |
To systematically evaluate the reconcilers, the pipeline parameters are highly configurable:
- APIs: SpaceX, Finnhub, OpenMeteo, OpenF1.
- Intensities: Supports testing across any chaos intensity parameters (e.g.,
--levels 5or--levels 0.05 0.01 0.005). - Frequencies: Evaluate performance profiles under traffic baseline targets using
--target-hz(e.g.--target-hz 100for 100 Hz up to--target-hz 1000000for 1 MHz). - Sequential Reconciler Loop: Reconcilers are run in strict sequence to prevent CPU/GPU core resource contention, ensuring pure latency and throughput metrics.
This framework provides optimized acceleration wheels across multiple hardware targets:
- Apple Silicon M4 Macs: Leverages macOS native GPU execution via Metal Performance Shaders (MPS).
- Windows AMD GPU Workstations (e.g. Radeon RX 7900 XT): Natively supports newest ROCm/HIP 7.x environments on Windows by checking paths and environment variables (
HIP_PATH,ROCM_PATH), fallbacking cleanly to Microsoft DirectML if needed. - NVIDIA Linux Clusters: Integrates native NVIDIA CUDA acceleration.
The platform and ablation tables below are automatically compiled and updated based on latest experimental results. After executing a benchmark run, simply run the following utility:
python scripts/update_readme_tables.pyThis script automatically parses the files in results/, computes aggregates, and updates the markdown sections below.
| Platform | Total Runs | Avg Latency (ms) | Avg Accuracy (%) | Avg Resilience P | Avg Throughput (pps) |
|---|---|---|---|---|---|
| Apple Silicon MPS (mps) | 2 | 1.75 ms | 75.0% | 0.950 | 4154.01 pps |
| Drift Type | Regex Acc | Levenshtein Acc | Bert Acc | Gemma Acc |
|---|---|---|---|---|
| renamed_keys | 1 | 0 | 0 | 0 |
| type_mismatch | 1 | 1 | 0 | 0 |
| Method | Avg Latency Ms | Min Latency Ms | Max Latency Ms |
|---|---|---|---|
| regex | 3.18 | 0.1583 | 6.20 |
| levenshtein | 0.3134 | 0.1227 | 0.5042 |