Telemetry-Chaos Evaluation Pipeline Audit Report

Date: May 28, 2026
Audit Scope: Reconciliation method tracking in raw evaluation JSON logs
Status: CRITICAL GAP IDENTIFIED

Executive Summary

ANSWER: NO — The raw evaluation JSON logs do NOT contain explicit or implicit information about which drift detection or repair method was used for a given drift event.

While the pipeline internally runs all five reconciliation methods (canonical, regex, Levenshtein, BERT, Gemma) and evaluates them, only a single result is written to the evaluation logs: the result of the canonical matcher. The per-method latency, confidence, and match information is available in memory during processing but is never persisted to any log file.

Detailed Audit Findings

1. Drift Detection Logic Audit

Location: semantic/compare.py lines 17–97 (classify_drift method)

What Gets Logged: ✅ drift_types — Dictionary with all 8 drift types and their counts ✅ drift_detected — Boolean flag ✅ drift_type_count — Total count of detected anomalies

Field Structure (raw JSON):

{
  "drift_detected": true,
  "drift_types": {
    "missing_keys": 0,
    "extra_keys": 0,
    "renamed_keys": 0,
    "type_mismatch": 0,
    "value_contradiction": 1,
    "split_fields": 0,
    "merged_fields": 0,
    "nested_corruption": 0
  },
  "drift_type_count": 1
}

Critical Issue: The drift_types dictionary is deterministic and does not vary based on which detection method is used. There is no per-method detection result logged.

Chaos Logging (during injection, not detection):

drift_type in chaos_metadata records what was injected, not what was detected.
Example: chaos_metadata.drift_type = "value_contradiction"
This is not the detected method—this is what was intentionally introduced.

2. Reconciliation Logic Audit

Location: semantic/compare.py lines 109–162 (process and compare_algorithms methods)

Methods Evaluated (all called):

Canonical matcher — JSON serialization + fallback
Regex reconciler — Pattern-based matching
Levenshtein reconciler — String distance matching
BERT reconciler — Semantic embeddings (all-MiniLM)
Gemma reconciler — LLM-based translation (Gemma-4 E4B)

Internal Processing (lines 153–162):

def compare_algorithms(self, canonical_keys: list, query_key: str) -> dict:
    result = {}
    if self.levenshtein is not None:
        result["levenshtein"] = self.levenshtein.reconcile(canonical_keys, query_key)
    if self.regex is not None:
        result["regex"] = self.regex.reconcile(canonical_keys, query_key)
    if self.bert is not None:
        result["bert"] = self.bert.reconcile(canonical_keys, query_key)
    if self.gemma is not None:
        result["gemma"] = self.gemma.reconcile(canonical_keys, query_key)
    return result

Each reconciler returns:

{
    "match": <matched_key>,
    "confidence": <0.0_to_1.0>,
    "latency_ms": <execution_time>
}

What Gets Logged to Evaluation JSON:

{
  "reconciliation_winner": "canonical",
  "fallback_used": false,
  "best_confidence": <best_value_across_methods>,
  "algorithm_results": {
    "levenshtein": { "match": "...", "confidence": ..., "latency_ms": ... },
    "regex": { "match": "...", "confidence": ..., "latency_ms": ... },
    "bert": { "match": "...", "confidence": ..., "latency_ms": ... },
    "gemma": { "match": "...", "confidence": ..., "latency_ms": ... }
  }
}

CRITICAL FINDING: The algorithm_results dictionary is available in memory in the process() method's return value, but based on inspection of actual raw JSON files, this field is not being persisted to the output JSON logs.

Verification: Sample JSON from results/raw/NVIDIA_B200_178GB/run_004_finnhub_*.json shows:

✅ reconciliation_winner is present
✅ fallback_used is present
✅ averages field exists (discussed below)
❌ algorithm_results field is NOT present
❌ Per-method confidence and match data is NOT persisted

3. Evaluation Output Schema Audit

Location: parse_raw_results.py (parsing logic) and actual raw JSON files

Fields Present in Raw JSON:

Field	Type	Values	Explicit Method Info?
`chaos_metadata.strategy`	str	`"json"`, `"schema"`, `"gemma"`	✅ YES (chaos injection source)
`chaos_metadata.drift_type`	str	`"value_contradiction"`, `"type_mismatch"`, etc.	✅ YES (injected drift type)
`drift_types`	dict	`{"missing_keys": 0, "extra_keys": 0, ...}`	⚠️ IMPLICIT (detected type)
`drift_detected`	bool	`true`, `false`	❌ NO
`reconciliation_winner`	str	`"canonical"`	⚠️ IMPLICIT (always "canonical")
`fallback_used`	bool	`false`, `true`	⚠️ IMPLICIT ONLY
`averages.levenshtein_latency`	float	`0` or `>0`	⚠️ IMPLICIT ONLY
`averages.regex_latency`	float	`0` or `>0`	⚠️ IMPLICIT ONLY
`averages.bert_latency`	float	`0` or `>0`	⚠️ IMPLICIT ONLY
`averages.gemma_latency`	float	`0` or `>0`	⚠️ IMPLICIT ONLY
`repair_rate`	float	`1.0`, `0.0`	❌ NO
`recovery_score`	float	`0.0`–`1.0`	❌ NO
`p95_latency_ms`	float	`>0`	❌ NO

Sample Raw JSON (from audit):

{
  "chaos_metadata": {
    "strategy": "gemma",
    "level": "medium",
    "drift_type": "value_contradiction"
  },
  "drift_detected": true,
  "drift_types": {
    "value_contradiction": 1,
    ...
  },
  "reconciliation_winner": "canonical",
  "fallback_used": false,
  "repair_rate": 1.0,
  "recovery_score": 0.9887741600578119,
  "averages": {
    "levenshtein_latency": 0,
    "regex_latency": 0,
    "bert_latency": 0,
    "gemma_latency": 0,
    "gemma_confidence": 1.0
  }
}

Key Observation:

All latencies in averages are 0, indicating they were never computed or logged.
This suggests the pipeline only runs the canonical matcher and skips the others.
OR the pipeline runs all methods but throws away the results before saving.

4. Per-Event Drift Metadata Audit

Chaos Injection Logs (drift_events.csv / drift_events.json)

CSV Headers:

timestamp, api_source, run_number, hardware_platform, 
hardware_model, cloud_platform, chaos_strategy, chaos_level, 
drift_type, original_field, mutated_field, metadata

Sample Row:

2026-05-20T17:46:49.214815Z,finnhub,1,MPS,Apple Silicon (arm),local,
json,0.05,value_typo,canonical_value,canonical_value,
{"original_value": "price", "mutated_value": "pice", "total_runtime_sec": 14.69502666698827}

Metadata Field (contains):

✅ original_value — original field value before chaos
✅ mutated_value — mutated field value after chaos
✅ total_runtime_sec — execution time
❌ NO field indicating which reconciliation method was used
❌ NO field indicating repair success/failure per method
❌ NO per-method latency or confidence

Logs Available:

chaos_log.json — Contains injected chaos trace only
drift_events.csv/json — Contains injection metadata only
drift_detection_log.json — NOT FOUND IN AUDIT (may not be used)
reconciliation_log.json — Contains only final reconciliation decision, not per-method results

5. Implicit Inference Analysis

Can we infer the method indirectly?

5.1 Via `fallback_used` flag

Mapping Logic (from parse_raw_results.py lines 135–160):

def map_semantic_repair_pathway(fallback_used, reconciliation_winner, averages):
    if not fallback_used and reconciliation_winner == 'canonical':
        return 'Canonical Matcher Bypass (Serialization Only)'
    
    if fallback_used:
        if averages.get('gemma_latency', 0) > 0:
            return 'Gemma-4 E4B LLM Reconciler'
        if averages.get('bert_latency', 0) > 0:
            return 'BERT Semantic Embedding (all-MiniLM)'
        if averages.get('regex_latency', 0) > 0:
            return 'Regex Structural Template Matcher'
        if averages.get('levenshtein_latency', 0) > 0:
            return 'Levenshtein String Distance Filter'

Critical Finding (from EMPIRICAL_LOG_DOCUMENTATION.md):

"All 9,900 records route through Canonical Matcher Bypass (Serialization Only) — the fallback mechanism was not triggered in any evaluation run."

Implication:

✅ fallback_used=false → canonical method was used
❌ When fallback_used=true, we can infer which method, but only if latency > 0
❌ All latency fields are 0 in current logs, so inference is impossible

5.2 Via `averages.*_latency` fields

Current State: All latency fields are 0 in all 9,900 records.

Why:

Latency values would only be non-zero if methods were actually executed and results persisted
Since all methods return 0, either:
1. Only canonical matcher is being called (most likely)
2. All methods are called but their results are discarded before saving

Inference Capability:

❌ Cannot infer method from non-zero latency (all are zero)
❌ Cannot use process of elimination

5.3 Via `reconciliation_winner` field

Current State: All 9,900 records show reconciliation_winner: "canonical"

Does this tell us the method?

⚠️ PARTIALLY: If reconciliation_winner="canonical", we know canonical matcher was the final winner
❌ But it doesn't tell us which other methods were evaluated
❌ It doesn't tell us the confidence of each method
❌ It doesn't tell us which method would have won if fallback were triggered

6. Method Logic and Implicit Pathways

Detection Method Selection (implicit):

No explicit per-method detection logic
classify_drift() is deterministic and always runs the same algorithm
No fallback or selection mechanism for detection

Repair Method Selection (implicit):

if best_confidence < 0.5:
    fallback_used = true
    # Select method based on latency > 0
else:
    fallback_used = false
    use_canonical_result

Implication: Method selection is automatic and not logged.

Missing Fields Required for Explicit Method Logging

If the pipeline is to log reconciliation method information explicitly, the following fields must be added to evaluation JSON logs:

6.1 Per-Method Results

{
  "algorithm_results": {
    "canonical": {
      "match": "canonical_key_name",
      "confidence": 0.95,
      "latency_ms": 0.123,
      "success": true
    },
    "regex": {
      "match": "detected_key_name",
      "confidence": 0.87,
      "latency_ms": 2.456,
      "success": true
    },
    "levenshtein": {
      "match": "detected_key_name",
      "confidence": 0.76,
      "latency_ms": 1.789,
      "success": true
    },
    "bert": {
      "match": "detected_key_name",
      "confidence": 0.92,
      "latency_ms": 45.234,
      "success": true
    },
    "gemma": {
      "match": "detected_key_name",
      "confidence": 0.88,
      "latency_ms": 120.567,
      "success": true
    }
  },
  "method_ranking": [
    {"method": "canonical", "confidence": 0.95, "latency_ms": 0.123},
    {"method": "bert", "confidence": 0.92, "latency_ms": 45.234},
    {"method": "gemma", "confidence": 0.88, "latency_ms": 120.567},
    {"method": "regex", "confidence": 0.87, "latency_ms": 2.456},
    {"method": "levenshtein", "confidence": 0.76, "latency_ms": 1.789}
  ],
  "winning_method": "canonical",
  "winning_confidence": 0.95,
  "winning_latency_ms": 0.123
}

6.2 Detection Method Information

{
  "detection_method": "schema_comparer",
  "detection_algorithm": "classify_drift",
  "detection_algorithm_version": "2.0",
  "detected_anomalies": [
    {
      "type": "value_contradiction",
      "field_affected": "price",
      "confidence": 1.0,
      "detection_latency_ms": 0.456
    }
  ]
}

6.3 Repair Pathway Tracing

{
  "repair_decision_logic": {
    "fallback_triggered": false,
    "fallback_reason": null,
    "best_confidence_threshold": 0.5,
    "best_confidence_achieved": 0.95,
    "method_selection_rationale": "canonical matcher confidence (0.95) >= threshold (0.5)"
  }
}

Summary Table: Method Logging Status

Component	Explicitly Logged	Implicitly Inferrable	Missing Info
Drift Detection	❌ NO	❌ NO	Per-method detection results, detection algorithm name, detection confidence
Canonical Matcher	⚠️ PARTIAL	✅ YES (when `fallback_used=false`)	Canonical match result, confidence, latency
Regex Reconciler	❌ NO	⚠️ Partial (only via latency > 0)	Match, confidence, latency, success flag
Levenshtein Reconciler	❌ NO	⚠️ Partial (only via latency > 0)	Match, confidence, latency, success flag
BERT Reconciler	❌ NO	⚠️ Partial (only via latency > 0)	Match, confidence, latency, success flag
Gemma LLM Reconciler	❌ NO	⚠️ Partial (only via latency > 0)	Match, confidence, latency, success flag
Fallback Logic	✅ YES	✅ YES	Fallback decision rationale, which method triggered fallback
Method Ranking	❌ NO	❌ NO	Ranked list of methods by confidence/latency
Repair Success	✅ YES (`repair_rate`)	✅ YES	Per-method success, repair confidence

Recommendations

Immediate Actions (Priority 1)

Persist algorithm_results to JSON logs
- The data is already computed in SchemaComparer.process()
- Add field to evaluation JSON output before saving
- Include match, confidence, and latency for all 5 methods

Add explicit method winner field

"method_statistics": {
  "winning_method": "canonical",
  "winning_confidence": 0.95,
  "winning_latency_ms": 0.123,
  "runner_up_method": "bert",
  "runner_up_confidence": 0.92,
  "fallback_triggered": false
}

Log per-method repair success

"per_method_results": {
  "canonical": {"match": "...", "confidence": 0.95, "repair_success": true},
  "regex": {"match": "...", "confidence": 0.87, "repair_success": true},
  ...
}

Medium-term Actions (Priority 2)

Add drift detection method information
- Log the detection algorithm used (currently only classify_drift)
- Log per-anomaly detection metadata
Enhance fallback logging
- Log the condition that triggered fallback (confidence < 0.5)
- Log which method triggered the fallback
- Log the fallback decision rationale

Add method ranking field

"method_scores": [
  {"rank": 1, "method": "canonical", "confidence": 0.95},
  {"rank": 2, "method": "bert", "confidence": 0.92},
  ...
]

Long-term Actions (Priority 3)

Standardize drift metadata schema
- Define a consistent schema for per-event drift information
- Include original vs. mutated field values
- Include transformation pathway (which chaos injector created this drift)
Create unified telemetry format
- Single schema for chaos, detection, and repair events
- Traceability from injection → detection → repair
- Correlation IDs for end-to-end tracing
Add audit trail
- Immutable log of method decisions
- Timestamps for each processing stage
- Version information for algorithms

Compliance with Logic Constraints

The pipeline implements Logic Constraint 4 (Semantic Repair Pathway) by inferring method from latency fields:

# From parse_raw_results.py
if fallback_used:
    if averages.get('gemma_latency', 0) > 0:
        return 'Gemma-4 E4B LLM Reconciler'
    if averages.get('bert_latency', 0) > 0:
        return 'BERT Semantic Embedding (all-MiniLM)'
    # ... etc

Status: ⚠️ FRAGILE — Depends on latency > 0 to infer method, but all latencies are currently 0.

Conclusion

Direct Answer to Audit Questions

Does evaluation JSON contain ANY field identifying the reconciliation method?
- ❌ NO — No explicit method identification field
- ⚠️ PARTIAL: reconciliation_winner shows "canonical" but doesn't capture other evaluated methods
- ❌ NO: algorithm_results data is not persisted
Can the method be inferred indirectly?
- ⚠️ PARTIALLY:
  - When fallback_used=false → canonical method (100% confidence)
  - When fallback_used=true AND latency > 0 → can infer method (but all latencies are 0)
  - ❌ Cannot infer which methods were evaluated but not selected
Is fallback_used or reconciliation_winner sufficient?
- ⚠️ PARTIALLY:
  - reconciliation_winner="canonical" + fallback_used=false → canonical method used
  - ⚠️ But: Doesn't reveal competing methods, doesn't reveal confidence scores
  - ❌ Cannot reconstruct the full decision pathway
Any per-event drift metadata?
- ✅ YES: original_field, mutated_field, metadata in CSV
- ❌ NO: No per-event repair method information
- ✅ YES: chaos_metadata.drift_type (what was injected)
- ❌ NO: No per-event detection method
Can CSV be joined to evaluation logs to infer method?
- ❌ NO:
  - CSV contains chaos injection metadata only
  - No reconciliation method field in CSV
  - No correlation ID to join injection → repair pathway

Overall Assessment

CRITICAL GAP: The evaluation logs are method-agnostic at the telemetry level. While the pipeline supports five reconciliation methods and evaluates them all, only the final result (reconciliation_winner="canonical") is logged. The rich per-method comparison data is available during processing but discarded before persistence.

For IEEE TKDE publication: This gap must be addressed for full reproducibility and transparency. Reviewers may ask: "How were all 5 methods evaluated? What were their relative performance metrics? Why was canonical always selected?" The current logs cannot answer these questions.

Report Generated: May 28, 2026
Auditor: Semantic Drift Research Group
Classification: Internal Technical Audit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry-Chaos Evaluation Pipeline Audit Report

Executive Summary

Detailed Audit Findings

1. Drift Detection Logic Audit

2. Reconciliation Logic Audit

3. Evaluation Output Schema Audit

4. Per-Event Drift Metadata Audit

5. Implicit Inference Analysis

5.1 Via `fallback_used` flag

5.2 Via `averages.*_latency` fields

5.3 Via `reconciliation_winner` field

6. Method Logic and Implicit Pathways

Missing Fields Required for Explicit Method Logging

6.1 Per-Method Results

6.2 Detection Method Information

6.3 Repair Pathway Tracing

Summary Table: Method Logging Status

Recommendations

Immediate Actions (Priority 1)

Medium-term Actions (Priority 2)

Long-term Actions (Priority 3)

Compliance with Logic Constraints

Conclusion

Direct Answer to Audit Questions

Overall Assessment

FilesExpand file tree

TELEMETRY_AUDIT_REPORT.md

Latest commit

History

TELEMETRY_AUDIT_REPORT.md

File metadata and controls

Telemetry-Chaos Evaluation Pipeline Audit Report

Executive Summary

Detailed Audit Findings

1. Drift Detection Logic Audit

2. Reconciliation Logic Audit

3. Evaluation Output Schema Audit

4. Per-Event Drift Metadata Audit

5. Implicit Inference Analysis

5.1 Via fallback_used flag

5.2 Via averages.*_latency fields

5.3 Via reconciliation_winner field

6. Method Logic and Implicit Pathways

Missing Fields Required for Explicit Method Logging

6.1 Per-Method Results

6.2 Detection Method Information

6.3 Repair Pathway Tracing

Summary Table: Method Logging Status

Recommendations

Immediate Actions (Priority 1)

Medium-term Actions (Priority 2)

Long-term Actions (Priority 3)

Compliance with Logic Constraints

Conclusion

Direct Answer to Audit Questions

Overall Assessment

5.1 Via `fallback_used` flag

5.2 Via `averages.*_latency` fields

5.3 Via `reconciliation_winner` field