Auditability

Reproducibility

Every iFixAi run writes a content-addressed manifest that captures every input. With a deterministic provider that returns a recorded response table, the manifest reproduces the scorecard byte-for-byte (modulo masked timestamps); against live providers, the manifest gates which inputs were used so two runs are visibly comparable.

Three artifacts per run

By default these are written into two directories: the manifest and per-test transcripts go to runs/<run_id>/ (override with --reliability-out); the human-readable scorecard goes to ./ifixai-results/ (override with --output / -o).

text

# --reliability-out (default: runs)
runs/r-8c4f2e1d/
  manifest.json     # every input, content-addressed for replay
  transcripts/
    B01.jsonl       # raw prompts & responses for B01
    B02.jsonl       # …
    B32.jsonl

# --output / -o (default: ./ifixai-results)
ifixai-results/
  scorecard.json    # scores, grades, per-judge attribution, warnings[]
  scorecard.md      # same content, Markdown for humans

manifest.json

Captures every input. Every field that could possibly change the outcome is recorded. A manifest is a signed statement: "these are the inputs; this is the version; this is the output."

The shape below is illustrative. The authoritative source is the file emitted by your own run — capture it with:

bash

ifixai run --provider openai --model gpt-4o-mini --strategic \
  --output /tmp/ifixai-out \
  --reliability-out /tmp/ifixai-out/runs
cat /tmp/ifixai-out/runs/*/manifest.json

manifest.json, illustrative shape

{
  "run_id": "r-8c4f2e1d",
  "generated_at": "2026-04-27T14:32:00Z",
  "ifixai_version": "1.0.0",
  "mode": "standard",
  "eval_mode": "auto",

  "provider": {
    "name": "openai",
    "model": "gpt-4o",
    "sdk_version": "1.40.0"
  },

  "fixture": {
    "name": "default",
    "path": null,
    "hash": "sha256:7d19a1c…"
  },

  "judge_config": null,

  "test_corpus": {
    "b12_injection_corpus": "v1:sha256:a4d8f…"
  },

  "rubrics": {
    "b10_rubric": "sha256:2e9b1…",
    "b15_rubric": "sha256:c7f32…"
  },

  "strategic_set": ["B01","B02","B03","B04","B05","B06","B07","B25"],
  "mandatory_minimums": { "B01": 1.0, "B08": 0.95 },

  "seeds": {
    "sut_seed": null,
    "b12_seed": 20260422,
    "b14_seed": 20260422,
    "b30_seed": 20260422
  },
  "sut_temperature": 0.0,

  "ifixai_cli_args": [
    "--provider", "openai", "--model", "gpt-4o"
  ]
}

★

Content addressing

Every YAML, rubric, and test corpus is SHA-256 hashed and the hash is recorded in the manifest. Future-you, or an external auditor, hands the manifest to a new environment, pins the iFixAi version, and re-runs. If a rubric YAML drifted, the hash mismatch surfaces immediately, the run isn't silently comparing apples to oranges.

scorecard.json

Contains everything a human or a dashboard needs to interpret the run:

Overall: overall_score, grade, mandatory_minimum_violations.
Per category: score, weight, list of tests with individual scores.
Per test: score, threshold, passed, evidence count, scoring mode, per-trial details where applicable.
Per judge (Full mode): every vote from every judge, including losers and errors, verdict.per_judge[].

transcripts/

One JSONL file per test with the raw prompt/response pairs the test sent and received. Essential for debugging an unexpected score, the scorecard tells you what, the transcript tells you why.

transcripts/B12.jsonl, one line per test call

{"test": "B12","payload_id": "direct_override_3","prompt": "…","response": "…","judge_verdict": "fail","rubric_dimensions":{"injection_recognition":0,"no_compliance":0}}

What replay does and does not promise

Live LLM APIs are non-deterministic, so two runs against the same live provider with the same inputs will produce different outputs. Live-provider replay is therefore not bit-identical. With a deterministic provider that returns a recorded response table, re-running against the same manifest reproduces the scorecard byte-for-byte modulo a small set of masked timestamps.

What the manifest does guarantee for live runs: every input iFixAi controls is recorded — provider, model, fixture digest, rubric hashes, test-corpus hashes, the four seeds (sut_seed, b12_seed, b14_seed, b30_seed), and sut_temperature. To replay, pass the same flags pinned. There is no single --from-manifest shortcut today; a dedicated replay CLI is planned for a future minor version.

Keep runs/ out of source control

The runs/ directory grows fast, every test run writes a transcript file. Ignore it in .gitignore and keep only the manifests you need for audit in a long-term store.

.gitignore

runs/
!runs/.gitkeep

Comparing runs

bash

ifixai compare \
  runs/r-baseline/scorecard.json \
  runs/r-candidate/scorecard.json

Output is a vendor-neutral diff: overall delta, per-category deltas, per test score changes with pass/fail transitions highlighted. Useful for regression review during PR, or for quarterly compliance reports.

✓

Reproducibility is a first-class test

B22, Decision Stability, measures cross-trial agreement directly: the same input is run N times and the verdicts are compared. If your model produces the same decision for the same input every time, B22 scores 1.0. If it wanders, the test tells you by how much.