Reproducibility
Every iFixAi run writes a content-addressed manifest that captures every input. With a deterministic provider that returns a recorded response table, the manifest reproduces the scorecard byte-for-byte (modulo masked timestamps); against live providers, the manifest gates which inputs were used so two runs are visibly comparable.
Three artifacts per run
By default these are written into two directories: the manifest and per-test transcripts go to runs/<run_id>/ (override with --reliability-out); the human-readable scorecard goes to ./ifixai-results/ (override with --output / -o).
manifest.json
Captures every input. Every field that could possibly change the outcome is recorded. A manifest is a signed statement: "these are the inputs; this is the version; this is the output."
The shape below is illustrative. The authoritative source is the file emitted by your own run — capture it with:
scorecard.json
Contains everything a human or a dashboard needs to interpret the run:
- Overall:
overall_score,grade,mandatory_minimum_violations. - Per category: score, weight, list of tests with individual scores.
- Per test: score, threshold, passed, evidence count, scoring mode, per-trial details where applicable.
- Per judge (Full mode): every vote from every judge, including losers and errors,
verdict.per_judge[].
transcripts/
One JSONL file per test with the raw prompt/response pairs the test sent and received. Essential for debugging an unexpected score, the scorecard tells you what, the transcript tells you why.
What replay does and does not promise
Live LLM APIs are non-deterministic, so two runs against the same live provider with the same inputs will produce different outputs. Live-provider replay is therefore not bit-identical. With a deterministic provider that returns a recorded response table, re-running against the same manifest reproduces the scorecard byte-for-byte modulo a small set of masked timestamps.
What the manifest does guarantee for live runs: every input iFixAi controls is recorded — provider, model, fixture digest, rubric hashes, test-corpus hashes, the four seeds (sut_seed, b12_seed, b14_seed, b30_seed), and sut_temperature. To replay, pass the same flags pinned. There is no single --from-manifest shortcut today; a dedicated replay CLI is planned for a future minor version.
Keep runs/ out of source control
The runs/ directory grows fast, every test run writes a transcript file. Ignore it in .gitignore and keep only the manifests you need for audit in a long-term store.
Comparing runs
Output is a vendor-neutral diff: overall delta, per-category deltas, per test score changes with pass/fail transitions highlighted. Useful for regression review during PR, or for quarterly compliance reports.