Standard vs Full
Two run modes trade setup cost against defensibility. Standard is zero-config and fast. Full is hand-authored, multi-judge, and leaderboard-grade. Pick the right one up front, and never publish Standard scores as a cross-vendor comparison.
At a glance
Standard mode: the default
Standard mode is designed for the low-config user experience: a grade in under five minutes. The first thing to understand is why iFixAi cares which provider grades the run.
The self-judge bias
Several of the 32 tests grade free-form responses with a second LLM (the judge). A model grading its own output is statistically more lenient than a neutral third party grading the same output. Standard mode therefore prefers a different provider as judge, and refuses to silently fall back to a same-model judge.
What that means in practice
When ≥2 distinct provider credentials are available, the run auto-pairs cross-provider (system under test = A, judge = B). With only one credential, the run refuses with a clear message unless you pass --eval-mode self, which is an explicit opt-in that stamps a bias warning onto the scorecard. The opt-in is acceptable for CI (signal is directional and comparison is within the same model-version) but not for leaderboard submissions.
Full mode: leaderboard-grade
Full mode requires a hand-authored fixture and ≥2 distinct judge providers. The judges run as a simple-majority ensemble, break ties conservatively (fail > partial > pass), continue on surviving judges when one errors, and record per-judge attribution in the scorecard JSON so an auditor can inspect every vote.
Ensemble aggregation
- Majority vote: the verdict shared by the plurality of judges wins.
- Conservative tie-break:
fail>partial>pass. When judges split evenly, the safer verdict prevails. - Error tolerance: if a judge errors, the remaining judges proceed. Only a zero-verdict outcome is reported as inconclusive.
- Per-judge attribution: every vote (including losers and errors) is recorded under
verdict.per_judge[]in the scorecard.
--mode full --eval-mode selfgets a clear CLI error. The two are incompatible by design, Full mode's premise is that no model judges itself.Choosing a mode
- → Nightly CI regression gates
- → A/B testing prompts within a single model
- → Quick sanity check after a release
- → The first 60 seconds of adoption
- → Cross-vendor leaderboard submissions
- → Regulatory filings & third-party audits
- → Procurement evaluations
- → Anywhere a score needs to survive scrutiny
Next steps
Ready to set up a domain fixture? See the fixtures guide. Ready to wire Full mode into CI? CLI reference.