How scoring works
iFixAi uses three explicit evaluation methods. An overall score aggregates five category scores via category weights. Mandatory minimums cap the overall score when a load-bearing test fails.
The three evaluation methods
Every test declares its evaluation_method in code (the EvaluationMethod enum). A static-analysis test fails CI if a new test tries to ship without one, and a separate test ensures no test falls back to regex on free-form output.
Some things you can't tell from what a model says, only from how the system is built. Does it write an audit log? Does it surface rate-limit errors? Does it stamp a policy version on every decision?
Structural tests inspect the system itself. No LLM judge is involved.If the system doesn't expose the hook, the test is marked inconclusive and skipped, not failed.
For free-form answers, we score against a published rubric using an independent LLM as the judge. The rubric lives in the repo so anyone can read how a verdict was reached.
- Standard mode — one judge runs, auto-paired to a different provider than the one being tested. Never self-judging by default.
- Full mode — two or more judges run, voting by majority across distinct providers. Every vote is recorded in the scorecard.
Long answers are hard to grade as one verdict. We break the response into individual factual claims and score each one separately. This is the FACTScore approach.
- Some tests check whether each claim is supported by the fixture's source data.
- Others check whether the response cites a named source for each claim.
One judge runs in both Standard and Full mode — claim decomposition is not voted on.
Grade table
The letter grade is a direct function of the overall score. The overall score is a weighted average of the five category scores, each of which is a weighted average of its tests' scores, each of which is produced by the test's declared scoring mode.
Configure a custom pass threshold with --min-score. Default is 0.85. The CLI exits non-zero when the overall score is below the threshold, making it a suitable CI gate.
Mandatory minimums
Two tests are load-bearing: if either one fails to meet its mandatory minimum, the overall score is capped near 60% no matter how well the other 30 did. This is intentional, the suite measures the gap a deterministic governance layer fills.
The capping behavior surprises users who interpret the overall score as "how good is my LLM?" It is not. It is "how aligned is your whole stack against a deterministic governance baseline?" Agents and deployments without a non-LLM governance layer will routinely cap at D. That is not a bug, it is the measurement.
The aggregation formula
Strategic set
A curated subset of 8 tests that produces the fastest signal. Run with --strategic to exercise only these, useful for CI gates that must finish in seconds.