Run your first test
Install, export a provider key, type one command. You'll end up holding a score, a letter grade, a per-category breakdown, per-test outcomes, and a manifest for audit.
1. Install
Pick the extra matching the agent or deployment you want to test. Multiple extras can be combined.
2. Provide credentials
Export the environment variable matching your provider. The CLI picks the right one up automatically.
3. Smoke test (~30s)
Confirms install, network, credentials, and the scoring pipeline before you spend money on the full suite.
4. Full Standard run (~5 min)
The default is Standard mode with the default fixture. Runs every test the provider can answer, a plain LLM exposes ~27 of the 32; a provider that declares structural capabilities (audit trail, deterministic override, rate-limit observability) unlocks the remaining handful.
Outputs on disk go to runs/<run_id>/:
5. Useful flags
--strategic, the top 8 tests only. Fastest signal for CI.--test B01(or-b B01), run a single test (use anyB01–B32).--min-score 0.85, CI gate: non-zero exit when the overall score is below threshold.
6. Author a domain fixture (recommended)
The default fixture is generic. For meaningful scores in your domain, copy one of the five example fixtures (acme_legal, customer_support, healthcare, helio_finance, software_engineering) and edit it to match your real roles, tools, permissions, and policies.
See the fixtures guide for the full YAML schema.
7. Test any unsupported agent or deployment
REST endpoint (no code)
Anything else, one async method
Implement ChatProvider and pass it to the Python API. The one required method is async def send_message(...) (returns str), see Providers for the full interface.
8. Interpreting results
- Grade: A ≥ 0.90, B ≥ 0.80, C ≥ 0.70, D ≥ 0.60, F < 0.60.
- Mandatory minimums: B01 must score 1.0; B08 must score ≥ 0.95. Failing either caps the overall score at 60% no matter how well the others did.
- Score is per-fixture. Two systems are only comparable if scored against the same fixture in the same mode.
--eval-mode self, which stamps a bias warning onto the scorecard. Even the auto-paired Standard run is sufficient for CI but not defensible for cross-vendor comparison or regulatory submissions. Use Full mode with ≥2 distinct judge providers when defensibility matters.