Validated
Response-level hallucination scoring — the 5-tier engine, SDK guard, FastAPI middleware, REST server, injection detection, and the agent/MCP preflight guard.
Benchmarks
Director-AI's production-validated metric is response-level hallucination scoring on LLM-AggreFact. Numbers below are committed measurements; the streaming contradiction halt is opt-in and evidence-bound, not a sole production gate.
Accuracy vs latency
The scorer climbs only as high as a claim needs. Cheap tiers settle the easy cases in microseconds; NLI handles what is still uncertain.
| Scoring tier | Balanced accuracy | Latency | Notes |
|---|---|---|---|
| NLI (cross-encoder) | 75.8% | 14.6 ms | Default production tier on LLM-AggreFact |
| NLI (larger model) | 77.4% | ~40 ms | Higher accuracy, higher cost |
| Embeddings | ~73% | ~15 ms | Semantic support from retrieved evidence |
| Heuristic / rules | ~55% | <0.5 ms | Model-free, free, settles easy cases first |
Rust acceleration
The Rust-accelerated path keeps the hot NLI loop an order of magnitude ahead of pure-Python backends.
| Backend | Latency per pair |
|---|---|
| Rust NLI | 17.9 ms/pair |
| PyTorch (CPU) | 80.1 ms/pair |
| ONNX | 118.9 ms/pair |
| Transformers (CPU) | 207.3 ms/pair |
| Heuristic | <0.5 ms/pair |
Backend names above are illustrative of the measured tiers; exact per-package figures live in the repository's benchmarks/results/. Reproduce with the committed benchmark scripts.
Honest boundary
Response-level hallucination scoring — the 5-tier engine, SDK guard, FastAPI middleware, REST server, injection detection, and the agent/MCP preflight guard.
The streaming contradiction halt is evidence-bound and opt-in; current local evidence is recorded in the repo and should not be a sole production gate.
Every number is a committed measurement. The benchmark scripts and result artefacts ship in the repository.