Why public benchmarks are the wrong measuring stick
Every LLM vendor markets with MMLU, GSM8K, HumanEval, and a handful of curated multiple-choice benchmarks. These were useful when they were new, and they are close to useless for deciding whether a model is ready to put in front of an agency's users. MMLU does not measure whether a model grounds its answers in a specific corpus, refuses cleanly when information is missing, preserves classification markings in output, or holds up across a long context window of policy text. It measures which AI can guess the right multiple-choice answer to an AP history question.
HELM from Stanford CRFM is broader and includes some robustness and fairness dimensions. BIG-bench is broader still. Both are valuable for vendor comparison at a gross level. Neither answers the question that matters inside a federal program: "will this system give a correct, cited, and defensible answer on the document my analyst just dropped into it?"
The eval harness we build inside federal programs starts with a premise: public benchmarks are a noisy filter for shortlisting candidate models. The real evaluation runs on the agency's data, against the agency's task definition, with scoring that matches how a domain expert would judge the output.
Task typology: the seven things federal LLMs actually do

Every agency pipeline we have built breaks into some subset of these task types. Eval is structured per type because the scoring function differs.
- Grounded QA. Answer a question from a retrieved passage, with citation. Scored on correctness, grounding, and citation validity.
- Extraction. Pull structured fields from a document (case number, dates, parties, amounts). Scored on field-level precision and recall.
- Summarization. Condense a document or a set of documents. Scored on faithfulness, coverage, and length conformance.
- Classification. Assign a label from a fixed taxonomy. Scored on precision, recall, F1 per class, and macro-F1.
- Redaction. Identify and mask PII or classified spans. Scored on precision and recall at the span level with an asymmetric cost — missed redactions are much worse than over-redactions.
- Translation / transformation. Convert text across languages, reading levels, or formats. Scored on faithfulness and target-format conformance.
- Agentic action. Tool use with side effects. Scored on task completion, tool-call validity, and safety constraints.
Golden sets that don't rot
A golden set is the labeled evaluation data — input queries paired with expected behaviors. It is the most valuable artifact on the program, and the one most likely to decay. The rules we enforce:
- Built from the live corpus, not synthesized. Analysts draft items from real documents and real questions, not from an LLM imagining what users might ask.
- Stratified by task type, difficulty, and sensitivity. A golden set dominated by easy grounded QA misleads. Include the hard, adversarial, and unanswerable cases deliberately.
- Versioned. Every item carries a version and a creation date. Items get deprecated when the underlying document changes or is superseded. You do not silently rewrite a golden item.
- Labeled by people qualified to label. A policy-QA golden set labeled by a contractor intern is not a policy-QA golden set. Use the SMEs who would read the production output.
- Kept small enough to run on every change, big enough to detect regression. Typical range: 300 to 1,500 items split across task types. A 30-item set hides regressions. A 30,000-item set does not run on every CI build.
Scoring: automated, LLM-judged, and human
Three scoring layers stack. Each catches what the others miss.
Layer 1: automated checks
Exact match, regex, JSON schema validation, citation-ID validation against the retrieved set, numeric range checks, and format checks. Fast, deterministic, cheap. Good for extraction, classification, format conformance, and grounded-citation validity. Bad for anything that depends on meaning.
Layer 2: LLM-as-judge
A separate, strong model (typically a different provider or a different model family than the one under test) scores outputs against a rubric. Rubrics are explicit: "Rate 1-5 on faithfulness to source. Deduct for any claim not supported by the passage. Deduct if the tone is speculative. Return JSON with score, one-sentence rationale, and the unsupported claims if any."
LLM-as-judge is strong on subjective dimensions (clarity, tone, completeness) once the rubric is tight. It is weak on correctness where ground truth exists and should not be used for that — use automated checks or human scoring instead. Calibrate the judge against 100-plus human-labeled items and track judge-human agreement as a first-class metric. Judges drift; rubrics need revision.
Layer 3: human review
A sampled subset — typically 50 to 200 items per eval run — is reviewed by SMEs with a structured form. Human review does three things automated scoring cannot: catches subtle correctness failures, validates that the automated and LLM-judge scores are tracking reality, and generates fresh failure cases that get added to the golden set.
Tooling: what runs the harness
A real federal eval harness is a small, tight piece of software. Three options we use depending on context:
| Harness | Best for | Gotchas |
|---|---|---|
| OpenAI Evals (open source) | Fast starts, lightweight CI integration, YAML-defined tasks | Opinionated toward chat models; custom task types need Python |
| LangChain / LlamaIndex eval modules | RAG-specific metrics (faithfulness, context recall), integrated with pipelines | Tight coupling to the framework; hard to swap in non-LangChain systems |
| Promptfoo | Side-by-side model comparison, prompt A/B, CI integration | Limited for multi-turn agentic tasks |
| Custom (Python + pytest + DuckDB) | Production federal programs with classified eval data | You build and maintain it; plus side — no external service sees your data |
| MLflow + Evidently | Tracking results over time, dashboards, drift monitoring | Eval logic lives elsewhere; these are the storage layer |
For any program handling CUI or classified eval data, custom harness in the authorization boundary is the default. The cost of building it is trivial compared to the compliance cost of sending agency data to an external eval service.
Regression suites that gate release
The point of an eval harness is to make deployment decisions. That requires release gates:
Task-level thresholds
A new candidate model or prompt must meet or exceed the current champion on each task type. Regressions on a single task type are not automatically overridden by gains elsewhere.
Cost ceilings
Quality is not the only axis. A candidate that improves accuracy by 1 percent at 4x cost usually does not ship.
Latency floors
P95 latency is scored alongside quality. A model that wins on quality but doubles P95 fails the gate unless quality improvement is large.
Safety failures as hard blockers
A candidate that regresses on refusal-for-CUI or PII-redaction precision does not ship regardless of other gains.
Manual override trail
If a human overrides a gate failure, the rationale and approver are logged and reviewed.
Evaluating refusal behavior
Federal systems fail more often from over-refusal than under-refusal. A model that refuses every question about "the investigation" because the word sounds sensitive is useless; a model that answers unanswerable questions with confident fabrication is dangerous. Both failure modes are measurable.
Build two complementary subsets:
- Should-refuse set. Queries that are out of scope, adversarial, unanswerable from the corpus, or involve classification levels above the caller's. Expected behavior: clean refusal with a specific reason.
- Should-answer set. Queries that are in scope and answerable. Expected behavior: answer with citations.
Report refusal precision (of the refusals, how many were correct) and recall (of the items that should have been refused, how many were) as separate numbers. Never collapse them into a single score — a system that refuses everything scores 100 percent on recall and 0 on precision.
Drift: the silent eval killer
Your golden set was written six months ago. The underlying documents have changed. The foundation model has been silently updated by the vendor. The retrieval corpus has grown. The production traffic mix is no longer what the golden set represented. Every one of these is drift, and every one erodes the harness's validity.
Drift controls that work:
Monthly corpus diff
What percent of the golden set's referenced documents have changed? Items touching changed documents get re-reviewed.
Model version pinning
Pin the exact model and version. For hosted APIs, test behavior before migrating to any new version. Sonnet X and Sonnet X.1 are not the same system.
Production sampling
Pull 200 to 500 real queries per month from production logs (scrubbed), route to the harness, and compare scores to the golden-set baseline. Large gaps mean the golden set has stopped representing real usage.
Refresh cadence
Add 30 to 50 new items per quarter from production failure cases. Retire items that no longer represent current corpus state.
Compliance overlay
Evaluation maps onto NIST SP 800-53 controls (CA-2 security assessments, CA-7 continuous monitoring, SI-4 system monitoring) and NIST AI RMF functions (Measure 1-4). For a federal authorization package, the eval harness and its results are evidence:
- Eval harness design doc maps to the AI RMF Measure function.
- Regression suite results map to SI-4 (and go into the continuous monitoring plan).
- Refusal-behavior evaluation maps to SI-5 (security alerts) when a refusal is triggered by a policy violation signal.
- Drift monitoring maps to CA-7.
- Human review procedures map to AT-3 (role-based training) when reviewers are identified and trained.
Writing the harness once and tagging every result with the control reference turns eval output into authorization evidence without extra work.
FAQ
Where this fits in our practice
We build eval harnesses that serve both engineering and compliance. See our RAG architecture and NIST 800-53 control mapping posts for how eval integrates with the rest of the stack.