LLM Evaluation for Federal Use Cases: Beyond MMLU

Why public benchmarks are the wrong measuring stick

Every LLM vendor markets with MMLU, GSM8K, HumanEval, and a handful of curated multiple-choice benchmarks. These were useful when they were new, and they are close to useless for deciding whether a model is ready to put in front of an agency's users. MMLU does not measure whether a model grounds its answers in a specific corpus, refuses cleanly when information is missing, preserves classification markings in output, or holds up across a long context window of policy text. It measures which AI can guess the right multiple-choice answer to an AP history question.

HELM from Stanford CRFM is broader and includes some robustness and fairness dimensions. BIG-bench is broader still. Both are valuable for vendor comparison at a gross level. Neither answers the question that matters inside a federal program: "will this system give a correct, cited, and defensible answer on the document my analyst just dropped into it?"

The eval harness we build inside federal programs starts with a premise: public benchmarks are a noisy filter for shortlisting candidate models. The real evaluation runs on the agency's data, against the agency's task definition, with scoring that matches how a domain expert would judge the output.

Rule of thumb. If your eval harness does not use a single document from your corpus, you are not evaluating your system. You are evaluating a model in the abstract.

Task typology: the seven things federal LLMs actually do

Every agency pipeline we have built breaks into some subset of these task types. Eval is structured per type because the scoring function differs.

Grounded QA. Answer a question from a retrieved passage, with citation. Scored on correctness, grounding, and citation validity.
Extraction. Pull structured fields from a document (case number, dates, parties, amounts). Scored on field-level precision and recall.
Summarization. Condense a document or a set of documents. Scored on faithfulness, coverage, and length conformance.
Classification. Assign a label from a fixed taxonomy. Scored on precision, recall, F1 per class, and macro-F1.
Redaction. Identify and mask PII or classified spans. Scored on precision and recall at the span level with an asymmetric cost — missed redactions are much worse than over-redactions.
Translation / transformation. Convert text across languages, reading levels, or formats. Scored on faithfulness and target-format conformance.
Agentic action. Tool use with side effects. Scored on task completion, tool-call validity, and safety constraints.

Golden sets that don't rot

A golden set is the labeled evaluation data — input queries paired with expected behaviors. It is the most valuable artifact on the program, and the one most likely to decay. The rules we enforce:

Built from the live corpus, not synthesized. Analysts draft items from real documents and real questions, not from an LLM imagining what users might ask.
Stratified by task type, difficulty, and sensitivity. A golden set dominated by easy grounded QA misleads. Include the hard, adversarial, and unanswerable cases deliberately.
Versioned. Every item carries a version and a creation date. Items get deprecated when the underlying document changes or is superseded. You do not silently rewrite a golden item.
Labeled by people qualified to label. A policy-QA golden set labeled by a contractor intern is not a policy-QA golden set. Use the SMEs who would read the production output.
Kept small enough to run on every change, big enough to detect regression. Typical range: 300 to 1,500 items split across task types. A 30-item set hides regressions. A 30,000-item set does not run on every CI build.

Scoring: automated, LLM-judged, and human

Three scoring layers stack. Each catches what the others miss.

Layer 1 — Automated

Speed

Cost

Low

✓

Meaning

Coverage

Editorial weighting from public sources and practitioner reading — illustrative, not a measured statistic.

Layer 2 — LLM-as-Judge

Speed

Cost

Med

Meaning

Coverage

Editorial weighting from public sources and practitioner reading — illustrative, not a measured statistic.

Layer 3 — Human Review

Speed

Cost

High

Meaning

Coverage

Editorial weighting from public sources and practitioner reading — illustrative, not a measured statistic.

Layer 1: automated checks

Exact match, regex, JSON schema validation, citation-ID validation against the retrieved set, numeric range checks, and format checks. Fast, deterministic, cheap. Good for extraction, classification, format conformance, and grounded-citation validity. Bad for anything that depends on meaning.

Layer 2: LLM-as-judge

A separate, strong model (typically a different provider or a different model family than the one under test) scores outputs against a rubric. Rubrics are explicit: "Rate 1-5 on faithfulness to source. Deduct for any claim not supported by the passage. Deduct if the tone is speculative. Return JSON with score, one-sentence rationale, and the unsupported claims if any."

LLM-as-judge is strong on subjective dimensions (clarity, tone, completeness) once the rubric is tight. It is weak on correctness where ground truth exists and should not be used for that — use automated checks or human scoring instead. Calibrate the judge against 100-plus human-labeled items and track judge-human agreement as a first-class metric. Judges drift; rubrics need revision.

Layer 3: human review

A sampled subset — typically 50 to 200 items per eval run — is reviewed by SMEs with a structured form. Human review does three things automated scoring cannot: catches subtle correctness failures, validates that the automated and LLM-judge scores are tracking reality, and generates fresh failure cases that get added to the golden set.

Tooling: what runs the harness

A real federal eval harness is a small, tight piece of software. Three options we use depending on context:

Harness	Best for	Gotchas
OpenAI Evals (open source)	Fast starts, lightweight CI integration, YAML-defined tasks	Opinionated toward chat models; custom task types need Python
LangChain / LlamaIndex eval modules	RAG-specific metrics (faithfulness, context recall), integrated with pipelines	Tight coupling to the framework; hard to swap in non-LangChain systems
Promptfoo	Side-by-side model comparison, prompt A/B, CI integration	Limited for multi-turn agentic tasks
Custom (Python + pytest + DuckDB)	Production federal programs with classified eval data	You build and maintain it; plus side — no external service sees your data
MLflow + Evidently	Tracking results over time, dashboards, drift monitoring	Eval logic lives elsewhere; these are the storage layer

For any program handling CUI or classified eval data, custom harness in the authorization boundary is the default. The cost of building it is trivial compared to the compliance cost of sending agency data to an external eval service.

Regression suites that gate release

The point of an eval harness is to make deployment decisions. That requires release gates:

Task-level thresholds

A new candidate model or prompt must meet or exceed the current champion on each task type. Regressions on a single task type are not automatically overridden by gains elsewhere.

Cost ceilings

Quality is not the only axis. A candidate that improves accuracy by 1 percent at 4x cost usually does not ship.

Latency floors

P95 latency is scored alongside quality. A model that wins on quality but doubles P95 fails the gate unless quality improvement is large.

Safety failures as hard blockers

A candidate that regresses on refusal-for-CUI or PII-redaction precision does not ship regardless of other gains.

Manual override trail

If a human overrides a gate failure, the rationale and approver are logged and reviewed.

A model that improves average quality while regressing on refusal precision is not an improvement — it is a liability with better marketing numbers.

Evaluating refusal behavior

Federal systems fail more often from over-refusal than under-refusal. A model that refuses every question about "the investigation" because the word sounds sensitive is useless; a model that answers unanswerable questions with confident fabrication is dangerous. Both failure modes are measurable.

Build two complementary subsets:

Should-refuse set. Queries that are out of scope, adversarial, unanswerable from the corpus, or involve classification levels above the caller's. Expected behavior: clean refusal with a specific reason.
Should-answer set. Queries that are in scope and answerable. Expected behavior: answer with citations.

Report refusal precision (of the refusals, how many were correct) and recall (of the items that should have been refused, how many were) as separate numbers. Never collapse them into a single score — a system that refuses everything scores 100 percent on recall and 0 on precision.

Drift: the silent eval killer

Your golden set was written six months ago. The underlying documents have changed. The foundation model has been silently updated by the vendor. The retrieval corpus has grown. The production traffic mix is no longer what the golden set represented. Every one of these is drift, and every one erodes the harness's validity.

Drift controls that work:

Monthly corpus diff

What percent of the golden set's referenced documents have changed? Items touching changed documents get re-reviewed.

Model version pinning

Pin the exact model and version. For hosted APIs, test behavior before migrating to any new version. Sonnet X and Sonnet X.1 are not the same system.

Production sampling

Pull 200 to 500 real queries per month from production logs (scrubbed), route to the harness, and compare scores to the golden-set baseline. Large gaps mean the golden set has stopped representing real usage.

Refresh cadence

Add 30 to 50 new items per quarter from production failure cases. Retire items that no longer represent current corpus state.

Compliance overlay

Evaluation maps onto NIST SP 800-53 controls (CA-2 security assessments, CA-7 continuous monitoring, SI-4 system monitoring) and NIST AI RMF functions (Measure 1-4). For a federal authorization package, the eval harness and its results are evidence:

Eval harness design doc maps to the AI RMF Measure function.
Regression suite results map to SI-4 (and go into the continuous monitoring plan).
Refusal-behavior evaluation maps to SI-5 (security alerts) when a refusal is triggered by a policy violation signal.
Drift monitoring maps to CA-7.
Human review procedures map to AT-3 (role-based training) when reviewers are identified and trained.

Writing the harness once and tagging every result with the control reference turns eval output into authorization evidence without extra work.

FAQ

Why is MMLU insufficient for federal LLM evaluation?

MMLU measures multiple-choice knowledge recall across academic subjects. It says nothing about grounding in your corpus, citation validity, refusal behavior, PII handling, or long-context reasoning over federal documents.

What does a federal LLM eval harness actually look like?

A task-typed golden set of 300 to 1,500 items from the agency corpus, labeled with expected behaviors, scored by automated checks, LLM-as-judge, and human spot-checks, running on every change and gating deployment.

Is LLM-as-judge trustworthy?

Yes for subjective dimensions with a tight rubric, calibrated quarterly against human labels. Not for correctness on domain specifics without ground truth. Track judge-human agreement as its own metric.

How do you evaluate refusal behavior?

Separate should-refuse and should-answer sets. Measure refusal precision and recall independently — never collapse to a single score.

How often should federal evals run?

Automated regression on every change, full human-in-loop eval quarterly, drift eval monthly on sampled production traffic.

Does the eval set itself need to be CUI-protected?

Yes. Treat the eval dataset with the same handling as the corpus it came from. Store in the authorized boundary, access-controlled, audit-logged.

Where this fits in our practice

We build eval harnesses that serve both engineering and compliance. See our RAG architecture and NIST 800-53 control mapping posts for how eval integrates with the rest of the stack.

LLM evaluation for federal use cases.

Why public benchmarks are the wrong measuring stick

Task typology: the seven things federal LLMs actually do

Golden sets that don't rot

Scoring: automated, LLM-judged, and human

Layer 1: automated checks

Layer 2: LLM-as-judge

Layer 3: human review

Tooling: what runs the harness

Regression suites that gate release

Evaluating refusal behavior

Drift: the silent eval killer

Compliance overlay

FAQ

Where this fits in our practice

Related insights

Need an eval harness that actually gates your LLM releases?

LLM evaluation for federal use cases.

Why public benchmarks are the wrong measuring stick

Task typology: the seven things federal LLMs actually do

Golden sets that don't rot

Scoring: automated, LLM-judged, and human

Layer 1: automated checks

Layer 2: LLM-as-judge

Layer 3: human review

Tooling: what runs the harness

Regression suites that gate release

Evaluating refusal behavior

Drift: the silent eval killer

Compliance overlay

FAQ

Where this fits in our practice

Related insights

RAG vs Fine-Tuning: The Federal Decision Tree

RAG Architecture for Federal Document Corpora

Prompt Injection Defense for Federal LLMs

Need an eval harness that actually gates your LLM releases?