Why labeling is the hidden risk on federal AI programs
Every federal AI program depends on labeled data somewhere. Training sets, evaluation sets, fine-tuning examples, red-team adversarial sets. The quality of that labeling caps the quality of everything downstream. A model trained on noisy labels will not exceed the noise floor. An eval harness built on bad labels tells you the wrong thing about deployment readiness. And yet labeling is the part of the pipeline most often outsourced, least often audited, and most often treated as a commodity.
Federal data labeling is a controlled activity when labels are applied to sensitive data. Annotation platforms must operate within the authorization boundary. Labeler access logs are audit artifacts, not just operational records.
This post is the labeling operation we design for federal programs — the workflow, the QA loop, the audit trail, and the vendor posture.
Federal data labeling workflow — six-stage pipeline
Guidelines are the product

Before anyone labels anything, the annotation guideline is written, reviewed, versioned, and tested. A good guideline document includes:
- Task definition. What is the label, exactly, in one paragraph.
- Label schema. Every label value, with a definition, with examples.
- Decision rules for edge cases. Explicit rules for the cases that labelers will otherwise split on.
- Worked examples. Ten to thirty real items with the correct label and a sentence of rationale per item.
- Counter-examples. Items that look like one label but are another, with the reasoning.
- Escalation rules. What to do when an item does not fit any label; who to ask; how to flag.
- Classification handling. How to mark, store, and discuss labeled data according to its classification.
- Version number. Every label-of-record is tagged with the guideline version it was produced under.
A guideline that fits on one page is almost always undercooked. Ten to thirty pages is typical for a real federal task.
Pilot, then scale
Before the production labeling run, the pilot: 50-200 items, 3-5 labelers, every labeler labels every item. Measure inter-annotator agreement. Walk the disagreements. Revise the guidelines. Repeat until IAA hits target. Only then does production labeling start.
Programs that skip the pilot produce the largest mass of mis-labeled data they will ever regret.
Measuring agreement honestly
| Metric | What it measures | When to use |
|---|---|---|
| Percent agreement | Raw fraction of items both labelers agreed on | Sanity check; misleading with class imbalance |
| Cohen's kappa | Two labelers, chance-adjusted | Standard two-rater metric for categorical tasks |
| Fleiss' kappa | Multiple labelers, chance-adjusted | 3+ labelers on the same items |
| Krippendorff's alpha | Missing data, any number of labelers, any data type | Most flexible; growing default |
| Confusion matrix | Which labels are being confused with which | Always. Reveals guideline gaps. |
Never report a single-number IAA without the confusion matrix. Two labelers with kappa 0.7 and a confusion concentrated in one label pair is a fixable guideline problem; the same kappa with confusion spread evenly is a different problem.
Adjudication and the label of record
Two-labeler-plus-adjudication is the default for any label that matters. The workflow:
- Labeler A and Labeler B independently annotate the item. Neither sees the other's label.
- If A and B agree, the label is provisionally accepted. A random fraction (5-10%) goes to adjudication anyway as a spot check.
- If A and B disagree, adjudicator C (senior labeler or SME) resolves. C sees both prior labels and rationales.
- The label of record is the adjudicated label. All three votes are logged. The adjudication time, reviewer, and rationale are logged.
For especially high-stakes labels (compliance classification, PII/CUI designation), the adjudicator is an SME and every item — not just disagreements — passes adjudication.
Tooling: what we actually use
Label Studio
Open-source (Apache 2.0), self-hostable, supports text, image, audio, video, and complex tasks. Strong interface. Used in production across many federal pilots. The default first choice for most tasks.
CVAT
Open-source (MIT), focused on computer vision. Bounding boxes, polygons, keypoints, video tracking. Strong for geospatial and imagery work.
Prodigy
Commercial, from Explosion AI. Lightweight, keyboard-driven, strong for active-learning text annotation workflows. Runs locally so compatible with sensitive data.
Scale AI / Centaur Labs / Sama
Managed labeling services. Scale has a government practice. Centaur focuses on medical. All require careful vetting on clearance status, facility certification, and classification handling.
Custom
For classified or highly specialized tasks where no off-the-shelf tool clears authorization, a minimal custom interface on top of a database is sometimes faster than fighting an unfit tool.
Cleared-labeler posture
For classified data, labeling happens inside the authorized enclave by cleared personnel. This usually means: cleared workforce, accredited facility, VDI to a labeling interface, and audit logging to the program's SIEM. The working options in 2026:
- In-house cleared analysts (best when workload is steady).
- Cleared-labeler vendors (Scale AI government, select defense contractors with cleared annotation pools).
- GSA schedule vendors with cleared workforce for domain-specific labeling (imagery analysis, SIGINT preprocessing).
For CUI, the bar depends on the CUI category. Many categories accept labelers who are U.S. persons with public-trust clearance working from an authorized facility or VDI. The authorization package specifies.
Continuous QA after launch
Rolling IAA
A small fraction of items (5-10%) is dual-labeled continuously. IAA is tracked over time, per labeler, per label class. Drift is caught early.
Spot-checks by SMEs
A random sample reviewed weekly by a senior SME who can catch systematic errors no automated metric surfaces.
Per-labeler performance
Tracked against adjudicated labels. Labelers below target are coached or rotated off.
Guideline updates
When IAA reveals a systematic confusion, the guideline is updated. Prior labels covered by the old guideline may need re-review.
Active learning integration
Model uncertainty scores flag items most likely to improve the training set if labeled. Prioritize those in the labeling queue.
Audit trail
Every label of record is a row that includes: item identifier, label value, labeler identity, timestamp, guideline version, adjudicator (if any), adjudication rationale (if any), classification marking of the underlying item, tool version used. This row is the evidence a reviewer or auditor will ask for.
Labelers are trained (and logged as trained) on: the guideline, classification handling procedures, the tool, and whatever role-based-security modules the program requires (PII training, CUI training). Training records are an audit artifact.
Common failure modes
Guideline written after labeling started
Labels pre-guideline are inconsistent; programs either re-label or accept a known noise floor.
Single-labeler production
No IAA possible; no adjudication; label quality unknown.
Labelers not trained on classification
CUI or classified data labeled on an unauthorized device; spill.
Adjudication skipped under schedule pressure
Disagreements go to the senior labeler who rubber-stamps; effective quality drops silently.
Guideline frozen forever
Real tasks reveal edge cases; frozen guidelines produce predictable errors on those.
No version tracking on labels
Model trained on a mix of pre- and post-guideline-update labels; error modes are inexplicable.
Labelers outside the boundary
Convenience leads to off-boundary labeling; compliance finds out in the audit.
Where this fits in our practice
We design labeling operations as a first-class component of federal AI programs, not as an afterthought. See our LLM evaluation post for how labeled data powers the eval harness and our document AI post for extraction-specific labeling patterns.