Skip to main content

Federal data labeling at scale, under audit.

April 7, 2026 · 14 min read · Annotator training, IAA measurement, disagreement resolution, audit trails, and cleared-labeler vendors for classified work.

Why labeling is the hidden risk on federal AI programs

Every federal AI program depends on labeled data somewhere. Training sets, evaluation sets, fine-tuning examples, red-team adversarial sets. The quality of that labeling caps the quality of everything downstream. A model trained on noisy labels will not exceed the noise floor. An eval harness built on bad labels tells you the wrong thing about deployment readiness. And yet labeling is the part of the pipeline most often outsourced, least often audited, and most often treated as a commodity.

LABELING IS A SECURITY-CONTROLLED ACTIVITY

Federal data labeling is a controlled activity when labels are applied to sensitive data. Annotation platforms must operate within the authorization boundary. Labeler access logs are audit artifacts, not just operational records.

This post is the labeling operation we design for federal programs — the workflow, the QA loop, the audit trail, and the vendor posture.

Federal data labeling workflow — six-stage pipeline

1
Requirements Scoping
Wk 1–2
2
Annotation Guide
Wk 2–3
3
Pilot Batch
100–500 samples
4
IAA Measurement
target kappa >0.75
5
Full Production Run
ongoing
6
Continuous QA
monthly audits

Guidelines are the product

Before anyone labels anything, the annotation guideline is written, reviewed, versioned, and tested. A good guideline document includes:

  • Task definition. What is the label, exactly, in one paragraph.
  • Label schema. Every label value, with a definition, with examples.
  • Decision rules for edge cases. Explicit rules for the cases that labelers will otherwise split on.
  • Worked examples. Ten to thirty real items with the correct label and a sentence of rationale per item.
  • Counter-examples. Items that look like one label but are another, with the reasoning.
  • Escalation rules. What to do when an item does not fit any label; who to ask; how to flag.
  • Classification handling. How to mark, store, and discuss labeled data according to its classification.
  • Version number. Every label-of-record is tagged with the guideline version it was produced under.

A guideline that fits on one page is almost always undercooked. Ten to thirty pages is typical for a real federal task.

Pilot, then scale

Before the production labeling run, the pilot: 50-200 items, 3-5 labelers, every labeler labels every item. Measure inter-annotator agreement. Walk the disagreements. Revise the guidelines. Repeat until IAA hits target. Only then does production labeling start.

Programs that skip the pilot produce the largest mass of mis-labeled data they will ever regret.

Measuring agreement honestly

MetricWhat it measuresWhen to use
Percent agreementRaw fraction of items both labelers agreed onSanity check; misleading with class imbalance
Cohen's kappaTwo labelers, chance-adjustedStandard two-rater metric for categorical tasks
Fleiss' kappaMultiple labelers, chance-adjusted3+ labelers on the same items
Krippendorff's alphaMissing data, any number of labelers, any data typeMost flexible; growing default
Confusion matrixWhich labels are being confused with whichAlways. Reveals guideline gaps.

Never report a single-number IAA without the confusion matrix. Two labelers with kappa 0.7 and a confusion concentrated in one label pair is a fixable guideline problem; the same kappa with confusion spread evenly is a different problem.

Adjudication and the label of record

Two-labeler-plus-adjudication is the default for any label that matters. The workflow:

  1. Labeler A and Labeler B independently annotate the item. Neither sees the other's label.
  2. If A and B agree, the label is provisionally accepted. A random fraction (5-10%) goes to adjudication anyway as a spot check.
  3. If A and B disagree, adjudicator C (senior labeler or SME) resolves. C sees both prior labels and rationales.
  4. The label of record is the adjudicated label. All three votes are logged. The adjudication time, reviewer, and rationale are logged.

For especially high-stakes labels (compliance classification, PII/CUI designation), the adjudicator is an SME and every item — not just disagreements — passes adjudication.

Tooling: what we actually use

Label Studio

Open-source (Apache 2.0), self-hostable, supports text, image, audio, video, and complex tasks. Strong interface. Used in production across many federal pilots. The default first choice for most tasks.

CVAT

Open-source (MIT), focused on computer vision. Bounding boxes, polygons, keypoints, video tracking. Strong for geospatial and imagery work.

Prodigy

Commercial, from Explosion AI. Lightweight, keyboard-driven, strong for active-learning text annotation workflows. Runs locally so compatible with sensitive data.

Scale AI / Centaur Labs / Sama

Managed labeling services. Scale has a government practice. Centaur focuses on medical. All require careful vetting on clearance status, facility certification, and classification handling.

Custom

For classified or highly specialized tasks where no off-the-shelf tool clears authorization, a minimal custom interface on top of a database is sometimes faster than fighting an unfit tool.

Cleared-labeler posture

For classified data, labeling happens inside the authorized enclave by cleared personnel. This usually means: cleared workforce, accredited facility, VDI to a labeling interface, and audit logging to the program's SIEM. The working options in 2026:

  • In-house cleared analysts (best when workload is steady).
  • Cleared-labeler vendors (Scale AI government, select defense contractors with cleared annotation pools).
  • GSA schedule vendors with cleared workforce for domain-specific labeling (imagery analysis, SIGINT preprocessing).

For CUI, the bar depends on the CUI category. Many categories accept labelers who are U.S. persons with public-trust clearance working from an authorized facility or VDI. The authorization package specifies.

The label of record must be traceable. Which labeler produced it, when, under which guideline version, with which adjudicator if applicable. Without that trail, the downstream model has nothing to defend itself with.

Continuous QA after launch

Rolling IAA

A small fraction of items (5-10%) is dual-labeled continuously. IAA is tracked over time, per labeler, per label class. Drift is caught early.

Spot-checks by SMEs

A random sample reviewed weekly by a senior SME who can catch systematic errors no automated metric surfaces.

Per-labeler performance

Tracked against adjudicated labels. Labelers below target are coached or rotated off.

Guideline updates

When IAA reveals a systematic confusion, the guideline is updated. Prior labels covered by the old guideline may need re-review.

Active learning integration

Model uncertainty scores flag items most likely to improve the training set if labeled. Prioritize those in the labeling queue.

Audit trail

Every label of record is a row that includes: item identifier, label value, labeler identity, timestamp, guideline version, adjudicator (if any), adjudication rationale (if any), classification marking of the underlying item, tool version used. This row is the evidence a reviewer or auditor will ask for.

Labelers are trained (and logged as trained) on: the guideline, classification handling procedures, the tool, and whatever role-based-security modules the program requires (PII training, CUI training). Training records are an audit artifact.

Common failure modes

Guideline written after labeling started

Labels pre-guideline are inconsistent; programs either re-label or accept a known noise floor.

Single-labeler production

No IAA possible; no adjudication; label quality unknown.

Labelers not trained on classification

CUI or classified data labeled on an unauthorized device; spill.

Adjudication skipped under schedule pressure

Disagreements go to the senior labeler who rubber-stamps; effective quality drops silently.

Guideline frozen forever

Real tasks reveal edge cases; frozen guidelines produce predictable errors on those.

No version tracking on labels

Model trained on a mix of pre- and post-guideline-update labels; error modes are inexplicable.

Labelers outside the boundary

Convenience leads to off-boundary labeling; compliance finds out in the audit.

Where this fits in our practice

We design labeling operations as a first-class component of federal AI programs, not as an afterthought. See our LLM evaluation post for how labeled data powers the eval harness and our document AI post for extraction-specific labeling patterns.

FAQ

What inter-annotator agreement threshold should a federal program target?
Depends on task. Simple binary classification: Cohens kappa above 0.8. Multi-class with clear boundaries: 0.7+. Subjective classification (policy-violation, tone): 0.55-0.7 is realistic. Report kappa and percent agreement; single numbers hide important failure modes.
What labeling tools are viable for federal work?
Label Studio (open-source, self-hostable, widely used), CVAT for vision, Prodigy for text (commercial, lightweight), Scale AI and Centaur Labs for managed services (check FedRAMP and cleared-labeler availability), custom tooling for classified or highly specialized tasks.
Do you need cleared labelers for CUI or classified data?
For classified, yes, without exception. For CUI, it depends on the CUI category and the program authorization. Many CUI categories require labelers who are U.S. persons with at least public-trust clearance, working in an authorized facility or VDI. Cleared-labeler vendors exist; availability and cost vary.
How do you resolve labeler disagreement?
Tiered adjudication: two labelers independently annotate. Agreement - accept. Disagreement - third-labeler tiebreak or expert adjudication. For high-stakes labels, always expert adjudication. All votes are logged; the label of record is the adjudicated label, not the majority.
What does a federal labeling audit actually check?
Guidelines completeness and versioning, labeler training records, IAA over time, disagreement-resolution trail, random-sample expert review, classification handling of the data during labeling, and access logs to the labeling tool. Every label of record should trace back to a specific labeler at a specific timestamp with a specific guideline version.
How much does federal data labeling cost?
Planning numbers: simple tabular / classification at $0.05-0.50 per label with cleared labelers. Document-level annotation with SME review at $3-20 per document. Bounding-box annotation for imagery at $0.10-1.00 per box. Complex policy-classification with adjudication: $5-50 per item. Cleared-labeler vendors add 30-100% over commercial rates.

Related insights

Standing up a labeling operation for a federal program?

We design labeling workflows, guidelines, and QA loops that produce ground truth your downstream model actually deserves.