MLOps for Federal Production Systems

Why federal MLOps is its own discipline

Commercial MLOps optimizes for velocity: ship fast, monitor loudly, roll back cheaply, retrain on a dime. Federal MLOps optimizes for defensibility: every model that reaches production has a traceable lineage, a signed artifact, a documented evaluation, a rollback path a human can execute, and an audit trail that a 3PAO can examine three years later. The tools overlap heavily. The lifecycle does not.

The friction is structural. A commercial team retrains nightly because a small regression costs nothing. A federal system on an ATO cannot silently replace a production model without violating CM-3 (configuration change control) and without producing the evidence an ISSM will require to approve the change. Any pattern that works commercially but cannot be documented and rolled back in writing will not survive a federal environment.

This is the MLOps stack we ship for production federal ML, the components we wire together, the monitoring we build in, and the documentation that makes it live cleanly inside an RMF package.

Scope. Production ML for classification, regression, ranking, and structured prediction. FedRAMP Moderate or High. AWS GovCloud, Azure Government, or Google Cloud for Government. Patterns transfer to on-prem and classified enclaves with minor changes.

The reference architecture

FEDERAL MLOPS PIPELINE — 6 STAGES

Data ingest

CDC, S3, GovCloud APIs

Feature store

Feature Store, DVC, Delta Lake

Model training

SageMaker, Vertex, AML

Validation and eval

Eval harness, golden sets

Registry and deploy

Model registry, Kubernetes

ConMon and retrain

Drift detection, scheduled retrain

The stack is predictable once you have built a few of these. The pieces we run in production:

Source control for training code, inference code, feature definitions, prompts (if any), and configuration. GitHub Enterprise or GitLab, inside the boundary or connected through a governed egress.
Feature store for reusable features with online and offline parity. Feast for self-hosted, SageMaker Feature Store in GovCloud, Databricks Feature Store on Azure Gov.
Model registry for versioned, signed, evaluated model artifacts. MLflow, SageMaker Model Registry, or Azure ML Registry.
Training orchestration with reproducible compute. SageMaker Training Jobs, Azure ML Jobs, Vertex Training, or Kubeflow Pipelines.
CI/CD for models — the pipeline that takes a commit or a trigger and produces a registered, validated, gated model candidate.
Serving layer with canary and shadow deployment. SageMaker Endpoints, Azure ML Online Endpoints, KServe on EKS/AKS, or a homegrown FastAPI + Ray Serve stack.
Monitoring for drift, performance, latency, cost, and fairness. Evidently, WhyLabs, Arize, Fiddler, or open-source stacks around Prometheus and custom exporters.
Evidence pipeline that continuously produces SSP artifacts: inventory, lineage, evaluation summaries, drift reports, change logs.

Model registry and versioning

The registry is the authoritative record of "what is in production, how it got there, and what it was validated against." Every production inference call must resolve to a specific, immutable model version. No "latest" tags in production.

What belongs in the registry entry:

Model artifact (weights, code, tokenizer, preprocessors) with a signed hash.
Training data reference — dataset ID and version, snapshot hash, record count.
Training code reference — git SHA, base image digest, library versions (Python lock file).
Evaluation results — metrics on the held-out set, segment-level breakdowns for fairness, comparison to the previous champion.
Model card — intended use, limitations, known failure modes, recommended monitoring.
Approver — who signed off, when, against which test run.
Lineage edges — pointers to upstream features, downstream consumers.

MLflow gives you most of this out of the box if you configure it correctly and back it with Postgres and S3 (or their GovCloud equivalents). SageMaker Model Registry is more opinionated but bakes the approval workflow and lineage in. Choice depends on whether you are more comfortable running infrastructure or consuming managed services — both work inside GovCloud or Azure Government.

Signing model artifacts

Under SI-7, model weights are software components and require integrity verification. Sign artifacts with Cosign or Sigstore; store signatures alongside the artifact in the registry; verify signatures at load time in every serving replica. If a serving pod loads an unsigned or wrong-signature artifact, it fails closed and alerts. SBOM coverage via SPDX or CycloneDX should include model components, not just container layers.

Feature stores and training/serving parity

The single biggest source of silent production failure is training/serving skew — features computed one way during training and another way at inference. A feature store with a single source of truth for feature logic, accessible both offline (for training) and online (for inference), eliminates a class of bugs that will otherwise recur.

What matters in a federal context:

Encryption at rest with CMKs

Both offline (S3/Blob/BigQuery-equivalent) and online (DynamoDB, Redis, Cosmos) stores use customer-managed keys.

Access logging

Every feature read, every feature write, landed in the audit log under AU-2.

Data lineage

Each feature points back to source tables and the transform code. When an upstream source schema changes, you need to find every model that consumes it.

PII and CUI tagging

Features carry classification labels. Training jobs and inference callers are authorized or denied based on those labels, not based on table-level ACLs alone.

Feast is the open-source workhorse; it runs anywhere and is the right choice when you want the feature logic fully inside your boundary. SageMaker Feature Store and Databricks Feature Store are the managed options; both are available in GovCloud/Gov regions.

CI/CD for models

Model CI/CD is CI/CD with three extra concerns: the artifact is large and data-dependent, the tests are statistical, and the deploy is staged over traffic rather than over infrastructure.

The pipeline

1. Trigger
   - Git push to the model repo
   - Scheduled retrain (weekly/monthly)
   - Drift alert from monitoring
   - Manual promotion

2. Data validation
   - Schema check against contract
   - Volume and freshness sanity checks
   - Great Expectations or Deequ suites

3. Training
   - Reproducible compute job (container digest pinned)
   - Random seed set
   - Feature snapshot (time-pinned)
   - Outputs: weights, training metrics, training log

4. Evaluation
   - Held-out test set (time-based split where applicable)
   - Segment metrics (demographic, geographic, temporal)
   - Fairness gates
   - Regression vs. current champion
   - Statistical test (bootstrap CIs, not point estimates)

5. Registration
   - Push to model registry with full metadata
   - Sign the artifact
   - Status = "candidate"

6. Staging deploy
   - Shadow traffic for N hours
   - Compare predictions to champion on live inputs
   - Watch latency, cost, error rates

7. Canary deploy
   - 1% -> 5% -> 25% -> 100% over days
   - Gate at each stage on SLOs and drift signals

8. Promotion
   - Human approver (ISSO / product owner)
   - Registry status = "champion"
   - Archive previous champion for rollback

The statistical piece that commercial teams sometimes skip: bootstrap confidence intervals on your evaluation metrics. A point estimate telling you the new model is 0.2 percent better means nothing when the 95 percent CI spans zero. A federal assessor who has seen a few bad models will ask how you know the lift is real.

Shadow and canary deployment

Shadow traffic lets you compare the candidate to the champion on real production inputs without user-visible impact. The candidate serves predictions that are logged and scored but not returned to callers. Run shadow for long enough to see the full input distribution — a day is often not enough for cyclical workloads.

Canary deployment ramps live traffic from 1 percent to 100 percent over hours or days with automated gates at each level. Every gate checks: latency SLO, error rate, per-segment performance, drift against training distribution, cost per prediction. A gate failure rolls back automatically.

Monitoring: drift, performance, latency, cost

You monitor three families of signals in production. Miss one and you will eventually deploy a model that looks fine on the dashboard but is quietly wrong.

Data drift

The input distribution is changing relative to what the model was trained on. Measure on each input feature independently (one-dimensional drift) and on the joint distribution via dimensionality reduction plus a density test (multidimensional drift).

KS test

for continuous features.

Population Stability Index (PSI)

for categorical and binned features. PSI > 0.1 is a warning, > 0.25 is a problem.

Wasserstein distance

for a more robust continuous comparison.

Domain classifier

train a model to distinguish training data from current production data. If AUC is meaningfully > 0.5, the distributions have drifted.

Prediction drift

The output distribution is shifting. This is a useful second signal because it catches shifts that are too subtle to show up on any one input feature but still change what the model produces. Same statistical tests applied to predicted labels or scores.

Concept drift and performance

The relationship between inputs and outputs has changed. a direct reliable way to detect it is with ground truth, and ground truth is usually delayed, partial, or expensive. Patterns that work:

Sample a small fraction of predictions for human review on a continuous cadence. Review is expensive; the sample is small.
Wait for natural labels where the use case produces them (fraud chargebacks, case resolution outcomes, classifier overrides by users).
Proxy metrics — downstream business or mission metrics that correlate with model quality.

Latency, throughput, cost

Operational SLOs in the same place as accuracy metrics. p50, p95, p99 latency by route. Tokens or records per second. Cost per 1,000 predictions or 1,000 tokens. Budget alerts at 50 percent, 80 percent, 100 percent of allocation.

Fairness and segment performance

Performance broken out by the segments that matter to the mission: demographic, geographic, organizational, temporal. A model with strong aggregate accuracy can still fail a specific subgroup catastrophically. NIST AI RMF and the AI-specific OMB guidance that now layers over EO 14110's implementation expect this to be a live metric, not an annual review.

Retraining: triggers and cadence

Trigger-based retraining beats calendar-based for most systems. The triggers we wire:

Drift threshold breach

PSI over 0.25 on a primary feature for N consecutive days.

Performance floor

Ground-truth metric drops below a defined floor, sustained over a window.

Upstream change

Schema change, source change, or freshness SLO breach in the training data path.

Time-based floor

Even with no other trigger, a retrain and full re-evaluation cycle at least every 6 to 12 months for stable domains. This catches slow drift the automated tests missed.

Every retrain flows through the same CI/CD pipeline. No one retrains on a laptop and pushes to production. Every retrain is a CM-3 configuration change with an approval trail.

Rollback and the ATO boundary question

The rollback story matters more in federal than commercial because the window between detection and mitigation is longer. Two questions you must answer in writing before production:

What is the rollback condition? Which signals, at which thresholds, trigger rollback? Who approves? What is the maximum time to rollback?
What is the rollback mechanism? Is the previous champion still loaded in a warm replica, or does rollback require a redeploy? How is traffic shifted? How is rollback tested?

Patterns that work: keep the previous champion deployed alongside the current champion with 0 percent traffic, routed via a feature flag. Rollback is flipping the flag — seconds, not hours. Practice the rollback quarterly so your ops team is not learning the procedure during an incident.

The ATO boundary question is trickier. If your model is served from within an authorization boundary and calls a foundation model or embedding API outside the boundary, that external dependency's version changes are configuration changes in your system even though you did not initiate them. You need the ability to pin the upstream version, detect when the provider has bumped it, and treat the bump as a CM-3 event — not a surprise. For hosted LLM APIs specifically, pin the versioned model ID (e.g., claude-opus-4-7-20260415, not claude-opus-latest) and run regression on every detectable upstream change.

Documentation for RMF packages

The artifacts that turn an ML system into an ATO-ready system:

Model card

per production model — intended use, inputs, outputs, training data summary, evaluation summary, known limitations, recommended monitoring, contact. Update on every version.

Data sheet

per training dataset — source, collection process, consent and legal basis, preprocessing, PII and CUI handling, retention.

Fairness and bias assessment

the segments evaluated, results, mitigations applied, residual risks.

Threat model

for the ML attack surface — data poisoning, model inversion, membership inference, evasion attacks, model theft. Mitigations and residual risk.

Monitoring plan

which metrics, which thresholds, which alerts, who owns each.

Retraining procedure

triggers, approvals, testing gates, rollback plan.

Continuous monitoring package

the ongoing evidence the system is still within its authorized baseline.

NIST AI RMF 1.0 gives you the governance structure. NIST SP 800-218A (Secure Software Development Framework for AI) gives you the lifecycle controls. Map both against your 800-53 baseline and keep the mapping in the SSP.

GovCloud and Government-region specifics

Tool availability in GovCloud and Azure Government lags commercial by 3 to 12 months. What we check before committing a stack:

SageMaker Feature Store, Model Registry, and Clarify are available in AWS GovCloud (US). Verify the specific sub-services you need (SageMaker Pipelines, Studio, JumpStart models) per region.
Azure ML is available in Azure Government with most features on parity; specific foundation models in Azure AI Foundry have variable availability.
MLflow (self-hosted) runs anywhere because it is just containers and Postgres/S3. Often the safest default.
Open-source ML frameworks (PyTorch, TensorFlow, scikit-learn, XGBoost) are unrestricted; pull through your artifact proxy rather than from public PyPI/Hub.
FIPS-validated crypto is the default in GovCloud/Gov regions but must be confirmed per service; some managed services use non-FIPS endpoints by default and need explicit configuration.

Common failure modes we see

No feature store, no training/serving parity

Training uses a Jupyter-computed feature; inference uses a slightly different pandas transform. Silent accuracy loss.

"Latest" tag in production

Upstream silently promotes a new version, production behavior changes, no one remembers rolling out a change.

Monitoring only aggregates

P95 latency and overall accuracy look fine; a specific segment is catastrophically wrong for three weeks.

No held-out eval on retraining

Retrain uses the same data that is also in the eval set; metrics inflate.

No rollback rehearsal

Rollback plan exists on paper; first time it runs is during an incident.

Model card is stale

Documentation from v1 still in the registry entry for v7.

FAQ

How is federal MLOps different from commercial MLOps?

The tooling is similar but the lifecycle is slower and the documentation burden is higher. Every component lives inside an authorization boundary with FIPS-validated crypto. Rollback paths require formal change control. Retraining triggers need a documentation trail.

Can I use MLflow inside an ATO boundary?

Yes. MLflow is self-hostable, stores metadata in a database you control, and stores artifacts in object storage. Deploy it inside your VPC, back it with FIPS-validated storage, wire authentication to your IdP, and capture its audit events in your SIEM.

What drift metrics are worth monitoring?

Data drift (KS, PSI, Wasserstein), prediction drift (output distribution shift), and concept drift (performance against ground truth). Plus latency, throughput, cost per prediction, and segment-level performance.

How often should a federal ML system retrain?

Trigger-based, not calendar-based. Triggers are drift thresholds, performance floors, or upstream data changes. For stable domains, a six-month full refresh is reasonable; for fast-moving domains, monthly with automated regression gates.

What documentation does the RMF package need for an ML system?

A model card per production model, a data sheet per training dataset, a fairness and bias assessment, a threat model for ML-specific attacks, and drift and retraining procedures mapped to CM-3. NIST AI RMF 1.0 and SP 800-218A give the structure.

Where this fits in our practice

We build federal MLOps stacks end to end: feature stores, registries, CI/CD pipelines, drift monitoring, and the evidence pipeline that turns day-to-day operations into RMF artifacts. See our machine learning and cloud architecture capabilities, and our GovCloud vs. Azure Government comparison when you are picking a platform.

MLOps for federal production systems.

Why federal MLOps is its own discipline

The reference architecture

Model registry and versioning

Signing model artifacts

Feature stores and training/serving parity

CI/CD for models

The pipeline

Shadow and canary deployment

Monitoring: drift, performance, latency, cost

Data drift

Prediction drift

Concept drift and performance

Latency, throughput, cost

Fairness and segment performance

Retraining: triggers and cadence

Rollback and the ATO boundary question

Documentation for RMF packages

GovCloud and Government-region specifics

Common failure modes we see

FAQ

Where this fits in our practice

Related insights

Standing up MLOps inside a boundary?

MLOps for federal production systems.

Why federal MLOps is its own discipline

The reference architecture

Model registry and versioning

Signing model artifacts

Feature stores and training/serving parity

CI/CD for models

The pipeline

Shadow and canary deployment

Monitoring: drift, performance, latency, cost

Data drift

Prediction drift

Concept drift and performance

Latency, throughput, cost

Fairness and segment performance

Retraining: triggers and cadence

Rollback and the ATO boundary question

Documentation for RMF packages

GovCloud and Government-region specifics

Common failure modes we see

FAQ

Where this fits in our practice

Related insights

AWS GovCloud vs. Azure Government for ML

NIST 800-53 Controls for LLM Systems

ATO Acceleration Playbook for AI Systems

Standing up MLOps inside a boundary?