Why federal MLOps is its own discipline
Commercial MLOps optimizes for velocity: ship fast, monitor loudly, roll back cheaply, retrain on a dime. Federal MLOps optimizes for defensibility: every model that reaches production has a traceable lineage, a signed artifact, a documented evaluation, a rollback path a human can execute, and an audit trail that a 3PAO can examine three years later. The tools overlap heavily. The lifecycle does not.
The friction is structural. A commercial team retrains nightly because a small regression costs nothing. A federal system on an ATO cannot silently replace a production model without violating CM-3 (configuration change control) and without producing the evidence an ISSM will require to approve the change. Any pattern that works commercially but cannot be documented and rolled back in writing will not survive a federal environment.
This is the MLOps stack we ship for production federal ML, the components we wire together, the monitoring we build in, and the documentation that makes it live cleanly inside an RMF package.
The reference architecture
The stack is predictable once you have built a few of these. The pieces we run in production:
- Source control for training code, inference code, feature definitions, prompts (if any), and configuration. GitHub Enterprise or GitLab, inside the boundary or connected through a governed egress.
- Feature store for reusable features with online and offline parity. Feast for self-hosted, SageMaker Feature Store in GovCloud, Databricks Feature Store on Azure Gov.
- Model registry for versioned, signed, evaluated model artifacts. MLflow, SageMaker Model Registry, or Azure ML Registry.
- Training orchestration with reproducible compute. SageMaker Training Jobs, Azure ML Jobs, Vertex Training, or Kubeflow Pipelines.
- CI/CD for models — the pipeline that takes a commit or a trigger and produces a registered, validated, gated model candidate.
- Serving layer with canary and shadow deployment. SageMaker Endpoints, Azure ML Online Endpoints, KServe on EKS/AKS, or a homegrown FastAPI + Ray Serve stack.
- Monitoring for drift, performance, latency, cost, and fairness. Evidently, WhyLabs, Arize, Fiddler, or open-source stacks around Prometheus and custom exporters.
- Evidence pipeline that continuously produces SSP artifacts: inventory, lineage, evaluation summaries, drift reports, change logs.
Model registry and versioning
The registry is the authoritative record of "what is in production, how it got there, and what it was validated against." Every production inference call must resolve to a specific, immutable model version. No "latest" tags in production.
What belongs in the registry entry:
- Model artifact (weights, code, tokenizer, preprocessors) with a signed hash.
- Training data reference — dataset ID and version, snapshot hash, record count.
- Training code reference — git SHA, base image digest, library versions (Python lock file).
- Evaluation results — metrics on the held-out set, segment-level breakdowns for fairness, comparison to the previous champion.
- Model card — intended use, limitations, known failure modes, recommended monitoring.
- Approver — who signed off, when, against which test run.
- Lineage edges — pointers to upstream features, downstream consumers.
MLflow gives you most of this out of the box if you configure it correctly and back it with Postgres and S3 (or their GovCloud equivalents). SageMaker Model Registry is more opinionated but bakes the approval workflow and lineage in. Choice depends on whether you are more comfortable running infrastructure or consuming managed services — both work inside GovCloud or Azure Government.
Signing model artifacts
Under SI-7, model weights are software components and require integrity verification. Sign artifacts with Cosign or Sigstore; store signatures alongside the artifact in the registry; verify signatures at load time in every serving replica. If a serving pod loads an unsigned or wrong-signature artifact, it fails closed and alerts. SBOM coverage via SPDX or CycloneDX should include model components, not just container layers.
Feature stores and training/serving parity
The single biggest source of silent production failure is training/serving skew — features computed one way during training and another way at inference. A feature store with a single source of truth for feature logic, accessible both offline (for training) and online (for inference), eliminates a class of bugs that will otherwise recur.
What matters in a federal context:
- Encryption at rest with CMKs. Both offline (S3/Blob/BigQuery-equivalent) and online (DynamoDB, Redis, Cosmos) stores use customer-managed keys.
- Access logging. Every feature read, every feature write, landed in the audit log under AU-2.
- Data lineage. Each feature points back to source tables and the transform code. When an upstream source schema changes, you need to find every model that consumes it.
- PII and CUI tagging. Features carry classification labels. Training jobs and inference callers are authorized or denied based on those labels, not based on table-level ACLs alone.
Feast is the open-source workhorse; it runs anywhere and is the right choice when you want the feature logic fully inside your boundary. SageMaker Feature Store and Databricks Feature Store are the managed options; both are available in GovCloud/Gov regions.
CI/CD for models
Model CI/CD is CI/CD with three extra concerns: the artifact is large and data-dependent, the tests are statistical, and the deploy is staged over traffic rather than over infrastructure.
The pipeline
1. Trigger
- Git push to the model repo
- Scheduled retrain (weekly/monthly)
- Drift alert from monitoring
- Manual promotion
2. Data validation
- Schema check against contract
- Volume and freshness sanity checks
- Great Expectations or Deequ suites
3. Training
- Reproducible compute job (container digest pinned)
- Random seed set
- Feature snapshot (time-pinned)
- Outputs: weights, training metrics, training log
4. Evaluation
- Held-out test set (time-based split where applicable)
- Segment metrics (demographic, geographic, temporal)
- Fairness gates
- Regression vs. current champion
- Statistical test (bootstrap CIs, not point estimates)
5. Registration
- Push to model registry with full metadata
- Sign the artifact
- Status = "candidate"
6. Staging deploy
- Shadow traffic for N hours
- Compare predictions to champion on live inputs
- Watch latency, cost, error rates
7. Canary deploy
- 1% -> 5% -> 25% -> 100% over days
- Gate at each stage on SLOs and drift signals
8. Promotion
- Human approver (ISSO / product owner)
- Registry status = "champion"
- Archive previous champion for rollback
The statistical piece that commercial teams sometimes skip: bootstrap confidence intervals on your evaluation metrics. A point estimate telling you the new model is 0.2 percent better means nothing when the 95 percent CI spans zero. A federal assessor who has seen a few bad models will ask how you know the lift is real.
Shadow and canary deployment
Shadow traffic lets you compare the candidate to the champion on real production inputs without user-visible impact. The candidate serves predictions that are logged and scored but not returned to callers. Run shadow for long enough to see the full input distribution — a day is often not enough for cyclical workloads.
Canary deployment ramps live traffic from 1 percent to 100 percent over hours or days with automated gates at each level. Every gate checks: latency SLO, error rate, per-segment performance, drift against training distribution, cost per prediction. A gate failure rolls back automatically.
Monitoring: drift, performance, latency, cost
You monitor three families of signals in production. Miss one and you will eventually deploy a model that looks fine on the dashboard but is quietly wrong.
Data drift
The input distribution is changing relative to what the model was trained on. Measure on each input feature independently (one-dimensional drift) and on the joint distribution via dimensionality reduction plus a density test (multidimensional drift).
- KS test for continuous features.
- Population Stability Index (PSI) for categorical and binned features. PSI > 0.1 is a warning, > 0.25 is a problem.
- Wasserstein distance for a more robust continuous comparison.
- Domain classifier — train a model to distinguish training data from current production data. If AUC is meaningfully > 0.5, the distributions have drifted.
Prediction drift
The output distribution is shifting. This is a useful second signal because it catches shifts that are too subtle to show up on any one input feature but still change what the model produces. Same statistical tests applied to predicted labels or scores.
Concept drift and performance
The relationship between inputs and outputs has changed. a direct reliable way to detect it is with ground truth, and ground truth is usually delayed, partial, or expensive. Patterns that work:
- Sample a small fraction of predictions for human review on a continuous cadence. Review is expensive; the sample is small.
- Wait for natural labels where the use case produces them (fraud chargebacks, case resolution outcomes, classifier overrides by users).
- Proxy metrics — downstream business or mission metrics that correlate with model quality.
Latency, throughput, cost
Operational SLOs in the same place as accuracy metrics. p50, p95, p99 latency by route. Tokens or records per second. Cost per 1,000 predictions or 1,000 tokens. Budget alerts at 50 percent, 80 percent, 100 percent of allocation.
Fairness and segment performance
Performance broken out by the segments that matter to the mission: demographic, geographic, organizational, temporal. A model with strong aggregate accuracy can still fail a specific subgroup catastrophically. NIST AI RMF and the AI-specific OMB guidance that now layers over EO 14110's implementation expect this to be a live metric, not an annual review.
Retraining: triggers and cadence
Trigger-based retraining beats calendar-based for most systems. The triggers we wire:
- Drift threshold breach. PSI over 0.25 on a primary feature for N consecutive days.
- Performance floor. Ground-truth metric drops below a defined floor, sustained over a window.
- Upstream change. Schema change, source change, or freshness SLO breach in the training data path.
- Time-based floor. Even with no other trigger, a retrain and full re-evaluation cycle at least every 6 to 12 months for stable domains. This catches slow drift the automated tests missed.
Every retrain flows through the same CI/CD pipeline. No one retrains on a laptop and pushes to production. Every retrain is a CM-3 configuration change with an approval trail.
Rollback and the ATO boundary question
The rollback story matters more in federal than commercial because the window between detection and mitigation is longer. Two questions you must answer in writing before production:
- What is the rollback condition? Which signals, at which thresholds, trigger rollback? Who approves? What is the maximum time to rollback?
- What is the rollback mechanism? Is the previous champion still loaded in a warm replica, or does rollback require a redeploy? How is traffic shifted? How is rollback tested?
Patterns that work: keep the previous champion deployed alongside the current champion with 0 percent traffic, routed via a feature flag. Rollback is flipping the flag — seconds, not hours. Practice the rollback quarterly so your ops team is not learning the procedure during an incident.
The ATO boundary question is trickier. If your model is served from within an authorization boundary and calls a foundation model or embedding API outside the boundary, that external dependency's version changes are configuration changes in your system even though you did not initiate them. You need the ability to pin the upstream version, detect when the provider has bumped it, and treat the bump as a CM-3 event — not a surprise. For hosted LLM APIs specifically, pin the versioned model ID (e.g., claude-opus-4-7-20260415, not claude-opus-latest) and run regression on every detectable upstream change.
Documentation for RMF packages
The artifacts that turn an ML system into an ATO-ready system:
- Model card per production model — intended use, inputs, outputs, training data summary, evaluation summary, known limitations, recommended monitoring, contact. Update on every version.
- Data sheet per training dataset — source, collection process, consent and legal basis, preprocessing, PII and CUI handling, retention.
- Fairness and bias assessment — the segments evaluated, results, mitigations applied, residual risks.
- Threat model for the ML attack surface — data poisoning, model inversion, membership inference, evasion attacks, model theft. Mitigations and residual risk.
- Monitoring plan — which metrics, which thresholds, which alerts, who owns each.
- Retraining procedure — triggers, approvals, testing gates, rollback plan.
- Continuous monitoring package — the ongoing evidence the system is still within its authorized baseline.
NIST AI RMF 1.0 gives you the governance structure. NIST SP 800-218A (Secure Software Development Framework for AI) gives you the lifecycle controls. Map both against your 800-53 baseline and keep the mapping in the SSP.
GovCloud and Government-region specifics
Tool availability in GovCloud and Azure Government lags commercial by 3 to 12 months. What we check before committing a stack:
- SageMaker Feature Store, Model Registry, and Clarify are available in AWS GovCloud (US). Verify the specific sub-services you need (SageMaker Pipelines, Studio, JumpStart models) per region.
- Azure ML is available in Azure Government with most features on parity; specific foundation models in Azure AI Foundry have variable availability.
- MLflow (self-hosted) runs anywhere because it is just containers and Postgres/S3. Often the safest default.
- Open-source ML frameworks (PyTorch, TensorFlow, scikit-learn, XGBoost) are unrestricted; pull through your artifact proxy rather than from public PyPI/Hub.
- FIPS-validated crypto is the default in GovCloud/Gov regions but must be confirmed per service; some managed services use non-FIPS endpoints by default and need explicit configuration.
Common failure modes we see
- No feature store, no training/serving parity. Training uses a Jupyter-computed feature; inference uses a slightly different pandas transform. Silent accuracy loss.
- "Latest" tag in production. Upstream silently promotes a new version, production behavior changes, no one remembers rolling out a change.
- Monitoring only aggregates. P95 latency and overall accuracy look fine; a specific segment is catastrophically wrong for three weeks.
- No held-out eval on retraining. Retrain uses the same data that is also in the eval set; metrics inflate.
- No rollback rehearsal. Rollback plan exists on paper; first time it runs is during an incident.
- Model card is stale. Documentation from v1 still in the registry entry for v7.
FAQ
Where this fits in our practice
We build federal MLOps stacks end to end: feature stores, registries, CI/CD pipelines, drift monitoring, and the evidence pipeline that turns day-to-day operations into RMF artifacts. See our machine learning and cloud architecture capabilities, and our GovCloud vs. Azure Government comparison when you are picking a platform.