MLOps is the hard part
The easy part of federal machine learning is the notebook: a data scientist writes Python, fits a model, achieves 94 percent accuracy on a held-out test set, and demos the result in a PowerPoint. The hard part starts the day after the demo. The notebook becomes a production system. The system must be deployed inside an authorization boundary. The model must be monitored for drift. The data pipeline must be reproducible. Incidents must be traced back to a specific model version trained on a specific dataset. Re-training must be automated but gated. Every change must leave an audit trail that satisfies a security control assessor. This is MLOps, and in a federal context it is what separates pilots that ship from pilots that die in a brief.
The federal MLOps reference architecture
Our default architecture for federal MLOps, deployable to AWS GovCloud, Azure Government, or on-premise Kubernetes:
- Data layer: versioned data snapshots via DVC, LakeFS, or Delta Lake time travel. Data lineage captured via OpenLineage and Marquez.
- Feature store: Feast for online/offline consistency, with tie-in to agency-managed PII classifiers to prevent leakage.
- Training orchestration: Kubeflow Pipelines or Airflow, running on EKS/AKS/GKE or OpenShift. Ray for distributed training on large models.
- Experiment tracking: MLflow self-hosted (inherits agency ATO) or Weights and Biases self-managed tier. Every run tagged with data hash, code commit, and hyperparameters.
- Model registry: MLflow Model Registry or SageMaker Model Registry with stage promotion workflow (staging, production, archived) and mandatory approval gates.
- Serving: KServe, Seldon Core, BentoML, SageMaker Endpoints, or NVIDIA Triton for GPU-heavy workloads. Blue/green and canary deployment patterns standard.
- Monitoring: Evidently, Whylogs, Arize, or homegrown Prometheus-based drift metrics. Integrated with Splunk or Elastic for the agency's SIEM.
- Explainability: SHAP values cached per prediction class, LIME for tabular audit, captum for deep learning. Model cards generated automatically on each release.
- CI/CD: GitHub Actions, GitLab CI, or Jenkins with stage gates for security scanning, license scanning, dependency check, and model eval. See our federal CI/CD page.
Model registry: the single source of truth
A federal model registry is not a nice-to-have. It is the authorization artifact that lets a security control assessor answer the question: "what model is running in production right now, what data was it trained on, who approved the promotion, and what is its current performance?" Without that, the assessor cannot approve the system.
We implement registries that enforce: a unique semantic version on every model, immutable pointers to training data snapshots and code commits, a mandatory model card documenting intended use and known limitations, a two-human promotion gate (author cannot self-promote), a rollback capability that returns to any prior version in under 5 minutes, and an API exposed to the agency's GRC platform for inventory auditing.
Continuous monitoring that matters
Most ML monitoring implementations are theater. Dashboards get built, thresholds get set, and the dashboards never get looked at. Federal MLOps must avoid this by wiring monitoring to action:
- Input drift — population stability index and KS-test on each feature distribution compared to training baseline. Breach triggers a ticket, not just a red square.
- Prediction drift — shift in predicted class distribution. Combined with input drift, flags concept versus covariate drift.
- Performance decay — for systems where ground-truth labels arrive with delay, shadow evaluation on a held-out canary set runs continuously. For online learning systems, streaming A/B comparison.
- Fairness slices — performance broken down by protected attributes and subpopulations. Required for rights-impacting systems under OMB M-24-10.
- Latency and error rate — standard SRE-style golden signals, with per-model SLOs.
- Adversarial probe alerts — periodic injection of known-bad inputs to validate guardrails still work.
Alignment with NIST AI RMF
NIST AI Risk Management Framework 1.0 and the Generative AI Profile (NIST AI 600-1) define a cycle of Govern, Map, Measure, Manage. MLOps platforms produce the evidence that auditors need in Measure and Manage: versioned training data with lineage, model cards documenting intended use and limitations, evaluation results across the model lifecycle, incident records with root cause and remediation, and continuous monitoring telemetry. We deliver MLOps deployments with AI RMF artifacts pre-generated and mapped to the relevant control families.
Continuous ATO (cATO) for ML systems
Agencies with a mature continuous authorization program want their ML systems to plug into it. We design MLOps pipelines so that: each model promotion passes automated control checks (static analysis, SBOM generation, container scanning), a documented security review gate separates staging from production, control assessment artifacts are auto-generated from pipeline logs, and the agency's eMASS or Xacta system receives a feed of change records in near real time. The result is that a new model version does not trigger a full re-ATO, only a documented change.
Operational excellence patterns
- Reproducible training. Train any historical model in under 4 hours from a frozen data snapshot.
- Feature parity. Online and offline features produce identical values. Drift between the two is the silent killer of production ML.
- Shadow deployment. New models run in shadow for 2-4 weeks, producing predictions that are logged but not served, before any traffic shifts.
- Progressive rollout. 1 percent, 5 percent, 25 percent, 100 percent canary with automatic rollback on any performance regression.
- Incident postmortems. Every production ML incident gets a written postmortem with action items tracked to closure.
Who we build MLOps for
- DoD — ML systems that must pass IL4/IL5 authorization and survive an adversarial red team.
- HHS and SAMHSA — production ML pipelines for substance abuse surveillance and grant analytics. Confirmed past performance.
- VA — claims adjudication models with mandatory fairness slicing and human review gates.
- DHS — operations ML for border, cyber, and emergency management workflows.
- NASA — scientific ML pipelines with provenance requirements for peer review.