What a real federal MLOps stack has to do
Federal MLOps is different from commercial MLOps in ways that look small individually and compound into a different architecture. Every step must be auditable. Every model must be traceable from training data to production endpoint. Every CI/CD runner must be hardened to a specific STIG benchmark. Every secret must flow through a managed service that logs access. Every data movement must respect classification. The result is a stack that looks familiar from commercial practice but has ten extra constraints written in its margins.
MLOps on GovCloud requires FedRAMP-authorized tooling throughout the pipeline. AWS SageMaker is FedRAMP High. MLflow must be self-hosted in the customer's GovCloud account. Commercial SaaS MLOps tools do not inherit the authorization boundary.
This post is the GovCloud MLOps reference we build from. It is AWS-first because that is where most federal AI workloads run in 2026; the Azure Government equivalents follow similar logic.
The reference architecture

DATA LAYER
S3 (raw, curated, feature, model artifacts) --- KMS CMK
AWS Lake Formation (governance, row/column security)
AWS Glue / EMR Serverless (ETL)
FEATURE / TRAINING
SageMaker Processing (ETL and feature prep)
SageMaker Training Jobs (with spot + checkpointing)
EKS + Kubeflow (for workloads SageMaker does not fit)
REGISTRY / GOVERNANCE
SageMaker Model Registry (or self-hosted MLflow)
Approvals gated in the registry state machine
SERVING
SageMaker Hosted Endpoints (real-time, multi-model, async)
SageMaker Batch Transform (batch)
EKS / ECS (custom containers, including vLLM for LLMs)
ORCHESTRATION
Step Functions (workflow)
EventBridge (event-driven triggers)
CI/CD
CodeCommit / CodeBuild / CodePipeline on STIG-hardened runners
(or self-hosted GitLab / GitHub Enterprise with hardened runners)
OBSERVABILITY
CloudWatch Metrics + Logs
OpenSearch for search/analytics
X-Ray for tracing
Custom eval harness as a scheduled job
SECURITY
IAM (least-privilege role assumption)
KMS (CMKs, automatic rotation)
Secrets Manager
GuardDuty, Security Hub, Config
CloudTrail (full capture, 7+ year retention)
Separation of environments
Three AWS accounts at minimum: dev, stage, prod. Some programs add a shared services account (for CI/CD, registry, logging aggregation) and a data-lake account. Cross-account access is via role assumption with explicit trust and external-ID conditions. Direct cross-account S3 access is narrow and logged. Production data does not flow to dev; dev trains on synthetic or de-identified samples. Production model weights are deployed to prod from a promoted registry entry, not rebuilt in prod.
Data layer discipline
- S3 tier structure. raw/ → curated/ → feature/ → model-artifacts/. Raw is immutable, checksummed, encrypted. Curated is the transformed base. Feature is the input to training. Artifacts are versioned model outputs.
- KMS everywhere. CMK per environment or per data category. Automatic rotation. Key policies separated from resource policies. No default-key usage on production data.
- Lake Formation governance. Row- and column-level security for data catalog access. Tag-based access control for classification. LF-Tags propagated through Glue and Athena.
- Object Lock on any bucket that holds system-of-record data. Prevents delete during retention.
- Versioning on every bucket that holds inputs or outputs to training. Required for reproducibility.
- Access logging to a separate, retention-locked bucket.
Training: SageMaker and the alternatives
SageMaker Training Jobs are the default for tabular, classical, and many deep-learning workloads. Benefits: FedRAMP High authorization, spot instance support with managed checkpointing, built-in distributed training, Step Functions integration, and CloudTrail audit.
Where SageMaker does not fit:
- Very custom distributed training topologies (some LLM training patterns).
- Workloads requiring specific GPU hardware not currently available on SageMaker instance types.
- Training loops that need Kubernetes-native integration with the rest of an EKS platform.
Alternative: EKS with Kubeflow Training Operator, Ray on EKS, or bare-metal EC2 with SLURM. More operational work; worth it when SageMaker limitations block the workload.
Model registry and promotion
The registry is the system of record. Every model gets a registry entry with:
- Model version and artifact URI (S3 path, versioned).
- Training job ID and full input manifest.
- Training code commit SHA.
- Container image digest.
- Training data version (snapshot IDs, Lake Formation query, whatever is applicable).
- Evaluation results on the standard eval harness.
- Approval state and approver identity.
- Deployment history (which endpoints received this version, when, status).
Promotion is a state-machine transition: Registered → Staging → Approved → Production. Transitions are gated on eval thresholds, approvals, and policy. Deployment reads from the registry; it does not deploy models that are not registered.
CI/CD that meets federal bar
The runner posture matters most. Runners must be STIG-hardened, patched, audited, and network-isolated. Options:
- CodeBuild on AL2023 STIG AMI. Managed; log everything to CloudTrail and S3; use VPC config with egress through approved proxies; custom image with Nessus agent and Session Manager.
- Self-hosted GitLab Runner on STIG RHEL. More control; more ops. Useful when the VCS is GitLab and the pipeline needs runner persistence.
- Self-hosted GitHub Actions Runner. Same pattern. Public runners not authorized.
Pipeline stages we ship:
- Lint, unit test, secret scan, SAST (SonarCloud is NOT authorized; use AWS CodeGuru, Semgrep self-hosted, or similar inside the boundary).
- Container build, sign (cosign or KMS-backed signing), push to ECR.
- Infrastructure-as-code validation (cfn-lint, checkov, tfsec, tflint).
- Training pipeline invocation (Step Function or SageMaker Pipeline).
- Eval harness run on trained model.
- Registry update with full lineage.
- Promotion gate (manual approval for Approved → Production).
- Deploy via blue/green or canary.
Secrets, identity, and access
- No long-lived access keys in pipelines. Every pipeline role assumption is short-lived and logged.
- SSO via IAM Identity Center for humans.
- Secrets in Secrets Manager, rotated, versioned, accessed by role not key.
- Every IAM policy reviewed for wildcard resources before merge.
- Cross-account access via role chaining with explicit trust and external ID.
- Break-glass access documented, logged, and reviewed.
Serving
SageMaker Hosted Endpoints are the default real-time serving layer. Multi-model endpoints are useful when you serve many small models (per-agency or per-task classifiers). Async endpoints handle long-running inference.
For LLMs specifically, SageMaker Large Model Inference (LMI) containers with vLLM or TensorRT-LLM are the supported pattern, or custom EKS deployments if you need more control. Either way, serving is deployed from the registry, not from a developer laptop.
Endpoints have autoscaling, CloudWatch alarms on error rate and latency, and request/response logging (with classification-aware redaction if the traffic handles CUI).
Monitoring and drift
Service metrics
Latency, error rate, throughput. CloudWatch + alarms.
Model metrics
Prediction distribution, feature distribution. SageMaker Model Monitor for tabular; custom for LLM / embedding models.
Data drift
Compare serving distribution to training distribution on a cadence.
Concept drift
Ground truth is delayed; compute model quality on labeled-in-retrospect samples.
Eval harness on schedule
Rerun the regression eval nightly against production model via the endpoint; alert on regression.
Cost management
- Spot instances for training where checkpointing is implemented. 60-70% savings.
- Savings Plans for steady-state inference compute.
- Auto-shutdown for notebook instances; tag enforcement on cost centers.
- S3 Intelligent-Tiering or lifecycle policies for large training-data buckets.
- Right-size endpoints; multi-model endpoints for spiky low-volume workloads.
Observability against the AI RMF
Tag every artifact with the NIST AI RMF function it supports (Govern / Map / Measure / Manage). Tag every model-registry entry with its intended use, known limitations, and responsible owner. This makes the authorization narrative straightforward: the registry is both the engineering system and the governance artifact.
Common failure modes
Training runs on un-versioned data
A rebuild cannot reproduce the model. Auditors flag this reliably.
Prod deployed from a developer machine
No CI/CD trail. Fails review.
Wildcard IAM policies
Attached to a role that has production read access. Finding in any remotely rigorous review.
CloudTrail disabled in a region
Gap in audit.
No model rollback plan
A bad production promotion cannot be reverted quickly.
Third-party CI service with outbound internet
Fails authorization scrub.
ECR images without scan results gated on promotion
Vulnerable images slip through.
Where this fits in our practice
We stand up and operate federal MLOps platforms end to end. See our GPU capacity planning for the compute layer, our vector database selection for retrieval infrastructure, and our Kubernetes in the IC tier for higher-classification deployments.