MLOps Pipelines on AWS GovCloud: The Real Reference Architecture

What a real federal MLOps stack has to do

Federal MLOps is different from commercial MLOps in ways that look small individually and compound into a different architecture. Every step must be auditable. Every model must be traceable from training data to production endpoint. Every CI/CD runner must be hardened to a specific STIG benchmark. Every secret must flow through a managed service that logs access. Every data movement must respect classification. The result is a stack that looks familiar from commercial practice but has ten extra constraints written in its margins.

GOVCLOUD ADDS CONSTRAINTS

MLOps on GovCloud requires FedRAMP-authorized tooling throughout the pipeline. AWS SageMaker is FedRAMP High. MLflow must be self-hosted in the customer's GovCloud account. Commercial SaaS MLOps tools do not inherit the authorization boundary.

This post is the GovCloud MLOps reference we build from. It is AWS-first because that is where most federal AI workloads run in 2026; the Azure Government equivalents follow similar logic.

The reference architecture

DATA LAYER
  S3 (raw, curated, feature, model artifacts)  ---  KMS CMK
  AWS Lake Formation (governance, row/column security)
  AWS Glue / EMR Serverless (ETL)

FEATURE / TRAINING
  SageMaker Processing (ETL and feature prep)
  SageMaker Training Jobs (with spot + checkpointing)
  EKS + Kubeflow (for workloads SageMaker does not fit)

REGISTRY / GOVERNANCE
  SageMaker Model Registry (or self-hosted MLflow)
  Approvals gated in the registry state machine

SERVING
  SageMaker Hosted Endpoints (real-time, multi-model, async)
  SageMaker Batch Transform (batch)
  EKS / ECS (custom containers, including vLLM for LLMs)

ORCHESTRATION
  Step Functions (workflow)
  EventBridge (event-driven triggers)

CI/CD
  CodeCommit / CodeBuild / CodePipeline on STIG-hardened runners
  (or self-hosted GitLab / GitHub Enterprise with hardened runners)

OBSERVABILITY
  CloudWatch Metrics + Logs
  OpenSearch for search/analytics
  X-Ray for tracing
  Custom eval harness as a scheduled job

SECURITY
  IAM (least-privilege role assumption)
  KMS (CMKs, automatic rotation)
  Secrets Manager
  GuardDuty, Security Hub, Config
  CloudTrail (full capture, 7+ year retention)

Separation of environments

Three AWS accounts at minimum: dev, stage, prod. Some programs add a shared services account (for CI/CD, registry, logging aggregation) and a data-lake account. Cross-account access is via role assumption with explicit trust and external-ID conditions. Direct cross-account S3 access is narrow and logged. Production data does not flow to dev; dev trains on synthetic or de-identified samples. Production model weights are deployed to prod from a promoted registry entry, not rebuilt in prod.

Data layer discipline

S3 tier structure. raw/ → curated/ → feature/ → model-artifacts/. Raw is immutable, checksummed, encrypted. Curated is the transformed base. Feature is the input to training. Artifacts are versioned model outputs.
KMS everywhere. CMK per environment or per data category. Automatic rotation. Key policies separated from resource policies. No default-key usage on production data.
Lake Formation governance. Row- and column-level security for data catalog access. Tag-based access control for classification. LF-Tags propagated through Glue and Athena.
Object Lock on any bucket that holds system-of-record data. Prevents delete during retention.
Versioning on every bucket that holds inputs or outputs to training. Required for reproducibility.
Access logging to a separate, retention-locked bucket.

Training: SageMaker and the alternatives

SageMaker Training Jobs are the default for tabular, classical, and many deep-learning workloads. Benefits: FedRAMP High authorization, spot instance support with managed checkpointing, built-in distributed training, Step Functions integration, and CloudTrail audit.

Where SageMaker does not fit:

Very custom distributed training topologies (some LLM training patterns).
Workloads requiring specific GPU hardware not currently available on SageMaker instance types.
Training loops that need Kubernetes-native integration with the rest of an EKS platform.

Alternative: EKS with Kubeflow Training Operator, Ray on EKS, or bare-metal EC2 with SLURM. More operational work; worth it when SageMaker limitations block the workload.

Model registry and promotion

The registry is the system of record. Every model gets a registry entry with:

Model version and artifact URI (S3 path, versioned).
Training job ID and full input manifest.
Training code commit SHA.
Container image digest.
Training data version (snapshot IDs, Lake Formation query, whatever is applicable).
Evaluation results on the standard eval harness.
Approval state and approver identity.
Deployment history (which endpoints received this version, when, status).

Promotion is a state-machine transition: Registered → Staging → Approved → Production. Transitions are gated on eval thresholds, approvals, and policy. Deployment reads from the registry; it does not deploy models that are not registered.

CI/CD that meets federal bar

The runner posture matters most. Runners must be STIG-hardened, patched, audited, and network-isolated. Options:

CodeBuild on AL2023 STIG AMI. Managed; log everything to CloudTrail and S3; use VPC config with egress through approved proxies; custom image with Nessus agent and Session Manager.
Self-hosted GitLab Runner on STIG RHEL. More control; more ops. Useful when the VCS is GitLab and the pipeline needs runner persistence.
Self-hosted GitHub Actions Runner. Same pattern. Public runners not authorized.

Pipeline stages we ship:

Lint, unit test, secret scan, SAST (SonarCloud is NOT authorized; use AWS CodeGuru, Semgrep self-hosted, or similar inside the boundary).
Container build, sign (cosign or KMS-backed signing), push to ECR.
Infrastructure-as-code validation (cfn-lint, checkov, tfsec, tflint).
Training pipeline invocation (Step Function or SageMaker Pipeline).
Eval harness run on trained model.
Registry update with full lineage.
Promotion gate (manual approval for Approved → Production).
Deploy via blue/green or canary.

The registry entry is the evidence package. Build it with the rigor you would give an ATO artifact, because that is what it is.

Secrets, identity, and access

No long-lived access keys in pipelines. Every pipeline role assumption is short-lived and logged.
SSO via IAM Identity Center for humans.
Secrets in Secrets Manager, rotated, versioned, accessed by role not key.
Every IAM policy reviewed for wildcard resources before merge.
Cross-account access via role chaining with explicit trust and external ID.
Break-glass access documented, logged, and reviewed.

Serving

SageMaker Hosted Endpoints are the default real-time serving layer. Multi-model endpoints are useful when you serve many small models (per-agency or per-task classifiers). Async endpoints handle long-running inference.

For LLMs specifically, SageMaker Large Model Inference (LMI) containers with vLLM or TensorRT-LLM are the supported pattern, or custom EKS deployments if you need more control. Either way, serving is deployed from the registry, not from a developer laptop.

Endpoints have autoscaling, CloudWatch alarms on error rate and latency, and request/response logging (with classification-aware redaction if the traffic handles CUI).

Monitoring and drift

Service metrics

Latency, error rate, throughput. CloudWatch + alarms.

Model metrics

Prediction distribution, feature distribution. SageMaker Model Monitor for tabular; custom for LLM / embedding models.

Data drift

Compare serving distribution to training distribution on a cadence.

Concept drift

Ground truth is delayed; compute model quality on labeled-in-retrospect samples.

Eval harness on schedule

Rerun the regression eval nightly against production model via the endpoint; alert on regression.

Cost management

Spot instances for training where checkpointing is implemented. 60-70% savings.
Savings Plans for steady-state inference compute.
Auto-shutdown for notebook instances; tag enforcement on cost centers.
S3 Intelligent-Tiering or lifecycle policies for large training-data buckets.
Right-size endpoints; multi-model endpoints for spiky low-volume workloads.

Observability against the AI RMF

Tag every artifact with the NIST AI RMF function it supports (Govern / Map / Measure / Manage). Tag every model-registry entry with its intended use, known limitations, and responsible owner. This makes the authorization narrative straightforward: the registry is both the engineering system and the governance artifact.

Common failure modes

Training runs on un-versioned data

A rebuild cannot reproduce the model. Auditors flag this reliably.

Prod deployed from a developer machine

No CI/CD trail. Fails review.

Wildcard IAM policies

Attached to a role that has production read access. Finding in any remotely rigorous review.

CloudTrail disabled in a region

Gap in audit.

No model rollback plan

A bad production promotion cannot be reverted quickly.

Third-party CI service with outbound internet

Fails authorization scrub.

ECR images without scan results gated on promotion

Vulnerable images slip through.

Where this fits in our practice

We stand up and operate federal MLOps platforms end to end. See our GPU capacity planning for the compute layer, our vector database selection for retrieval infrastructure, and our Kubernetes in the IC tier for higher-classification deployments.

FAQ

What is the default MLOps reference stack on GovCloud in 2026?

S3 for data and artifacts, ECR for images, SageMaker for training and hosted inference (with EKS or ECS as alternatives for custom workloads), Step Functions for orchestration, CodeBuild/CodePipeline for CI/CD on STIG-hardened runners, SageMaker Model Registry or MLflow for registry, CloudWatch and OpenSearch for monitoring, KMS for keys, Lake Formation for governance.

Can you use SageMaker in GovCloud?

Yes. SageMaker (including training, hosted endpoints, processing jobs, and pipelines) is available in GovCloud with FedRAMP High authorization. Not every feature ships in GovCloud on the same cadence as commercial AWS, so check the service availability page.

Is MLflow production-ready for federal work?

Yes. Self-host MLflow on ECS or EKS with RDS backing store and S3 artifact store; all inside the boundary. Alternatively, Databricks on GovCloud offers managed MLflow. SageMaker Model Registry is a viable alternative when the platform is AWS-first.

What does STIG-hardened CI/CD look like?

CodeBuild runners on STIG-hardened AL2023 images, or self-hosted GitLab Runners on STIG-hardened RHEL, with hardened base images, no outbound internet except through approved egress proxies, SSM session logging, and audit to CloudTrail. Third-party GitHub runners or hosted CI cloud services are generally not authorized.

How do you handle model governance under ATO?

Model registry is the system of record. Every promoted model has training-data version, code commit hash, training job ID, eval results, and approval signature attached. Deployment is gated on registry state. CloudTrail logs every registry write. The registry entry is the artifact auditors ask for.

Where do teams commonly go wrong on federal MLOps?

Treating dev and prod as the same account (should be separate with hard boundaries), shipping models without a pinned training-code commit, using personal AWS access keys instead of role assumption, missing CloudTrail retention, and skipping the model-lineage capture that later turns out to be the audit requirement.

MLOps pipelines on AWS GovCloud.

What a real federal MLOps stack has to do

The reference architecture

Separation of environments

Data layer discipline

Training: SageMaker and the alternatives

Model registry and promotion

CI/CD that meets federal bar

Secrets, identity, and access

Serving

Monitoring and drift

Cost management

Observability against the AI RMF

Common failure modes

Where this fits in our practice

FAQ

Related insights

Building MLOps on GovCloud?

MLOps pipelines on AWS GovCloud.

What a real federal MLOps stack has to do

The reference architecture

Separation of environments

Data layer discipline

Training: SageMaker and the alternatives

Model registry and promotion

CI/CD that meets federal bar

Secrets, identity, and access

Serving

Monitoring and drift

Cost management

Observability against the AI RMF

Common failure modes

Where this fits in our practice

FAQ

Related insights

Vector Database Selection for Federal AI

GPU Capacity Planning for Federal AI

Kubernetes in the IC Tier

Building MLOps on GovCloud?