GPU Capacity Planning for Federal AI Workloads

Why this is hard to get right

GPU capacity is the single largest infrastructure cost on a federal AI program, and it is the one most often sized on vibes instead of math. Teams either over-provision (because a vendor said they would need 8 GPUs and nobody questioned it) or under-provision (because the initial POC was small and the production workload blew through the budget in the first month). The planning that works is grounded in measured workload characteristics, honest uncertainty about growth, and a clear lease-vs-buy decision for the authorization boundary.

3–6 months

GovCloud GPU reservation lead time

p4d / g5

Primary GovCloud ML instance types

Inferentia2

Cost-effective inference alternative

The 2026 GPU lineup, honestly

GPU	Memory	Best for	Federal availability
H100 SXM (80GB)	80 GB HBM3	LLM training and serving, general DL	GovCloud p5, on-prem (mature supply)
H200 SXM (141GB)	141 GB HBM3e	Larger-context LLM serving, bigger models without tensor parallel	GovCloud p5en, on-prem (growing supply)
B200 / GB200	192 GB HBM3e	Frontier training, high-throughput inference, FP4/FP8	Commercial + federal availability growing through 2026
L40S	48 GB GDDR6	Mid-size inference, graphics-adjacent workloads	Widely available, lower cost
L4	24 GB GDDR6	Small-model inference, CPU-offload augmentation	Widely available
A100 (40/80GB)	40-80 GB HBM2e	Older but capable; still useful for training and serving	Reliable availability; price-per-unit-capability weakening
A10 / A10G	24 GB GDDR6	Small-model serving; cost-sensitive	Widely available

Workload sizing

LLM inference

70B-class models

2xH100 or 1xH200 per serving replica. Throughput 1500-3000 tok/s batched.

8B-class models

1xL40S or 1xH100 (MIG slice). Throughput 2000-4000 tok/s batched.

405B-class models

8xH100 or 4xH200 per serving replica.

Mixture of experts (Mixtral 8x22B)

4xH100 or 2xH200 bf16; 2xH100 AWQ 4-bit.

Fine-tuning

LoRA on 7B-13B. 1xH100 or 1xA100 80GB. Hours to a day of compute.
LoRA on 70B. 4x-8xH100. Overnight to a few days.
Full fine-tune 70B. 8x-32xH100 for a week or more. Mostly avoid.

Embedding models

Small embedding models (BGE, E5) on 1xL4 or CPU for light batch workloads.
Larger models or high-throughput indexing: 1xL40S or 1xA10G.

Classical and deep-learning training

Tabular or small-network training: often CPU or single-GPU (A10, L4).
Vision model fine-tuning: 1x-4xH100 or L40S typical.
Geospatial foundation-model fine-tuning: 4x-8xH100 per job.

Inference vs training: the real ratio

Across federal programs our founder has worked on, sustained inference GPU hours outnumber training by 5-20x. A program that plans 50/50 for "training vs inference" is planning wrong; the real ratio is 10/90 or worse. Capacity planning should reflect this: reserve steady-state capacity for inference, add burst capacity for periodic training.

Typical GPU Hour Split

What this means in practice

Inference (serving)

90%

Training (burst)

10%

Most programs plan 50/50. Mature programs end up closer to 10/90. Over-provisioning training GPUs while starving inference is the most common capacity mistake.

Utilization: the honest number

Nominal GPU plans assume full utilization. Real deployments hit 30-70% utilization. The gap is:

Idle time between requests (serving).
KV cache fragmentation.
Non-prompt GPU memory overhead.
Single-GPU workloads on multi-GPU nodes (bad scheduling).
Dev/test sharing that is hard to pin down.

Size for 50-60% average utilization as the planning target; 80%+ sustained means you are under-provisioned and at risk for latency SLAs.

The planning document says 80% utilization. The monitoring dashboard says 45%. Believe the dashboard.

Lease vs buy on a federal program

Lease via cloud

Good for: variable workloads, exploratory work, pre-ATO development, rapid-scale needs, burst training. GovCloud and Azure Government both offer reserved capacity (AWS Capacity Blocks for ML, Azure ML reserved instances) which can approach on-prem cost-per-GPU-hour at multi-year commitment.

Buy / capital lease

Good for: steady inference production, classified environments, long-term programs, situations where cloud is unavailable. Real cost is hardware plus datacenter integration plus power plus cooling plus ops staff plus refresh. Multi-year program horizon is necessary to amortize.

Hybrid

Common pattern: on-prem for steady production inference, cloud for burst training and for dev/test. Split the authorization accordingly.

Reserved capacity strategies

AWS Capacity Blocks for ML

Schedule specific H100/H200 instance capacity for a future window. Necessary for reliable large training jobs.

AWS Savings Plans

Commit to aggregate compute spend for 1-3 years; discount 20-60%. For steady inference.

Azure reservations

Equivalent mechanism on Azure Government.

Spot / preemptible

For training with checkpointing; 60-70% savings. Not for production inference.

Power, cooling, and datacenter integration for on-prem

8xH100 SXM system: 10-12 kW sustained. 4 systems to a rack with high-density cooling.
Power distribution: 3-phase, high-amperage whips. Facility needs must be confirmed before procurement.
Cooling: liquid-cooled racks increasingly standard for H100/H200/B200 density. Air-cooled works but at lower density.
Networking: NVIDIA Spectrum-X or InfiniBand for multi-node training. 400 Gbps per GPU is the pattern for H100 training clusters.
Storage: high-IOPS NVMe tier for training data staging; object storage for archives.

MIG vs whole-GPU scheduling

MIG (H100/H200): partition into 1/2/3/4/7 logical GPUs with hardware isolation. Good for serving many small models, for dev/test multi-tenancy, for separating classification contexts on the same hardware.
Whole-GPU: for training, large-model inference, or any workload that would not fit a MIG slice.
Time-slicing (Kubernetes plugin): no hardware isolation, just temporal sharing. Simpler than MIG but weaker boundary.
Plan capacity in units that match the scheduling mode. A "10-GPU cluster" with MIG partitioning is not the same as 10 whole-GPU slots.

Forecasting growth

User count and per-user requests/day.
Average input tokens and output tokens per request.
Expected tokens/sec per GPU under the specific model and serving stack.
Redundancy multiplier for availability (typically 1.5x-2x).
Growth assumption: 2-5x year-over-year for successful deployments in the first two years, flattening after.

What goes wrong

Sizing on batch-1 latency. Actual throughput under batched inference is much higher; sizing on single-request latency over-provisions by 3-10x.
Ignoring KV cache. Long contexts fit the model but not the cache; production OOM at high concurrency.
MIG configured once, never re-tuned. Workload mix changes, MIG profile does not. Utilization collapses.
Training capacity stranded. 16xH100 bought for training sits idle 90% of the time. Convert to inference or spot-sell internally.
Power cost ignored. On-prem OpEx dominated by power over a 4-year horizon.
Refresh not planned. GPUs bought in year 1 need a refresh plan in year 3-4 regardless of how the workload grows.

Where this fits in our practice

We size GPU capacity as part of the full stack build. See our on-prem LLM deployment for the workloads the GPUs run, our Kubernetes in the IC tier for the scheduling platform, and our MLOps on GovCloud for orchestration.

FAQ

What GPUs are available in AWS GovCloud in 2026?

H100 (p5, p5e) and H200 (p5en) availability has steadily improved through 2025-2026 in GovCloud. A100 (p4d, p4de) remains available. L40S and L4 for lighter inference. Reserved capacity (Capacity Blocks for ML) is the most reliable path for planned training; on-demand availability varies by region and time.

Should a federal program buy GPUs or lease them?

Rule of thumb: lease via cloud for variable, exploratory, or bursty workloads. Buy (or capital lease) when steady-state utilization above 60-70% for multi-year. For classified environments buying is often the only option. Mid-cases (single-region steady inference) favor reserved cloud capacity for tax and procurement reasons.

How much GPU does inference vs training actually need?

On most federal programs, inference dominates by 5-20x over training in sustained GPU hours. A typical LLM-serving workload for an agency with a few thousand users needs 2-16 H100s running continuously; training or fine-tuning happens periodically with 4-32 GPU bursts.

What is MIG and when should I use it?

MIG (Multi-Instance GPU) partitions an H100/H200 into up to 7 smaller logical GPUs with hardware isolation. Use when serving many small models or when GPU utilization per model is low. Do not use for large models that exceed a single MIG instance or for training.

Is B200/Blackwell worth waiting for in 2026?

Blackwell (B200, GB200) delivers significant gains on inference and training, especially at FP8 and lower precisions. Federal cloud availability in 2026 is growing. For programs buying hardware now, Blackwell is the forward choice if lead time works; for cloud users, H100/H200 are battle-tested and immediately available.

What is the real cost of an H100 cluster on-prem?

Rough 2026 numbers: 8xH100 SXM DGX/HGX system $200-300K depending on configuration and vendor. Networking (NVIDIA Spectrum-X or InfiniBand), storage (high-IOPS NVMe), and datacenter integration add 30-60%. Multi-year lease with support is often cleaner than outright capex.

GPU capacity planning for federal AI workloads.

Why this is hard to get right

The 2026 GPU lineup, honestly

Workload sizing

LLM inference

Fine-tuning

Embedding models

Classical and deep-learning training

Inference vs training: the real ratio

Utilization: the honest number

Lease vs buy on a federal program

Lease via cloud

Buy / capital lease

Hybrid

Reserved capacity strategies

Power, cooling, and datacenter integration for on-prem

MIG vs whole-GPU scheduling

Forecasting growth

What goes wrong

Where this fits in our practice

FAQ

Related insights

Planning GPU capacity for a federal AI program?

GPU capacity planning for federal AI workloads.

Why this is hard to get right

The 2026 GPU lineup, honestly

Workload sizing

LLM inference

Fine-tuning

Embedding models

Classical and deep-learning training

Inference vs training: the real ratio

Utilization: the honest number

Lease vs buy on a federal program

Lease via cloud

Buy / capital lease

Hybrid

Reserved capacity strategies

Power, cooling, and datacenter integration for on-prem

MIG vs whole-GPU scheduling

Forecasting growth

What goes wrong

Where this fits in our practice

FAQ

Related insights

On-Prem LLM Deployment for Air-Gapped Environments

Kubernetes in the IC Tier

MLOps Pipelines on AWS GovCloud

Planning GPU capacity for a federal AI program?