Why this is hard to get right
GPU capacity is the single largest infrastructure cost on a federal AI program, and it is the one most often sized on vibes instead of math. Teams either over-provision (because a vendor said they would need 8 GPUs and nobody questioned it) or under-provision (because the initial POC was small and the production workload blew through the budget in the first month). The planning that works is grounded in measured workload characteristics, honest uncertainty about growth, and a clear lease-vs-buy decision for the authorization boundary.
The 2026 GPU lineup, honestly

| GPU | Memory | Best for | Federal availability |
|---|---|---|---|
| H100 SXM (80GB) | 80 GB HBM3 | LLM training and serving, general DL | GovCloud p5, on-prem (mature supply) |
| H200 SXM (141GB) | 141 GB HBM3e | Larger-context LLM serving, bigger models without tensor parallel | GovCloud p5en, on-prem (growing supply) |
| B200 / GB200 | 192 GB HBM3e | Frontier training, high-throughput inference, FP4/FP8 | Commercial + federal availability growing through 2026 |
| L40S | 48 GB GDDR6 | Mid-size inference, graphics-adjacent workloads | Widely available, lower cost |
| L4 | 24 GB GDDR6 | Small-model inference, CPU-offload augmentation | Widely available |
| A100 (40/80GB) | 40-80 GB HBM2e | Older but capable; still useful for training and serving | Reliable availability; price-per-unit-capability weakening |
| A10 / A10G | 24 GB GDDR6 | Small-model serving; cost-sensitive | Widely available |
Workload sizing
LLM inference
70B-class models
2xH100 or 1xH200 per serving replica. Throughput 1500-3000 tok/s batched.
8B-class models
1xL40S or 1xH100 (MIG slice). Throughput 2000-4000 tok/s batched.
405B-class models
8xH100 or 4xH200 per serving replica.
Mixture of experts (Mixtral 8x22B)
4xH100 or 2xH200 bf16; 2xH100 AWQ 4-bit.
Fine-tuning
- LoRA on 7B-13B. 1xH100 or 1xA100 80GB. Hours to a day of compute.
- LoRA on 70B. 4x-8xH100. Overnight to a few days.
- Full fine-tune 70B. 8x-32xH100 for a week or more. Mostly avoid.
Embedding models
- Small embedding models (BGE, E5) on 1xL4 or CPU for light batch workloads.
- Larger models or high-throughput indexing: 1xL40S or 1xA10G.
Classical and deep-learning training
- Tabular or small-network training: often CPU or single-GPU (A10, L4).
- Vision model fine-tuning: 1x-4xH100 or L40S typical.
- Geospatial foundation-model fine-tuning: 4x-8xH100 per job.
Inference vs training: the real ratio
Across programs we have worked on, sustained inference GPU hours outnumber training by 5-20x. A program that plans 50/50 for "training vs inference" is planning wrong; the real ratio is 10/90 or worse. Capacity planning should reflect this: reserve steady-state capacity for inference, add burst capacity for periodic training.
Most programs plan 50/50. Mature programs end up closer to 10/90. Over-provisioning training GPUs while starving inference is the most common capacity mistake.
Utilization: the honest number
Nominal GPU plans assume full utilization. Real deployments hit 30-70% utilization. The gap is:
- Idle time between requests (serving).
- KV cache fragmentation.
- Non-prompt GPU memory overhead.
- Single-GPU workloads on multi-GPU nodes (bad scheduling).
- Dev/test sharing that is hard to pin down.
Size for 50-60% average utilization as the planning target; 80%+ sustained means you are under-provisioned and at risk for latency SLAs.
Lease vs buy on a federal program
Lease via cloud
Good for: variable workloads, exploratory work, pre-ATO development, rapid-scale needs, burst training. GovCloud and Azure Government both offer reserved capacity (AWS Capacity Blocks for ML, Azure ML reserved instances) which can approach on-prem cost-per-GPU-hour at multi-year commitment.
Buy / capital lease
Good for: steady inference production, classified environments, long-term programs, situations where cloud is unavailable. Real cost is hardware plus datacenter integration plus power plus cooling plus ops staff plus refresh. Multi-year program horizon is necessary to amortize.
Hybrid
Common pattern: on-prem for steady production inference, cloud for burst training and for dev/test. Split the authorization accordingly.
Reserved capacity strategies
AWS Capacity Blocks for ML
Schedule specific H100/H200 instance capacity for a future window. Necessary for reliable large training jobs.
AWS Savings Plans
Commit to aggregate compute spend for 1-3 years; discount 20-60%. For steady inference.
Azure reservations
Equivalent mechanism on Azure Government.
Spot / preemptible
For training with checkpointing; 60-70% savings. Not for production inference.
Power, cooling, and datacenter integration for on-prem
- 8xH100 SXM system: 10-12 kW sustained. 4 systems to a rack with high-density cooling.
- Power distribution: 3-phase, high-amperage whips. Facility needs must be confirmed before procurement.
- Cooling: liquid-cooled racks increasingly standard for H100/H200/B200 density. Air-cooled works but at lower density.
- Networking: NVIDIA Spectrum-X or InfiniBand for multi-node training. 400 Gbps per GPU is the pattern for H100 training clusters.
- Storage: high-IOPS NVMe tier for training data staging; object storage for archives.
MIG vs whole-GPU scheduling
- MIG (H100/H200): partition into 1/2/3/4/7 logical GPUs with hardware isolation. Good for serving many small models, for dev/test multi-tenancy, for separating classification contexts on the same hardware.
- Whole-GPU: for training, large-model inference, or any workload that would not fit a MIG slice.
- Time-slicing (Kubernetes plugin): no hardware isolation, just temporal sharing. Simpler than MIG but weaker boundary.
- Plan capacity in units that match the scheduling mode. A "10-GPU cluster" with MIG partitioning is not the same as 10 whole-GPU slots.
Forecasting growth
- User count and per-user requests/day.
- Average input tokens and output tokens per request.
- Expected tokens/sec per GPU under the specific model and serving stack.
- Redundancy multiplier for availability (typically 1.5x-2x).
- Growth assumption: 2-5x year-over-year for successful deployments in the first two years, flattening after.
What goes wrong
- Sizing on batch-1 latency. Actual throughput under batched inference is much higher; sizing on single-request latency over-provisions by 3-10x.
- Ignoring KV cache. Long contexts fit the model but not the cache; production OOM at high concurrency.
- MIG configured once, never re-tuned. Workload mix changes, MIG profile does not. Utilization collapses.
- Training capacity stranded. 16xH100 bought for training sits idle 90% of the time. Convert to inference or spot-sell internally.
- Power cost ignored. On-prem OpEx dominated by power over a 4-year horizon.
- Refresh not planned. GPUs bought in year 1 need a refresh plan in year 3-4 regardless of how the workload grows.
Where this fits in our practice
We size GPU capacity as part of the full stack build. See our on-prem LLM deployment for the workloads the GPUs run, our Kubernetes in the IC tier for the scheduling platform, and our MLOps on GovCloud for orchestration.