The constraint that shapes everything
On-prem and air-gapped LLM deployments live by a different set of rules than their cloud cousins. No vendor API. No automatic update. No outbound telemetry. No dynamic scaling from the hyperscaler's fleet. Every decision — which model, which serving stack, which quantization, which hardware — is made once, lived with for months, and updated through a deliberate cross-domain process.
This post is the reference design we ship for classified and air-gapped LLM deployments in 2026, plus the mistakes we have watched programs make by assuming cloud patterns transfer.
Air-gapped LLM deployment architecture
Model selection for disconnected environments

You cannot use hosted frontier models. The decision space is open-weight models you can run entirely inside the enclave. The viable options in 2026:
- Llama 3.1 / 3.3 (Meta). 8B, 70B, 405B. The default. Permissive license, strong English and instruction-following, wide tooling support. 70B is the sweet spot for mission QA; 8B for latency-critical paths.
- Mixtral 8x22B (Mistral AI). Sparse mixture of experts; activates ~39B parameters per token. Strong quality at moderate cost. Apache 2.0 license.
- Qwen 2.5 (Alibaba). 7B to 72B. Strong on coding, math, multilingual. License is permissive for most federal use but check the specific variant.
- DeepSeek V3 / R1 derivatives. Strong reasoning. License is permissive. Originating from a foreign developer; review your program's foreign-weights policy before selecting.
- Phi-4 (Microsoft). 14B, small-model efficiency. Good for structured tasks.
- Gemma 2 (Google). 2B, 9B, 27B. Strong smaller-model option with Google Terms of Use (distinct from Apache 2.0 — review).
Hardware sizing
Model choice drives hardware. Some working numbers for 2026 NVIDIA hardware (bf16 unless noted):
| Model | Min GPU memory | Typical serving config | Throughput (tokens/sec, batch) |
|---|---|---|---|
| Llama 3.1 8B bf16 | ~18 GB | 1x L40S or 1x H100 | 2000-4000 tok/s (batched) |
| Llama 3.1 70B bf16 | ~140 GB | 4x A100 80GB or 2x H100 | 1500-3000 tok/s (batched) |
| Llama 3.1 70B AWQ 4-bit | ~40 GB | 1x H100 or 2x L40S | 1200-2500 tok/s (batched) |
| Llama 3.1 405B bf16 | ~810 GB | 8x H100 or 4x H200 | 500-1200 tok/s (batched) |
| Mixtral 8x22B bf16 | ~280 GB | 4x H100 or 2x H200 | 1800-3500 tok/s (batched) |
| Mixtral 8x22B AWQ 4-bit | ~80 GB | 2x H100 | 1500-2800 tok/s (batched) |
Throughput varies widely with prompt length, batch composition, and serving stack. These are planning numbers, not SLAs.
Serving stack: vLLM and TensorRT-LLM
vLLM
Open-source, Apache 2.0, Python. Implements PagedAttention (efficient KV cache management), continuous batching (requests joined and left across decode steps), speculative decoding, and wide model compatibility. Runs on NVIDIA and AMD GPUs. The default choice for on-prem serving in 2026.
What you get out of the box: OpenAI-compatible HTTP API, tensor and pipeline parallelism, quantization support (AWQ, GPTQ, FP8), LoRA adapter hot-swap, metrics endpoint for Prometheus.
TensorRT-LLM
NVIDIA-optimized inference engine. Higher throughput on NVIDIA hardware for steady-state production (20-40% over vLLM in many configurations), at the cost of an engine build step per model + quantization + parallelism combination. Engine builds are not trivial — expect 30-120 minutes per build and a dependency on matching CUDA and driver versions.
We use TensorRT-LLM when the workload is stable, throughput is the binding constraint, and the team has NVIDIA expertise on staff. Otherwise vLLM.
Alternatives worth naming
SGLang
Competitive with vLLM on throughput for some workloads; strong for structured generation.
llama.cpp
CPU and Apple Silicon; niche in federal but useful for edge deployments.
Hugging Face TGI
Mature, good tooling; overlaps vLLM in capability.
NVIDIA NIM
Containerized TensorRT-LLM with preset configs. Convenient; review licensing for federal use.
Air-gap packaging
The installation bundle is a first-class deliverable. What goes in:
- Model weights (safetensors) with checksums.
- Tokenizer files.
- Serving image (OCI tarball) including vLLM / TensorRT-LLM with pinned versions.
- Python wheels for any Python-layer dependencies if not baked into the image.
- CUDA and driver version manifest.
- Install and verification scripts.
- Eval harness bundle and a small evaluation set so the install can be verified post-transfer.
- Signed manifest listing every file, hash, and provenance.
The bundle moves across the air gap via approved media or cross-domain solution. Verification on the high side re-computes every hash, validates the signature, and runs the eval set before the new version is promoted.
Update pipeline pattern
- Low-side pull. Weights are downloaded, hashed, and placed in a staging area on the low side. Source provenance is logged.
- Scan. Bundle is scanned for malware with approved tools. Supply-chain provenance is verified (model card, source repo commit hash).
- Package. Full install bundle is assembled and signed.
- Transfer. Bundle moves via approved path. Receipt is logged.
- High-side verify. Hashes and signatures re-checked. Test install on a non-production node.
- Eval. Run the regression eval set against the new model. Compare to champion.
- Promote. If eval passes, promote to staging, then canary, then full production. Rollback plan documented.
- Decommission. Old weights and image tags are retained for rollback window, then purged on schedule.
Observability without outbound telemetry
You cannot ship traces to Datadog or model metrics to a SaaS. What works:
Prometheus + Grafana in-enclave
vLLM and TensorRT-LLM both expose Prometheus endpoints. Standard serving metrics: request rate, P50/P95/P99 latency, queue depth, KV cache usage, GPU utilization.
OpenTelemetry + Tempo or Jaeger in-enclave
Distributed tracing at the application layer.
Structured JSON logs to Loki or Elasticsearch
Request/response logging with classification markings.
Eval-as-monitoring
Run the eval harness on a cadence (hourly on a subset, daily on full) and alert on regression. In a disconnected environment this is your early warning for model-serving drift.
Security constraints that shape architecture
No outbound
Serving processes must not make any outbound network call. Block at the host firewall level and audit.
Supply-chain integrity
Every container, every Python wheel, every CUDA binary is pulled through an approved internal artifact registry, not pip directly.
Weights signing
Model weights are signed by the authorizing official; serving verifies signature on load.
STIG compliance
Host OS is STIG-hardened (DISA STIGs for RHEL / Ubuntu Pro). GPU driver install must be reconciled with STIG settings.
Access logging
Every inference request has a user identity, timestamp, classification, and input/output hash logged to an audit store.
Data-at-rest encryption
Model weights, logs, eval data — all on encrypted storage (LUKS on Linux, self-encrypting drives, or SED-plus-volume-encryption).
Failure modes we have debugged in disconnected enclaves
Silent GPU MIG misconfiguration
H100 MIG slices configured for a prior workload left a subset of the GPU unusable. Throughput was half of expected and no one noticed for a week.
Mismatched CUDA version on transfer
Bundle was built against CUDA 12.4; high-side hosts were pinned to 12.2. TensorRT-LLM engines would not load.
Tokenizer drift
New model version shipped a tokenizer change; downstream RAG chunking assumed the old tokens. Context overflow in production.
LoRA adapter mismatch
Adapter trained on a slightly different base snapshot produced garbled output. Weight hashes must match.
Quantization quality collapse on narrow tasks
AWQ 4-bit performed fine on broad QA but lost a meaningful percentage on structured-extraction tasks. Task-level eval caught it; aggregate eval missed it.
KV cache exhaustion under real traffic
Production prompt length exceeded sizing assumptions. Latency spiked and requests were dropped. Sizing must include worst-case context length, not average.
Cost planning
Order-of-magnitude numbers for a production on-prem LLM platform, 2026:
Hardware (one site, Llama 70B class, redundant)
$400K-800K for 4-8 H100s, networking, storage, and racks.
Software and licenses
largely open-source; NVIDIA enterprise driver support, Red Hat OpenShift, monitoring stack. $50K-150K/year.
Engineering (build + operate, year one)
$600K-1.5M depending on scope.
Update pipeline (ongoing)
a named role, typically 0.5-1 FTE per enclave.
Where this fits in our practice
We design and build on-prem LLM platforms end to end. See our GPU capacity planning for sizing, our Kubernetes in the IC tier for the container platform, and our RAG architecture for the retrieval layer that typically sits alongside the LLM.