Skip to main content

On-prem LLM deployment for air-gapped federal environments.

April 4, 2026 · 17 min read · Hardware sizing, model selection, serving stack, and update pipelines for disconnected networks.

The constraint that shapes everything

On-prem and air-gapped LLM deployments live by a different set of rules than their cloud cousins. No vendor API. No automatic update. No outbound telemetry. No dynamic scaling from the hyperscaler's fleet. Every decision — which model, which serving stack, which quantization, which hardware — is made once, lived with for months, and updated through a deliberate cross-domain process.

This post is the reference design we ship for classified and air-gapped LLM deployments in 2026, plus the mistakes we have watched programs make by assuming cloud patterns transfer.

Scope. Genuinely disconnected environments: SIPR, JWICS, isolated research enclaves, and OCONUS facilities with bandwidth constraints. FedRAMP High GovCloud deployments follow a related but less restrictive pattern.

Air-gapped LLM deployment architecture

flowchart LR U[User Workstation] --> GW[Secure API Gateway] GW --> AUTH[Auth Service PKI or CAC] GW --> INF[Inference Server Llama or Mistral] INF --> MS[Model Store NFS or local SSD] INF --> PA[Prompt Audit Log] PA --> SIEM[SIEM] style GW fill:#d97706,color:#fff,stroke:#d97706 style INF fill:#3b82f6,color:#fff,stroke:#3b82f6 style MS fill:#0d9488,color:#fff,stroke:#0d9488 style PA fill:#dc2626,color:#fff,stroke:#dc2626 style SIEM fill:#dc2626,color:#fff,stroke:#dc2626 style AUTH fill:#d97706,color:#fff,stroke:#d97706

Model selection for disconnected environments

You cannot use hosted frontier models. The decision space is open-weight models you can run entirely inside the enclave. The viable options in 2026:

  • Llama 3.1 / 3.3 (Meta). 8B, 70B, 405B. The default. Permissive license, strong English and instruction-following, wide tooling support. 70B is the sweet spot for mission QA; 8B for latency-critical paths.
  • Mixtral 8x22B (Mistral AI). Sparse mixture of experts; activates ~39B parameters per token. Strong quality at moderate cost. Apache 2.0 license.
  • Qwen 2.5 (Alibaba). 7B to 72B. Strong on coding, math, multilingual. License is permissive for most federal use but check the specific variant.
  • DeepSeek V3 / R1 derivatives. Strong reasoning. License is permissive. Originating from a foreign developer; review your program's foreign-weights policy before selecting.
  • Phi-4 (Microsoft). 14B, small-model efficiency. Good for structured tasks.
  • Gemma 2 (Google). 2B, 9B, 27B. Strong smaller-model option with Google Terms of Use (distinct from Apache 2.0 — review).

Hardware sizing

Model choice drives hardware. Some working numbers for 2026 NVIDIA hardware (bf16 unless noted):

ModelMin GPU memoryTypical serving configThroughput (tokens/sec, batch)
Llama 3.1 8B bf16~18 GB1x L40S or 1x H1002000-4000 tok/s (batched)
Llama 3.1 70B bf16~140 GB4x A100 80GB or 2x H1001500-3000 tok/s (batched)
Llama 3.1 70B AWQ 4-bit~40 GB1x H100 or 2x L40S1200-2500 tok/s (batched)
Llama 3.1 405B bf16~810 GB8x H100 or 4x H200500-1200 tok/s (batched)
Mixtral 8x22B bf16~280 GB4x H100 or 2x H2001800-3500 tok/s (batched)
Mixtral 8x22B AWQ 4-bit~80 GB2x H1001500-2800 tok/s (batched)

Throughput varies widely with prompt length, batch composition, and serving stack. These are planning numbers, not SLAs.

Serving stack: vLLM and TensorRT-LLM

vLLM

Open-source, Apache 2.0, Python. Implements PagedAttention (efficient KV cache management), continuous batching (requests joined and left across decode steps), speculative decoding, and wide model compatibility. Runs on NVIDIA and AMD GPUs. The default choice for on-prem serving in 2026.

What you get out of the box: OpenAI-compatible HTTP API, tensor and pipeline parallelism, quantization support (AWQ, GPTQ, FP8), LoRA adapter hot-swap, metrics endpoint for Prometheus.

TensorRT-LLM

NVIDIA-optimized inference engine. Higher throughput on NVIDIA hardware for steady-state production (20-40% over vLLM in many configurations), at the cost of an engine build step per model + quantization + parallelism combination. Engine builds are not trivial — expect 30-120 minutes per build and a dependency on matching CUDA and driver versions.

We use TensorRT-LLM when the workload is stable, throughput is the binding constraint, and the team has NVIDIA expertise on staff. Otherwise vLLM.

Alternatives worth naming

SGLang

Competitive with vLLM on throughput for some workloads; strong for structured generation.

llama.cpp

CPU and Apple Silicon; niche in federal but useful for edge deployments.

Hugging Face TGI

Mature, good tooling; overlaps vLLM in capability.

NVIDIA NIM

Containerized TensorRT-LLM with preset configs. Convenient; review licensing for federal use.

Air-gap packaging

The installation bundle is a first-class deliverable. What goes in:

  • Model weights (safetensors) with checksums.
  • Tokenizer files.
  • Serving image (OCI tarball) including vLLM / TensorRT-LLM with pinned versions.
  • Python wheels for any Python-layer dependencies if not baked into the image.
  • CUDA and driver version manifest.
  • Install and verification scripts.
  • Eval harness bundle and a small evaluation set so the install can be verified post-transfer.
  • Signed manifest listing every file, hash, and provenance.

The bundle moves across the air gap via approved media or cross-domain solution. Verification on the high side re-computes every hash, validates the signature, and runs the eval set before the new version is promoted.

The update pipeline is your most important deliverable. Getting a model in is easy once. Getting twelve versions in over three years without drift is the actual engineering problem.

Update pipeline pattern

  1. Low-side pull. Weights are downloaded, hashed, and placed in a staging area on the low side. Source provenance is logged.
  2. Scan. Bundle is scanned for malware with approved tools. Supply-chain provenance is verified (model card, source repo commit hash).
  3. Package. Full install bundle is assembled and signed.
  4. Transfer. Bundle moves via approved path. Receipt is logged.
  5. High-side verify. Hashes and signatures re-checked. Test install on a non-production node.
  6. Eval. Run the regression eval set against the new model. Compare to champion.
  7. Promote. If eval passes, promote to staging, then canary, then full production. Rollback plan documented.
  8. Decommission. Old weights and image tags are retained for rollback window, then purged on schedule.

Observability without outbound telemetry

You cannot ship traces to Datadog or model metrics to a SaaS. What works:

Prometheus + Grafana in-enclave

vLLM and TensorRT-LLM both expose Prometheus endpoints. Standard serving metrics: request rate, P50/P95/P99 latency, queue depth, KV cache usage, GPU utilization.

OpenTelemetry + Tempo or Jaeger in-enclave

Distributed tracing at the application layer.

Structured JSON logs to Loki or Elasticsearch

Request/response logging with classification markings.

Eval-as-monitoring

Run the eval harness on a cadence (hourly on a subset, daily on full) and alert on regression. In a disconnected environment this is your early warning for model-serving drift.

Security constraints that shape architecture

No outbound

Serving processes must not make any outbound network call. Block at the host firewall level and audit.

Supply-chain integrity

Every container, every Python wheel, every CUDA binary is pulled through an approved internal artifact registry, not pip directly.

Weights signing

Model weights are signed by the authorizing official; serving verifies signature on load.

STIG compliance

Host OS is STIG-hardened (DISA STIGs for RHEL / Ubuntu Pro). GPU driver install must be reconciled with STIG settings.

Access logging

Every inference request has a user identity, timestamp, classification, and input/output hash logged to an audit store.

Data-at-rest encryption

Model weights, logs, eval data — all on encrypted storage (LUKS on Linux, self-encrypting drives, or SED-plus-volume-encryption).

Failure modes we have debugged in disconnected enclaves

Silent GPU MIG misconfiguration

H100 MIG slices configured for a prior workload left a subset of the GPU unusable. Throughput was half of expected and no one noticed for a week.

Mismatched CUDA version on transfer

Bundle was built against CUDA 12.4; high-side hosts were pinned to 12.2. TensorRT-LLM engines would not load.

Tokenizer drift

New model version shipped a tokenizer change; downstream RAG chunking assumed the old tokens. Context overflow in production.

LoRA adapter mismatch

Adapter trained on a slightly different base snapshot produced garbled output. Weight hashes must match.

Quantization quality collapse on narrow tasks

AWQ 4-bit performed fine on broad QA but lost a meaningful percentage on structured-extraction tasks. Task-level eval caught it; aggregate eval missed it.

KV cache exhaustion under real traffic

Production prompt length exceeded sizing assumptions. Latency spiked and requests were dropped. Sizing must include worst-case context length, not average.

Cost planning

Order-of-magnitude numbers for a production on-prem LLM platform, 2026:

Hardware (one site, Llama 70B class, redundant)

$400K-800K for 4-8 H100s, networking, storage, and racks.

Software and licenses

largely open-source; NVIDIA enterprise driver support, Red Hat OpenShift, monitoring stack. $50K-150K/year.

Engineering (build + operate, year one)

$600K-1.5M depending on scope.

Update pipeline (ongoing)

a named role, typically 0.5-1 FTE per enclave.

Where this fits in our practice

We design and build on-prem LLM platforms end to end. See our GPU capacity planning for sizing, our Kubernetes in the IC tier for the container platform, and our RAG architecture for the retrieval layer that typically sits alongside the LLM.

FAQ

Which open-weight models are ready for on-prem federal deployment in 2026?
The strongest options are Llama 3.1 / 3.3 (8B, 70B, 405B), Mixtral 8x7B and 8x22B, Qwen 2.5 (7B-72B), and DeepSeek V3 derivatives. For most mission workloads Llama 3.1 70B or Mixtral 8x22B is the baseline. 405B is only worthwhile when you need frontier-class quality and can afford 8x H100 or 4x H200 for serving.
Can vLLM serve models in a disconnected network?
Yes. vLLM is a Python package with no required network calls after the model weights and dependencies are loaded. Build an installation bundle with the wheel files, model weights, and a containerized serving image, transfer it across the air gap, and run. The same pattern works for TensorRT-LLM.
How do you update models in an air-gapped environment?
Through a controlled transfer pipeline: model is downloaded on the low side, hashed, scanned for malware, signed by an authorizing official, burned to approved removable media or pushed through a cross-domain solution, verified on the high side, and promoted through staging before production. Cadence is typically quarterly, not weekly.
What GPUs are available in GovCloud and classified enclaves?
AWS GovCloud has H100 and H200 availability that has steadily improved through 2025-2026. Classified enclaves typically use on-prem racks: DGX H100 or H200 systems, HPE Apollo, or Dell PowerEdge XE9680. Availability and waitlists vary by facility.
Do you need TensorRT-LLM or is vLLM enough?
vLLM is the default starting point — faster to stand up, wide model support, good throughput with PagedAttention. TensorRT-LLM delivers 20-40% better throughput on NVIDIA hardware for production steady-state, at the cost of engine builds and less model flexibility. Use vLLM for iteration, TensorRT-LLM for steady-state production.
Is quantization safe to use on mission models?
AWQ and GPTQ 4-bit quantization give 3-4x memory reduction with modest quality loss on most tasks. FP8 on H100/H200 is close to BF16 quality for many workloads. Always evaluate quantized vs full-precision on your eval harness before deploying quantized to mission.

Related insights

Deploying an LLM inside a classified or disconnected enclave?

We size hardware, select open-weight models, build the vLLM or TensorRT-LLM serving layer, and design update pipelines that survive air-gap constraints.