The question people are actually asking
"Should we fine-tune the model or use RAG?" is almost never the real question. The real question is some mix of: "how do we get this model to be accurate on our domain," "how do we keep our data inside our boundary," "how do we make it cite sources," "how do we make it sound like us," and "how do we afford it at production scale." Each of those sub-questions has a different answer, and the architecture falls out of answering them honestly.
This post is the decision tree we walk with federal programs. It is not a theoretical survey. It is what we actually tell agencies who ask us whether they should stand up a fine-tuning pipeline or invest that effort in retrieval.
What each technique actually does

RAG keeps the knowledge outside the model. At query time you retrieve relevant passages from a vector or hybrid store, pass them to the model as context, and the model answers from that context. The model itself is unchanged. The knowledge lives in the index.
Fine-tuning pushes knowledge, behavior, or both into the weights. Supervised fine-tuning (SFT) teaches the model patterns from input-output pairs. Instruction tuning teaches it to follow specific instruction formats. Preference tuning (DPO, KTO) teaches it to prefer certain answer styles. LoRA and QLoRA train a small adapter layered on top of a frozen base model, which is the only fine-tuning variant that makes economic sense on federal programs outside of very large primes.
The two techniques solve different problems. RAG solves "the model does not know this specific fact" and "we need to cite where we got it." Fine-tuning solves "the model does not respond the way we want" and "we need a consistent pattern of output."
Decision tree, top level
flowchart TD
A([Your federal LLM use case]) --> B{Answers from\na specific corpus?}
B -->|Yes| RAG1[RAG — document retrieval\nis mandatory]
B -->|No| C{Citations required\nback to source?}
C -->|Yes| RAG2[RAG — fine-tuned models\nfabricate citations]
C -->|No| D{Corpus changes\nmore than quarterly?}
D -->|Yes| RAG3[RAG — fine-tunes\ngo stale immediately]
D -->|No| E{Access is user-dependent\nor clearance-gated?}
E -->|Yes| RAG4[RAG — cannot bake\nper-user access into weights]
E -->|No| F{Narrow classification,\nextraction, or transformation?}
F -->|Yes| FT1[Fine-Tune — 7B specialized\nbeats frontier and cheaper]
F -->|No| G{Specific format or\ntone base model resists?}
G -->|Yes| FT2[Fine-Tune adapter —\nprompt-forcing is fragile at scale]
G -->|No| H[Evaluate further —\nRAG plus adapter hybrid often wins]
style RAG1 fill:#3b82f6,color:#fff,stroke:#3b82f6
style RAG2 fill:#3b82f6,color:#fff,stroke:#3b82f6
style RAG3 fill:#3b82f6,color:#fff,stroke:#3b82f6
style RAG4 fill:#3b82f6,color:#fff,stroke:#3b82f6
style FT1 fill:#7c3aed,color:#fff,stroke:#7c3aed
style FT2 fill:#7c3aed,color:#fff,stroke:#7c3aed
style H fill:#0d9488,color:#fff,stroke:#0d9488
- Does the task require answering from a specific document corpus? Yes: RAG is mandatory. The only question left is whether you also need fine-tuning on top.
- Does the task require citations back to source? Yes: RAG. Fine-tuning cannot honestly cite; it compresses the training data and will fabricate plausible-looking citations.
- Does the corpus change more than once a quarter? Yes: RAG. Any fine-tune is stale the week it ships.
- Is access to the corpus user-dependent (clearance, need-to-know)? Yes: RAG. You cannot bake user-variable access into weights.
- Is the task a narrow classification, extraction, or transformation? Consider fine-tuning. A specialized 7B model trained on 10K-50K labeled examples often beats a frontier model on narrow tasks, runs cheaper, and is easier to self-host.
- Does the output need a specific format, tone, or style the base model resists? Consider a small fine-tuned adapter. Prompting for format is cheap until prompt size dominates latency.
When RAG is the right call
- Policy lookup. Agency policies, regulations, SOPs, CONOPS. Churn is continuous, provenance is required, and users have differing authorizations.
- Case file analysis. Investigators, adjudicators, and analysts asking questions about specific records.
- Research assistance. Technical libraries, NIST publications, journal archives, patent databases.
- Compliance assistance. Answering "what control applies" or "what is the reporting requirement" from a living standard.
- Contracting support. FAR/DFARS lookups, past-performance retrieval, clause analysis.
In every one of these, fine-tuning would be actively counterproductive. A fine-tuned model forgets yesterday's policy update, cannot cite its source, and cannot be partitioned by user clearance.
When fine-tuning is the right call
Domain classification
Routing incoming documents or tickets into agency-specific taxonomies. A LoRA on Llama-3.1-8B trained on 20K labeled examples typically matches or exceeds a frontier model at a fraction of the cost.
Structured extraction
Pulling specific fields from agency-specific forms. Fine-tuning locks in the schema and tolerates schema variance better than prompting.
Agency voice
Writing in the style of the agency (concise, formal, numbered lists, specific boilerplate). Prompting works but adds tokens on every call; an adapter bakes it in.
Small-model deployment
When you need a model that fits on a single GPU in a disconnected environment and still performs on the task, fine-tuning a 7B-13B model on task data is often the only path.
Latency-critical paths
Real-time tagging, routing, or scoring where a 500-token RAG prompt is too slow. A fine-tuned 3B-7B classifier hits sub-50ms.
The "both" case: adapter plus RAG
The pattern we most commonly ship on mature federal programs: a small LoRA adapter on an open-weight base model for tone and format, plus a RAG pipeline for facts. The adapter makes the output sound like the agency without prompting for it. The RAG pipeline keeps the content current and cited.
Cost math that is honest
Rough 2026 numbers for planning, not procurement.
| Approach | Setup cost | Per-1K-query cost | When it pays back |
|---|---|---|---|
| RAG on frontier model (Claude 4.x / GPT-4.x via GovCloud) | $50K-200K platform build | $1-8 at current API pricing | Immediately when corpus access is the blocker |
| RAG on self-hosted open-weight (Llama 3.1 70B, vLLM) | $150K-400K (includes GPU infra) | $0.10-0.50 amortized | At sustained >1M queries/month or when offline is mandatory |
| LoRA adapter on 7B-13B model | $10K-40K (training + eval) | $0.05-0.20 self-hosted | Immediately for narrow tasks at scale |
| Full SFT on 70B model | $100K-500K (compute + data curation) | $0.20-0.80 self-hosted | Rarely justified vs LoRA + RAG |
| Preference tuning (DPO) on adapter | +$15K-50K on top of SFT | Same as base adapter | When refusal or safety behavior is the blocker |
Those numbers exclude compliance and ATO work, which dominate federal project budgets and are independent of the approach.
Fine-tuning pitfalls that federal programs hit
Training data leakage
Fine-tuning on CUI binds that data into weights that then move everywhere the model moves. Treat fine-tuned weights with the same handling as the training data.
Catastrophic forgetting
Heavy fine-tuning on narrow data degrades the model's general reasoning. Mitigated with LoRA (base is frozen) and with mix-in of general instruction data.
Eval drift
Your fine-tune was a win against the Q1 eval set. Q2 queries have shifted. Without a refreshed eval harness you have no idea whether the fine-tune still helps.
Adapter sprawl
Every mission wants its own adapter. Without a registry, versioning, and rollback, the deployment matrix becomes unmaintainable.
Hidden prompting
A fine-tuned model that was trained on prompts with hidden system instructions will repeat those in surprising contexts. Document training prompts as rigorously as production prompts.
RAG pitfalls that send teams toward fine-tuning for the wrong reasons
Teams often reach for fine-tuning because their RAG is failing, when the actual problem is a bad RAG pipeline. Fix the pipeline before changing the approach:
- Poor chunking produces fragmented context. Fine-tuning will not fix retrieval.
- Single-model embeddings on a domain they were not trained on. Swap the embedder before touching the LLM.
- No hybrid search means exact matches on case numbers or statutes fail. Add BM25.
- No reranker, so the top-5 is noisy. Add a cross-encoder.
- Retrieval works but the prompt lets the model drift. Tighten the prompt before tuning the model.
Tooling honest review
Hugging Face TRL and PEFT
Default stack for LoRA and QLoRA training. Mature, well-documented, permissive licenses. Runs on any GPU box inside the boundary.
Axolotl
Opinionated YAML wrapper around TRL. Faster to stand up, easier to version training configs, slightly less flexible.
Unsloth
2x training speedup for small-to-mid models on single-GPU setups. Worth trying before buying more GPUs.
SageMaker JumpStart / Bedrock custom models
Managed fine-tuning available in GovCloud for select models. Convenient but locks you into the managed path and is expensive at scale.
Azure ML fine-tuning
Similar to SageMaker. Strong integration with Azure OpenAI in Azure Government for hosted fine-tunes of GPT family.
LangChain / LlamaIndex
RAG frameworks. Useful for prototypes; production federal systems usually strip them back to the pieces that actually ship.
vLLM / TensorRT-LLM
Serving. Fine-tuned models need a production-grade serving layer. See our on-prem deployment post.
The defensibility argument
On a federal program, "why did the model answer this way" is a question you will be asked. RAG makes it answerable: "because we retrieved these passages from these documents and presented them as context." Fine-tuning makes it harder: "because the model learned a pattern during training." For any application where an answer will be read by a decision-maker, a court, an IG, or Congress, the RAG story holds up better. This is not a technical preference; it is a defensibility preference that matters uniquely in federal work.
A recommended starting posture
- Build a solid RAG pipeline first. Hybrid retrieval, reranking, grounded citations, real eval harness.
- Measure what actually fails. Is it retrieval? Is it tone? Is it format?
- If retrieval is the failure, fix retrieval.
- If tone or format is the failure and prompt engineering is expensive or unreliable, train a small LoRA adapter.
- Only consider full fine-tuning when narrow-task performance, self-hosting economics, or latency floors force the issue.
- Whatever you ship, gate it on a regression eval that mirrors real usage.
Where this fits in our practice
We build RAG platforms, train LoRA adapters, and design the eval harnesses that tell you which to use when. See our RAG architecture and LLM evaluation posts for the rest of the stack.