Skip to main content

RAG vs fine-tuning: the federal decision tree.

April 2, 2026 · 15 min read · When RAG wins, when fine-tuning wins, when both, and the real cost math on a federal program.

The question people are actually asking

"Should we fine-tune the model or use RAG?" is almost never the real question. The real question is some mix of: "how do we get this model to be accurate on our domain," "how do we keep our data inside our boundary," "how do we make it cite sources," "how do we make it sound like us," and "how do we afford it at production scale." Each of those sub-questions has a different answer, and the architecture falls out of answering them honestly.

This post is the decision tree we walk with federal programs. It is not a theoretical survey. It is what we actually tell agencies who ask us whether they should stand up a fine-tuning pipeline or invest that effort in retrieval.

Short version. Default to RAG. Add a small fine-tuned adapter when you need consistent tone or format that RAG cannot reliably produce. Full fine-tuning of a frontier model is almost never the right first move on a federal program.

What each technique actually does

RAG keeps the knowledge outside the model. At query time you retrieve relevant passages from a vector or hybrid store, pass them to the model as context, and the model answers from that context. The model itself is unchanged. The knowledge lives in the index.

Fine-tuning pushes knowledge, behavior, or both into the weights. Supervised fine-tuning (SFT) teaches the model patterns from input-output pairs. Instruction tuning teaches it to follow specific instruction formats. Preference tuning (DPO, KTO) teaches it to prefer certain answer styles. LoRA and QLoRA train a small adapter layered on top of a frozen base model, which is the only fine-tuning variant that makes economic sense on federal programs outside of very large primes.

The two techniques solve different problems. RAG solves "the model does not know this specific fact" and "we need to cite where we got it." Fine-tuning solves "the model does not respond the way we want" and "we need a consistent pattern of output."

Decision tree, top level

RAG vs Fine-Tuning Decision Flow
flowchart TD
    A([Your federal LLM use case]) --> B{Answers from\na specific corpus?}
    B -->|Yes| RAG1[RAG — document retrieval\nis mandatory]
    B -->|No| C{Citations required\nback to source?}
    C -->|Yes| RAG2[RAG — fine-tuned models\nfabricate citations]
    C -->|No| D{Corpus changes\nmore than quarterly?}
    D -->|Yes| RAG3[RAG — fine-tunes\ngo stale immediately]
    D -->|No| E{Access is user-dependent\nor clearance-gated?}
    E -->|Yes| RAG4[RAG — cannot bake\nper-user access into weights]
    E -->|No| F{Narrow classification,\nextraction, or transformation?}
    F -->|Yes| FT1[Fine-Tune — 7B specialized\nbeats frontier and cheaper]
    F -->|No| G{Specific format or\ntone base model resists?}
    G -->|Yes| FT2[Fine-Tune adapter —\nprompt-forcing is fragile at scale]
    G -->|No| H[Evaluate further —\nRAG plus adapter hybrid often wins]
    style RAG1 fill:#3b82f6,color:#fff,stroke:#3b82f6
    style RAG2 fill:#3b82f6,color:#fff,stroke:#3b82f6
    style RAG3 fill:#3b82f6,color:#fff,stroke:#3b82f6
    style RAG4 fill:#3b82f6,color:#fff,stroke:#3b82f6
    style FT1 fill:#7c3aed,color:#fff,stroke:#7c3aed
    style FT2 fill:#7c3aed,color:#fff,stroke:#7c3aed
    style H fill:#0d9488,color:#fff,stroke:#0d9488
        
  1. Does the task require answering from a specific document corpus? Yes: RAG is mandatory. The only question left is whether you also need fine-tuning on top.
  2. Does the task require citations back to source? Yes: RAG. Fine-tuning cannot honestly cite; it compresses the training data and will fabricate plausible-looking citations.
  3. Does the corpus change more than once a quarter? Yes: RAG. Any fine-tune is stale the week it ships.
  4. Is access to the corpus user-dependent (clearance, need-to-know)? Yes: RAG. You cannot bake user-variable access into weights.
  5. Is the task a narrow classification, extraction, or transformation? Consider fine-tuning. A specialized 7B model trained on 10K-50K labeled examples often beats a frontier model on narrow tasks, runs cheaper, and is easier to self-host.
  6. Does the output need a specific format, tone, or style the base model resists? Consider a small fine-tuned adapter. Prompting for format is cheap until prompt size dominates latency.

When RAG is the right call

  • Policy lookup. Agency policies, regulations, SOPs, CONOPS. Churn is continuous, provenance is required, and users have differing authorizations.
  • Case file analysis. Investigators, adjudicators, and analysts asking questions about specific records.
  • Research assistance. Technical libraries, NIST publications, journal archives, patent databases.
  • Compliance assistance. Answering "what control applies" or "what is the reporting requirement" from a living standard.
  • Contracting support. FAR/DFARS lookups, past-performance retrieval, clause analysis.

In every one of these, fine-tuning would be actively counterproductive. A fine-tuned model forgets yesterday's policy update, cannot cite its source, and cannot be partitioned by user clearance.

When fine-tuning is the right call

Domain classification

Routing incoming documents or tickets into agency-specific taxonomies. A LoRA on Llama-3.1-8B trained on 20K labeled examples typically matches or exceeds a frontier model at a fraction of the cost.

Structured extraction

Pulling specific fields from agency-specific forms. Fine-tuning locks in the schema and tolerates schema variance better than prompting.

Agency voice

Writing in the style of the agency (concise, formal, numbered lists, specific boilerplate). Prompting works but adds tokens on every call; an adapter bakes it in.

Small-model deployment

When you need a model that fits on a single GPU in a disconnected environment and still performs on the task, fine-tuning a 7B-13B model on task data is often the only path.

Latency-critical paths

Real-time tagging, routing, or scoring where a 500-token RAG prompt is too slow. A fine-tuned 3B-7B classifier hits sub-50ms.

The "both" case: adapter plus RAG

The pattern we most commonly ship on mature federal programs: a small LoRA adapter on an open-weight base model for tone and format, plus a RAG pipeline for facts. The adapter makes the output sound like the agency without prompting for it. The RAG pipeline keeps the content current and cited.

Fine-tune for behavior. Retrieve for facts. Confuse the two and you ship a confidently wrong system with a strong voice.

Cost math that is honest

Rough 2026 numbers for planning, not procurement.

ApproachSetup costPer-1K-query costWhen it pays back
RAG on frontier model (Claude 4.x / GPT-4.x via GovCloud)$50K-200K platform build$1-8 at current API pricingImmediately when corpus access is the blocker
RAG on self-hosted open-weight (Llama 3.1 70B, vLLM)$150K-400K (includes GPU infra)$0.10-0.50 amortizedAt sustained >1M queries/month or when offline is mandatory
LoRA adapter on 7B-13B model$10K-40K (training + eval)$0.05-0.20 self-hostedImmediately for narrow tasks at scale
Full SFT on 70B model$100K-500K (compute + data curation)$0.20-0.80 self-hostedRarely justified vs LoRA + RAG
Preference tuning (DPO) on adapter+$15K-50K on top of SFTSame as base adapterWhen refusal or safety behavior is the blocker

Those numbers exclude compliance and ATO work, which dominate federal project budgets and are independent of the approach.

Fine-tuning pitfalls that federal programs hit

Training data leakage

Fine-tuning on CUI binds that data into weights that then move everywhere the model moves. Treat fine-tuned weights with the same handling as the training data.

Catastrophic forgetting

Heavy fine-tuning on narrow data degrades the model's general reasoning. Mitigated with LoRA (base is frozen) and with mix-in of general instruction data.

Eval drift

Your fine-tune was a win against the Q1 eval set. Q2 queries have shifted. Without a refreshed eval harness you have no idea whether the fine-tune still helps.

Adapter sprawl

Every mission wants its own adapter. Without a registry, versioning, and rollback, the deployment matrix becomes unmaintainable.

Hidden prompting

A fine-tuned model that was trained on prompts with hidden system instructions will repeat those in surprising contexts. Document training prompts as rigorously as production prompts.

RAG pitfalls that send teams toward fine-tuning for the wrong reasons

Teams often reach for fine-tuning because their RAG is failing, when the actual problem is a bad RAG pipeline. Fix the pipeline before changing the approach:

  • Poor chunking produces fragmented context. Fine-tuning will not fix retrieval.
  • Single-model embeddings on a domain they were not trained on. Swap the embedder before touching the LLM.
  • No hybrid search means exact matches on case numbers or statutes fail. Add BM25.
  • No reranker, so the top-5 is noisy. Add a cross-encoder.
  • Retrieval works but the prompt lets the model drift. Tighten the prompt before tuning the model.

Tooling honest review

Hugging Face TRL and PEFT

Default stack for LoRA and QLoRA training. Mature, well-documented, permissive licenses. Runs on any GPU box inside the boundary.

Axolotl

Opinionated YAML wrapper around TRL. Faster to stand up, easier to version training configs, slightly less flexible.

Unsloth

2x training speedup for small-to-mid models on single-GPU setups. Worth trying before buying more GPUs.

SageMaker JumpStart / Bedrock custom models

Managed fine-tuning available in GovCloud for select models. Convenient but locks you into the managed path and is expensive at scale.

Azure ML fine-tuning

Similar to SageMaker. Strong integration with Azure OpenAI in Azure Government for hosted fine-tunes of GPT family.

LangChain / LlamaIndex

RAG frameworks. Useful for prototypes; production federal systems usually strip them back to the pieces that actually ship.

vLLM / TensorRT-LLM

Serving. Fine-tuned models need a production-grade serving layer. See our on-prem deployment post.

The defensibility argument

On a federal program, "why did the model answer this way" is a question you will be asked. RAG makes it answerable: "because we retrieved these passages from these documents and presented them as context." Fine-tuning makes it harder: "because the model learned a pattern during training." For any application where an answer will be read by a decision-maker, a court, an IG, or Congress, the RAG story holds up better. This is not a technical preference; it is a defensibility preference that matters uniquely in federal work.

A recommended starting posture

  1. Build a solid RAG pipeline first. Hybrid retrieval, reranking, grounded citations, real eval harness.
  2. Measure what actually fails. Is it retrieval? Is it tone? Is it format?
  3. If retrieval is the failure, fix retrieval.
  4. If tone or format is the failure and prompt engineering is expensive or unreliable, train a small LoRA adapter.
  5. Only consider full fine-tuning when narrow-task performance, self-hosting economics, or latency floors force the issue.
  6. Whatever you ship, gate it on a regression eval that mirrors real usage.

Where this fits in our practice

We build RAG platforms, train LoRA adapters, and design the eval harnesses that tell you which to use when. See our RAG architecture and LLM evaluation posts for the rest of the stack.

FAQ

When does RAG win over fine-tuning on a federal program?
When the source corpus changes faster than a fine-tuning cycle, when users demand citations back to source documents, when access control on the corpus is dynamic, and when compliance requires the data to stay in a specific authorization boundary. Most federal document-QA, policy lookup, and analyst-assistant workloads fall here.
When does fine-tuning win?
When you need a consistent tone or format that is hard to prompt-engineer, when latency or cost at scale is dominated by prompt size, when the task is a narrow skill (classification, extraction, structured output) and the domain is stable, or when you need to compress behavior into a small self-hostable model.
What is a LoRA adapter and why does it matter in federal?
LoRA trains small low-rank weight matrices instead of full model weights, producing a 10-200 MB adapter that can be loaded on top of a base model. For federal it matters because you can host one base model and swap adapters per mission, train adapters on sensitive data without retraining the base, and fit the footprint on self-hosted GPUs.
Can you combine RAG and fine-tuning?
Yes, and often you should. Fine-tune a small adapter for tone, format, and domain jargon. Use RAG for the facts. The adapter makes retrieved passages render in the agency voice without prompting for it every call; RAG keeps the facts current and cited.
What does fine-tuning cost on a federal program in 2026?
A LoRA adapter on a 7B-70B open-weight model runs a few hundred to a few thousand dollars of GPU time on 10K-100K examples. Full fine-tuning of a 70B model is five-to-six figures. The bigger cost is building and curating the training set with SMEs.
Where does fine-tuning go wrong in federal?
Training data leakage (fine-tuning on CUI without the right handling), catastrophic forgetting (the tuned model loses general capability), and drift between the fine-tune snapshot and the live corpus that is what users actually need answers from.

Related insights

Choosing between RAG and fine-tuning for a federal program?

We design RAG and fine-tuning pipelines end to end, and we will tell you honestly which one your problem needs, or whether you need both.