RAG architecture for federal document corpora.

April 16, 2026 · 17 min read · Ingestion, chunking, FedRAMP-safe embeddings, vector stores, hybrid retrieval, reranking, citation, and CUI handling for production federal RAG.

Why RAG is the right fit for federal

Retrieval-augmented generation is a pattern that grounds a language model's output in documents the agency controls. Instead of relying on what the model memorized during training, a RAG system retrieves relevant passages at query time, passes them to the model as context, and expects the model to answer from that context with citations back to the source. The combination solves three problems at once that matter more in federal than in commercial contexts.

  • Freshness. A foundation model's knowledge cutoff is months or years old. Agency policies, regulations, case law, and operational guidance change faster than that. RAG retrieves the current document.
  • Provenance. Federal outputs need to be defensible. "The model said so" is not acceptable when a response will be read by Congress, a court, or an IG. "Here is the specific paragraph from the specific document, cited inline" is.
  • Control. Training or fine-tuning a foundation model on sensitive data is a compliance burden most programs cannot absorb. RAG keeps the data in the agency's store, retrieves at query time, and lets the model do language, not memorization.

Done well, RAG over a federal corpus is the most defensible application pattern we have right now for putting LLMs in front of mission work. Done poorly, it produces confidently wrong answers with fake citations, which is worse than no LLM at all. This post walks the architecture we ship, with the pieces that matter most for federal.

Scope. Unclassified or CUI document corpora: regulations, case files, manuals, SOPs, research reports, investigation notes, contract archives. FedRAMP Moderate or High. The patterns extend to classified enclaves with self-hosted components.

Reference architecture

The RAG system we ship in federal has seven stages. Every stage has an associated control surface, an audit event, and a failure mode worth naming.

1. Ingest           Pull source docs, identify format, classify
2. Parse            Extract text, structure, tables, images
3. Chunk            Split into retrievable units with metadata
4. Embed            Convert chunks to vectors with a FedRAMP-safe model
5. Index            Store in vector DB + keyword index
6. Retrieve         Hybrid search + rerank at query time
7. Generate         LLM answers with grounded citations

Stage 1: ingestion

Federal corpora are messy. Expect: PDFs (born-digital and scanned), Word docs, Excel sheets with relevant free text, HTML from SharePoint and Confluence, PowerPoint decks with information buried in images, emails with attachments, CSVs with narrative columns, plain text logs, and the occasional WordPerfect file from the 1990s. An ingestion pipeline that assumes clean PDFs will not survive contact with a real corpus.

What the ingestion layer does:

  • Pull documents from the source system (SharePoint, Documentum, shared drives, database BLOBs, an S3 landing bucket).
  • Identify format, checksum, and deduplicate against the existing corpus.
  • Apply document-level classification — source trust, PII likelihood, CUI marking, record retention category.
  • Queue for parsing with appropriate handlers per format.
  • Emit an audit event per document with source, hash, classification, and ingestion timestamp.

Source trust is a concept most commercial RAG stacks do not have. In federal, you will have a mix of authoritative documents (policy manuals signed by the agency), semi-authoritative (internal working drafts), external-but-trusted (NIST publications), and external-untrusted (crawled web pages, external email bodies). Trust score affects how the chunk is weighted in retrieval and how the model is instructed to treat it. Indirect prompt injection exploits mix in through this channel, so treat the trust score as a security control, not a ranking hint.

Stage 2: parsing

Parsing quality dominates downstream RAG quality. A chunk that has lost its section heading, its table structure, or its footnote reference will retrieve poorly and generate worse. The tools that do this well in 2026:

  • PyMuPDF and pdfplumber for clean born-digital PDFs. Fast, local, no external calls. First choice when it works.
  • Unstructured.io (open source library, not the hosted API for federal) for mixed formats and fallback handling.
  • Azure Document Intelligence (Azure Government) and AWS Textract (GovCloud) for layout-aware extraction, table structure, and OCR on scanned docs. Both are FedRAMP authorized.
  • Marker or Nougat for scientific and dense technical PDFs where layout matters.
  • Tesseract for self-hosted OCR on scanned documents when cloud OCR is not allowed.

Layout-aware parsing matters more than most teams realize. A table flattened to a single run-on line is functionally lost. A section heading detached from its body breaks retrieval of any chunk from that section. Parsers that preserve structure (JSON output with element types, bounding boxes, reading order) are worth the extra integration cost.

Stage 3: chunking

Chunking is where naive RAG dies. The goal: every chunk is self-contained enough to answer a question on its own while being small enough to retrieve precisely and fit within context budgets. The chunking strategies that work, in order of sophistication:

  • Fixed-size with overlap. 500 to 1000 tokens with 10 to 20 percent overlap. Works as a baseline. Fails on documents with strong structure because it cuts through headings.
  • Structural chunking. Split on document structure — headings, sections, paragraphs — and attach section context to every chunk. Much better on policy manuals, regulations, and SOPs.
  • Semantic chunking. Group adjacent sentences whose embeddings are similar, split where similarity drops. Better on prose without strong structural markers.
  • Hierarchical chunking. Index at multiple granularities — page-level summaries, section-level chunks, paragraph-level chunks — and retrieve the right layer for the query.

Whatever strategy you pick, every chunk must carry metadata:

{
  "chunk_id": "chk_01HX...",
  "doc_id": "doc_01HW...",
  "source": "policy-manual-v12",
  "section_path": ["Part II", "Chapter 4", "Section 4.3"],
  "page_start": 47,
  "page_end": 48,
  "classification": "CUI",
  "source_trust": "authoritative",
  "effective_date": "2025-09-01",
  "superseded_by": null,
  "text": "…",
  "embedding": [...],
  "created_at": "2026-04-16T...",
  "ingestion_run_id": "run_01HX..."
}

The effective_date and superseded_by fields are often missed and matter a lot in federal: agency policies change, old versions stay in the corpus, and the model needs to know which is current. A chunk from a superseded policy should either be excluded or retrieved with an explicit "superseded" flag that the generation prompt respects.

Stage 4: embeddings

The embedding model converts each chunk to a dense vector. Choice is constrained by FedRAMP authorization state and by whether the embeddings leave your boundary.

FedRAMP-safe embedding options

  • OpenAI text-embedding-3-large via Azure OpenAI in Azure Government — Azure OpenAI holds FedRAMP High authorization in Azure Government as of 2026. Strong quality, 3072 dimensions, good multilingual support.
  • Cohere Embed v3 or v4 via AWS Bedrock in GovCloud — Bedrock's Cohere availability in GovCloud has grown through 2025-2026. Strong English quality, good chunk/query separation (different prompts for indexing vs. search).
  • Amazon Titan Embeddings via Bedrock in GovCloud — available, lower quality than Cohere or OpenAI at time of writing but improving.
  • Self-hosted open-weight: BGE-large, E5-large-v2, Nomic Embed Text, Voyage-lite-instruct (when self-hosted). Deploy inside your boundary on SageMaker endpoints, Azure ML online endpoints, or KServe. No data leaves the boundary. Required for some classified contexts.

What to evaluate

Do not pick an embedding model on MTEB leaderboard scores alone. Build a representative eval set from your corpus — a few hundred realistic queries paired with known-relevant chunks — and measure recall@k and MRR for each candidate. The right model for a legal-corpus retrieval task can be different from the right model for a medical-corpus retrieval task.

Stage 5: indexing

You are indexing twice: a vector index for semantic search and a keyword index for exact-match search. Storing them in the same database simplifies operations.

Vector store options

  • pgvector on Postgres. RDS for PostgreSQL in GovCloud and Azure Database for PostgreSQL in Azure Government both support pgvector. HNSW indexes scale cleanly to tens of millions of vectors. Best default when you are also storing metadata and you value operational simplicity.
  • Qdrant (self-hosted). Excellent performance, rich filtering, runs anywhere. Strong choice when you need hundreds of millions of vectors or sub-50 ms latency at scale.
  • Weaviate (self-hosted). Similar tier to Qdrant with a different API model.
  • Amazon OpenSearch with k-NN. Available in GovCloud; native hybrid BM25 + vector in one engine.
  • Azure AI Search. Available in Azure Government; hybrid BM25 + vector + semantic ranking in one service.

For most federal programs, either pgvector on a managed Postgres or Azure AI Search / OpenSearch is the right default. Dedicated vector stores are faster and richer but add another service to operate and authorize.

Keyword indexing

BM25 on the same chunks. OpenSearch and Azure AI Search provide it natively. On Postgres, use full-text search (GIN index on tsvector). Index the raw chunk text plus any metadata fields that users search by (document title, section path, dates).

Stage 6: retrieval

Hybrid search

Pure vector search has a known weakness: it misses exact matches on rare tokens (case numbers, contract numbers, statute citations, proper nouns, acronyms) because embeddings smooth over the specific. Keyword search alone misses paraphrase. Hybrid retrieval runs both and fuses the results.

The fusion method that works: Reciprocal Rank Fusion (RRF). For each document, sum 1 / (k + rank) across retrieval methods, where k is a small constant (typically 60). RRF is simple, robust, and does not require score calibration across different retrievers. Return the top N after fusion.

Reranking

After hybrid retrieval returns a candidate set of, say, 50 chunks, a cross-encoder reranker scores each against the query and returns the top 5 to 10. Cross-encoders are slower than embedding retrievers (they compare query and chunk token-by-token instead of comparing two precomputed vectors) but meaningfully more accurate on the top-k that actually reaches the model. Options:

  • Cohere Rerank via Bedrock in GovCloud — strong quality, easy integration.
  • BGE-reranker-v2 or Jina reranker self-hosted — open weights, deploy on any GPU endpoint.
  • LLM-as-judge reranking — prompt a small LLM to rank candidates. Flexible, more expensive, quality dependent on the LLM.

Metadata filtering

Filter before or alongside retrieval on classification, source trust, effective_date, document type, and user-authorization labels. "Retrieve only chunks classified at or below the caller's clearance from documents effective as of the query date" is a filter, not a post-processing step. Most vector stores support filtered ANN search; use it.

Stage 7: generation with grounded citations

The generation prompt has three jobs: tell the model it must answer from the retrieved context, instruct it to quote and cite, and refuse to answer if the context is insufficient.

Prompt pattern

System:
You are an assistant answering questions about agency
documents. Use only the retrieved passages below. If the
passages do not contain the answer, say so plainly — do
not speculate. Cite every factual claim with the passage
ID in square brackets, e.g., [P3]. Quote directly when
accuracy matters.

Retrieved passages:
[P1] (source: policy-manual-v12, section 4.3.2, effective 2025-09-01)
"…"

[P2] (source: sop-17, section 2.1, effective 2024-03-15)
"…"

User:
<question>

Citation discipline

The model must not invent citations. Enforce this in three places:

  • Every retrieved chunk is presented to the model with a stable passage ID.
  • The generation prompt instructs the model to use those IDs and no others.
  • A post-processing step validates every citation token in the output against the retrieved set and flags or redacts any citation that does not match.

The validation step is where most teams save themselves from the "hallucinated citation" incident. A response that cites a passage not in the retrieval set is either a model mistake or a prompt injection; either way, do not render it to the user as if it were grounded.

Structured output for provenance

Return the response as a structured object that makes provenance auditable:

{
  "answer": "The retention period for...",
  "claims": [
    {"text": "retained for 7 years", "passage_id": "P1"},
    {"text": "classified as CUI", "passage_id": "P2"}
  ],
  "refused": false,
  "refusal_reason": null,
  "retrieved_passages": ["P1", "P2", "P3"],
  "unused_passages": ["P3"]
}

PII and CUI handling in chunks

Sensitive data in federal corpora is a first-class concern, not a post-filter. The pattern:

  1. Classify at ingestion. Every document enters with a classification label inherited from its source. Every chunk inherits from the document.
  2. Detect PII per chunk. Presidio, AWS Comprehend PII detection, Azure Language Service PII detection, or a fine-tuned NER model. Tag which PII entity types are present.
  3. Route by sensitivity. Chunks with higher classification go to a separate index with stricter access control. Chunks with named PII can be redacted, tokenized, or paraphrased for the index while preserving the original in secure cold storage.
  4. Filter at query time. The user's authorization determines which index is queried. Do not rely on the model to self-censor; do not include a chunk in context if the user was not authorized to see it.
  5. Log classifications on retrieval. The audit log includes which classifications were returned to which user.

Common RAG anti-patterns

  • Pure vector, no keyword. Misses exact identifiers. Hybrid is table stakes.
  • Chunks with no metadata. "The model retrieved a chunk" means nothing if you cannot trace it back to a specific document, version, page, and classification.
  • One embedding model for everything. Different corpora want different models. Evaluate on your data.
  • No retrieval eval. No one knows whether retrieval is getting better or worse between releases. Every prompt and index change should run a retrieval regression on a held-out query set.
  • Citations from the model, not from the retrieval layer. The model invents plausible-looking citations. Validate every citation against the retrieved set.
  • Superseded content retrieved without warning. Old policy versions sit in the index; users get answers from obsolete documents. Effective dates and superseded flags must be indexed and respected.
  • Ingesting untrusted web content into the same index as authoritative docs. The attack surface for indirect prompt injection becomes the entire internet.
  • Raw PII in chunks. Classification and PII detection must happen at ingestion, not at query time.

Evaluation: how you know retrieval is working

Build a retrieval eval set of 200 to 1000 queries paired with known-relevant chunks from your corpus. Measure:

  • Recall@k — does the relevant chunk appear in the top k results?
  • MRR (Mean Reciprocal Rank) — how high does the relevant chunk rank?
  • nDCG@k — graded relevance, weighted by rank.
  • Citation accuracy on generations — of the claims the model made, what fraction are actually supported by the cited chunk?
  • Refusal rate and refusal precision — how often the system declines, and how often declining was right.

Run this on every ingestion change, every embedding change, every chunking change, every reranker change, and every prompt change. Gate releases on no regression relative to the last champion.

FAQ

Which embedding models are safe to use in a federal system?
OpenAI text-embedding-3 via Azure OpenAI in Azure Government, Cohere Embed via AWS Bedrock in GovCloud, and self-hosted open-weight models like BGE, E5, or Nomic Embed Text. For classified work, self-hosted open-weight models are the default.
Is pgvector production-ready for federal workloads?
Yes. RDS for PostgreSQL in GovCloud and Azure Database for PostgreSQL in Azure Government both support pgvector. HNSW indexes scale to tens of millions of vectors. For hundreds of millions or very low latency, Qdrant or Weaviate self-hosted is more attractive.
Why is hybrid retrieval better than pure vector search?
Vector search misses exact matches on names, numbers, and acronyms that are common in federal documents. Hybrid combines dense vector search with BM25 keyword search and fuses the results, typically via Reciprocal Rank Fusion.
How do you handle PII and CUI in chunks?
Tag chunks with classification labels at ingestion, apply Presidio or a fine-tuned NER model to detect PII, redact or route sensitive chunks to a separate index with stricter access, and filter at query time by the caller's authorization.
What causes RAG systems to hallucinate citations?
Chunks too small to carry context, generation prompts that do not require grounding, or citation format produced by the model rather than attached deterministically. The fix is to attach citation metadata at retrieval, instruct the model to quote and cite, and validate every citation against the returned chunk set.

Where this fits in our practice

We build federal RAG systems end to end: ingestion, parsing, chunking, embedding, hybrid retrieval, reranking, grounded generation, and the evaluation loop that tells you the whole pipeline is still working. See our agentic AI and data engineering capabilities for the broader platform context, and our NIST 800-53 control mapping for the compliance overlay.

Related insights

Building a RAG system over a federal corpus?

We design and build federal RAG platforms end to end — ingestion, hybrid retrieval, reranking, grounded generation, and the eval loop that keeps them honest.