RAG systems for federal document corpora.

Chunking, embeddings, hybrid retrieval, reranking, CUI-aware access control, and provenance tracking — built so every generated answer traces back to a source a reviewer can open.

Why RAG is the default for federal knowledge work

The most common federal AI question is the simplest: "what does our own policy say about this?" The answer sits in PDFs, SharePoint sites, policy libraries, SOP binders, OGC opinions, FOIA logs, and decades of memoranda that no human has read end-to-end in years. Retrieval-augmented generation is how federal agencies actually unlock that institutional knowledge — provided the system is built to the standards federal work requires.

We build production RAG systems that meet three non-negotiable federal requirements: every answer cites its sources, every source respects access control, and every retrieval is auditable. Miss any one of those and the system fails security review before it serves a user.

The RAG pipeline we ship

  1. Ingestion & parsing — Tika, Unstructured, pdfplumber, and custom extractors for federal formats. Preserve structural metadata: section headings, page numbers, paragraph offsets, table cells, form fields, banner markings, CUI categories, and classification.
  2. Chunking — recursive structural chunking along section/subsection/paragraph boundaries with 10-20% overlap. Table-aware row-level chunking. Hierarchical parent-child chunks for long documents. Per-chunk metadata carried end-to-end.
  3. Embedding — BGE-large, E5-Mistral, Nomic-Embed, or fine-tuned domain embeddings for agency-specific vocabulary. Multi-vector (ColBERT-style) for long-document tasks. All embedding runs happen inside the authorization boundary.
  4. Indexing — pgvector (default), Qdrant, Weaviate, or OpenSearch for hybrid. HNSW indexes with tuned ef_construction. Sparse BM25 alongside dense for hybrid fusion.
  5. Retrieval — hybrid BM25 + dense with reciprocal rank fusion. Metadata filters enforced pre-search for access control. Query rewriting and decomposition for multi-hop questions.
  6. Reranking — cross-encoder rerankers (BGE-reranker, Cohere Rerank 3, or self-hosted Qwen-Reranker) on the top 50-100 candidates. Lifts nDCG 10-20 points on domain-specific queries.
  7. Generation — the LLM sees retrieved context with source IDs. The prompt enforces citation format. Output parser validates every claim maps to a cited chunk.
  8. Provenance — every generated sentence is tied back to source chunks, page numbers, and document URIs. A reviewer can click a citation and read the original paragraph.
  9. Evaluation — Ragas metrics plus domain-specific gold sets. Continuous eval on every deployment. Regression blocks bad releases.

CUI-aware retrieval

The defining federal RAG problem is not retrieval quality — it is access control. A corpus contains CUI, PII, LES, SBU, and sometimes classified information. A user has a clearance level, need-to-know, and a role. Retrieval must enforce access before any content reaches the LLM. We implement this as a metadata filter evaluated at query time against the authenticated identity, with the filter mechanically combined into the vector search (not applied as a post-filter, which leaks content via embedding correlations).

Beyond filtering, we propagate banner markings through the generation step. If a chunk is CUI//BASIC//SP-PRIV, the response inherits the marking and the system appends appropriate dissemination language. Nothing crosses a boundary it should not.

Chunking is where RAG lives or dies

We have rebuilt more federal RAG pilots because of bad chunking than because of bad models. The failure pattern is consistent: someone used LangChain's default 1000-character recursive splitter on a 400-page CFR and wondered why the retrieval was garbage. Federal documents have strong structural signals. Ignoring them is a choice.

Our chunking defaults: split at section and subsection boundaries first, paragraphs second, sentences only as a fallback. Attach the section heading as metadata. For tables, chunk row-by-row with the header row prepended to each row chunk. For long-form regulations, maintain a hierarchical index where a parent summary points to child chunks. For forms, index field-by-field with the form name and instruction text as metadata.

Hybrid search is non-negotiable

Federal queries are full of exact strings that dense embeddings fumble: "M-24-10", "44 U.S.C. 3554", "DD-254", "SF-1449", "IL5". BM25 handles these cleanly. Dense retrieval handles semantic queries: "what's the rule on vendor data retention after contract close-out." Running both and fusing via reciprocal rank fusion with tuned weights beats either alone by a wide margin. Add a cross-encoder reranker on the top results and you approach the ceiling of what retrieval can deliver on federal corpora.

Deployment paths

  • Cloud-hosted: AWS GovCloud with Bedrock + OpenSearch, or Azure Government with Azure AI Search + Azure OpenAI. FedRAMP-aligned.
  • On-premise: pgvector + self-hosted embedding models + vLLM-served Llama or Mistral. For SIPR, JWICS, or agency-specific enclaves.
  • Hybrid: public/open corpus in cloud, sensitive corpus on-prem, federated query layer.

Federal use cases we build for

  • Policy and regulation Q&A with statutory citations
  • FOIA triage and de-duplication across request backlogs
  • Contract language retrieval from prior SOWs and RFPs
  • Clinical guideline retrieval for VA and HHS workflows
  • Intelligence report synthesis across source documents
  • Grant program eligibility determinations
  • Agency knowledge assistants over SOPs, training materials, and institutional memory

Related reading

Federal RAG, answered.
Why is RAG preferred over fine-tuning for most federal document use cases?

RAG keeps the source of truth external and updatable. When policy changes — which happens constantly in federal environments — you re-index, not retrain. RAG also provides citations, which is non-negotiable for federal outputs. Fine-tuning bakes knowledge in, making it stale and unattributable.

How do you handle CUI, PII, and classification markings in a RAG corpus?

Every chunk is tagged at ingest with its classification, handling caveat, and access control list. Retrieval enforces per-user access before reranking. Responses only cite chunks the user is cleared for. We preserve banner markings, dissemination controls, and CUI categories through the entire pipeline.

What chunking strategy works best for federal documents?

Recursive structural chunking that respects sections, subsections, and paragraph boundaries, with 10-20% overlap, plus hierarchical summarization for the parent document. For tables, chunk row-wise with header context. For forms, index field-by-field.

Should we use hybrid search or pure vector search?

Hybrid, always. Federal queries often contain exact acronyms, statute citations, and form numbers that dense embeddings handle poorly. BM25 nails the exact matches, dense retrieval nails the semantic matches, and a reranker resolves the fusion.

How do you evaluate a federal RAG system?

Retrieval metrics (recall@k, MRR, nDCG) on a gold question set plus generation metrics (faithfulness, answer relevance, citation accuracy) via Ragas or custom rubrics. We also run adversarial evals — can the system be tricked into citing a wrong source?

Can you deploy RAG in an air-gapped enclave?

Yes. pgvector + self-hosted embeddings (BGE, E5) + vLLM-served open-weight LLM with no external call-out. Every component runs inside the boundary. Ideal for SIPR, JWICS, and agency-specific classified enclaves via a cleared prime partner.

Often deployed together.
1 business day response

Unlock institutional knowledge at scale.

Federal RAG with real citations. Ready to deliver.

[email protected]
UEI Y2JVCZXT9HP5CAGE 1AYQ0NAICS 541512SAM.GOV ACTIVE