Why RAG is the default for federal knowledge work
The most common federal AI question is the simplest: "what does our own policy say about this?" The answer sits in PDFs, SharePoint sites, policy libraries, SOP binders, OGC opinions, FOIA logs, and decades of memoranda that no human has read end-to-end in years. Retrieval-augmented generation is how federal agencies actually unlock that institutional knowledge — provided the system is built to the standards federal work requires.
We build production RAG systems that meet three non-negotiable federal requirements: every answer cites its sources, every source respects access control, and every retrieval is auditable. Miss any one of those and the system fails security review before it serves a user.
The RAG pipeline we ship
- Ingestion & parsing — Tika, Unstructured, pdfplumber, and custom extractors for federal formats. Preserve structural metadata: section headings, page numbers, paragraph offsets, table cells, form fields, banner markings, CUI categories, and classification.
- Chunking — recursive structural chunking along section/subsection/paragraph boundaries with 10-20% overlap. Table-aware row-level chunking. Hierarchical parent-child chunks for long documents. Per-chunk metadata carried end-to-end.
- Embedding — BGE-large, E5-Mistral, Nomic-Embed, or fine-tuned domain embeddings for agency-specific vocabulary. Multi-vector (ColBERT-style) for long-document tasks. All embedding runs happen inside the authorization boundary.
- Indexing — pgvector (default), Qdrant, Weaviate, or OpenSearch for hybrid. HNSW indexes with tuned ef_construction. Sparse BM25 alongside dense for hybrid fusion.
- Retrieval — hybrid BM25 + dense with reciprocal rank fusion. Metadata filters enforced pre-search for access control. Query rewriting and decomposition for multi-hop questions.
- Reranking — cross-encoder rerankers (BGE-reranker, Cohere Rerank 3, or self-hosted Qwen-Reranker) on the top 50-100 candidates. Lifts nDCG 10-20 points on domain-specific queries.
- Generation — the LLM sees retrieved context with source IDs. The prompt enforces citation format. Output parser validates every claim maps to a cited chunk.
- Provenance — every generated sentence is tied back to source chunks, page numbers, and document URIs. A reviewer can click a citation and read the original paragraph.
- Evaluation — Ragas metrics plus domain-specific gold sets. Continuous eval on every deployment. Regression blocks bad releases.
CUI-aware retrieval
The defining federal RAG problem is not retrieval quality — it is access control. A corpus contains CUI, PII, LES, SBU, and sometimes classified information. A user has a clearance level, need-to-know, and a role. Retrieval must enforce access before any content reaches the LLM. We implement this as a metadata filter evaluated at query time against the authenticated identity, with the filter mechanically combined into the vector search (not applied as a post-filter, which leaks content via embedding correlations).
Beyond filtering, we propagate banner markings through the generation step. If a chunk is CUI//BASIC//SP-PRIV, the response inherits the marking and the system appends appropriate dissemination language. Nothing crosses a boundary it should not.
Chunking is where RAG lives or dies
We have rebuilt more federal RAG pilots because of bad chunking than because of bad models. The failure pattern is consistent: someone used LangChain's default 1000-character recursive splitter on a 400-page CFR and wondered why the retrieval was garbage. Federal documents have strong structural signals. Ignoring them is a choice.
Our chunking defaults: split at section and subsection boundaries first, paragraphs second, sentences only as a fallback. Attach the section heading as metadata. For tables, chunk row-by-row with the header row prepended to each row chunk. For long-form regulations, maintain a hierarchical index where a parent summary points to child chunks. For forms, index field-by-field with the form name and instruction text as metadata.
Hybrid search is non-negotiable
Federal queries are full of exact strings that dense embeddings fumble: "M-24-10", "44 U.S.C. 3554", "DD-254", "SF-1449", "IL5". BM25 handles these cleanly. Dense retrieval handles semantic queries: "what's the rule on vendor data retention after contract close-out." Running both and fusing via reciprocal rank fusion with tuned weights beats either alone by a wide margin. Add a cross-encoder reranker on the top results and you approach the ceiling of what retrieval can deliver on federal corpora.
Deployment paths
- Cloud-hosted: AWS GovCloud with Bedrock + OpenSearch, or Azure Government with Azure AI Search + Azure OpenAI. FedRAMP-aligned.
- On-premise: pgvector + self-hosted embedding models + vLLM-served Llama or Mistral. For SIPR, JWICS, or agency-specific enclaves.
- Hybrid: public/open corpus in cloud, sensitive corpus on-prem, federated query layer.
Federal use cases we build for
- Policy and regulation Q&A with statutory citations
- FOIA triage and de-duplication across request backlogs
- Contract language retrieval from prior SOWs and RFPs
- Clinical guideline retrieval for VA and HHS workflows
- Intelligence report synthesis across source documents
- Grant program eligibility determinations
- Agency knowledge assistants over SOPs, training materials, and institutional memory