Higher score = stronger discipline in published federal RAG work.
What RAG is, in one paragraph

Retrieval-augmented generation (RAG) is a pattern where a language model answers questions using documents you supply rather than its training memory. The system retrieves the most relevant passages from a document store, hands them to the model along with the question, and the model writes the answer with those passages as context. The point is grounding: the answer should reflect what the documents actually say, not what the model thinks they probably say.
Commercial RAG demos make this look easy. Federal RAG is harder for four specific reasons: the documents are heterogeneous (PDFs, scanned forms, regulations, briefing slides, emails, structured records), the user expects citations they can audit, the system has to know when to refuse, and the whole stack often has to run on government cloud or inside an air-gapped enclave. Each of those reasons changes the architecture.
Why federal corpora are heterogeneous
A "corpus" sounds like one thing. In federal use, it is rarely one thing. A typical RAG deployment touches multiple document types at once.
You have native PDFs (clean text, easy to extract). You have scanned PDFs (images of text, requires OCR — Optical Character Recognition — which introduces its own error). You have structured data in databases. You have semi-structured forms with fixed sections. You have briefing slide decks where the meaning lives partly in the layout. You have email threads where the context is the thread, not any single message.
Each of these requires a different ingestion pipeline. The published research is consistent that there is no universal extractor; what works is a per-source pipeline that produces a normalized intermediate format — usually clean text plus structured metadata about source, page, section, and any access controls — that downstream stages can treat uniformly.
Choosing a vector store: pgvector, FAISS, OpenSearch
The vector store is the part of the system that finds relevant passages. Each passage is converted into an embedding (a numerical fingerprint produced by a small language model), and the store finds the embeddings closest to the question's embedding.
Three options dominate the public discussion. pgvector is a Postgres extension — it adds vector search to the relational database your team already runs. The advantage is operational simplicity: one database, one backup process, transactional access controls. The disadvantage is that pure-vector workloads at very large scale are not its strength.
FAISS (Facebook AI Similarity Search, an open-source library) is a research-grade in-memory index that is extremely fast at scale but is a library, not a database — you bolt it into your application yourself, and you get no built-in persistence, replication, or access control. OpenSearch (the AWS-stewarded fork of Elasticsearch) supports both vector and lexical (keyword) search in one engine, which matters because the published research consistently shows that hybrid retrieval — vector plus keyword — outperforms either alone.
For most federal RAG, the boring answer wins: pgvector for small to mid-size corpora, OpenSearch for hybrid retrieval at scale, FAISS only when raw embedding throughput is the binding constraint and you have engineering budget to wrap it.
| Vector store | Strengths | Trade-offs | Federal-context fit |
|---|---|---|---|
| pgvector | Lives inside Postgres; transactional access controls; simple ops | Pure vector at very large scale is not its strength | Strong fit for small-to-mid corpora; reuses existing audit and backup tooling |
| OpenSearch | Hybrid vector + keyword in one engine; mature scaling | More moving parts; tuning hybrid weights takes work | Strong fit when retrieval quality matters more than ops simplicity |
| FAISS | Extremely fast in-memory similarity | Library, not a database; you build the surrounding system | Only when throughput is the binding constraint |
| Milvus / Qdrant / Weaviate | Purpose-built vector DBs; rich filtering | Another system to operate, accredit, and harden | Strong tech, weaker case for adding a system to an air-gapped enclave |
Citation-required outputs
In a federal context, an uncited answer is unusable. Reviewers, auditors, and FOIA respondents all need to trace claims back to their sources.
The published research distinguishes two flavors of citation. The weaker version — "post-hoc citations" — lets the model write its answer freely and then attaches a list of retrieved passages at the end. This is fast to build and easy to fool: the listed passages may not actually support the claims in the answer.
The stronger version — "inline grounded citations" — requires every assertion in the answer to point to a specific passage that supports it. The model is prompted (or fine-tuned) to emit answers with bracketed citation markers, and a separate verification step checks that each cited passage genuinely entails the claim. This is what survives audit.
Faithfulness frameworks like RAGAS (a faithfulness-scoring framework for RAG outputs) and TruLens (an open-source RAG observability library) automate the verification step. Each broken-down claim from the answer is checked against the cited passage using a separate language model as judge. Scores below threshold get flagged for human review or trigger refusal.
Refuse-to-answer thresholds
Sometimes the right answer is "I do not know based on the documents available." Federal users would rather hear that than a confident wrong answer.
The published systems implement refusal in a few stacked ways. Retrieval-side: if no retrieved passage scores above a similarity threshold, the system refuses before generation. Generation-side: a small classifier is trained to detect ungrounded outputs and triggers refusal when the model's draft answer is not anchored in retrieved evidence. Self-consistency: the system runs the question multiple times with different sampling and refuses if the answers disagree on key facts.
Setting the threshold is a calibration exercise. Too high, and the system refuses easy questions. Too low, and it confidently answers questions the corpus does not cover. The published methodology pairs threshold tuning with a per-domain holdout set of questions known to be unanswerable, and reports refusal accuracy alongside answer accuracy.
Faithfulness evaluation in plain terms
"Faithfulness" is a specific metric in RAG: does the answer actually reflect the retrieved passages, or did the model add things that are not there? It is separate from "is the answer correct" and from "is the answer helpful."
The standard public eval frameworks decompose this into a few sub-metrics. Context relevance — were the right passages retrieved? Answer faithfulness — do the claims in the answer follow from those passages? Answer relevance — does the answer address the question? Citation precision — do the inline citations point to passages that actually support the cited claim?
Each gets scored separately, usually by an LLM judge configured for the task. The published research is careful about LLM-as-judge bias: judges agree most when claims are atomic and disagree when claims are compound. The methodology that survives review breaks claims down to atomic units before judging.
Air-gapped and federated deployment
Federal RAG often runs where a public LLM API cannot reach. Options narrow accordingly.
For an air-gapped enclave, every component — embedding model, vector store, generation model, evaluation models — lives inside the enclave. Open-weight models (Llama, Mistral, Qwen, and DeepSeek-class architectures, served via vLLM, llama.cpp, or TensorRT-LLM) are the practical answer for generation. Embedding models from the open BGE, E5, and Nomic families cover the retrieval side.
For federated data — multiple sites that cannot share their underlying documents — the published research explores patterns where retrieval happens at each site, results are returned with redacted metadata, and a central node performs the generation step. Trust boundaries are explicit at each hop. This is more architecture than most teams want, but it is the only published pattern that preserves data sovereignty across sites.
Either way, you need a hardened observability story. Logging every query, every retrieval result, every generation, every evaluation score — with controlled access — is what makes the system auditable. The published RAG observability tools (TruLens, Phoenix, LangSmith) can be deployed inside an enclave but require explicit wiring to the enclave's logging and authentication systems.
Permissions that flow with the data
Federal documents have access controls. A user who is not cleared for a document should not see passages from that document — ever — regardless of how the model phrases the answer.
The published pattern is to attach access metadata to every chunk at ingestion time, and to filter retrieval results by the requesting user's permissions before any passage reaches the generation model. The filter has to happen at the retrieval layer; you cannot rely on the model to refuse to use passages it has been shown.
This sounds simple and is not. Document-level permissions are the easy case. Section-level and field-level permissions — common in regulatory and personnel records — require ingestion-time chunking that respects the permission boundary. Cross-source permissions — "this user can see passage A from corpus 1 and passage B from corpus 2 but not the combination" — require explicit reasoning the published systems are still working out.
Common questions on the public-research framing
Why hybrid retrieval and not just vector retrieval?
Vector retrieval finds passages that mean similar things to the question. Keyword retrieval finds passages that share specific terms with the question. Federal queries often hinge on exact terms (statute numbers, contract IDs, named entities) that vector embeddings can blur. Hybrid retrieval combines both and consistently outperforms either alone in the published evaluations.
What is the difference between post-hoc citations and inline grounded citations?
Post-hoc citations attach a list of retrieved passages after a freely written answer; the listed passages may not actually support the answer's claims. Inline grounded citations require each claim to point to a specific supporting passage and verify the support before output. The latter survives audit; the former does not.
What does this article not cover?
Specific corpora, specific agency systems, or any Precision Federal architectural approach to a particular RAG problem.
Frequently asked questions
Retrieval-augmented generation lets a language model answer questions using your documents instead of its training memory. For federal use, this matters because the answer needs to reflect the actual policy, regulation, or record — not what a general-purpose model thinks they probably say — and the answer needs to be auditable back to those sources.
For most cases, pgvector inside Postgres for small-to-mid corpora and OpenSearch for hybrid retrieval at scale. FAISS is a fit when raw similarity throughput is the binding constraint and you have engineering budget to wrap it. Adding another system to an air-gapped enclave has real ops cost; the boring answer often wins.
By stacking a few signals: low retrieval similarity (no passage above threshold), a classifier that detects ungrounded drafts, and self-consistency checks (the model gives different answers on retries). The thresholds are tuned against held-out questions known to be unanswerable, and refusal accuracy is reported alongside answer accuracy.
A separate metric from accuracy and helpfulness: do the claims in the answer actually follow from the retrieved passages, or did the model add things not in the sources? Frameworks like RAGAS and TruLens automate the check by decomposing the answer into atomic claims and judging each against its cited passage.
Every component — embedding model, vector store, generation model, evaluation judge — lives inside the enclave. Open-weight generation models (Llama, Mistral, Qwen) served via vLLM or TensorRT-LLM are the practical answer; open embedding models (BGE, E5, Nomic) cover retrieval. Logging and access control wire into the enclave's existing infrastructure.
How we use this site
We write articles like this to make our reading of the open literature visible — what we think the published methods say, what the open gaps are, and where careful work might land. We do not use these pages to preview proposed approaches in active program spaces. Precision Federal is a software-only SBIR firm. If your office is funding work in this area and would value a software-first partner with a documented public-reading habit, we welcome the introduction.