Why LLMs need their own control mapping
NIST SP 800-53 Rev 5 was written for general-purpose information systems. When you drop a large language model into that framework, the control text still applies, but the implementation shifts in ways that most assessors have not yet internalized. LLM systems fail in unique ways: hallucinated outputs that look authoritative, prompt injection through untrusted context, data leakage through prompt echoes, adversarial completions that bypass downstream validation, and model drift after a silent upstream version bump.
None of those failure modes are called out by name in 800-53. All of them can be mapped to existing controls if you know where to look. This post walks the families that matter most for federal LLM deployments, gives you concrete language for your System Security Plan, and shows what the audit log actually needs to contain.
The advice below assumes a FedRAMP Moderate or High baseline, a typical agency deployment pattern (LLM behind an API gateway inside a VPC, calling a hosted foundation model or a self-hosted open-weight model), and that your system is in scope for an Authority to Operate.
Control families most affected by LLMs
Seven families carry almost all of the LLM-specific weight. Other families (PL, PS, AT, CP, IR, MA, MP, PE, SA) still apply but typically inherit from the underlying cloud platform or enterprise program without LLM-specific twists.
- AC — Access Control. Who can call the model, with what scopes, against which data.
- AU — Audit and Accountability. What gets logged for every prompt and completion.
- IA — Identification and Authentication. How agents, users, and service principals prove identity to the model endpoint.
- SC — System and Communications Protection. Boundary, transit, storage, and cryptographic protection for model traffic and artifacts.
- SI — System and Information Integrity. Prompt injection defense, output validation, model integrity, supply chain.
- RA — Risk Assessment. Vulnerability scanning that now includes model weights, prompts, and adversarial testing.
- CM — Configuration Management. Pinning model versions, prompt templates, tool definitions, and RAG indexes.
AC — Access Control for model endpoints
The most common early mistake is treating the model endpoint as a generic API and forgetting that prompts often carry sensitive context. AC-3 (access enforcement) needs to consider not just who can call the endpoint but what data they can include in the prompt and what completions they are allowed to receive.
AC-3 (access enforcement). Enforce scoped service roles per calling application. A chatbot role should not have access to the batch summarization endpoint. In Bedrock, use resource-based policies on model IDs. In Azure OpenAI, use Azure RBAC at the resource and deployment level, plus managed identity on the caller.
AC-4 (information flow enforcement). Classification labels on prompts must propagate. If a user's context includes CUI, the completion inherits CUI classification until validated otherwise. A lot of systems get this wrong by logging completions at a lower classification than the prompt that generated them.
AC-6 (least privilege). The service account that the model calls for tool use (retrieval, code execution, database queries) should have the narrowest scope that works. A retrieval tool that has read access to the whole vector store when the user only needs a single collection is a finding waiting to happen.
AC-17 (remote access). If developers hit the model endpoint from outside the boundary, require a TIC-compliant path, a federated identity, and MFA. No shared API keys.
AU — Audit and Accountability
This is where most LLM systems are thinnest. AU-2 and AU-3 together demand that you log enough to reconstruct the event and attribute it. For an LLM, that event is a prompt and a completion, plus every tool call the model made in between.
Here is a log schema we use as a baseline. Treat this as a starting point, not a target.
{
"event_id": "evt_01HX8K4YQZ3F4K8R2N1V0W7ZAB",
"timestamp": "2026-04-16T14:02:19.482Z",
"system_id": "llm-chat-prod",
"environment": "prod",
"classification": "CUI",
"user": {
"id": "user:[email protected]",
"session_id": "sess_01HX8K4Y...",
"source_ip": "10.42.17.88",
"auth_method": "PIV-CAC"
},
"request": {
"model_id": "anthropic.claude-sonnet-4",
"model_version_hash": "sha256:7ab1...c4",
"system_prompt_id": "sp_v12",
"tools_enabled": ["rag_search_v3", "case_lookup_v2"],
"prompt_hash": "sha256:91f8...3d",
"prompt_tokens": 842,
"temperature": 0.2,
"max_tokens": 1024
},
"tool_calls": [
{"name":"rag_search_v3","args_hash":"sha256:2a...","result_hash":"sha256:5f...","latency_ms":412}
],
"response": {
"completion_hash": "sha256:0e6b...8a",
"completion_tokens": 318,
"moderation_flags": [],
"confidence_gate": "passed",
"latency_ms": 1873
},
"policy": {
"prompt_injection_score": 0.03,
"pii_detected": false,
"classification_check": "ok"
}
}
AU-2 (event logging). Every model invocation is an auditable event. So is every tool call the model makes, every moderation flag, and every policy decision (accept, reject, redact).
AU-3 (content of audit records). The record must be sufficient to reconstruct the event. For CUI and above, storing a hash of the prompt with the cleartext in a separately classified store is defensible. For Moderate baselines with non-sensitive prompts, store the cleartext prompt and completion, compressed.
AU-6 (audit review, analysis, reporting). Build a dashboard that surfaces prompt injection scores, refusal rates, jailbreak flags, token spend per tenant, and latency outliers. Assessors love this. So does your ops team.
AU-11 (record retention). Organization-defined. A reasonable default: 90 days hot (queryable), 1 year warm (S3 Glacier Instant Retrieval or equivalent), 7 years cold for CUI. Link the retention period to the records schedule that covers the underlying business process, not just the audit logs as their own class.
AU-12 (audit generation). Generate audit records at every trust boundary: the gateway, the model proxy, the tool runtime, the retrieval layer. One centralized log per tier is better than a single log that tries to cover every layer.
IA — Identification and Authentication
IA-2 (identification and authentication — organizational users). PIV/CAC at the front door. Do not let the LLM's model identity substitute for a user identity on downstream systems; always pass the human's context through.
IA-5 (authenticator management). API keys to foundation models are authenticators. Rotate them, store them in a KMS-backed secret manager (AWS Secrets Manager, Azure Key Vault), and alert on usage from unexpected source identities.
IA-9 (service identification and authentication). Services calling the LLM gateway authenticate with SPIFFE IDs or managed identities, not shared secrets. This matters when an agent calls multiple tools which each need to know which agent is calling.
SC — System and Communications Protection
SC-7 (boundary protection). Put an LLM gateway inside the authorization boundary. The gateway is the chokepoint for logging, policy, tool allowlisting, and rate limiting. Downstream foundation model calls leave the boundary through an explicit egress path (PrivateLink, Private Endpoint) that is documented in the boundary diagram.
SC-8 (transmission confidentiality and integrity). TLS 1.2+ to the model endpoint. Mutual TLS inside the boundary. Document cipher suites against the FIPS 140-3 validated module list your agency requires.
SC-12 and SC-13 (cryptographic key establishment and cryptographic protection). If you fine-tune and store custom weights, they are at-rest sensitive. Use KMS-managed customer-managed keys (AWS CMK, Azure Key Vault key) with rotation. SC-13 points to FIPS-validated cryptography; GovCloud and Azure Gov regions provide this by default.
SC-28 (protection of information at rest). Prompt logs, completion logs, vector embeddings, and fine-tuned weights are all at-rest sensitive. Encrypt all of them with CMKs. If embeddings can be inverted to reconstruct training text (and many can), treat them at the classification of the underlying data.
SC-39 (process isolation). Tool runtimes that execute code on behalf of the model should run in isolated sandboxes (Firecracker micro-VMs, gVisor, or a dedicated container per invocation). Shared Python processes are not acceptable.
SI — System and Information Integrity
This is the family that carries the most novel LLM content. Three controls do the heavy lifting.
SI-3 (malicious code protection). Prompt injection is the LLM-specific case. Map it here. Implementation: input classifiers that score untrusted content (retrieved documents, emails, uploaded files) before it enters the prompt; strict delimiters between system instructions and user content; output parsers that refuse to execute instructions that appeared inside tool results; denylists of known jailbreak patterns with quick rotation when new ones emerge.
SI-4 (system monitoring). Log and alert on anomalous prompt patterns: unusually high injection scores, sudden shifts in token distribution, model outputs that trigger moderation, tool call patterns that do not match normal user behavior. Pipe these into your SIEM (Sentinel, Splunk, Elastic) with correlation rules.
SI-7 (software, firmware, and information integrity). Model weights are software. Require SBOM coverage (Syft generates SPDX/CycloneDX from container images and from model directories with custom formats). Sign artifacts with Cosign. Verify signatures at load time. Record model_version_hash in every audit event (see schema above). When a foundation model provider bumps a version silently, you need to detect it, not hope.
SI-10 (information input validation). Not just JSON schema validation on the API layer. For LLMs this includes structured output validation, refusal on ambiguous intent, and confidence gating on completions before they pass downstream.
SI-11 (error handling). Do not leak prompt content, system prompts, or model internals in error messages to unauthenticated callers. A surprising number of systems fail this because the foundation model provider returns verbose errors that get passed through.
RA — Risk Assessment
RA-5 (vulnerability monitoring and scanning). Extend your scanning program to cover model artifacts. Scan container images that bundle weights. Scan model files for known malicious pickles (pickle-scan, ClamAV signatures for ML payloads). Schedule adversarial testing (red teaming) on at least a quarterly cadence for production LLM systems.
RA-3 (risk assessment). The risk assessment needs to call out LLM-specific risks explicitly: hallucination in high-impact workflows, prompt injection through untrusted sources, data leakage through prompt memory or caching, model supply chain compromise, adversarial examples in vision-language models. If your risk register does not name these, assessors will ask why.
RA-9 (criticality analysis). Rank LLM-backed workflows by consequence of a wrong answer. A summarization tool for an analyst is lower risk than an agent that modifies a case record. Your controls should be proportional.
CM — Configuration Management
CM-2 (baseline configuration). Your LLM baseline includes: foundation model ID and version, system prompt text (or prompt template ID), tool definitions, RAG index version, retrieval parameters, decoding parameters. All of these go into source control with a CHANGELOG.
CM-3 (configuration change control). Treat prompt changes like code changes. Peer review, test against a regression set of prompts, tag with a version, deploy behind a feature flag. A five-line change in a system prompt can meaningfully change behavior across the whole population of users.
CM-7 (least functionality). Disable model capabilities you do not need. If you do not need web browsing, do not enable it. If you do not need code execution, do not enable the interpreter tool. If a function calling capability is enabled, every tool in the registry should be individually audited and allowlisted per role.
CM-8 (system component inventory). Your inventory names the model (provider, ID, version), the deployment region, the fine-tuning dataset (if any), the system prompt version, and the tool registry. Keep it current in your CMDB.
Controls that should exist but don't — yet
NIST is moving toward AI-specific guidance through the AI RMF and SP 800-218A, but 800-53 Rev 5 does not yet have a dedicated AI family. Until a future revision adds one, these gaps are typically filled by overlays:
- Prompt injection monitoring as its own control. Currently mapped to SI-3 / SI-4 by analogy.
- Hallucination confidence gating. Mapped to SI-10 but underspecified.
- Human oversight requirements for high-impact outputs. Partially covered by AC-6 and policy, fully covered by EO 14110 (now overlaid by OMB M-24-10 and its successors) and agency AI use case inventories.
- Model card and data sheet maintenance. Implied by CM-2 but rarely called out explicitly.
Sample SSP language you can adapt
These are short, reusable passages we drop into System Security Plans for LLM components. They are not finished SSP content — they are skeletons that need your system-specific details.
AU-2 (Event Logging) — LLM implementation
The LLM gateway generates an audit event for every model
invocation. Events include user identity, model ID and
version hash, prompt hash, completion hash, tool calls with
argument and result hashes, token counts, moderation flags,
classification labels, and latency. Events are written to
a centralized logging service within 5 seconds of completion
and retained per AU-11.
SI-3 (Malicious Code Protection) — LLM implementation
Prompt injection is treated as a malicious input vector.
Untrusted content (retrieved documents, email bodies, user
uploads) is scored by a prompt injection classifier before
inclusion in the prompt. Content scoring above the configured
threshold is rejected or quarantined. System prompt
delimiters prevent instruction blending. Outputs are parsed
through a strict schema and tool calls from injected content
are refused.
SI-7 (Integrity) — LLM implementation
Foundation model artifacts are treated as software. Container
images bundling open-weight models are scanned, signed with
Cosign, and verified at load. Model version hashes are
recorded in every audit event. Upstream model version changes
trigger an alert, a regression test run, and a configuration
change review per CM-3.
Putting it together: a practical implementation pattern
The shortest path from zero to an assessor-ready LLM system is to establish four pieces in this order:
- LLM gateway. A thin service in front of every model call. Enforces identity, logs the audit schema above, applies policies (prompt injection scoring, PII checks, classification checks), and rate limits. This is where most of your controls land.
- Prompt and tool registry. Source-controlled, versioned, with a CHANGELOG per prompt. Every production invocation references a specific prompt ID and version. No inline prompts in application code.
- Evidence pipeline. Nightly jobs that produce SSP evidence from the running system: audit log samples, boundary diagrams from Terraform state, SBOM diffs, model inventory, access control summaries. Stored in the same place your 3PAO will pull from.
- Regression and red team loop. A held-out evaluation set run on every prompt change and every model version bump. A quarterly adversarial testing engagement against the production gateway. Findings tracked in POA&M.
With those four in place, most of 800-53's LLM-relevant controls come for free. Without them, every control is a paper exercise.
FAQ
Where this fits in our practice
We build federal LLM systems with these controls mapped from sprint one, not bolted on at assessment time. If you are standing up an agentic system, a RAG platform, or a foundation model gateway inside an authorization boundary, the pattern above is what we ship. See our agentic AI, machine learning, and DevSecOps capabilities for more detail.