NIST 800-53 controls for LLM systems.

April 16, 2026 · 14 min read · A control-by-control mapping of NIST SP 800-53 Rev 5 to large language model systems, with SSP language and audit log schemas you can use.

Why LLMs need their own control mapping

NIST SP 800-53 Rev 5 was written for general-purpose information systems. When you drop a large language model into that framework, the control text still applies, but the implementation shifts in ways that most assessors have not yet internalized. LLM systems fail in unique ways: hallucinated outputs that look authoritative, prompt injection through untrusted context, data leakage through prompt echoes, adversarial completions that bypass downstream validation, and model drift after a silent upstream version bump.

None of those failure modes are called out by name in 800-53. All of them can be mapped to existing controls if you know where to look. This post walks the families that matter most for federal LLM deployments, gives you concrete language for your System Security Plan, and shows what the audit log actually needs to contain.

The advice below assumes a FedRAMP Moderate or High baseline, a typical agency deployment pattern (LLM behind an API gateway inside a VPC, calling a hosted foundation model or a self-hosted open-weight model), and that your system is in scope for an Authority to Operate.

Scope of this mapping. Rev 5 High baseline assumed. References to OSCAL mean the NIST Open Security Controls Assessment Language schema. This is not a substitute for your 3PAO's review; it is what we bring to assessor meetings so the conversation starts at the right altitude.

Control families most affected by LLMs

Seven families carry almost all of the LLM-specific weight. Other families (PL, PS, AT, CP, IR, MA, MP, PE, SA) still apply but typically inherit from the underlying cloud platform or enterprise program without LLM-specific twists.

  • AC — Access Control. Who can call the model, with what scopes, against which data.
  • AU — Audit and Accountability. What gets logged for every prompt and completion.
  • IA — Identification and Authentication. How agents, users, and service principals prove identity to the model endpoint.
  • SC — System and Communications Protection. Boundary, transit, storage, and cryptographic protection for model traffic and artifacts.
  • SI — System and Information Integrity. Prompt injection defense, output validation, model integrity, supply chain.
  • RA — Risk Assessment. Vulnerability scanning that now includes model weights, prompts, and adversarial testing.
  • CM — Configuration Management. Pinning model versions, prompt templates, tool definitions, and RAG indexes.

AC — Access Control for model endpoints

The most common early mistake is treating the model endpoint as a generic API and forgetting that prompts often carry sensitive context. AC-3 (access enforcement) needs to consider not just who can call the endpoint but what data they can include in the prompt and what completions they are allowed to receive.

AC-3 (access enforcement). Enforce scoped service roles per calling application. A chatbot role should not have access to the batch summarization endpoint. In Bedrock, use resource-based policies on model IDs. In Azure OpenAI, use Azure RBAC at the resource and deployment level, plus managed identity on the caller.

AC-4 (information flow enforcement). Classification labels on prompts must propagate. If a user's context includes CUI, the completion inherits CUI classification until validated otherwise. A lot of systems get this wrong by logging completions at a lower classification than the prompt that generated them.

AC-6 (least privilege). The service account that the model calls for tool use (retrieval, code execution, database queries) should have the narrowest scope that works. A retrieval tool that has read access to the whole vector store when the user only needs a single collection is a finding waiting to happen.

AC-17 (remote access). If developers hit the model endpoint from outside the boundary, require a TIC-compliant path, a federated identity, and MFA. No shared API keys.

AU — Audit and Accountability

This is where most LLM systems are thinnest. AU-2 and AU-3 together demand that you log enough to reconstruct the event and attribute it. For an LLM, that event is a prompt and a completion, plus every tool call the model made in between.

Here is a log schema we use as a baseline. Treat this as a starting point, not a target.

{
  "event_id": "evt_01HX8K4YQZ3F4K8R2N1V0W7ZAB",
  "timestamp": "2026-04-16T14:02:19.482Z",
  "system_id": "llm-chat-prod",
  "environment": "prod",
  "classification": "CUI",
  "user": {
    "id": "user:[email protected]",
    "session_id": "sess_01HX8K4Y...",
    "source_ip": "10.42.17.88",
    "auth_method": "PIV-CAC"
  },
  "request": {
    "model_id": "anthropic.claude-sonnet-4",
    "model_version_hash": "sha256:7ab1...c4",
    "system_prompt_id": "sp_v12",
    "tools_enabled": ["rag_search_v3", "case_lookup_v2"],
    "prompt_hash": "sha256:91f8...3d",
    "prompt_tokens": 842,
    "temperature": 0.2,
    "max_tokens": 1024
  },
  "tool_calls": [
    {"name":"rag_search_v3","args_hash":"sha256:2a...","result_hash":"sha256:5f...","latency_ms":412}
  ],
  "response": {
    "completion_hash": "sha256:0e6b...8a",
    "completion_tokens": 318,
    "moderation_flags": [],
    "confidence_gate": "passed",
    "latency_ms": 1873
  },
  "policy": {
    "prompt_injection_score": 0.03,
    "pii_detected": false,
    "classification_check": "ok"
  }
}

AU-2 (event logging). Every model invocation is an auditable event. So is every tool call the model makes, every moderation flag, and every policy decision (accept, reject, redact).

AU-3 (content of audit records). The record must be sufficient to reconstruct the event. For CUI and above, storing a hash of the prompt with the cleartext in a separately classified store is defensible. For Moderate baselines with non-sensitive prompts, store the cleartext prompt and completion, compressed.

AU-6 (audit review, analysis, reporting). Build a dashboard that surfaces prompt injection scores, refusal rates, jailbreak flags, token spend per tenant, and latency outliers. Assessors love this. So does your ops team.

AU-11 (record retention). Organization-defined. A reasonable default: 90 days hot (queryable), 1 year warm (S3 Glacier Instant Retrieval or equivalent), 7 years cold for CUI. Link the retention period to the records schedule that covers the underlying business process, not just the audit logs as their own class.

AU-12 (audit generation). Generate audit records at every trust boundary: the gateway, the model proxy, the tool runtime, the retrieval layer. One centralized log per tier is better than a single log that tries to cover every layer.

IA — Identification and Authentication

IA-2 (identification and authentication — organizational users). PIV/CAC at the front door. Do not let the LLM's model identity substitute for a user identity on downstream systems; always pass the human's context through.

IA-5 (authenticator management). API keys to foundation models are authenticators. Rotate them, store them in a KMS-backed secret manager (AWS Secrets Manager, Azure Key Vault), and alert on usage from unexpected source identities.

IA-9 (service identification and authentication). Services calling the LLM gateway authenticate with SPIFFE IDs or managed identities, not shared secrets. This matters when an agent calls multiple tools which each need to know which agent is calling.

SC — System and Communications Protection

SC-7 (boundary protection). Put an LLM gateway inside the authorization boundary. The gateway is the chokepoint for logging, policy, tool allowlisting, and rate limiting. Downstream foundation model calls leave the boundary through an explicit egress path (PrivateLink, Private Endpoint) that is documented in the boundary diagram.

SC-8 (transmission confidentiality and integrity). TLS 1.2+ to the model endpoint. Mutual TLS inside the boundary. Document cipher suites against the FIPS 140-3 validated module list your agency requires.

SC-12 and SC-13 (cryptographic key establishment and cryptographic protection). If you fine-tune and store custom weights, they are at-rest sensitive. Use KMS-managed customer-managed keys (AWS CMK, Azure Key Vault key) with rotation. SC-13 points to FIPS-validated cryptography; GovCloud and Azure Gov regions provide this by default.

SC-28 (protection of information at rest). Prompt logs, completion logs, vector embeddings, and fine-tuned weights are all at-rest sensitive. Encrypt all of them with CMKs. If embeddings can be inverted to reconstruct training text (and many can), treat them at the classification of the underlying data.

SC-39 (process isolation). Tool runtimes that execute code on behalf of the model should run in isolated sandboxes (Firecracker micro-VMs, gVisor, or a dedicated container per invocation). Shared Python processes are not acceptable.

SI — System and Information Integrity

This is the family that carries the most novel LLM content. Three controls do the heavy lifting.

SI-3 (malicious code protection). Prompt injection is the LLM-specific case. Map it here. Implementation: input classifiers that score untrusted content (retrieved documents, emails, uploaded files) before it enters the prompt; strict delimiters between system instructions and user content; output parsers that refuse to execute instructions that appeared inside tool results; denylists of known jailbreak patterns with quick rotation when new ones emerge.

SI-4 (system monitoring). Log and alert on anomalous prompt patterns: unusually high injection scores, sudden shifts in token distribution, model outputs that trigger moderation, tool call patterns that do not match normal user behavior. Pipe these into your SIEM (Sentinel, Splunk, Elastic) with correlation rules.

SI-7 (software, firmware, and information integrity). Model weights are software. Require SBOM coverage (Syft generates SPDX/CycloneDX from container images and from model directories with custom formats). Sign artifacts with Cosign. Verify signatures at load time. Record model_version_hash in every audit event (see schema above). When a foundation model provider bumps a version silently, you need to detect it, not hope.

SI-10 (information input validation). Not just JSON schema validation on the API layer. For LLMs this includes structured output validation, refusal on ambiguous intent, and confidence gating on completions before they pass downstream.

SI-11 (error handling). Do not leak prompt content, system prompts, or model internals in error messages to unauthenticated callers. A surprising number of systems fail this because the foundation model provider returns verbose errors that get passed through.

RA — Risk Assessment

RA-5 (vulnerability monitoring and scanning). Extend your scanning program to cover model artifacts. Scan container images that bundle weights. Scan model files for known malicious pickles (pickle-scan, ClamAV signatures for ML payloads). Schedule adversarial testing (red teaming) on at least a quarterly cadence for production LLM systems.

RA-3 (risk assessment). The risk assessment needs to call out LLM-specific risks explicitly: hallucination in high-impact workflows, prompt injection through untrusted sources, data leakage through prompt memory or caching, model supply chain compromise, adversarial examples in vision-language models. If your risk register does not name these, assessors will ask why.

RA-9 (criticality analysis). Rank LLM-backed workflows by consequence of a wrong answer. A summarization tool for an analyst is lower risk than an agent that modifies a case record. Your controls should be proportional.

CM — Configuration Management

CM-2 (baseline configuration). Your LLM baseline includes: foundation model ID and version, system prompt text (or prompt template ID), tool definitions, RAG index version, retrieval parameters, decoding parameters. All of these go into source control with a CHANGELOG.

CM-3 (configuration change control). Treat prompt changes like code changes. Peer review, test against a regression set of prompts, tag with a version, deploy behind a feature flag. A five-line change in a system prompt can meaningfully change behavior across the whole population of users.

CM-7 (least functionality). Disable model capabilities you do not need. If you do not need web browsing, do not enable it. If you do not need code execution, do not enable the interpreter tool. If a function calling capability is enabled, every tool in the registry should be individually audited and allowlisted per role.

CM-8 (system component inventory). Your inventory names the model (provider, ID, version), the deployment region, the fine-tuning dataset (if any), the system prompt version, and the tool registry. Keep it current in your CMDB.

Controls that should exist but don't — yet

NIST is moving toward AI-specific guidance through the AI RMF and SP 800-218A, but 800-53 Rev 5 does not yet have a dedicated AI family. Until a future revision adds one, these gaps are typically filled by overlays:

  • Prompt injection monitoring as its own control. Currently mapped to SI-3 / SI-4 by analogy.
  • Hallucination confidence gating. Mapped to SI-10 but underspecified.
  • Human oversight requirements for high-impact outputs. Partially covered by AC-6 and policy, fully covered by EO 14110 (now overlaid by OMB M-24-10 and its successors) and agency AI use case inventories.
  • Model card and data sheet maintenance. Implied by CM-2 but rarely called out explicitly.

Sample SSP language you can adapt

These are short, reusable passages we drop into System Security Plans for LLM components. They are not finished SSP content — they are skeletons that need your system-specific details.

AU-2 (Event Logging) — LLM implementation
The LLM gateway generates an audit event for every model
invocation. Events include user identity, model ID and
version hash, prompt hash, completion hash, tool calls with
argument and result hashes, token counts, moderation flags,
classification labels, and latency. Events are written to
a centralized logging service within 5 seconds of completion
and retained per AU-11.

SI-3 (Malicious Code Protection) — LLM implementation
Prompt injection is treated as a malicious input vector.
Untrusted content (retrieved documents, email bodies, user
uploads) is scored by a prompt injection classifier before
inclusion in the prompt. Content scoring above the configured
threshold is rejected or quarantined. System prompt
delimiters prevent instruction blending. Outputs are parsed
through a strict schema and tool calls from injected content
are refused.

SI-7 (Integrity) — LLM implementation
Foundation model artifacts are treated as software. Container
images bundling open-weight models are scanned, signed with
Cosign, and verified at load. Model version hashes are
recorded in every audit event. Upstream model version changes
trigger an alert, a regression test run, and a configuration
change review per CM-3.

Putting it together: a practical implementation pattern

The shortest path from zero to an assessor-ready LLM system is to establish four pieces in this order:

  • LLM gateway. A thin service in front of every model call. Enforces identity, logs the audit schema above, applies policies (prompt injection scoring, PII checks, classification checks), and rate limits. This is where most of your controls land.
  • Prompt and tool registry. Source-controlled, versioned, with a CHANGELOG per prompt. Every production invocation references a specific prompt ID and version. No inline prompts in application code.
  • Evidence pipeline. Nightly jobs that produce SSP evidence from the running system: audit log samples, boundary diagrams from Terraform state, SBOM diffs, model inventory, access control summaries. Stored in the same place your 3PAO will pull from.
  • Regression and red team loop. A held-out evaluation set run on every prompt change and every model version bump. A quarterly adversarial testing engagement against the production gateway. Findings tracked in POA&M.

With those four in place, most of 800-53's LLM-relevant controls come for free. Without them, every control is a paper exercise.

FAQ

Does NIST 800-53 Rev 5 have dedicated AI or LLM controls?
Not as a standalone family yet. Rev 5 covers LLM risk through existing families (AC, AU, SC, SI, RA, CM). NIST AI RMF 1.0 and NIST SP 800-218A (SSDF for AI) provide AI-specific guidance that overlays 800-53. Assessors increasingly expect LLM systems to map to both.
What must I log for an LLM system under AU-2 and AU-3?
At minimum: timestamp, user identity, session ID, prompt content (or a secure hash if classification requires), completion content, model name and version, tool calls with arguments and results, token counts, moderation flags, and latency. AU-3 requires enough content to reconstruct the event; for LLMs that means the full input/output pair.
How long should prompt logs be retained under AU-11?
AU-11 is organization-defined. Most federal programs retain audit records for at least one year online and three to seven years total. For LLM systems, retain at least 90 days of full prompt and completion content online, then archive through ATO lifecycle.
Is prompt injection treated as malicious code under SI-3?
Increasingly, yes. Assessors accept mapping prompt injection defenses (input classifiers, delimiters, allowlists, jailbreak detection) to SI-3 and SI-4. It is a defensible interpretation because the injected content attempts to subvert the control flow of a computing component.
Do model weights need integrity protection?
Yes. Model weights are software components. SI-7 requires integrity verification. In practice that means signed weights (Cosign or sigstore), hashed manifests in the artifact registry, and verification at load time. SBOM coverage via SPDX or CycloneDX should include model components.

Where this fits in our practice

We build federal LLM systems with these controls mapped from sprint one, not bolted on at assessment time. If you are standing up an agentic system, a RAG platform, or a foundation model gateway inside an authorization boundary, the pattern above is what we ship. See our agentic AI, machine learning, and DevSecOps capabilities for more detail.

Related insights

Shipping an LLM inside an authorization boundary?

We build federal LLM systems with NIST 800-53 controls mapped from sprint one. Gateway, audit pipeline, prompt registry, evidence automation.