Skip to main content
Acquisition

GAO-26-107859: the federal AI acquisition gap — what the April 2026 report means for SBIR

On April 13, 2026, GAO issued a report documenting recurring gaps in how federal agencies procure artificial-intelligence capabilities. The findings are not a surprise to anyone who has lived inside a federal AI acquisition. They are, however, a useful forcing function — and a template for how SBIR offerors should position Phase I deliverables for the next twelve months.

The report, in one sentence

GAO-26-107859 reviewed a cross-section of federal agency AI procurements initiated over the prior two fiscal years and documented the same four gaps appearing in case after case: agencies were not defining what acceptable AI performance looks like, not verifying what vendors represented about their models, not instrumenting deployed systems for the behaviors that actually degrade, and not capturing what they learned so that the next acquisition could benefit from the last one. Each gap is familiar in isolation. The report's contribution is showing how consistently the four cluster together and how predictably that cluster produces deployed systems that drift, surprise their sponsors, and require rework within the first year of use.

WHAT THIS ARTICLE IS AND IS NOT

This is a practitioner reading of the report, not a summary. We translate the recurring findings into the language federal program managers, contracting officers, and SBIR offerors use — acceptance criteria, authorization boundaries, continuous monitoring, and feedback instrumentation. We do not reproduce the report's numbers or case identifiers; those belong in the source document. We focus on what the findings imply for a Phase I work-plan written next week.

The four gaps cluster together because they share a common cause: agencies treat AI procurement as a one-time purchase when it is, technically and operationally, a continuous system.

Gap 1: the requirements-definition gap

The first and most consequential gap the report documents is that agencies are procuring AI capabilities without specifying what "acceptable" means. A conventional IT acquisition under FAR Part 39 asks for a system that meets a set of functional requirements, a security baseline, and a service-level agreement. An AI acquisition asks for a model that "summarizes documents," "classifies requests," or "supports analyst decision-making" — and then stops. There is no threshold for acceptable summary fidelity, no false-positive rate the classifier must stay under, no calibration requirement for the decision-support output, no definition of what the system must refuse to do.

The practical consequence is that the agency and the vendor meet at delivery with different, unreconciled definitions of success. The vendor ships a model that performs within the bounds of its published benchmarks. The program office tries to accept it against unstated assumptions about domain fidelity, edge-case handling, and refusal behavior. The acceptance conversation becomes a negotiation rather than a measurement. Rework starts before production.

NIST AI RMF 1.0 provides the vocabulary for closing this gap — the Measure function in particular. A requirements document that references NIST AI RMF Measure 2.3 (model performance characterization), Measure 2.6 (AI system operational performance under deployment conditions), and Measure 2.11 (fairness and bias metrics where relevant) gives both sides a concrete basis for acceptance. None of this is exotic; it is the federal-procurement equivalent of specifying a tolerance in a mechanical-parts contract.

What a Phase I offeror should do about Gap 1

A Phase I work plan should include, as an explicit deliverable, a written Acceptance Criteria Specification that enumerates the behaviors the Phase I prototype will be measured against and the thresholds at which each behavior is considered acceptable. This document should be written in the first four weeks of a six-month Phase I, reviewed with the TPOC before the technical work ramps, and treated as a commitment the offeror can be held to. It costs little to produce and materially de-risks the Phase II decision for the agency.

Gap 2: the vendor due-diligence gap

The second gap the report documents is that vendor representations about AI systems — model provenance, training-data jurisdiction, FedRAMP authorization boundary scope, sub-processor lists, data-retention posture — are being accepted without independent verification. The vendor says the model is U.S.-hosted and U.S.-trained, the SSP says the same, and the evidence chain stops there. For FedRAMP-authorized services, the authorization package is public and inspectable on the FedRAMP Marketplace; but when the AI capability is delivered as a layer on top of an authorized platform (an agency-specific fine-tuned model sitting on a FedRAMP High IaaS), the layer itself often has no independent authorization and the representations about it are not independently verifiable.

This matters because AI systems fail in ways that are deeply coupled to provenance. A model trained on data of uncertain jurisdiction may exfiltrate memorized content. A model running inference in a region outside the declared boundary is a boundary violation regardless of how the data got there. A fine-tune whose base weights are themselves of uncertain lineage inherits all the supply-chain risk of the base, silently. The report's examples (paraphrased, not reproduced) show that multiple agencies accepted vendor attestations on all three dimensions without a verification procedure.

NIST SP 800-37 RMF Step 1 (Categorize) and Step 2 (Select) are the normative hooks here. Categorization must account for the sensitivity of any data the model was exposed to during training, not only the sensitivity of data it will see at inference. Control selection must include supply-chain controls — SR-3 (Supply Chain Controls and Processes), SR-4 (Provenance), SR-5 (Acquisition Strategies, Tools, and Methods) — applied to the model itself and not only to the hosting infrastructure.

A vendor's SSP describes the authorized platform. It does not, by itself, describe what an agency-specific fine-tuned model running on that platform was trained on or how it behaves at the edges.

What a Phase I offeror should do about Gap 2

Phase I proposals should include a Provenance and Due-Diligence package as a deliverable: base-model identity and version, training-data jurisdictional attestation, fine-tuning dataset documentation, authorization-boundary map showing exactly which components of the prototype are covered by an existing ATO and which are not, and a sub-processor list if any third-party inference services are invoked. Many Phase I prototypes will not have all this information neatly packaged at kickoff; that is precisely why it should be an explicit deliverable to be assembled during the period of performance. The Phase II proposal inherits it intact.

Gap 3: the continuous-monitoring gap

The third gap the report documents is that agencies are deploying AI systems without the instrumentation required to detect the behaviors that actually cause operational failure after go-live. Traditional IT continuous monitoring, as structured under NIST SP 800-37 RMF Step 6 and NIST SP 800-137, focuses on security control effectiveness — unauthorized access, configuration drift, vulnerability exposure. AI systems fail along additional axes that the traditional instrumentation does not see: model performance drift as input distributions shift, hallucination rate on queries outside training-distribution, bias emergence as fine-tuning data is added, and authorization-boundary integrity when the agentic system gains new tool-calling capabilities.

The report documents cases where agencies signed an ATO, stood the system up, and then had no way to observe any of these AI-specific failure modes until a user surfaced a bad output. By the time a user surfaces a hallucinated citation or a drifted classifier threshold, the failure has already affected some number of downstream decisions. The report's implicit argument — and one we strongly endorse — is that AI continuous monitoring is not optional polish. It is a prerequisite for the authorization to be meaningful over the life of the system.

The specific instrumentation that closes this gap is not mysterious. Log every inference request and response with sufficient fidelity to reconstruct the decision path. Track a small battery of golden-set performance metrics on a schedule. Instrument the tool-calls of any agentic system so that each call, its arguments, and its result are auditable. Establish a drift threshold and an alerting path so that degradation is detected before it accumulates. Document all of this in the Continuous Monitoring Strategy that NIST SP 800-37 requires in any case — just extend it to cover the AI-specific axes.

What a Phase I offeror should do about Gap 3

A Phase I technical volume should include a Monitoring Hooks section that enumerates the telemetry the prototype will emit and how that telemetry would feed into an agency's existing continuous-monitoring stack at Phase II scale. This is the cheapest possible differentiator in a competitive Phase I evaluation. Most offerors treat monitoring as a Phase II concern. The report establishes that treating it as a Phase II concern is the failure mode agencies are now being audited for.

Gap 4: the lessons-learned-loop gap

The fourth gap the report documents is that agencies are not systematically capturing what worked and what did not across sequential AI acquisitions. The first AI procurement in a program office is, understandably, a learning exercise. The second should be materially easier. The report finds that it often is not — because the first acquisition's lessons were not captured in any form the second acquisition's PM could consult. Each new AI acquisition re-learns the same lessons about acceptance criteria, vendor due-diligence, and monitoring, at agency cost, on schedule it cannot afford.

This is, unglamorously, a documentation problem. A structured after-action on each AI acquisition — what was specified, what was delivered, where the two diverged, what changes to the next acquisition's SOW would have prevented the divergence — compounds across a program's acquisition portfolio. Without it, every PM in every program office begins from zero, and the aggregate cost to the federal AI enterprise is the sum of all the zeros.

What a Phase I offeror should do about Gap 4

A Phase I Final Report should include a Lessons-Learned section written for the TPOC's successor — the person who will run the next acquisition in the program office after the current one concludes. The section should cover what the prototype taught the offeror about the problem space, what the acceptance criteria got right and wrong, what vendor representations turned out to be load-bearing, and what monitoring hooks proved most useful. This is how an SBIR offeror positions itself as a continuing partner rather than a one-time vendor. It is also how a program office builds the institutional memory the report identifies as missing.

The four gaps, side by side

GapWhat the report documentsClosure mechanism (Phase I deliverable)
Requirements definitionAgencies procure AI capabilities without specified acceptance criteria for performance, security, or data-handling behavior.Acceptance Criteria Specification tied to NIST AI RMF Measure functions. Reviewed with TPOC in first four weeks.
Vendor due-diligenceVendor representations on provenance, training-data jurisdiction, and authorization-boundary scope accepted without independent verification.Provenance and Due-Diligence package: base-model identity, training-data attestation, boundary map, sub-processor list. Mapped to NIST SP 800-37 RMF Steps 1–2 and SR-3/4/5.
Continuous monitoringDeployed AI systems lack instrumentation for drift, hallucination, bias, and authorization-boundary integrity beyond traditional security monitoring.Monitoring Hooks section in the technical volume. Concrete telemetry schema that extends the agency's existing CMS under NIST SP 800-37 Step 6 / SP 800-137.
Lessons-learned loopNo systematic capture of what worked or failed across sequential AI acquisitions in a program office.Lessons-Learned section in the Phase I Final Report, written for the successor PM. Positions the offeror as institutional memory.

What this means for agency program managers and CORs

The same four gaps are also a template for how to write a better SOW. A program manager who reads the report and sits down to draft the next AI acquisition can close most of the documented failure modes at the requirements stage, at zero marginal cost, by embedding four clauses in the SOW. The clauses are not novel. Each has established normative backing. The report's contribution is making the case that they should be default rather than optional.

  • Acceptance criteria clause. The SOW should enumerate the measurable behaviors the delivered system will be accepted against, with thresholds. Reference NIST AI RMF Measure functions by number. Require the vendor to propose any additional criteria they believe are material, so that both parties commit to a shared measurement framework before technical work starts.
  • Provenance attestation clause. The SOW should require the vendor to submit, as a deliverable, a Provenance Package covering base-model identity, training-data jurisdictional scope, fine-tuning dataset documentation, authorization-boundary map, and sub-processor list. The agency should require this package before acceptance rather than accepting a representation at contract award.
  • Monitoring instrumentation clause. The SOW should require the vendor to deliver the continuous-monitoring hooks necessary for drift, hallucination, bias, and authorization-boundary integrity observation, with a schema compatible with the agency's existing Continuous Monitoring Strategy. This is a concrete, inspectable deliverable. Failure to deliver it is a failure to deliver the system.
  • Lessons-learned clause. The SOW should require a structured after-action at contract closeout, delivered in a form the successor PM can actually use. A paragraph of contract language costs nothing and materially improves the next acquisition.

Three of the four clauses are already conventional in good federal acquisitions. The report's finding is that they are frequently omitted in AI acquisitions because the program office treats the AI system as a capability purchase rather than an IT system under FAR Part 39 with continuous-monitoring obligations. Re-framing the acquisition as a system procurement with AI-specific axes brings the conventional mechanisms back online.

Where the report understates the problem

One observation from the field: the report treats the four gaps as independent, when in our experience they are not. The requirements-definition gap causes the vendor due-diligence gap (the agency does not know what to ask for, so the vendor's representations fill the vacuum). The vendor due-diligence gap causes the continuous-monitoring gap (the agency cannot instrument what it did not fully characterize at acquisition). The continuous-monitoring gap causes the lessons-learned-loop gap (nothing was captured about operational behavior because nothing was measured). The four are a chain. Breaking any link weakens the rest of the chain, but the highest leverage is at the first link: define the requirements, and the other three become tractable.

Define the acceptance criteria early and concretely. Every subsequent gap becomes cheaper to close.

Precision Federal's SBIR approach, in one paragraph

Our Phase I work plans include the four deliverables described above as line items: an Acceptance Criteria Specification in week four, a Provenance and Due-Diligence package by the mid-phase review, a Monitoring Hooks section in the technical volume, and a Lessons-Learned section in the Final Report. These are not promotional additions; they are the minimum compliance surface a federal AI prototype should ship with in 2026, and they map directly to the four gaps documented in the report. When we write a Phase II proposal off a Phase I, each of these deliverables compounds forward — the Acceptance Criteria become the Phase II measurement framework, the Provenance package becomes the Phase II SSP appendix, the Monitoring Hooks become the Phase II continuous-monitoring strategy, and the Lessons-Learned section becomes the narrative of why the Phase II scope is shaped the way it is. The approach is conservative, documentable, and aligned with how the federal acquisition system wants AI procurement to work.

What we expect to change over the next twelve months

Three practical shifts we expect in the wake of the report.

  • SOW language will converge. Program offices will adopt template clauses for AI acceptance criteria, provenance attestation, and monitoring instrumentation. Offerors who can speak that language fluently will price accurately; offerors who cannot will lose on evaluation or lose on performance.
  • Continuous-monitoring expectations will sharpen. The gap between a traditional Continuous Monitoring Strategy and an AI-extended one will close. Expect to see specific AI telemetry requirements in SOWs and in the CMS templates agencies share with vendors.
  • Due-diligence deliverables will become standard. Provenance documentation will move from a nice-to-have into the core deliverable set, similar to how SBOMs became standard after Executive Order 14028. Offerors who have a documented Provenance Package ready at kickoff will accelerate the acquisition timeline materially.

Frequently asked questions

Is GAO-26-107859 a binding requirement on federal agencies?

GAO reports are not self-executing regulation. They document findings and make recommendations. Agencies respond with a concurrence or non-concurrence and a corrective-action plan. The practical effect is that SOW language and acquisition practices tend to migrate toward the recommendations over the following twelve to twenty-four months, particularly where the findings align with existing NIST and OMB guidance — as they do here.

Do the four gaps apply to Phase I SBIR procurements specifically?

Phase I SBIR is a smaller acquisition than the major AI procurements the report primarily examines, but the four gaps scale down to it cleanly. A $200K Phase I that does not specify acceptance criteria, document model provenance, include monitoring hooks, or produce a lessons-learned artifact reproduces the same failure modes at smaller scale. Conversely, a Phase I that addresses all four closes a substantial piece of the Phase II risk-reduction case the agency is trying to build.

How do these findings interact with the NIST AI RMF?

The NIST AI Risk Management Framework 1.0 is the normative vocabulary for closing the gaps. Requirements-definition maps to the Govern and Map functions. Vendor due-diligence maps to Measure and Map. Continuous monitoring maps to Measure and Manage. Lessons-learned maps to Govern. Proposals and SOWs that reference the framework by function name give both sides a shared basis for measurement.

What about FedRAMP — does authorization close the vendor due-diligence gap?

Partially. A FedRAMP authorization characterizes the platform on which an AI system runs; it does not, by itself, characterize an agency-specific fine-tuned model running on that platform, the training data used to produce that fine-tune, or the sub-processor chain behind the inference path. Provenance documentation for the model layer is complementary to, not substitutable by, the platform's FedRAMP package.

How should continuous monitoring for AI differ from traditional IT continuous monitoring?

The traditional axes — access control effectiveness, configuration drift, vulnerability exposure — remain necessary. AI systems require additional axes: model performance drift against a golden set, hallucination or fabrication rate on out-of-distribution inputs, bias emergence as fine-tuning data accumulates, and authorization-boundary integrity when agentic systems gain new tool-calling capabilities. The existing CMS under NIST SP 800-37 Step 6 and NIST SP 800-137 is the right container; the instrumentation schema has to be extended.

What is the single highest-leverage change an offeror can make in response to the report?

Write the Acceptance Criteria Specification as a four-week deliverable and review it with the TPOC before the technical work ramps. It costs the least, has the highest downstream payoff, and is the link in the chain that makes the other three closure mechanisms tractable. If only one change is made, make this one.

1 business day response

Book a federal AI acquisition review

We help program offices and SBIR offerors close the four gaps documented in GAO-26-107859 — acceptance criteria, provenance documentation, continuous-monitoring instrumentation, and lessons-learned capture. Work-plan templates and SOW language included.

Book a federal AI acquisition reviewRead more insights →
UEI Y2JVCZXT9HP5CAGE 1AYQ0NAICS 541512SAM.GOV ACTIVE