Higher score = the pattern is more strongly evidenced in the public 2026 DAF record.
Why data engineering keeps appearing in open programs

The recurring data-engineering pattern in the publicly published 2026 DAF open programs is straightforward: many Air Force operational problems are blocked not by missing models but by missing or fragmented data. The open-innovation pathway is well-suited to this kind of work because the offeror can describe a specific data problem at a specific customer site and propose a specific software-based solution, without having to guess at the program office's preferred technology.
Data engineering supports analytics and ML — not the reverse. Most operational ML at DAF depends on data work that has not yet been done at sufficient quality for evaluation, retraining, and maintenance.
Public reference architectures
The Department of the Air Force has published, at varying depths, several reference architectures for enterprise data: DAF efforts under the Advanced Battle Management System (ABMS) and CJADC2 lineage, the Department of the Air Force Chief Data and Artificial Intelligence Office (CDAIO) data publications, the published Platform One reference patterns, and several program-specific data architectures inside Air Force Materiel Command and Air Combat Command. These documents describe ingest, governance, access, and analytic layers in technology-neutral terms.
The CDAIO publications and the broader DoD CDAO publications cover the ingest, catalog, and access-control layers in unclassified detail. The DoD Data Strategy and the published Data Decrees signed by the Deputy Secretary of Defense (data as strategic asset; visible, accessible, understandable, linked, trustworthy, interoperable, secure — VAULTIS) are the policy substrate that the technical architectures implement. Offerors who quote VAULTIS by name in proposal text are demonstrating they have read the policy literature.
Offerors who read these documents carefully and align their proposed software to those conventions do better than offerors who propose against an idealized greenfield. The published reference architectures are not optional reading; reviewers will compare proposals against them. The cleanest posture is to name the document, name the architectural layer being addressed, and describe the offeror's contribution in those terms.
The legacy-system reality
A substantial fraction of operationally relevant Air Force data lives in systems built years or decades ago, with documented but not always machine-readable schemas, intermittent maintenance, and significant operator workarounds embedded in daily use. Aircraft maintenance data, logistics data, personnel data, and many ISR data flows have legacy origins, and the operator-workaround layer is often where the operationally-meaningful semantics actually live.
The public research literature on enterprise data integration — schema matching (the Rahm and Bernstein survey lineage), entity resolution (the Christen lineage), semantic enrichment, and the broader Halevy/Doan dataspaces line of work — is well developed but is not always taught to junior software engineers. The methodological discipline that reviewers expect — probabilistic matching with documented thresholds, human-in-the-loop adjudication for ambiguous cases, audit-grade provenance — is the part of the literature that closes the gap between prototype and field.
Offerors who lead with their integration methodology and their evidence of doing it before tend to land better with reviewers. The honest framing acknowledges that legacy integration is grindy, that the AI/ML layer cannot bypass it, and that the offeror has prior evidence of doing it well.
Public reference architectures and patterns
- ABMS / CJADC2 lineage — Public Air Force command-and-control modernization.
- DAF CDAIO publications — Chief Data and AI Office data strategy artifacts.
- Schema matching — Rahm/Bernstein — foundational integration literature.
- Entity resolution — Christen — canonical reference for cross-source entity reconciliation.
- Data observability — Great Expectations, Soda, TFDV for production monitoring.
Governance and access
Air Force data governance is real and has consequences for any open-innovation data-engineering project. CUI handling under DoDI 5200.48, role-based access control under the published Air Force IT directives, audit logging consistent with the NIST SP 800-53 control set, and data-tagging consistent with the DoD CDAO's published metadata guidance are all published requirements. Offerors who skip these in their Phase I plan, expecting to add them later, generally find that they can't.
The published cloud-services landscape (Cloud One, Platform One, IL4/IL5/IL6 environments under the DoD Cloud Computing Security Requirements Guide) is part of this picture. Software-first offerors building data pipelines need to know which environment their target customer operates in, what controls are inherited, and what the offeror has to add. Offerors who can name the impact level, the inherited controls, and the residual control set are read as more credible than offerors who handwave at "secure cloud."
The right posture is to design governance in from the start, even at prototype scale. Audit logging, RBAC, and provenance are easier to build in than to bolt on; the published guidance from CDAIO, CDAO, and the Platform One community of practice consistently emphasizes this point.
Analytics and ML on top
Data engineering supports analytics and ML, not the other way around. Several public Air Force data initiatives have explicitly observed that ML models cannot be evaluated, retrained, or maintained without a data substrate that supports those operations. The MLOps literature — the Sculley et al. paper on hidden technical debt in ML systems, the broader Google MLOps publications, and the published practice from federal ML programs — has converged on the data-substrate point as a precondition rather than a finishing detail.
Offerors who propose models without articulating the data substrate face honest skepticism from reviewers; offerors who propose data substrates that enable a clear ML roadmap fare better. The clearest framing is the data-then-models progression: a Phase I that delivers a clean data substrate with a documented evaluation harness; a Phase II that adds the first analytical or ML layer on top of that substrate, with the substrate as a permanent deliverable.
The published evaluation methodology from NIST AI RMF, the DoD Responsible AI Strategy, and the CDAO's published AI-evaluation guidance gives offerors a specific vocabulary for how reviewers will assess any AI/ML claim. Performance under realistic data distributions, bias audits, drift monitoring, and the data substrate that enables them are part of every credible AI/ML proposal in this space.
Transition
The transition pattern for successful data-engineering Phase II work is consistent: a specific Air Force unit or office adopts the prototype as part of its operational data flow, an O&M-style sustainment account is identified, and Phase III work scales the prototype to additional units. The published Air Force Phase III case studies show several data-engineering examples that follow this pattern, with the named operator unit and the named program element appearing in unclassified contract documents.
Offerors who scope Phase II to enable that transition — rather than to deliver a generic platform — perform better in the transition rate statistics. The cleanest Phase II SOW pattern is to deliver a unit-specific deployment with measurable adoption metrics, then describe the additional units that the same software supports without per-unit reengineering.
The non-SBIR appropriation question, which dominates Phase III, is also a Phase II question for data-engineering work because the unit's O&M line is what sustains the deployment after the SBIR ends. Offerors who can name the line and the responsible budget officer in the Phase II proposal are read as transition-credible; offerors who cannot, are read as research-only.
Concept terms in this problem class
Reference architecture. A technology-neutral published description of an enterprise data system's layers — ingest, governance, access, analytics — that an offeror can align to instead of proposing against an idealized greenfield.
Schema matching / entity resolution. Two of the core enterprise-data-integration techniques that reviewers expect to see when an offeror is proposing legacy-system integration work.
Sustainment account. A non-SBIR appropriations source identified for Phase III scaling, usually a specific operational unit's O&M line.
Common questions on the public-record framing
What public DAF data references are foundational?
VAULTIS, Platform One, AFMC and ACC data publications, and the DAF Chief Data and AI Office data strategy. Each is public.
Why is legacy integration the recurring engineering bottleneck?
Operationally relevant data sits in systems built across decades with overlapping but non-identical schemas. Schema matching (Rahm/Bernstein), entity resolution (Christen), and dataspaces (Halevy/Doan) are the published frameworks.
How does governance show up at small Phase I scale?
CUI handling, role-based access, audit logging, data tagging — all retrofittable but expensive after the fact. Phase I scope that designs governance from the start scales better than scope that bolts it on.
What does this article not cover?
Specific data systems under modernization, specific named programs, or any Precision Federal data integration methodology.
Frequently asked questions
Because many Air Force operational problems are blocked by missing or fragmented data, not by missing models. The open-innovation pathway is well-suited to specifying a data problem at a specific customer site and proposing a specific software solution.
Yes. Aligning to ABMS / CJADC2 and CDAIO publications is a credibility signal; proposing against an idealized greenfield is read as inexperience.
From the start, at prototype scale. CUI handling, RBAC, audit logging, and data-tagging are published requirements; offerors who plan to add them later usually find they can't.
A specific Air Force unit or office adopts the prototype as part of its operational data flow, an O&M-style sustainment account is identified, and Phase III work scales the prototype to additional units.
Why this work matters to us
Precision Federal is a software-only SBIR firm. The reason articles like this one exist on this site is simple: federal program offices fund teams whose principal investigators have demonstrated, in public, that they think carefully about the problems the program is trying to solve. We write to demonstrate that posture, not to telegraph any particular technical approach. If your office is exploring the problem class above and wants a partner who reads the literature, codes the prototypes, and ships under a Phase I or Direct-to-Phase-II SOW, we are listening.