Why CDC is where our stack earns its keep
Precision Federal is pursuing opportunities at the Centers for Disease Control and Prevention. CDC sits at the intersection of three things our team does unusually well for a small business: production machine learning on sensitive health data, standards-driven data engineering over messy public health feeds, and agentic LLM systems that can accelerate the work of a small number of scientific staff across a very large corpus.
Our federal health anchor is a production ML system shipped at SAMHSA — HHS, full ATO, sensitive behavioral health data, real users. CDC is SAMHSA's sister agency under HHS, shares much of the governance regime, and is in the middle of a multi-billion-dollar, multi-year data modernization push that is the largest AI/ML-addressable opportunity in U.S. public health.
The Data Modernization Initiative — CDC's center of gravity
The CDC Data Modernization Initiative (DMI) is the ten-year rebuild of the public health data pipeline that began with COVID-era emergency funding and has continued as a program of record. DMI's scope is vast and directly AI/ML-relevant:
- Electronic Case Reporting (eCR) — automatic condition reporting from EHRs to public health, FHIR-native.
- Electronic Laboratory Reporting (ELR) — HL7 v2 and FHIR lab feeds from thousands of labs into state and CDC systems.
- National Notifiable Diseases Surveillance System (NNDSS) modernization.
- National Electronic Disease Surveillance System (NEDSS) base system modernization with state jurisdictions.
- BioSense Platform / National Syndromic Surveillance Program (NSSP) — real-time emergency department chief complaint and diagnosis data.
- Immunization registries — IIS modernization and interoperability.
- Mortality and natality data — Vital Statistics modernization.
- Data hub and enterprise data architecture — cross-CDC lakehouse and API modernization.
Almost every DMI workstream has an AI/ML-addressable data quality, anomaly detection, entity resolution, or forecasting problem embedded in it. A small business that can show up with a production federal ML system in its past and governance discipline in its delivery playbook is in a rare position.
CDC centers and offices we target
- OPHDST — Office of Public Health Data, Surveillance, and Technology. DMI owner. Enterprise data architecture.
- CFA — Center for Forecasting and Outbreak Analytics. Infectious disease modeling, scenario analysis, nowcasting. Our Kaggle modeling background directly transfers.
- NCIRD — National Center for Immunization and Respiratory Diseases. Flu, COVID, RSV forecasting and VAERS-adjacent signal detection.
- NCHHSTP — HIV, Viral Hepatitis, STD, and TB Prevention. Case-based surveillance, molecular epidemiology.
- NCEZID — Emerging and Zoonotic Infectious Diseases. PulseNet, foodborne outbreak ML.
- NCCDPHP — Chronic Disease Prevention and Health Promotion. BRFSS analytics, YRBSS, cancer registry ML.
- NCIPC — Injury Prevention and Control. Overdose surveillance (DOSE), suicide surveillance. SAMHSA-adjacent.
- NCEH — Environmental Health. Environmental Public Health Tracking Network.
- NIOSH — Occupational Safety and Health. Worker safety data and ML.
- NCBDDD — Birth Defects and Developmental Disabilities.
- NCHS — National Center for Health Statistics. NHANES, NHIS, NAMCS, vital statistics.
- CGH — Center for Global Health. Global outbreak response, PEPFAR-adjacent informatics.
- PHIC — Public Health Infrastructure Center. Workforce, training, state/local capacity.
PHGKB and the genomic epidemiology frontier
The Public Health Genomics Knowledge Base (PHGKB) is CDC's curated knowledge system linking genomic variants to public health evidence. It sits at the junction of three things we build routinely: biomedical NLP, structured knowledge extraction, and evidence synthesis. PHGKB-adjacent scope includes:
- Literature triage and extraction — automated identification and classification of genomic epidemiology papers, PICO extraction, guideline alignment.
- Variant-disease linkage — structured extraction of variant-phenotype-population associations from open-access genomic literature.
- CDC Tier 1 evidence classification — classifying reported genomic applications against CDC's tiered evidence framework.
- PulseNet integration — linking PHGKB entries to PulseNet whole-genome sequence data for foodborne pathogens.
This is the kind of scope where agentic LLM systems with RAG and human-in-the-loop review gates replace months of manual curation with hours of reviewed output.
Syndromic and signal-detection ML — our strongest lane
Syndromic surveillance anomaly detection
BioSense / NSSP chief complaint and diagnosis stream modeling. Bayesian change-point detection, seasonal-adjusted anomalies, jurisdiction-level and hospital-level alerting with false-positive control.
Respiratory virus forecasting
Flu, COVID, RSV nowcasting and short-horizon forecasting. Hierarchical Bayesian ensembles and ML meta-learners. Alignment with CFA's open forecasting hub standards.
Genomic epidemiology ML
Sequence classification, outbreak clustering, phylogenetic placement at scale. PulseNet and SARS-CoV-2 surveillance patterns.
VAERS and drug safety NLP
Adverse event signal detection, free-text narrative NLP, disproportionality analysis with ML residualization.
Overdose surveillance
DOSE and state-reported overdose data, toxicology narrative NLP. Direct bridge from SAMHSA TEDS to CDC NCIPC.
Data quality ML for ELR and eCR
Lab feed quality scoring, duplicate and deduplication, entity resolution across jurisdictions. DMI-core scope.
Standards we design around — not bolt onto
CDC's data stack is standards-first. We design ingestion and ML pipelines that respect:
- HL7 FHIR R4 — eCR, FHIR-based case reporting, US Core profiles.
- HL7 v2.5.1 — ELR, immunization, lab.
- LOINC — lab test identifiers.
- SNOMED CT — clinical findings.
- ICD-10-CM — diagnoses.
- RxNorm — medications.
- CDC PHIN VADS — CDC's vocabulary access.
- HL7 CDA — case report documents.
Machine learning that works in production at CDC must ingest and harmonize across these. We have built these pipelines.
Vehicles and pathways into CDC
- CDC SBIR — CDC participates in HHS SBIR with smaller topic pools than NIH but clear AI/ML-relevant themes.
- CDC Broad Agency Announcement — the CDC Emerging Infectious Diseases BAA and public health informatics BAAs.
- CIO-SP4 — HHS-wide IT vehicle administered by NIH NITAAC, reachable for CDC scope through teaming.
- CDC-specific IDIQs — public health informatics support vehicles.
- Cooperative agreements — DMI funding flows heavily through state, tribal, local, and territorial jurisdictions; we partner with grantees.
- OTA via BARDA / ASPR — for preparedness-adjacent response tooling.
- USASpending / SAM.gov opportunity streams — we monitor CDC NAICS 541512 activity weekly.
Governance: FISMA, HIPAA, and CDC-specific data use
CDC systems span FISMA Moderate and FISMA High impact levels. Many data use agreements with state jurisdictions, the National Death Index, NCHS restricted data, and linked NHANES files carry bespoke terms. Our SAMHSA delivery experience means we are used to:
- Designing ML pipelines that respect minimum-necessary and purpose-limitation constraints.
- Running analytics inside controlled environments with no exfiltration.
- Writing System Security Plans, POA&Ms, and ATO artifacts without needing a compliance hand-holder.
- Operating under IRB exemptions and determinations where needed for surveillance.
Subcontracting, partnering, and state jurisdiction teaming
Three engagement patterns:
- Subcontract to a CDC prime — AI/ML specialty on DMI task orders, NEDSS modernization, or PHGKB informatics.
- Team with a state health department — CDC flows cooperative-agreement dollars to states. We have been building relationships with Iowa HHS and neighboring Midwestern jurisdictions.
- Prime on SBIR or small-dollar BAA — where topic fit is clear and scope aligns with our ML and data engineering strengths.
How to engage on a CDC requirement
Email [email protected] with the CDC center, vehicle, and scope. We respond within 24 hours with a fit assessment, rough level of effort, and teaming construct. For SBIR topics, see SBIR partnering. For related capability pages, see Machine Learning and Data Engineering.