Federal Health Data Platform

Millions of records. Dozens of source systems. One auditable pipeline. How we designed and operated a federal-scale health data platform that served analysts, dashboards, and downstream machine learning.

Millions

Records ingested and curated under federal data governance.

Multi-source

Authoritative federal source systems normalized into a canonical model.

HIPAA-aligned

Access controls, encryption, and audit logging throughout.

Analyst-ready

Dashboards, extracts, and ML-ready tables from the same platform.

The Challenge

Federal health agencies run on data. Program offices need weekly reports. Congressional staff need answers in hours. Researchers need extracts with proper disclosure review. Analysts need dashboards that refresh overnight without breaking. Data scientists need tables that will not silently change shape between training and inference. And every one of those consumers is working against the same underlying source systems that were built, extended, and re-extended over decades.

The challenge is not "move data from A to B." The challenge is: build a platform that every one of those consumers can rely on, where every data element is traceable back to its authoritative source, where every access is logged, and where the whole thing can pass security review and stay passing year after year.

Scope note. This case study describes platform architecture and engineering patterns at a level suitable for public publication. Specific agency systems, contract details, and data internals are intentionally not disclosed.

Ingestion: Multi-Source Pipelines

Federal health data arrives in a half-dozen shapes: bulk extracts from legacy mainframes, FTP drops of flat files, REST APIs from modern subsystems, database replicas, and periodic manual uploads from partner organizations. A serious platform has to speak all of those protocols without letting any of them dictate the shape of the rest of the system.

Our ingestion tier was built on a pattern that served us well across every source:

Separating landing, standardization, and canonical layers is not glamorous engineering. It is the difference between a platform that can be maintained for ten years and a platform that collapses under its own tangled SQL in eighteen months.

Storage: Warehouse Design

The warehouse was designed around three access patterns:

Schema Discipline

Every table has an owner, a definition, a refresh cadence, and an SLA. Every column has a description, a type, a nullability rule, and a classification (public, sensitive, restricted). No table ships without that metadata. This is the unglamorous foundation that makes the rest of the platform auditable.

Partitioning and Performance

At federal scale, partitioning strategy is the difference between a three-minute query and a three-hour query. We partitioned on reporting period for most fact tables, with clustered keys on the high-cardinality attributes analysts filter on. Performance budgets were published: any query that does not hit a partition key is flagged in the warehouse query log and reviewed.

Governance and Compliance

HIPAA-Aligned Architecture

Health data under federal authority routinely intersects HIPAA. That shapes the architecture in concrete ways:

Access Controls

Role-based access control is the baseline. Above it sits attribute-based access control for data elements with program-specific restrictions. Above that sits a review workflow: certain queries and extracts require approval, and that approval is recorded. The platform makes the secure path the easy path.

Audit Logging

Every access, every query, every extract, every schema change is logged with operator, timestamp, query text, and row count. Logs are immutable and retained according to the agency's records schedule. When a reviewer asks "who touched this record between these dates," we can answer in minutes, not weeks.

Audit logging is not a feature you add at the end. It is a property you design into the platform from the first table.

Real-Time Dashboards and Automated Reporting

Program offices do not want to run SQL. They want answers. The platform exposes curated dashboards over the aggregated layer, with consistent visual grammar, clearly labeled "as-of" timestamps, and data-quality indicators.

Automated reporting runs on the same curated tables. Congressional response documents, internal briefs, and recurring program reports are generated from the warehouse, with the underlying queries version-controlled and reviewable. When a number shows up in a brief, anyone on the team can trace it back to the exact query, table, and partition that produced it.

Support for Analysts and ML Practitioners

A platform that only serves dashboards is not a platform. Analysts and ML practitioners are first-class users.

Operational Patterns

Data Quality Monitoring

Row counts, null rates, value distributions, and referential integrity checks run on every ingest and every transformation. Anomalies fire alerts. Alerts have runbooks. Runbooks are rehearsed.

Incident Response

A data incident is different from a security incident, and the platform has to be able to handle both. A bad upstream file that poisons a dashboard is a data incident: we need to roll back the affected partitions cleanly. A suspected unauthorized access is a security incident: we need to freeze the relevant accounts, collect logs, and escalate. Both workflows are documented and both have been exercised.

Change Management

Schema changes, pipeline changes, and access policy changes all go through a change management process. Emergency changes are possible but they are documented and reviewed after the fact. Silent changes are not allowed.

Lessons Learned for Federal Data Engineering

1. Separate layers ruthlessly

Landing, standardization, canonical, aggregate, extract. Each layer has one job. Blurring the layers creates messes that take years to clean up.

2. Metadata is first-class

Owner, definition, lineage, classification, SLA. If you are not capturing those at the moment a table is created, you will be reverse-engineering them later.

3. Make provenance cheap

Every downstream artifact should be traceable to an upstream source without archaeology. Build provenance in; never bolt it on.

4. Design for multiple consumers

Dashboards, analysts, and ML are different consumer types with different needs. A platform that optimizes for one and treats the others as afterthoughts will fail two out of three.

5. Respect the records schedule

Federal records retention is a legal matter. Build it into your storage policies from day one.

6. Your catalog is a product

If analysts cannot find a table, it does not exist. Invest in the catalog with the same seriousness you invest in the warehouse.

7. Treat de-identification as code

Policies about what can be shared with whom should live in reviewable, version-controlled code, not in memos. If the rule is in a memo, it will eventually be violated.

8. Rehearse your incident response

Tabletop exercises cost hours. Discovering your runbook is wrong during a real incident costs days.

FAQ

What stack do you build federal health data platforms on?
It depends on the agency's approved environment. Common patterns include cloud warehouses, object storage for landing, orchestrators like Airflow, and dbt or similar for in-warehouse transformation. We use what the agency has approved and bring discipline to how it is used.
How do you handle de-identification?
A clearly defined boundary separates identified and de-identified zones. De-identification transformations are code, reviewed, and version-controlled. Re-identification risk is assessed and documented.
Do you support real-time streaming?
Yes, where the source system and use case justify it. Many federal health workloads are batch by nature; we do not force streaming architectures where batch is the right answer.
How do you support downstream ML teams?
Point-in-time correct snapshots, documented feature contracts, and a clear boundary between analytics tables and ML-training tables.
Can you work inside an existing federal data environment?
Yes. We are comfortable working inside an agency's existing accounts, warehouses, and tooling rather than insisting on greenfield.
What security standards apply?
FISMA/NIST SP 800-53 as the baseline, with HIPAA and agency-specific policy layered on. Exact control profile depends on the system's boundary and data classification.

Related Capabilities

Building a federal data platform?

We have done the hard parts before: governance, audit, consumer diversity, and ten-year maintainability. Let's talk.

Email Bo Peng →