Millions
Records ingested and curated under federal data governance.
Multi-source
Authoritative federal source systems normalized into a canonical model.
HIPAA-aligned
Access controls, encryption, and audit logging throughout.
Analyst-ready
Dashboards, extracts, and ML-ready tables from the same platform.
The Challenge
Federal health agencies run on data. Program offices need weekly reports. Congressional staff need answers in hours. Researchers need extracts with proper disclosure review. Analysts need dashboards that refresh overnight without breaking. Data scientists need tables that will not silently change shape between training and inference. And every one of those consumers is working against the same underlying source systems that were built, extended, and re-extended over decades.
The challenge is not "move data from A to B." The challenge is: build a platform that every one of those consumers can rely on, where every data element is traceable back to its authoritative source, where every access is logged, and where the whole thing can pass security review and stay passing year after year.
Ingestion: Multi-Source Pipelines
Federal health data arrives in a half-dozen shapes: bulk extracts from legacy mainframes, FTP drops of flat files, REST APIs from modern subsystems, database replicas, and periodic manual uploads from partner organizations. A serious platform has to speak all of those protocols without letting any of them dictate the shape of the rest of the system.
Our ingestion tier was built on a pattern that served us well across every source:
- Landing zone. Raw bytes land in an encrypted, access-controlled store with no transformation. If something goes wrong later, we can always replay from raw.
- Source-of-record manifest. Every arrival is catalogued with source system, operator, timestamp, record count, and a cryptographic hash. That manifest is the spine of the audit trail.
- Schema contract. Each source has a versioned schema contract checked on ingest. A silent schema drift in an upstream system does not propagate; it fires an alert.
- Standardization layer. Code sets, date formats, state codes, missingness conventions, controlled vocabularies are normalized in an explicit, reviewable step, not in ad-hoc SQL scattered across downstream jobs.
- Canonical model. Standardized records merge into a canonical entity model that downstream consumers query.
Separating landing, standardization, and canonical layers is not glamorous engineering. It is the difference between a platform that can be maintained for ten years and a platform that collapses under its own tangled SQL in eighteen months.
Storage: Warehouse Design
The warehouse was designed around three access patterns:
- Analyst SQL. Star-schema fact and dimension tables, conformed across programs, documented in a data catalog.
- Dashboard extracts. Pre-aggregated tables sized for sub-second dashboard response, refreshed on a defined cadence with clearly published "as-of" timestamps.
- ML training pulls. Point-in-time correct snapshots for model training, engineered to prevent leakage between features and labels.
Schema Discipline
Every table has an owner, a definition, a refresh cadence, and an SLA. Every column has a description, a type, a nullability rule, and a classification (public, sensitive, restricted). No table ships without that metadata. This is the unglamorous foundation that makes the rest of the platform auditable.
Partitioning and Performance
At federal scale, partitioning strategy is the difference between a three-minute query and a three-hour query. We partitioned on reporting period for most fact tables, with clustered keys on the high-cardinality attributes analysts filter on. Performance budgets were published: any query that does not hit a partition key is flagged in the warehouse query log and reviewed.
Governance and Compliance
HIPAA-Aligned Architecture
Health data under federal authority routinely intersects HIPAA. That shapes the architecture in concrete ways:
- Encryption at rest and in transit. Every byte, every hop. No exceptions.
- Minimum necessary. Consumers get access to the minimum data needed for their role. If an analyst needs only state-level aggregates, they do not get row-level PHI.
- De-identification boundary. A clear, reviewed boundary separates identified and de-identified zones, with defined transformations and a documented re-identification risk assessment.
- Business Associate considerations. Where applicable, BAAs are in place and documented.
Access Controls
Role-based access control is the baseline. Above it sits attribute-based access control for data elements with program-specific restrictions. Above that sits a review workflow: certain queries and extracts require approval, and that approval is recorded. The platform makes the secure path the easy path.
Audit Logging
Every access, every query, every extract, every schema change is logged with operator, timestamp, query text, and row count. Logs are immutable and retained according to the agency's records schedule. When a reviewer asks "who touched this record between these dates," we can answer in minutes, not weeks.
Audit logging is not a feature you add at the end. It is a property you design into the platform from the first table.
Real-Time Dashboards and Automated Reporting
Program offices do not want to run SQL. They want answers. The platform exposes curated dashboards over the aggregated layer, with consistent visual grammar, clearly labeled "as-of" timestamps, and data-quality indicators.
Automated reporting runs on the same curated tables. Congressional response documents, internal briefs, and recurring program reports are generated from the warehouse, with the underlying queries version-controlled and reviewable. When a number shows up in a brief, anyone on the team can trace it back to the exact query, table, and partition that produced it.
Support for Analysts and ML Practitioners
A platform that only serves dashboards is not a platform. Analysts and ML practitioners are first-class users.
- Documented, stable tables. The canonical tables are documented in a searchable catalog. Analysts do not have to guess what a column means.
- Point-in-time correct snapshots. ML practitioners can reconstruct the state of the data as it existed at any historical timestamp. This is essential to build training sets without leakage.
- Notebook environments with the right guardrails. Approved notebook environments with access to curated tables, with output egress controlled.
- Feature contracts. Where the same feature is used by multiple downstream models, it is computed once, in one place, with one definition.
Operational Patterns
Data Quality Monitoring
Row counts, null rates, value distributions, and referential integrity checks run on every ingest and every transformation. Anomalies fire alerts. Alerts have runbooks. Runbooks are rehearsed.
Incident Response
A data incident is different from a security incident, and the platform has to be able to handle both. A bad upstream file that poisons a dashboard is a data incident: we need to roll back the affected partitions cleanly. A suspected unauthorized access is a security incident: we need to freeze the relevant accounts, collect logs, and escalate. Both workflows are documented and both have been exercised.
Change Management
Schema changes, pipeline changes, and access policy changes all go through a change management process. Emergency changes are possible but they are documented and reviewed after the fact. Silent changes are not allowed.
Lessons Learned for Federal Data Engineering
1. Separate layers ruthlessly
Landing, standardization, canonical, aggregate, extract. Each layer has one job. Blurring the layers creates messes that take years to clean up.
2. Metadata is first-class
Owner, definition, lineage, classification, SLA. If you are not capturing those at the moment a table is created, you will be reverse-engineering them later.
3. Make provenance cheap
Every downstream artifact should be traceable to an upstream source without archaeology. Build provenance in; never bolt it on.
4. Design for multiple consumers
Dashboards, analysts, and ML are different consumer types with different needs. A platform that optimizes for one and treats the others as afterthoughts will fail two out of three.
5. Respect the records schedule
Federal records retention is a legal matter. Build it into your storage policies from day one.
6. Your catalog is a product
If analysts cannot find a table, it does not exist. Invest in the catalog with the same seriousness you invest in the warehouse.
7. Treat de-identification as code
Policies about what can be shared with whom should live in reviewable, version-controlled code, not in memos. If the rule is in a memo, it will eventually be violated.
8. Rehearse your incident response
Tabletop exercises cost hours. Discovering your runbook is wrong during a real incident costs days.