Production
System went live inside the federal boundary and is still operating.
ATO-Ready
Passed HHS/SAMHSA security review with a full artifact package.
Years Live
Multi-year operational history with drift monitoring and re-training.
HHS Scope
Delivered under the Substance Abuse and Mental Health Services Administration.
Agency Context: SAMHSA and the HHS Mission
The Substance Abuse and Mental Health Services Administration (SAMHSA) is the agency within the U.S. Department of Health and Human Services (HHS) charged with leading public health efforts to advance the behavioral health of the nation. SAMHSA administers programs and funding that touch every state, every tribal nation, and tens of thousands of treatment facilities. Its work spans the full behavioral-health continuum: prevention, treatment, recovery support, crisis response, and population-level surveillance.
Because SAMHSA sits inside HHS, every system it operates must meet federal security controls (NIST SP 800-53 / FISMA Moderate at minimum, often with HIPAA considerations layered on top), follow HHS data governance standards, and be auditable end-to-end. Behavioral health data is among the most sensitive data the federal government holds, and the policy environment around it is correspondingly tight.
That is the environment in which this machine learning system was designed, reviewed, and put into production.
The Problem Space
Federal health agencies collect enormous quantities of structured and semi-structured data: treatment admissions, discharge records, survey instruments, claims, provider attributes, geographic attributes, and longitudinal outcomes. Analysts have traditionally used that data for descriptive statistics, program reporting, and congressional briefs. Machine learning opens a different door: pattern discovery, risk stratification, early signal detection, and decision support at a scale humans cannot reach by hand.
The challenge is that "machine learning" in a Silicon Valley context is a different thing than "machine learning" inside the federal boundary. In the federal boundary you inherit:
- A data system whose schema was designed in the 1990s and has carried forward through multiple technology generations.
- A security environment where every library, every model artifact, and every network egress has to be justified.
- A review cadence measured in months, not sprints.
- An operational environment where "it broke, we are redeploying" is not an acceptable answer.
Our charter was to build a production ML system that fit inside those constraints and still delivered real, repeatable value to the program office.
Technical Approach
Data Ingestion
The first engineering problem was getting the right data into the right place with the right guardrails. We built an ingestion layer that pulled from the authoritative federal data systems, normalized schemas across vintages, resolved encoding drift (federal datasets accumulate small schema changes over years), and produced a single canonical analytic table versioned by cohort and release.
Every ingestion job wrote a manifest: source system, record count, hash, run timestamp, job ID, and the operator identity under which it ran. That manifest became part of the audit trail the security team later reviewed.
Feature Engineering
Federal behavioral-health data is rich in categorical attributes, temporal attributes, and missingness patterns that themselves carry signal. Feature engineering focused on three disciplines:
- Stable feature contracts. Every feature had a written definition, a computed lineage, and a version. A re-trained model could not silently start consuming a subtly different feature.
- Leakage discipline. In health data, leakage is everywhere: future-coded fields, retrospective updates, outcome indicators that appear to be inputs. We wrote feature tests that explicitly asserted temporal ordering.
- Policy-safe categoricals. Some attributes that look useful (for example, certain demographic combinations) are policy-sensitive or directly restricted. The feature layer encoded those policies so a downstream modeler could not accidentally violate them.
Model Training
Training ran on an approved federal compute environment with a constrained software supply chain. Every Python package came from a mirrored, scanned, approved index. Model code was reviewed not only for correctness but for any outbound call, any dependency that could phone home, any serialization format that could execute on load. Pickle was replaced with safer formats where possible.
We trained multiple candidate model families and compared them on a locked holdout that the modeling team did not touch during development. The holdout was refreshed on a defined cadence so we could measure honest out-of-time performance, not just random-split performance. That discipline mattered later when the security and program teams asked, "How do you know it still works?"
Validation
Validation was split into three independent phases:
- Statistical validation. Classical metrics on the locked holdout, plus calibration, subgroup performance, and stability across time.
- Operational validation. Does the model run in the deployment environment within latency and memory budgets? Does it fail gracefully on malformed input?
- Program validation. Does the output make sense to the subject-matter experts who will consume it? SMEs have veto power; a statistically good model that produces nonsense to the people doing the work does not ship.
Federal Security Review
The security review is where most federal ML projects stall, so it deserves its own section. Review at HHS covers, at minimum:
- System boundary definition. A clean diagram of every component, data flow, account, role, and network path.
- Data classification. What categories of data the system touches, and the legal authority for each.
- Control mapping. Every applicable NIST SP 800-53 control, with evidence: how it is implemented, who owns it, and how it is tested.
- Supply chain. Every dependency, with provenance and vulnerability scan results.
- Risk assessment and POA&M. A frank accounting of residual risks and a plan of action and milestones for anything not fully mitigated.
- Continuous monitoring plan. How the system stays secure after go-live.
We prepared for this in parallel with engineering, not after. Every engineering decision carried a security consequence, and we wanted the security team reading living documentation, not a last-minute artifact dump.
The teams that succeed at federal ML treat the security review as a design input, not a tollbooth at the end of the sprint.
Deployment: Production Architecture
The production architecture prioritized three things, in order: reproducibility, observability, and graceful degradation.
- Reproducibility. Every deployed artifact was tied to a git SHA, a training data manifest, and a software bill of materials. Any output the system ever produced could be traced back to an exact model and exact inputs.
- Observability. Inference volume, latency, input distribution, and output distribution were logged and surfaced on dashboards. Operators did not need to SSH to know the system's state.
- Graceful degradation. If a dependency was slow or a feature was missing, the system returned a defined partial result rather than crashing. Federal operational environments do not tolerate silent failure.
Drift Detection
Federal data changes, slowly and without warning. A new reporting guideline, a change in a state's submission pipeline, a refresh of a controlled vocabulary, and your inputs have shifted. We implemented drift detection on both inputs and outputs, with alerts that distinguish "a feature moved" from "the world changed." The first is often a data bug; the second is sometimes a signal.
What's Live Today
The system went through federal security review, received the authorization it needed, and deployed into SAMHSA's production environment. It has been operating in federal production ever since, with scheduled re-training, ongoing drift monitoring, and a defined operational runbook. It is still running today. That fact, unembellished, is the case study.
Lessons Learned for Federal ML Projects
1. The model is 20% of the work
Most of the engineering effort goes into data plumbing, feature contracts, deployment hardening, and security evidence. Budgets and timelines that assume "mostly modeling" will be wrong.
2. Treat the security team as a customer
They are not adversaries. They are reviewers protecting the agency and the public. Give them clean documentation, early. A security reviewer who can read your system in a morning will move faster than one who has to reverse-engineer it.
3. Write the runbook before you think you need it
"What does an operator do when X alert fires?" should be answerable in one page. If it is not, the system is not ready for production.
4. Lock your holdout and never look at it
The temptation to peek is constant. Peek once, and your reported performance is fiction.
5. Make policy part of the code
If a feature is policy-restricted, the restriction should live in the feature pipeline, not in a memo. Memos drift. Code does not.
6. Versioning is not optional
Data version, feature version, model version, code version, image version. All five. With hashes. You will be asked to reconstruct a production prediction from two years ago, and you need to be able to do it.
7. Plan for re-training from day one
The question is not "will we re-train?" It is "what triggers a re-train, who approves it, and what validation must it pass?" Answer that before you deploy the first model, not after the first drift alert.
8. Measure on subgroups
In federal health data, aggregate accuracy hides a lot. Subgroup analysis is not a nice-to-have; it is a design requirement.