Production Machine Learning at SAMHSA

A federal-scale machine learning system that passed security review, went live in government production, and continues to operate today. What we built, how we built it, and what we learned about shipping ML inside the federal boundary.

Production

System went live inside the federal boundary and is still operating.

ATO-Ready

Passed HHS/SAMHSA security review with a full artifact package.

Years Live

Multi-year operational history with drift monitoring and re-training.

HHS Scope

Delivered under the Substance Abuse and Mental Health Services Administration.

Agency Context: SAMHSA and the HHS Mission

The Substance Abuse and Mental Health Services Administration (SAMHSA) is the agency within the U.S. Department of Health and Human Services (HHS) charged with leading public health efforts to advance the behavioral health of the nation. SAMHSA administers programs and funding that touch every state, every tribal nation, and tens of thousands of treatment facilities. Its work spans the full behavioral-health continuum: prevention, treatment, recovery support, crisis response, and population-level surveillance.

Because SAMHSA sits inside HHS, every system it operates must meet federal security controls (NIST SP 800-53 / FISMA Moderate at minimum, often with HIPAA considerations layered on top), follow HHS data governance standards, and be auditable end-to-end. Behavioral health data is among the most sensitive data the federal government holds, and the policy environment around it is correspondingly tight.

That is the environment in which this machine learning system was designed, reviewed, and put into production.

The Problem Space

Federal health agencies collect enormous quantities of structured and semi-structured data: treatment admissions, discharge records, survey instruments, claims, provider attributes, geographic attributes, and longitudinal outcomes. Analysts have traditionally used that data for descriptive statistics, program reporting, and congressional briefs. Machine learning opens a different door: pattern discovery, risk stratification, early signal detection, and decision support at a scale humans cannot reach by hand.

The challenge is that "machine learning" in a Silicon Valley context is a different thing than "machine learning" inside the federal boundary. In the federal boundary you inherit:

Our charter was to build a production ML system that fit inside those constraints and still delivered real, repeatable value to the program office.

Public framing. This case study intentionally stays at the capability level. The model family, feature set, and exact program within SAMHSA are not disclosed. What is disclosed matches what is already public: Precision Federal delivered and maintains a production ML system at SAMHSA that has passed federal security review.

Technical Approach

Data Ingestion

The first engineering problem was getting the right data into the right place with the right guardrails. We built an ingestion layer that pulled from the authoritative federal data systems, normalized schemas across vintages, resolved encoding drift (federal datasets accumulate small schema changes over years), and produced a single canonical analytic table versioned by cohort and release.

Every ingestion job wrote a manifest: source system, record count, hash, run timestamp, job ID, and the operator identity under which it ran. That manifest became part of the audit trail the security team later reviewed.

Feature Engineering

Federal behavioral-health data is rich in categorical attributes, temporal attributes, and missingness patterns that themselves carry signal. Feature engineering focused on three disciplines:

Model Training

Training ran on an approved federal compute environment with a constrained software supply chain. Every Python package came from a mirrored, scanned, approved index. Model code was reviewed not only for correctness but for any outbound call, any dependency that could phone home, any serialization format that could execute on load. Pickle was replaced with safer formats where possible.

We trained multiple candidate model families and compared them on a locked holdout that the modeling team did not touch during development. The holdout was refreshed on a defined cadence so we could measure honest out-of-time performance, not just random-split performance. That discipline mattered later when the security and program teams asked, "How do you know it still works?"

Validation

Validation was split into three independent phases:

Federal Security Review

The security review is where most federal ML projects stall, so it deserves its own section. Review at HHS covers, at minimum:

We prepared for this in parallel with engineering, not after. Every engineering decision carried a security consequence, and we wanted the security team reading living documentation, not a last-minute artifact dump.

The teams that succeed at federal ML treat the security review as a design input, not a tollbooth at the end of the sprint.

Deployment: Production Architecture

The production architecture prioritized three things, in order: reproducibility, observability, and graceful degradation.

Drift Detection

Federal data changes, slowly and without warning. A new reporting guideline, a change in a state's submission pipeline, a refresh of a controlled vocabulary, and your inputs have shifted. We implemented drift detection on both inputs and outputs, with alerts that distinguish "a feature moved" from "the world changed." The first is often a data bug; the second is sometimes a signal.

What's Live Today

The system went through federal security review, received the authorization it needed, and deployed into SAMHSA's production environment. It has been operating in federal production ever since, with scheduled re-training, ongoing drift monitoring, and a defined operational runbook. It is still running today. That fact, unembellished, is the case study.

Why this matters. There are many vendors who can build a Jupyter notebook. There are very few who can build a notebook, turn it into a production service, pass federal security review, and keep it operational for years. That last 80% is where federal programs actually need help.

Lessons Learned for Federal ML Projects

1. The model is 20% of the work

Most of the engineering effort goes into data plumbing, feature contracts, deployment hardening, and security evidence. Budgets and timelines that assume "mostly modeling" will be wrong.

2. Treat the security team as a customer

They are not adversaries. They are reviewers protecting the agency and the public. Give them clean documentation, early. A security reviewer who can read your system in a morning will move faster than one who has to reverse-engineer it.

3. Write the runbook before you think you need it

"What does an operator do when X alert fires?" should be answerable in one page. If it is not, the system is not ready for production.

4. Lock your holdout and never look at it

The temptation to peek is constant. Peek once, and your reported performance is fiction.

5. Make policy part of the code

If a feature is policy-restricted, the restriction should live in the feature pipeline, not in a memo. Memos drift. Code does not.

6. Versioning is not optional

Data version, feature version, model version, code version, image version. All five. With hashes. You will be asked to reconstruct a production prediction from two years ago, and you need to be able to do it.

7. Plan for re-training from day one

The question is not "will we re-train?" It is "what triggers a re-train, who approves it, and what validation must it pass?" Answer that before you deploy the first model, not after the first drift alert.

8. Measure on subgroups

In federal health data, aggregate accuracy hides a lot. Subgroup analysis is not a nice-to-have; it is a design requirement.

FAQ

Can you share the specific model, dataset, or contract for the SAMHSA engagement?
No. Public-facing documentation stays at the capability and architecture level. Specific implementation details, contract numbers, and model internals are not disclosed. We can speak to those under an appropriate agreement.
Is the system still running?
Yes. It went through federal security review, deployed into production, and continues to operate.
What federal security standards applied?
HHS inherits FISMA and NIST SP 800-53 controls, with additional HHS and SAMHSA-specific policies layered on. Depending on data category, HIPAA considerations also apply.
What tools and frameworks do you use for federal ML?
Python is the core language. We favor mature, well-maintained libraries with clean supply-chain provenance over bleeding-edge ones. Training runs on approved federal compute; deployment targets whatever approved environment the agency operates.
Do you help agencies prepare the ATO package?
Yes. We build the security artifacts alongside the system so the ATO package is a byproduct of the engineering work, not a separate project.
Can this approach be applied outside HHS?
Yes. The discipline generalizes to DoD, intelligence community, and civilian agencies. Specific controls shift (NIST 800-171, CMMC, IL4/IL5/IL6), but the engineering spine is the same.

Related Capabilities

Have a federal ML problem that needs to ship?

We build systems that pass security review and keep running. Tell us the problem and the constraints.

Email Bo Peng →