Case Study: Production Machine Learning at SAMHSA

Production

System went live inside the federal boundary and is still operating.

ATO-Ready

Passed HHS/SAMHSA security review with a full artifact package.

Years Live

Multi-year operational history with drift monitoring and re-training.

HHS Scope

Delivered under the Substance Abuse and Mental Health Services Administration.

Agency Context: SAMHSA and the HHS Mission

The Substance Abuse and Mental Health Services Administration (SAMHSA) is the agency within the U.S. Department of Health and Human Services (HHS) charged with leading public health efforts to advance the behavioral health of the nation. SAMHSA administers programs and funding that touch every state, every tribal nation, and tens of thousands of treatment facilities. Its work spans the full behavioral-health continuum: prevention, treatment, recovery support, crisis response, and population-level surveillance.

Because SAMHSA sits inside HHS, every system it operates must meet federal security controls (NIST SP 800-53 / FISMA Moderate at minimum, often with HIPAA considerations layered on top), follow HHS data governance standards, and be auditable end-to-end. Behavioral health data is among the most sensitive data the federal government holds, and the policy environment around it is correspondingly tight.

That is the environment in which this machine learning system was designed, reviewed, and put into production.

The Problem Space

Federal health agencies collect enormous quantities of structured and semi-structured data: treatment admissions, discharge records, survey instruments, claims, provider attributes, geographic attributes, and longitudinal outcomes. Analysts have traditionally used that data for descriptive statistics, program reporting, and congressional briefs. Machine learning opens a different door: pattern discovery, risk stratification, early signal detection, and decision support at a scale humans cannot reach by hand.

The challenge is that "machine learning" in a Silicon Valley context is a different thing than "machine learning" inside the federal boundary. In the federal boundary you inherit:

A data system whose schema was designed in the 1990s and has carried forward through multiple technology generations.
A security environment where every library, every model artifact, and every network egress has to be justified.
A review cadence measured in months, not sprints.
An operational environment where "it broke, we are redeploying" is not an acceptable answer.

Our charter was to build a production ML system that fit inside those constraints and still delivered real, repeatable value to the program office.

Public framing. This case study intentionally stays at the capability level. The model family, feature set, and exact program within SAMHSA are not disclosed. What is disclosed matches what is already public: Precision Federal delivered and maintains a production ML system at SAMHSA that has passed federal security review.

Technical Approach

Data Ingestion

The first engineering problem was getting the right data into the right place with the right guardrails. We built an ingestion layer that pulled from the authoritative federal data systems, normalized schemas across vintages, resolved encoding drift (federal datasets accumulate small schema changes over years), and produced a single canonical analytic table versioned by cohort and release.

Every ingestion job wrote a manifest: source system, record count, hash, run timestamp, job ID, and the operator identity under which it ran. That manifest became part of the audit trail the security team later reviewed.

Feature Engineering

Federal behavioral-health data is rich in categorical attributes, temporal attributes, and missingness patterns that themselves carry signal. Feature engineering focused on three disciplines:

Stable feature contracts. Every feature had a written definition, a computed lineage, and a version. A re-trained model could not silently start consuming a subtly different feature.
Leakage discipline. In health data, leakage is everywhere: future-coded fields, retrospective updates, outcome indicators that appear to be inputs. We wrote feature tests that explicitly asserted temporal ordering.
Policy-safe categoricals. Some attributes that look useful (for example, certain demographic combinations) are policy-sensitive or directly restricted. The feature layer encoded those policies so a downstream modeler could not accidentally violate them.

Model Training

Training ran on an approved federal compute environment with a constrained software supply chain. Every Python package came from a mirrored, scanned, approved index. Model code was reviewed not only for correctness but for any outbound call, any dependency that could phone home, any serialization format that could execute on load. Pickle was replaced with safer formats where possible.

We trained multiple candidate model families and compared them on a locked holdout that the modeling team did not touch during development. The holdout was refreshed on a defined cadence so we could measure honest out-of-time performance, not just random-split performance. That discipline mattered later when the security and program teams asked, "How do you know it still works?"

Validation

Validation was split into three independent phases:

Statistical validation. Classical metrics on the locked holdout, plus calibration, subgroup performance, and stability across time.
Operational validation. Does the model run in the deployment environment within latency and memory budgets? Does it fail gracefully on malformed input?
Program validation. Does the output make sense to the subject-matter experts who will consume it? SMEs have veto power; a statistically good model that produces nonsense to the people doing the work does not ship.

Federal Security Review

The security review is where most federal ML projects stall, so it deserves its own section. Review at HHS covers, at minimum:

System boundary definition. A clean diagram of every component, data flow, account, role, and network path.
Data classification. What categories of data the system touches, and the legal authority for each.
Control mapping. Every applicable NIST SP 800-53 control, with evidence: how it is implemented, who owns it, and how it is tested.
Supply chain. Every dependency, with provenance and vulnerability scan results.
Risk assessment and POA&M. A frank accounting of residual risks and a plan of action and milestones for anything not fully mitigated.
Continuous monitoring plan. How the system stays secure after go-live.

We prepared for this in parallel with engineering, not after. Every engineering decision carried a security consequence, and we wanted the security team reading living documentation, not a last-minute artifact dump.

The teams that succeed at federal ML treat the security review as a design input, not a tollbooth at the end of the sprint.

Deployment: Production Architecture

The production architecture prioritized three things, in order: reproducibility, observability, and graceful degradation.

Reproducibility. Every deployed artifact was tied to a git SHA, a training data manifest, and a software bill of materials. Any output the system ever produced could be traced back to an exact model and exact inputs.
Observability. Inference volume, latency, input distribution, and output distribution were logged and surfaced on dashboards. Operators did not need to SSH to know the system's state.
Graceful degradation. If a dependency was slow or a feature was missing, the system returned a defined partial result rather than crashing. Federal operational environments do not tolerate silent failure.

Drift Detection

Federal data changes, slowly and without warning. A new reporting guideline, a change in a state's submission pipeline, a refresh of a controlled vocabulary, and your inputs have shifted. We implemented drift detection on both inputs and outputs, with alerts that distinguish "a feature moved" from "the world changed." The first is often a data bug; the second is sometimes a signal.

What's Live Today

The system went through federal security review, received the authorization it needed, and deployed into SAMHSA's production environment. It has been operating in federal production ever since, with scheduled re-training, ongoing drift monitoring, and a defined operational runbook. It is still running today. That fact, unembellished, is the case study.

Why this matters. There are many vendors who can build a Jupyter notebook. There are very few who can build a notebook, turn it into a production service, pass federal security review, and keep it operational for years. That last 80% is where federal programs actually need help.

Lessons Learned for Federal ML Projects

1. The model is 20% of the work

Most of the engineering effort goes into data plumbing, feature contracts, deployment hardening, and security evidence. Budgets and timelines that assume "mostly modeling" will be wrong.

2. Treat the security team as a customer

They are not adversaries. They are reviewers protecting the agency and the public. Give them clean documentation, early. A security reviewer who can read your system in a morning will move faster than one who has to reverse-engineer it.

3. Write the runbook before you think you need it

"What does an operator do when X alert fires?" should be answerable in one page. If it is not, the system is not ready for production.

4. Lock your holdout and never look at it

The temptation to peek is constant. Peek once, and your reported performance is fiction.

5. Make policy part of the code

If a feature is policy-restricted, the restriction should live in the feature pipeline, not in a memo. Memos drift. Code does not.

6. Versioning is not optional

Data version, feature version, model version, code version, image version. All five. With hashes. You will be asked to reconstruct a production prediction from two years ago, and you need to be able to do it.

7. Plan for re-training from day one

The question is not "will we re-train?" It is "what triggers a re-train, who approves it, and what validation must it pass?" Answer that before you deploy the first model, not after the first drift alert.

8. Measure on subgroups

In federal health data, aggregate accuracy hides a lot. Subgroup analysis is not a nice-to-have; it is a design requirement.

FAQ

Can you share the specific model, dataset, or contract for the SAMHSA engagement?

No. Public-facing documentation stays at the capability and architecture level. Specific implementation details, contract numbers, and model internals are not disclosed. We can speak to those under an appropriate agreement.

Is the system still running?

Yes. It went through federal security review, deployed into production, and continues to operate.

What federal security standards applied?

HHS inherits FISMA and NIST SP 800-53 controls, with additional HHS and SAMHSA-specific policies layered on. Depending on data category, HIPAA considerations also apply.

What tools and frameworks do you use for federal ML?

Python is the core language. We favor mature, well-maintained libraries with clean supply-chain provenance over bleeding-edge ones. Training runs on approved federal compute; deployment targets whatever approved environment the agency operates.

Do you help agencies prepare the ATO package?

Yes. We build the security artifacts alongside the system so the ATO package is a byproduct of the engineering work, not a separate project.

Can this approach be applied outside HHS?

Yes. The discipline generalizes to DoD, intelligence community, and civilian agencies. Specific controls shift (NIST 800-171, CMMC, IL4/IL5/IL6), but the engineering spine is the same.

Production Machine Learning at SAMHSA

Production

ATO-Ready

Years Live

HHS Scope

Agency Context: SAMHSA and the HHS Mission

The Problem Space

Technical Approach

Data Ingestion

Feature Engineering

Model Training

Validation

Federal Security Review

Deployment: Production Architecture

Drift Detection

What's Live Today

Lessons Learned for Federal ML Projects

1. The model is 20% of the work

2. Treat the security team as a customer

3. Write the runbook before you think you need it

4. Lock your holdout and never look at it

5. Make policy part of the code

6. Versioning is not optional

7. Plan for re-training from day one

8. Measure on subgroups

FAQ

Related Capabilities

Have a federal ML problem that needs to ship?

Production Machine Learning at SAMHSA

Production

ATO-Ready

Years Live

HHS Scope

Agency Context: SAMHSA and the HHS Mission

The Problem Space

Technical Approach

Data Ingestion

Feature Engineering

Model Training

Validation

Federal Security Review

Deployment: Production Architecture

Drift Detection

What's Live Today

Lessons Learned for Federal ML Projects

1. The model is 20% of the work

2. Treat the security team as a customer

3. Write the runbook before you think you need it

4. Lock your holdout and never look at it

5. Make policy part of the code

6. Versioning is not optional

7. Plan for re-training from day one

8. Measure on subgroups

FAQ

Related Capabilities

Machine Learning

Past Performance

HHS & Health Agencies

Have a federal ML problem that needs to ship?