Responsible AI for federal systems.

Bias testing, explainability, robustness, privacy, and governance that satisfies NIST AI RMF, OMB M-24-10, and EO 14110. Not a checkbox exercise. The engineering and documentation agencies actually need.

Discuss your system See past performance

What we build

Responsible AI is where federal programs get stuck. Mission teams ship a working model. The chief AI officer asks for an impact assessment. Legal asks about fairness. Privacy asks about training data. OMB asks whether the use case is on the inventory. Congress asks whether the model discriminates. Every one of those asks has a legitimate answer, and every one of them can halt deployment if the engineering team treated responsible AI as someone else's problem.

NIST AI RMF

Govern · Map · Measure · Manage

OMB M-24-10

Rights-impacting AI rule

Fairness audit

Slice analysis

RESPONSIBLE AI — what we track

p99

latency budget

eval

domain harness

ATO

NIST 800-53

cost

per mission call

drift

continuous eval

We build responsible AI engineering into the model lifecycle from day one. Bias testing runs in CI, not as a quarterly audit. Explainability outputs ship with predictions, not as a separate deck. Privacy protections are measured, not assumed. Documentation satisfies OMB M-24-10 and NIST AI RMF without requiring an after-the-fact rewrite. The result is systems that pass review the first time and operate with the audit trail agencies need to defend them.

Bias & fairness testing

Aequitas, Fairlearn, AIF360, What-If Tool across protected classes and agency-specific groupings.

Explainability

SHAP, LIME, integrated gradients, counterfactual explanations, concept activation vectors, Anchors.

LLM evaluation

HELM, BIG-bench, TruthfulQA, MMLU, agency red-team suites, continuous eval harnesses.

Privacy-preserving ML

differential privacy (Opacus, TF Privacy), federated learning (Flower, FATE), secure aggregation.

Robustness testing

adversarial examples (ART, TextAttack), distribution shift analysis, stress testing.

Governance documentation

AI Impact Assessments, Model Cards, Datasheets, System Security Plans aligned to NIST AI RMF.

Monitoring

drift detection, fairness drift, performance stratified by group, incident reporting pipelines.

RAI FRAMEWORK COVERAGE (NIST AI RMF ALIGNED)

GOVERN function — policy and accountability

88%

MAP function — risk identification

82%

MEASURE function — evaluation

78%

MANAGE function — remediation

75%

TEVV (test, eval, verify, validate)

80%

NIST AI Risk Management Framework

The NIST AI RMF 1.0 (January 2023) is the common language federal agencies use to talk about AI risk. It organizes work around four functions:

Govern

policies, accountability, training, risk appetite for the organization.

Map

context, purpose, stakeholders, risks for each AI use case.

Measure

quantitative and qualitative assessment of identified risks.

Manage

risk response, mitigation, monitoring, incident handling.

Our implementation translates these from framework language into engineering artifacts: risk registers per use case that map to specific RMF subcategories, measurement plans that define the metrics and test data for each risk, mitigation plans that specify the technical and procedural controls, and monitoring plans that specify the ongoing measurement and alert thresholds. Each artifact is versioned with the model and the data, so an auditor can see what was planned, what was measured, and what was mitigated.

The Generative AI Profile (NIST AI 600-1, July 2024) adds specific guidance for LLMs and foundation models: hallucination risk, data provenance, harmful outputs, prompt injection, and malicious use. Our LLM engineering practices integrate this profile explicitly.

OMB M-24-10 and EO 14110

OMB Memorandum M-24-10 (March 2024) operationalized Executive Order 14110 (October 2023) for civilian federal agencies. Key requirements:

AI use case inventory

every agency maintains a public inventory of AI use cases.

Chief AI Officer

designated senior official responsible for coordinating AI.

Rights-impacting and safety-impacting AI

defined categories with additional requirements.

AI Impact Assessments

required for rights-impacting AI before use and annually thereafter.

Minimum practices

testing, monitoring, human alternatives, consultation with affected communities, public notice.

Independent review

certain high-risk uses require independent evaluation.

Our engagements for federal programs subject to M-24-10 produce the documentation and engineering controls that satisfy each requirement. AIIAs include specific performance claims backed by test data, not vague statements of intent. Human alternatives are implemented and tested, not promised. Public notice drafts are prepared to agency and OMB standards. Minimum practices are wired into the CI/CD pipeline.

Bias and fairness testing in practice

Fairness is not a single metric. A common failure mode is picking one metric that looks good and declaring victory. Statistical parity, equal opportunity, equalized odds, predictive parity, and calibration-within-groups are often mutually incompatible. The right answer depends on the use case: a loan-approval use case prioritizes different metrics than a resume-screening use case, which prioritizes different metrics than a risk-scoring use case.

Our bias testing stack: Aequitas for fairness audits with configurable reference groups, Fairlearn for mitigation algorithms and MetricFrame analysis, AIF360 for the full suite of fairness metrics and debiasing techniques, What-If Tool for interactive exploration of model behavior across groups. We test across protected classes (race, gender, age, disability) and agency-specific groupings (veteran status, geography, LEP, beneficiary category). We report tradeoffs explicitly and document which metric the agency chose to optimize and why.

Mitigation techniques: pre-processing (reweighing, disparate impact remover), in-processing (adversarial debiasing, exponentiated gradient reduction), post-processing (calibrated equalized odds, reject-option classification). We test the mitigation and report performance tradeoffs, because every mitigation costs accuracy somewhere.

Explainability

Explainability for federal AI serves three audiences: decision subjects (why was I denied), operators (should I trust this prediction), and reviewers (does this model work the way it claims). Each audience needs different explanations.

For tabular models: SHAP TreeExplainer for gradient-boosting, KernelSHAP for general models, LIME for local approximations, counterfactual explanations (DiCE) for "what would change the outcome" questions. For deep models: integrated gradients, SmoothGrad, Grad-CAM for vision, attention rollout for some transformer architectures (with appropriate skepticism). For LLMs: chain-of-thought with verification, retrieval citation for RAG outputs, counterfactual probing.

Explainability outputs are engineered as first-class predictions, not after-the-fact analysis. Every production prediction ships with the explanation that produced it, and explanations are tested for consistency and faithfulness against the underlying model.

LLM evaluation

LLMs break the old evaluation paradigm. Static test sets do not capture the open-ended nature of generation. We build evaluation harnesses that combine: public benchmarks (HELM, BIG-bench, TruthfulQA, MMLU, HumanEval) for capability ranking; safety benchmarks (HarmBench, AdvBench) for jailbreak and harmful-output testing; agency-specific evaluation sets that reflect actual use cases; red-team prompts that probe for hallucination, bias, and privacy leakage; rubric-based evaluation with LLM-as-judge (with validation against human judgment); continuous eval pipelines that run on every model or prompt change.

Grounding matters especially for RAG systems. We measure retrieval recall, answer faithfulness (is every claim in the answer supported by retrieved documents), citation accuracy, and refusal appropriateness (does the model abstain when retrieval fails). RAGAS and TruLens are primary toolkits.

Privacy-preserving ML

Federal data is often the most sensitive data in the country. Training on it requires privacy engineering: differential privacy (Opacus, TensorFlow Privacy) with explicit epsilon-delta budgets tied to utility requirements, PII detection and redaction (Microsoft Presidio, spaCy NER) before training, synthetic data generation (SDV, Gretel, CTGAN) for development environments, federated learning (Flower, NVIDIA FLARE, FATE) where data cannot leave agency environments, and membership inference / model inversion testing to verify privacy claims hold empirically.

Monitoring and incident response

Responsible AI is not done at deployment. Models drift, data shifts, and fairness degrades silently. We deploy monitoring that tracks: performance stratified by group (not just aggregate), fairness metrics over time, prediction distribution shifts, input distribution shifts, and adversarial input patterns. Alerts tie to on-call rotations and incident playbooks that include communication, containment, and remediation. Every incident becomes a new test.

Federal agencies and programs

OMB & agency CAIO offices — M-24-10 implementation, use case inventory
VA, HHS — clinical AI, benefits AI, fairness across veteran and beneficiary groups
DHS, DOJ — risk assessment AI, immigration AI, civil-rights-impacting systems
DoD CDAO — Responsible AI Strategy implementation, test and evaluation
NIST AI Safety Institute — evaluation methodology, measurement science
GAO — AI audit support for oversight engagements

Pass review the first time.

Responsible AI engineering for federal systems. Ready to deliver.

Contact the PI See which agencies we serve →

UEI Y2JVCZXT9HP5CAGE 1AYQ0NAICS 541512SAM.GOV ACTIVE

Responsible AI for federal systems.

What we build

NIST AI Risk Management Framework

OMB M-24-10 and EO 14110

Bias and fairness testing in practice

Explainability

LLM evaluation

Privacy-preserving ML

Monitoring and incident response

Federal agencies and programs

Related reading

Machine Learning

MLOps

Generative AI

Pass review the first time.