What we build
Responsible AI is where federal programs get stuck. Mission teams ship a working model. The chief AI officer asks for an impact assessment. Legal asks about fairness. Privacy asks about training data. OMB asks whether the use case is on the inventory. Congress asks whether the model discriminates. Every one of those asks has a legitimate answer, and every one of them can halt deployment if the engineering team treated responsible AI as someone else's problem.
We build responsible AI engineering into the model lifecycle from day one. Bias testing runs in CI, not as a quarterly audit. Explainability outputs ship with predictions, not as a separate deck. Privacy protections are measured, not assumed. Documentation satisfies OMB M-24-10 and NIST AI RMF without requiring an after-the-fact rewrite. The result is systems that pass review the first time and operate with the audit trail agencies need to defend them.
- Bias & fairness testing — Aequitas, Fairlearn, AIF360, What-If Tool across protected classes and agency-specific groupings.
- Explainability — SHAP, LIME, integrated gradients, counterfactual explanations, concept activation vectors, Anchors.
- LLM evaluation — HELM, BIG-bench, TruthfulQA, MMLU, agency red-team suites, continuous eval harnesses.
- Privacy-preserving ML — differential privacy (Opacus, TF Privacy), federated learning (Flower, FATE), secure aggregation.
- Robustness testing — adversarial examples (ART, TextAttack), distribution shift analysis, stress testing.
- Governance documentation — AI Impact Assessments, Model Cards, Datasheets, System Security Plans aligned to NIST AI RMF.
- Monitoring — drift detection, fairness drift, performance stratified by group, incident reporting pipelines.
NIST AI Risk Management Framework
The NIST AI RMF 1.0 (January 2023) is the common language federal agencies use to talk about AI risk. It organizes work around four functions:
- Govern — policies, accountability, training, risk appetite for the organization.
- Map — context, purpose, stakeholders, risks for each AI use case.
- Measure — quantitative and qualitative assessment of identified risks.
- Manage — risk response, mitigation, monitoring, incident handling.
Our implementation translates these from framework language into engineering artifacts: risk registers per use case that map to specific RMF subcategories, measurement plans that define the metrics and test data for each risk, mitigation plans that specify the technical and procedural controls, and monitoring plans that specify the ongoing measurement and alert thresholds. Each artifact is versioned with the model and the data, so an auditor can see what was planned, what was measured, and what was mitigated.
The Generative AI Profile (NIST AI 600-1, July 2024) adds specific guidance for LLMs and foundation models: hallucination risk, data provenance, harmful outputs, prompt injection, and malicious use. Our LLM engineering practices integrate this profile explicitly.
OMB M-24-10 and EO 14110
OMB Memorandum M-24-10 (March 2024) operationalized Executive Order 14110 (October 2023) for civilian federal agencies. Key requirements:
- AI use case inventory — every agency maintains a public inventory of AI use cases.
- Chief AI Officer — designated senior official responsible for coordinating AI.
- Rights-impacting and safety-impacting AI — defined categories with additional requirements.
- AI Impact Assessments — required for rights-impacting AI before use and annually thereafter.
- Minimum practices — testing, monitoring, human alternatives, consultation with affected communities, public notice.
- Independent review — certain high-risk uses require independent evaluation.
Our engagements for federal programs subject to M-24-10 produce the documentation and engineering controls that satisfy each requirement. AIIAs include specific performance claims backed by test data, not vague statements of intent. Human alternatives are implemented and tested, not promised. Public notice drafts are prepared to agency and OMB standards. Minimum practices are wired into the CI/CD pipeline.
Bias and fairness testing in practice
Fairness is not a single metric. The most common federal failure mode is picking one metric that looks good and declaring victory. Statistical parity, equal opportunity, equalized odds, predictive parity, and calibration-within-groups are often mutually incompatible. The right answer depends on the use case: a loan-approval use case prioritizes different metrics than a resume-screening use case, which prioritizes different metrics than a risk-scoring use case.
Our bias testing stack: Aequitas for fairness audits with configurable reference groups, Fairlearn for mitigation algorithms and MetricFrame analysis, AIF360 for the full suite of fairness metrics and debiasing techniques, What-If Tool for interactive exploration of model behavior across groups. We test across protected classes (race, gender, age, disability) and agency-specific groupings (veteran status, geography, LEP, beneficiary category). We report tradeoffs explicitly and document which metric the agency chose to optimize and why.
Mitigation techniques: pre-processing (reweighing, disparate impact remover), in-processing (adversarial debiasing, exponentiated gradient reduction), post-processing (calibrated equalized odds, reject-option classification). We test the mitigation and report performance tradeoffs, because every mitigation costs accuracy somewhere.
Explainability
Explainability for federal AI serves three audiences: decision subjects (why was I denied), operators (should I trust this prediction), and reviewers (does this model work the way it claims). Each audience needs different explanations.
For tabular models: SHAP TreeExplainer for gradient-boosting, KernelSHAP for general models, LIME for local approximations, counterfactual explanations (DiCE) for "what would change the outcome" questions. For deep models: integrated gradients, SmoothGrad, Grad-CAM for vision, attention rollout for some transformer architectures (with appropriate skepticism). For LLMs: chain-of-thought with verification, retrieval citation for RAG outputs, counterfactual probing.
Explainability outputs are engineered as first-class predictions, not after-the-fact analysis. Every production prediction ships with the explanation that produced it, and explanations are tested for consistency and faithfulness against the underlying model.
LLM evaluation
LLMs break the old evaluation paradigm. Static test sets do not capture the open-ended nature of generation. We build evaluation harnesses that combine: public benchmarks (HELM, BIG-bench, TruthfulQA, MMLU, HumanEval) for capability ranking; safety benchmarks (HarmBench, AdvBench) for jailbreak and harmful-output testing; agency-specific evaluation sets that reflect actual use cases; red-team prompts that probe for hallucination, bias, and privacy leakage; rubric-based evaluation with LLM-as-judge (with validation against human judgment); continuous eval pipelines that run on every model or prompt change.
Grounding matters especially for RAG systems. We measure retrieval recall, answer faithfulness (is every claim in the answer supported by retrieved documents), citation accuracy, and refusal appropriateness (does the model abstain when retrieval fails). RAGAS and TruLens are primary toolkits.
Privacy-preserving ML
Federal data is often the most sensitive data in the country. Training on it requires privacy engineering: differential privacy (Opacus, TensorFlow Privacy) with explicit epsilon-delta budgets tied to utility requirements, PII detection and redaction (Microsoft Presidio, spaCy NER) before training, synthetic data generation (SDV, Gretel, CTGAN) for development environments, federated learning (Flower, NVIDIA FLARE, FATE) where data cannot leave agency environments, and membership inference / model inversion testing to verify privacy claims hold empirically.
Monitoring and incident response
Responsible AI is not done at deployment. Models drift, data shifts, and fairness degrades silently. We deploy monitoring that tracks: performance stratified by group (not just aggregate), fairness metrics over time, prediction distribution shifts, input distribution shifts, and adversarial input patterns. Alerts tie to on-call rotations and incident playbooks that include communication, containment, and remediation. Every incident becomes a new test.
Federal agencies and programs
- OMB & agency CAIO offices — M-24-10 implementation, use case inventory
- VA, HHS — clinical AI, benefits AI, fairness across veteran and beneficiary groups
- DHS, DOJ — risk assessment AI, immigration AI, civil-rights-impacting systems
- DoD CDAO — Responsible AI Strategy implementation, test and evaluation
- NIST AI Safety Institute — evaluation methodology, measurement science
- GAO — AI audit support for oversight engagements