What we build
Speech is the most information-dense modality in federal operations. Body-worn radio, interview rooms, call centers, courtrooms, intercept feeds, field recordings, tactical comms, and oral histories all generate audio that needs to become searchable, translatable, and analyzable text. Speech AI is the bridge, and the gap between a demo and a production system is wider here than almost anywhere else because acoustic conditions, accents, jargon, and overlapping speakers break off-the-shelf models in predictable ways.
We build production speech systems for federal agencies end-to-end: ingest, voice activity detection, diarization, transcription, translation, entity extraction, redaction, and indexing. Every component is measured, every error budget is explicit, and every deployment target is considered from day one rather than bolted on at the end.
- Automatic speech recognition (ASR) — Whisper large-v3, Whisper-v3-turbo, NeMo Canary-1B, Parakeet-TDT, wav2vec2, HuBERT, conformer-CTC models.
- Text-to-speech (TTS) — XTTS-v2, Coqui, Parler-TTS, Kokoro-82M, Bark, F5-TTS for natural neural synthesis with SSML and prosody control.
- Speaker diarization — pyannote.audio 3.x, NeMo TitaNet, VBx resegmentation, overlap detection, target-speaker ASR.
- Voice activity detection (VAD) — Silero-VAD, WebRTC-VAD, pyannote-VAD for robust speech-region extraction from noisy audio.
- Translation & code-switching — SeamlessM4T-v2, NLLB-200, Whisper translation mode for 100+ language pairs.
- Speaker identification & verification — ECAPA-TDNN, TitaNet-Large embeddings, cosine-scoring backends tuned for agency watchlist use.
- Emotion, sentiment, and acoustic analytics — paralinguistic feature extraction for triage, quality assurance, and operator support.
Federal transcription pipelines
A federal transcription pipeline is not a one-call-to-Whisper operation. Real-world federal audio lives in WAV, MP3, M4A, AMR (phone carriers), OPUS (radio), and proprietary formats dropped by legacy hardware. Sample rates range from 8 kHz telephony to 48 kHz broadcast. Channels can be mono mixdowns or stereo with separate speakers per channel. The first third of any production pipeline is ingest, format normalization, resampling, loudness normalization, and channel handling. Cutting corners here shows up as a 5-10 point WER regression that nobody can explain.
Once audio is normalized, we run voice activity detection to strip silence, apply speaker diarization to label turns, transcribe each turn with the ASR model, apply language-model rescoring where vocabulary is domain-specific (medical, legal, military), and emit timestamped JSON that downstream systems can consume. For sensitive content, PII redaction runs on the transcript (names, SSNs, phone numbers, addresses, case numbers) with both rule-based and NER-based detection. For translation, we run ASR in source language first, then apply NMT, rather than going directly through Whisper's translation mode which tends to be lossy on technical content.
Our pipelines emit speaker-attributed, timestamped, diarized transcripts in WebVTT, SRT, JSON, and TEI XML. We produce confidence scores per word, not just per segment, so agency reviewers know where to spend their time. For long-form audio (depositions, hearings, oral histories) we add chapter-level segmentation and topic labels so a 6-hour recording becomes navigable instead of a wall of text.
Whisper in production, honestly
Whisper is the default for a reason: it is the most capable open-weight ASR model shipping today, it handles 99 languages, and it is free to run on your own hardware. But Whisper has failure modes that matter in federal use and that notebook demos hide.
- Hallucinations during silence or music — Whisper will generate plausible text when there is nothing to transcribe. VAD preprocessing and confidence thresholding are mandatory for production.
- Repetition loops — long audio can trigger repetitive output. We mitigate with chunked decoding, no-speech probability thresholds, and temperature fallback with beam search.
- Weak diarization — Whisper does not diarize natively. Pairing with pyannote is required.
- Timestamp drift — word-level timestamps need WhisperX or forced alignment (wav2vec2 CTC) for frame-accurate output.
- Domain drift — military jargon, agency acronyms, program names, and personal names are error-prone. We fine-tune on agency-provided corpora or apply LLM-based post-editing with a domain glossary.
We ship Whisper deployments with all of these known-failure mitigations baked in. That is the difference between a research demo and a system a federal agency can depend on.
Domain adaptation
Most federal audio is out-of-distribution for the training data Whisper and other open ASR models have seen. Radio traffic, phone calls through legacy carriers, body-worn audio with wind and machinery, courtroom recordings with multiple lapel mics, and operational chatter with mission-specific jargon all require domain adaptation.
Our adaptation toolkit: (1) continue pretraining on unlabeled in-domain audio with self-supervised objectives (wav2vec2, HuBERT, WavLM); (2) supervised fine-tuning on labeled agency data, typically 10-100 hours is enough to move WER by 5-15 points; (3) acoustic augmentation with noise, reverb, codec simulation, and channel distortion to match deployment conditions; (4) language model fusion with n-gram or neural LMs trained on agency text corpora; (5) biased decoding and contextual biasing for rare entities, call signs, and program names. We measure per-condition WER before and after adaptation so there is no ambiguity about whether the work paid off.
Diarization that survives real audio
Diarization is where speech systems fail silently. A single missed speaker turn corrupts every downstream analytic. Overlapping speech, short turns under 1 second, phone-bandwidth audio, and unknown speaker counts all stress clustering-based systems. We use a stacked approach: pyannote.audio 3.x for the initial segmentation and embedding, TitaNet embeddings for speaker representation, VBx resegmentation for cleaner boundaries, and an overlap-aware module to split simultaneous speech. For recordings where speakers are known (interviews with identified participants, courtroom with known attorneys), we move to target-speaker ASR or per-channel ASR where the source format supports it.
Translation and code-switching
Federal linguist workloads are shifting toward AI-assisted triage. Before a human listens to 40 hours of audio, a triage layer decides what deserves attention. We build that triage layer: language identification on each segment, ASR in source language, NMT to English (or other target), keyword search, and entity extraction. SeamlessM4T-v2 handles the combined ASR+translation in one pass for 100+ languages. NLLB-200 handles text-to-text translation for extended language coverage. For code-switching audio (bilingual speakers, borrowed technical terms), we use language-aware ASR and maintain provenance so translators can see what language each segment was in.
Text-to-speech for federal use
TTS is less glamorous than ASR but arguably more regulated. Section 508 accessibility, IVR modernization, training simulations, and emergency alert systems all need TTS that sounds natural, supports SSML, handles domain vocabulary, and respects voice-identity governance. Our open-weight TTS stack (XTTS-v2, Parler-TTS, Kokoro-82M, Coqui) gives agencies full control over deployment and voice data without sending anything to commercial cloud endpoints. For voice cloning applications (preservation of historical voices, accessibility for users with speech impairments), we implement authorization workflows and audio watermarking (AudioSeal) to prevent misuse.
Deployment posture
Speech AI deployments span from hosted cloud APIs (fast, cheap, not available for sensitive data) to fully air-gapped on-prem (slower to iterate, but necessary for classified or regulated content). We design for the middle of that spectrum: AWS GovCloud or Azure Government with FedRAMP Moderate or High authorization, on-prem GPU clusters for agencies with existing infrastructure, and edge deployment on ruggedized hardware for tactical use cases. Inference is optimized with TensorRT, CTranslate2, faster-whisper, and ONNX Runtime depending on target hardware. We benchmark real-time factor (RTF) and throughput explicitly, because an ASR system that processes audio slower than real time is not useful for live monitoring.
Federal agencies and programs
- DoD — tactical radio transcription, multilingual intercept triage, training simulations, operator readiness
- FBI and DOJ — interview transcription, wiretap processing, courtroom support, evidence indexing
- DHS — call center modernization, border audio analysis, multilingual public engagement
- VA — clinical dictation, accessibility for visually impaired veterans, oral history preservation
- IC — multilingual SIGINT triage, linguist workload offload, target-speaker ASR
- NASA — mission audio archival, astronaut communications transcription, historical tape digitization