Synthetic Data for Sensitive Federal Domains

What synthetic data is actually for

Synthetic data serves five common jobs on a federal program: unblock development when production data access takes months, build realistic demos and documentation that do not expose real records, augment training sets to balance rare classes, load-test systems with realistic volume, and fuel third-party research without an MOU. Each job has a different fidelity requirement and a different privacy bar. Treating them as one problem is where synthetic-data programs run aground.

Rule of thumb. Start by asking what the synthetic data is for. Dev/test does not need the same fidelity as model training, and neither needs the same privacy guarantees as public-release research data.

Synthetic Data Fidelity vs Real Data (by domain)

Tabular structured records

87%

Clinical notes (de-identified)

72%

Geospatial tracks

80%

Biometric signatures

55%

Network traffic logs

78%

Natural language chat

65%

Editorial weighting from public sources and practitioner reading — illustrative, not a measured statistic.

Tabular synthesis: the tools

SDV (Synthetic Data Vault)

Source-available Python library (Business Source License 1.1 since the 2022 relicense — free for many uses, but not OSI open source) with multiple synthesizers. What we actually use:

GaussianCopula

Fast, captures marginal distributions and pairwise correlations. Baseline for any tabular task; useful when the distribution is well-behaved.

CTGAN

Generative adversarial network for complex mixed-type tabular data. Handles categorical imbalance and non-Gaussian marginals. Slower to train, better fidelity on realistic tabular distributions.

TVAE

Variational autoencoder equivalent. Often comparable to CTGAN; different failure modes.

HMA (Hierarchical Modeling Algorithm)

Multi-table relational synthesis. Critical for anything beyond a single denormalized table.

Gretel.ai

Commercial platform with synthesis, privacy evaluation, and PII detection. Useful managed option; check FedRAMP status before federal production. Good quality; vendor lock-in considerations.

MOSTLY AI

Commercial, similar scope. European provenance matters for some federal programs.

YData

Commercial; good for profiling and privacy evaluation as much as synthesis.

Custom GANs / diffusion for tabular

Rarely justified. SDV handles the 80% case. Custom work is for dataset sizes or types where library defaults fall down — very high cardinality categoricals, complex hierarchies, or domain-specific constraints.

Privacy: what actually protects sensitive data

The naive failure mode

A GAN trained on sensitive tabular data will sometimes produce near-exact copies of training records. Without explicit privacy mechanisms the model memorizes. Any "synthetic" record that is a permuted copy of a real record is a data leak, not synthesis.

Differential privacy at synthesis time

DP-CTGAN and PATE-GAN add calibrated noise during training such that no single training record can disproportionately influence the output. You get a provable (epsilon, delta) bound. You also pay in utility — the more privacy, the less signal. Budget carefully.

Post-hoc privacy evaluation

Treat privacy as measurable, not assumed. What to measure:

Nearest-neighbor distance. For each synthetic record, distance to the closest real record. Distribution of distances should be similar between synthetic-to-real and real-to-real; if synthetic records are systematically closer to real than real records are to each other, you have memorization.
Membership inference attack. Can an attacker with access to the synthetic data distinguish whether a given real record was in the training set? Measured as AUC; closer to 0.5 is better.
Attribute inference attack. Given partial record information, can the attacker recover sensitive attributes from the synthetic data? Measured as accuracy vs a random baseline.

Utility: does it actually work for the job

The canonical utility test: train-on-synthetic, test-on-real (TSTR). Train the downstream model on synthetic data, evaluate on a held-out real test set, compare to the train-on-real baseline. TSTR within 2-5% of baseline is strong utility; wider gaps require investigation.

Additional checks:

Marginal distributions

Every field's distribution in synthetic should match the real distribution (Kolmogorov-Smirnov for continuous, chi-square for categorical).

Pairwise correlations

Real correlation matrix vs synthetic correlation matrix; Frobenius norm of the difference.

Conditional distributions for key fields

P(outcome | demographics) should match. This catches the synthesis that gets marginals right but loses the signal that matters.

Rare-class preservation

Many federal datasets have rare-but-important classes (fraud, critical events). Synthesis should preserve these; measure explicitly.

A synthetic dataset that matches marginals and breaks conditionals is worse than no synthetic data. It passes the naive check and fails silently on the task that pays for it.

Text synthesis for federal

Text synthesis is a different problem. The tools:

Template-based generation

For structured documents (forms, reports) with variable slots. Highest control, lowest sophistication. Use for dev/test data.

LLM-generated from a non-sensitive seed corpus

Prompt a strong LLM with a structured schema and example of the public-document style; generate at scale. Risk: model memorization of training data leaks style or content.

Fine-tuned small LM on public data with constrained generation

Stronger control than prompting; requires training investment.

Paraphrase real documents after de-identification

Start with real public documents, paraphrase, inject synthetic variables. Useful middle ground.

For any synthetic text corpus, run PII detection (Presidio, Azure Language Service PII) against the entire output before release. LLMs occasionally emit real names, addresses, or numbers even when prompted not to.

Validation checklist before release

Run nearest-neighbor distance analysis. No synthetic record should be a near-copy of a real record.
Run membership inference test. AUC should be close to 0.5.
Run TSTR on the target downstream task. Utility gap vs real must be within program tolerance.
Run distributional checks (marginals, pairwise, key conditionals).
Run PII detection on every record, text and tabular.
Run a manual review of a random sample with an SME.
Document privacy budget (if DP), synthesis model version, training data version, and evaluation results.
Assign classification to the synthetic dataset explicitly (usually derivative of the source classification, occasionally a lower marking depending on privacy guarantees).

What synthetic data will not solve

Production model training where accuracy matters

Synthetic data is an augmentation, not a replacement. Models trained purely on synthetic underperform.

Research on rare or edge phenomena

If the phenomenon is rare in the real data, the synthesizer has fewer examples to learn from; it will either fail to reproduce the pattern or will be trained in a way that effectively copies the few real examples.

Evaluation of production systems

Evaluate on real data (or real data proxies). Synthetic eval sets miss real failure modes.

Bias elimination

Synthetic data reproduces the biases in the source. Do not sell synthesis as a fairness mechanism without separate bias-mitigation techniques.

Governance

Synthesis pipeline itself is treated as a sensitive system — it has access to real data.
Synthetic datasets are versioned, hashed, and documented.
Release to downstream users is gated on a validation report, not just a synthesis run.
Classification: default derivative of source; lower classification only with privacy evaluation that supports the downgrade.
Audit log includes who ran the synthesis, which version of the synthesizer, which training data version, and the validation results.

Where this fits in our practice

We build synthesis pipelines, privacy evaluation harnesses, and governance frameworks for synthetic data programs. See our federated learning post for an adjacent privacy-preserving pattern and our federal data labeling post for the upstream data workflows.

FAQ

When is synthetic data the right answer for federal?

When real data access is a compliance blocker for development, test, demo, or training workflows and the downstream use does not require production-fidelity signal. It is excellent for dev/test environments, documentation, and training samples. It is a weak substitute for real data in model training for production deployment.

What is SDV and how mature is it?

SDV (Synthetic Data Vault) is a Python library with multiple models: GaussianCopula for fast baseline synthesis, CTGAN for complex tabular distributions, HMA for multi-table relational data, PARSynthesizer for sequential data. Mature, well-maintained, BSL-licensed — check the license fit for your deployment — and widely used in production.

Is synthetic data automatically private?

No. A naive GAN trained on sensitive data can memorize and reproduce real records. Privacy requires either differentially private synthesis (DP-CTGAN, PATE-GAN) with a bounded epsilon, or post-hoc privacy evaluation (nearest-neighbor distance, membership inference testing). Assume nothing; measure.

What about synthetic text from LLMs?

An LLM can generate synthetic text at scale, but it will reproduce style and sometimes content from training data. For federal documents, use an LLM fine-tuned on non-sensitive public documents with a structured prompt template, validate each output against a deny list of sensitive tokens, and evaluate the synthetic corpus for inadvertent PII leakage before release.

How do you validate synthetic data utility?

Train-on-synthetic, test-on-real (TSTR): train the downstream model on synthetic, evaluate on a real held-out set. Compare to train-on-real baseline. Also compute distributional similarity (marginals, correlations, joint distributions) and conditional-distribution fidelity for the fields that matter downstream.

Can synthetic data replace real data for ML training?

Rarely, at current fidelity. Synthetic data is excellent for augmenting training sets, balancing rare classes, and standing up pipelines before real data is available. It is a weak standalone replacement for real training data when production accuracy matters.

Synthetic data for sensitive federal domains.

What synthetic data is actually for

Tabular synthesis: the tools

SDV (Synthetic Data Vault)

Gretel.ai

MOSTLY AI

YData

Custom GANs / diffusion for tabular

Privacy: what actually protects sensitive data

The naive failure mode

Differential privacy at synthesis time

Post-hoc privacy evaluation

Utility: does it actually work for the job

Text synthesis for federal

Validation checklist before release

What synthetic data will not solve

Governance

Where this fits in our practice

FAQ

Related insights

Generating synthetic data for a federal development or test workload?

Synthetic data for sensitive federal domains.

What synthetic data is actually for

Tabular synthesis: the tools

SDV (Synthetic Data Vault)

Gretel.ai

MOSTLY AI

YData

Custom GANs / diffusion for tabular

Privacy: what actually protects sensitive data

The naive failure mode

Differential privacy at synthesis time

Post-hoc privacy evaluation

Utility: does it actually work for the job

Text synthesis for federal

Validation checklist before release

What synthetic data will not solve

Governance

Where this fits in our practice

FAQ

Related insights

Federated Learning for Federal Data Silos

Federal Data Labeling at Scale

MLOps Pipelines on AWS GovCloud

Generating synthetic data for a federal development or test workload?