What synthetic data is actually for
Synthetic data serves five common jobs on a federal program: unblock development when production data access takes months, build realistic demos and documentation that do not expose real records, augment training sets to balance rare classes, load-test systems with realistic volume, and fuel third-party research without an MOU. Each job has a different fidelity requirement and a different privacy bar. Treating them as one problem is where synthetic-data programs run aground.
Synthetic Data Fidelity vs Real Data (by domain)
Tabular synthesis: the tools

SDV (Synthetic Data Vault)
Open-source Python library (MIT-licensed) with multiple synthesizers. What we actually use:
GaussianCopula
Fast, captures marginal distributions and pairwise correlations. Baseline for any tabular task; useful when the distribution is well-behaved.
CTGAN
Generative adversarial network for complex mixed-type tabular data. Handles categorical imbalance and non-Gaussian marginals. Slower to train, better fidelity on realistic tabular distributions.
TVAE
Variational autoencoder equivalent. Often comparable to CTGAN; different failure modes.
HMA (Hierarchical Modeling Algorithm)
Multi-table relational synthesis. Critical for anything beyond a single denormalized table.
Gretel.ai
Commercial platform with synthesis, privacy evaluation, and PII detection. Useful managed option; check FedRAMP status before federal production. Good quality; vendor lock-in considerations.
MOSTLY AI
Commercial, similar scope. European provenance matters for some federal programs.
YData
Commercial; good for profiling and privacy evaluation as much as synthesis.
Custom GANs / diffusion for tabular
Rarely justified. SDV handles the 80% case. Custom work is for dataset sizes or types where library defaults fall down — very high cardinality categoricals, complex hierarchies, or domain-specific constraints.
Privacy: what actually protects sensitive data
The naive failure mode
A GAN trained on sensitive tabular data will sometimes produce near-exact copies of training records. Without explicit privacy mechanisms the model memorizes. Any "synthetic" record that is a permuted copy of a real record is a data leak, not synthesis.
Differential privacy at synthesis time
DP-CTGAN and PATE-GAN add calibrated noise during training such that no single training record can disproportionately influence the output. You get a provable (epsilon, delta) bound. You also pay in utility — the more privacy, the less signal. Budget carefully.
Post-hoc privacy evaluation
Treat privacy as measurable, not assumed. What to measure:
- Nearest-neighbor distance. For each synthetic record, distance to the closest real record. Distribution of distances should be similar between synthetic-to-real and real-to-real; if synthetic records are systematically closer to real than real records are to each other, you have memorization.
- Membership inference attack. Can an attacker with access to the synthetic data distinguish whether a given real record was in the training set? Measured as AUC; closer to 0.5 is better.
- Attribute inference attack. Given partial record information, can the attacker recover sensitive attributes from the synthetic data? Measured as accuracy vs a random baseline.
Utility: does it actually work for the job
The canonical utility test: train-on-synthetic, test-on-real (TSTR). Train the downstream model on synthetic data, evaluate on a held-out real test set, compare to the train-on-real baseline. TSTR within 2-5% of baseline is strong utility; wider gaps require investigation.
Additional checks:
Marginal distributions
Every field's distribution in synthetic should match the real distribution (Kolmogorov-Smirnov for continuous, chi-square for categorical).
Pairwise correlations
Real correlation matrix vs synthetic correlation matrix; Frobenius norm of the difference.
Conditional distributions for key fields
P(outcome | demographics) should match. This catches the synthesis that gets marginals right but loses the signal that matters.
Rare-class preservation
Many federal datasets have rare-but-important classes (fraud, critical events). Synthesis should preserve these; measure explicitly.
Text synthesis for federal
Text synthesis is a different problem. The tools:
Template-based generation
For structured documents (forms, reports) with variable slots. Highest control, lowest sophistication. Use for dev/test data.
LLM-generated from a non-sensitive seed corpus
Prompt a strong LLM with a structured schema and example of the public-document style; generate at scale. Risk: model memorization of training data leaks style or content.
Fine-tuned small LM on public data with constrained generation
Stronger control than prompting; requires training investment.
Paraphrase real documents after de-identification
Start with real public documents, paraphrase, inject synthetic variables. Useful middle ground.
For any synthetic text corpus, run PII detection (Presidio, Azure Language Service PII) against the entire output before release. LLMs occasionally emit real names, addresses, or numbers even when prompted not to.
Validation checklist before release
- Run nearest-neighbor distance analysis. No synthetic record should be a near-copy of a real record.
- Run membership inference test. AUC should be close to 0.5.
- Run TSTR on the target downstream task. Utility gap vs real must be within program tolerance.
- Run distributional checks (marginals, pairwise, key conditionals).
- Run PII detection on every record, text and tabular.
- Run a manual review of a random sample with an SME.
- Document privacy budget (if DP), synthesis model version, training data version, and evaluation results.
- Assign classification to the synthetic dataset explicitly (usually derivative of the source classification, occasionally a lower marking depending on privacy guarantees).
What synthetic data will not solve
Production model training where accuracy matters
Synthetic data is an augmentation, not a replacement. Models trained purely on synthetic underperform.
Research on rare or edge phenomena
If the phenomenon is rare in the real data, the synthesizer has fewer examples to learn from; it will either fail to reproduce the pattern or will be trained in a way that effectively copies the few real examples.
Evaluation of production systems
Evaluate on real data (or real data proxies). Synthetic eval sets miss real failure modes.
Bias elimination
Synthetic data reproduces the biases in the source. Do not sell synthesis as a fairness mechanism without separate bias-mitigation techniques.
Governance
- Synthesis pipeline itself is treated as a sensitive system — it has access to real data.
- Synthetic datasets are versioned, hashed, and documented.
- Release to downstream users is gated on a validation report, not just a synthesis run.
- Classification: default derivative of source; lower classification only with privacy evaluation that supports the downgrade.
- Audit log includes who ran the synthesis, which version of the synthesizer, which training data version, and the validation results.
Where this fits in our practice
We build synthesis pipelines, privacy evaluation harnesses, and governance frameworks for synthetic data programs. See our federated learning post for an adjacent privacy-preserving pattern and our federal data labeling post for the upstream data workflows.