Steven's Knowledge

Synthetic Data

Use models to generate the data that makes other models — when it works and when it backfires

Synthetic data — examples generated by a model rather than collected from the real world — has gone from research curiosity to a core tool in modern AI development. It powers fine-tuning datasets, eval sets, edge case coverage, and entire training corpora for frontier models. It also fails in subtle ways that can quietly degrade quality. Knowing when to reach for it matters.

What It's Used For

  • Fine-tuning data. Generating instruction/response pairs for a target task.
  • Eval set augmentation. Filling in coverage gaps that real data misses.
  • Edge case generation. Producing rare-but-important scenarios.
  • Distillation. Capturing a frontier model's behavior in a smaller model's training set.
  • Privacy-preserving datasets. Standing in for real user data that can't be used directly.
  • Bootstrapping. Generating starter data for tasks where no labeled examples exist yet.

Why It Works

Modern frontier models are good enough that their outputs, used carefully, train smaller models effectively. The smaller model doesn't need to discover capability from scratch; it learns to reproduce the teacher's behavior on the specific task. Even cheaper: sometimes you just need the model's structure, not its content — synthetic prompts paired with a real labeling step can generate diverse training data fast.

The Standard Patterns

Self-instruct / instruction generation. Prompt a strong model to generate instruction/response pairs in a target domain. Filter and you have a fine-tuning corpus.

Synthetic eval generation. For each existing example, prompt a model to generate variations or adjacent cases. Fills holes in eval coverage.

Adversarial generation. Specifically prompt a model to produce hard cases — edge cases, near-failures, tricky inputs. Useful for stress-testing.

Persona-driven generation. Vary a "user persona" prompt to get diverse phrasings of similar requests.

Back-translation and rewriting. Take real examples, rewrite them in different styles or languages, get diversity without leaving the real distribution.

Where It Goes Wrong

Mode collapse. Models generating their own training data end up sampling a narrow slice of the distribution. The resulting dataset looks like 1,000 examples but covers maybe 50 patterns. Filtering for diversity is mandatory.

Inherited errors. The synthetic generator's biases and mistakes become the student's. If the teacher is wrong about a class of inputs, the student learns to be wrong the same way.

Distribution drift. Synthetic data drifts away from real user data in ways that aren't visible until production. The model performs well on the synthetic eval and poorly in the wild.

Self-preference loops. Using a model to generate data, then using the same model to evaluate it, creates a feedback loop that selects for output the model likes — not necessarily output that's correct.

Confidence inflation. Synthetic data tends to be more uniformly confident and clean than real data. Models trained on it can be poorly calibrated when they encounter the real, messier world.

Mitigations

  • Diversity filtering. Embed generated examples; cluster; keep one per cluster. Aggressive but effective.
  • Mix synthetic and real. Even a small fraction of real examples anchors the distribution.
  • Different generator and judge. Don't have the same model produce and grade the data.
  • Human spot-checks. Sample generated data and have humans review it. The expensive but reliable signal.
  • Hold out a real eval set. Synthetic data should never touch the eval. Final quality is judged on real-world examples only.

Generation Quality Matters

Garbage synthetic data is worse than no synthetic data. Treat the generation pipeline like any other engineering problem:

  • Strong generator. Use the best model you can afford for generation; the data quality compounds.
  • Careful prompting. The prompts that produce the synthetic data are themselves prompts to engineer and version.
  • Validation step. Programmatic checks (does it parse? does it fit the schema? does it actually answer the question?) before any example enters the dataset.
  • Diversity targets. Generate more than you need; sample down for diversity.

A Realistic Mental Model

Synthetic data is a force multiplier for the real data and capability you already have. It expands coverage, fills gaps, and accelerates iteration. It does not replace real user signal. The teams that get the most out of synthetic data are the ones that already have good real data and use synthetic to extend it — not the ones trying to substitute for it.

On this page