Steven's Knowledge

Dataset Curation

The data is the model — and most of the work is data work

In every meaningful AI project, the dataset ends up being the dominant lever. Models converge to the quality of their training data; evals reflect the quality of their eval data; fine-tunes only ever get as good as the examples behind them. Curation is the practice of treating data as the first-class artifact it is.

Why Curation Matters More Than Volume

A small, clean, representative dataset reliably outperforms a large, noisy one. Large public corpora are full of duplicates, junk, biases, and content irrelevant to the target task. Models trained or evaluated on raw scraped data underperform models trained on a fraction of that data after careful filtering.

This applies to fine-tuning, eval sets, RAG corpora, and prompt few-shot examples alike.

What "Quality" Means

Quality is task-dependent. For most LLM work, you're after some combination of:

  • Correctness. Labels reflect ground truth.
  • Coverage. The data spans the situations the model will see.
  • Cleanliness. No duplicates, no garbage, consistent formatting.
  • Relevance. Each example earns its spot — no filler.
  • Balance. Important categories aren't drowned out by common ones.

A useful working definition: data that, looked at example by example, makes you nod rather than wince.

The Curation Loop

Curation is iterative, not a single pass:

  1. Sample. Pull a small batch (a few hundred examples).
  2. Inspect. Read them. Look at the actual examples, not just statistics.
  3. Categorize problems. Where does the data fall short?
  4. Filter or fix. Remove what's broken, correct what's recoverable.
  5. Repeat. Sample a new batch and check whether the issues are resolved.

The team that's iterating on data through this loop will out-ship the team that ran one massive cleanup script.

Common Defects to Hunt For

  • Near-duplicates. Trivial paraphrases inflate apparent dataset size and bias the model toward repeated content.
  • Label noise. Wrong answers, inconsistent rubrics, annotator drift over time.
  • Leakage. Test set examples that overlap training data. Especially insidious with web-scraped data.
  • Length bias. All short examples or all long ones; models learn the bias as a feature.
  • Spurious correlations. A non-causal feature that the model ends up relying on.
  • Toxic content. Especially in RAG corpora and fine-tuning data scraped from the wild.

Each one has known detection patterns; finding them is mostly a matter of looking.

Filtering Pipelines

For large datasets, automated filtering is unavoidable. Common stages:

  • Deduplication. Exact-match and near-duplicate detection (MinHash, embeddings).
  • Language filtering. Keep only the languages you care about.
  • Quality classifiers. Trained models that score "is this good" — fast, imperfect, useful at scale.
  • Heuristics. Length bounds, repetition ratios, junk-character ratios.
  • Deduplication against eval sets. Critical to prevent leakage.

Layer them. Each catches what the previous one missed.

Eval Set Curation Is Different

For eval sets, prioritize differently:

  • Representativeness. The eval should reflect production distribution, not a convenient sample.
  • Difficulty spread. Easy, medium, hard cases. All-easy evals can't show progress.
  • Edge cases. Real failures from production are worth their weight in gold.
  • Stable over time. An eval set that drifts can't measure progress.
  • Smaller is fine. A few hundred curated examples beats thousands of mediocre ones.

A locked, versioned eval set is the foundation for every other improvement you'll make.

Curation as a Discipline

Mature teams build muscle here:

  • Curation gets dedicated time, not "whoever has bandwidth."
  • Data inspections are part of the review cycle for any data-touching change.
  • Metrics on dataset health (size, dedup rate, label noise, coverage gaps) live in dashboards.
  • New failures from production loop back into the eval set automatically.

The teams that ship the best AI products usually have the best data hygiene. Not coincidentally.

On this page