Dataset Curation
The data is the model — and most of the work is data work
In every meaningful AI project, the dataset ends up being the dominant lever. Models converge to the quality of their training data; evals reflect the quality of their eval data; fine-tunes only ever get as good as the examples behind them. Curation is the practice of treating data as the first-class artifact it is.
Why Curation Matters More Than Volume
A small, clean, representative dataset reliably outperforms a large, noisy one. Large public corpora are full of duplicates, junk, biases, and content irrelevant to the target task. Models trained or evaluated on raw scraped data underperform models trained on a fraction of that data after careful filtering.
This applies to fine-tuning, eval sets, RAG corpora, and prompt few-shot examples alike.
What "Quality" Means
Quality is task-dependent. For most LLM work, you're after some combination of:
- Correctness. Labels reflect ground truth.
- Coverage. The data spans the situations the model will see.
- Cleanliness. No duplicates, no garbage, consistent formatting.
- Relevance. Each example earns its spot — no filler.
- Balance. Important categories aren't drowned out by common ones.
A useful working definition: data that, looked at example by example, makes you nod rather than wince.
The Curation Loop
Curation is iterative, not a single pass:
- Sample. Pull a small batch (a few hundred examples).
- Inspect. Read them. Look at the actual examples, not just statistics.
- Categorize problems. Where does the data fall short?
- Filter or fix. Remove what's broken, correct what's recoverable.
- Repeat. Sample a new batch and check whether the issues are resolved.
The team that's iterating on data through this loop will out-ship the team that ran one massive cleanup script.
Common Defects to Hunt For
- Near-duplicates. Trivial paraphrases inflate apparent dataset size and bias the model toward repeated content.
- Label noise. Wrong answers, inconsistent rubrics, annotator drift over time.
- Leakage. Test set examples that overlap training data. Especially insidious with web-scraped data.
- Length bias. All short examples or all long ones; models learn the bias as a feature.
- Spurious correlations. A non-causal feature that the model ends up relying on.
- Toxic content. Especially in RAG corpora and fine-tuning data scraped from the wild.
Each one has known detection patterns; finding them is mostly a matter of looking.
Filtering Pipelines
For large datasets, automated filtering is unavoidable. Common stages:
- Deduplication. Exact-match and near-duplicate detection (MinHash, embeddings).
- Language filtering. Keep only the languages you care about.
- Quality classifiers. Trained models that score "is this good" — fast, imperfect, useful at scale.
- Heuristics. Length bounds, repetition ratios, junk-character ratios.
- Deduplication against eval sets. Critical to prevent leakage.
Layer them. Each catches what the previous one missed.
Eval Set Curation Is Different
For eval sets, prioritize differently:
- Representativeness. The eval should reflect production distribution, not a convenient sample.
- Difficulty spread. Easy, medium, hard cases. All-easy evals can't show progress.
- Edge cases. Real failures from production are worth their weight in gold.
- Stable over time. An eval set that drifts can't measure progress.
- Smaller is fine. A few hundred curated examples beats thousands of mediocre ones.
A locked, versioned eval set is the foundation for every other improvement you'll make.
Curation as a Discipline
Mature teams build muscle here:
- Curation gets dedicated time, not "whoever has bandwidth."
- Data inspections are part of the review cycle for any data-touching change.
- Metrics on dataset health (size, dedup rate, label noise, coverage gaps) live in dashboards.
- New failures from production loop back into the eval set automatically.
The teams that ship the best AI products usually have the best data hygiene. Not coincidentally.