Evaluation & Benchmarks
Public benchmarks tell you almost nothing about your application — build your own evals
The hardest part of shipping LLM features isn't building them; it's knowing whether they're getting better or worse. Evaluation is the discipline that turns "vibes" into something you can actually iterate on.
Why Public Benchmarks Disappoint
MMLU, GSM8K, HumanEval, GPQA — these are useful for comparing model capabilities at a population level, but they tell you essentially nothing about whether a given model will work for your application. Your data, your prompts, your edge cases, your tolerance for failure — none of it is captured by a public score.
Treat leaderboards as a rough first filter. Build your own evals for everything that matters.
Building Your Own Eval Set
The minimum viable eval is a CSV with two columns: input and the kind of output you want. Start there.
- Cover the golden path. The cases that should obviously work.
- Cover the edge cases. The cases that historically broke.
- Include adversarial cases. Things users actually try that you don't want the model to do.
- Lock the set. Don't iteratively tune on it; treat it like a test set.
A few hundred examples beats a thousand bad ones. The bottleneck is curation, not volume.
Grading the Output
Three options, in order of cost and quality:
- Exact / programmatic checks — does the JSON parse? Does the output contain the required field? Cheap, fast, narrow.
- LLM-as-judge — use a frontier model to grade outputs against a rubric. Good for open-ended tasks; needs careful rubric design.
- Human evaluation — slow and expensive but the ground truth. Use it to calibrate your LLM judges.
LLM-as-judge has its own failure modes: position bias, length bias, self-preference. Mitigate with rubrics, paired comparisons, and randomized order.
What to Track
- Quality — pass rate against your eval set.
- Cost — input + output tokens per call.
- Latency — p50, p95 end-to-end.
- Failure modes — categorize regressions, not just count them.
A regression that drops one capability while improving another is the most common surprise in LLM development. Categorical breakdowns catch this; aggregate scores hide it.
Closing the Loop
Evals are most valuable when they run automatically: on every prompt change, on every model upgrade, on every release. The team that ships LLM features fastest is usually the team that's most ruthless about regression-testing them.