Evaluation & Benchmarks

Public benchmarks tell you almost nothing about your application — build your own evals

The hardest part of shipping LLM features isn't building them; it's knowing whether they're getting better or worse. Evaluation is the discipline that turns "vibes" into something you can actually iterate on.

Why Public Benchmarks Disappoint

MMLU, GSM8K, HumanEval, GPQA — these are useful for comparing model capabilities at a population level, but they tell you essentially nothing about whether a given model will work for your application. Your data, your prompts, your edge cases, your tolerance for failure — none of it is captured by a public score.

Treat leaderboards as a rough first filter. Build your own evals for everything that matters.

Building Your Own Eval Set

The minimum viable eval is a CSV with two columns: input and the kind of output you want. Start there.

Cover the golden path. The cases that should obviously work.
Cover the edge cases. The cases that historically broke.
Include adversarial cases. Things users actually try that you don't want the model to do.
Lock the set. Don't iteratively tune on it; treat it like a test set.

A few hundred examples beats a thousand bad ones. The bottleneck is curation, not volume.

Grading the Output

Three options, in order of cost and quality:

Exact / programmatic checks — does the JSON parse? Does the output contain the required field? Cheap, fast, narrow.
LLM-as-judge — use a frontier model to grade outputs against a rubric. Good for open-ended tasks; needs careful rubric design.
Human evaluation — slow and expensive but the ground truth. Use it to calibrate your LLM judges.

LLM-as-judge has its own failure modes: position bias, length bias, self-preference. Mitigate with rubrics, paired comparisons, and randomized order.

What to Track

Quality — pass rate against your eval set.
Cost — input + output tokens per call.
Latency — p50, p95 end-to-end.
Failure modes — categorize regressions, not just count them.

A regression that drops one capability while improving another is the most common surprise in LLM development. Categorical breakdowns catch this; aggregate scores hide it.

Closing the Loop

Evals are most valuable when they run automatically: on every prompt change, on every model upgrade, on every release. The team that ships LLM features fastest is usually the team that's most ruthless about regression-testing them.

Why Public Benchmarks Disappoint

Building Your Own Eval Set

Grading the Output

What to Track

Closing the Loop

On this page