Steven's Knowledge

Evaluation & Benchmarks

Public benchmarks tell you almost nothing about your application — build your own evals

The hardest part of shipping LLM features isn't building them; it's knowing whether they're getting better or worse. Evaluation is the discipline that turns "vibes" into something you can actually iterate on.

Why Public Benchmarks Disappoint

MMLU, GSM8K, HumanEval, GPQA — these are useful for comparing model capabilities at a population level, but they tell you essentially nothing about whether a given model will work for your application. Your data, your prompts, your edge cases, your tolerance for failure — none of it is captured by a public score.

Treat leaderboards as a rough first filter. Build your own evals for everything that matters.

Building Your Own Eval Set

The minimum viable eval is a CSV with two columns: input and the kind of output you want. Start there.

  • Cover the golden path. The cases that should obviously work.
  • Cover the edge cases. The cases that historically broke.
  • Include adversarial cases. Things users actually try that you don't want the model to do.
  • Lock the set. Don't iteratively tune on it; treat it like a test set.

A few hundred examples beats a thousand bad ones. The bottleneck is curation, not volume.

Grading the Output

Three options, in order of cost and quality:

  1. Exact / programmatic checks — does the JSON parse? Does the output contain the required field? Cheap, fast, narrow.
  2. LLM-as-judge — use a frontier model to grade outputs against a rubric. Good for open-ended tasks; needs careful rubric design.
  3. Human evaluation — slow and expensive but the ground truth. Use it to calibrate your LLM judges.

LLM-as-judge has its own failure modes: position bias, length bias, self-preference. Mitigate with rubrics, paired comparisons, and randomized order.

What to Track

  • Quality — pass rate against your eval set.
  • Cost — input + output tokens per call.
  • Latency — p50, p95 end-to-end.
  • Failure modes — categorize regressions, not just count them.

A regression that drops one capability while improving another is the most common surprise in LLM development. Categorical breakdowns catch this; aggregate scores hide it.

Closing the Loop

Evals are most valuable when they run automatically: on every prompt change, on every model upgrade, on every release. The team that ships LLM features fastest is usually the team that's most ruthless about regression-testing them.

On this page