Eval & Observability Tools
The category of tools that turns "the model got worse" from a feeling into a finding
Once an LLM application is real, the gap between teams that improve it and teams that don't usually traces back to one thing: whether they can see what's happening. Eval and observability tools occupy that gap. They overlap heavily — most "observability" tools also run evals; most "eval" tools also log production traces — and the right answer is usually one tool that does both well.
What These Tools Do
- Trace logging — capture every model call with inputs, outputs, tokens, latency, cost.
- Multi-step tracing — connect agent runs and chains across many calls.
- Eval running — score outputs against rubrics, on offline sets and on production samples.
- LLM-as-judge — built-in or BYO judge prompts to grade open-ended outputs.
- Dataset management — eval sets, golden examples, version-controlled.
- Prompt versioning — diff prompts, attribute regressions.
- A/B testing — compare prompt or model variants on real traffic.
- Annotation — humans labeling outputs, which becomes more eval data.
The exact mix differs by tool; almost all of them do most of these.
The Major Players
Langfuse — open source, self-hostable, broad integrations. Strong tracing, growing eval surface. Easy to start with, hard to outgrow.
Braintrust — eval-first, very polished UX for offline evals and CI gating. Production tracing and online evals as well. Popular with teams that take evaluation seriously.
Arize Phoenix / Arize AX — Phoenix is the open-source side, AX the managed product. Strong on observability and embedding-level debugging; comes from a classical ML observability background and that shows.
Helicone — observability-first, low-friction integration (proxy or async logging). Good cost/usage analytics; lighter on the eval side.
LangSmith — LangChain's own tool. Tightest integration with LangChain/LangGraph; works fine standalone too.
Weights & Biases (Weave) — W&B's LLM-specific layer. Useful if you're already using W&B for ML training; otherwise overkill.
Honeycomb / Datadog / New Relic — your existing APM, increasingly with LLM-specific features. The "we already have this tool" answer that often beats specialized options for non-LLM teams.
OpenTelemetry GenAI conventions — not a tool but a standard. Once stable, lets you instrument once and choose backends freely. Worth tracking even if you don't use it yet.
Eval Tooling Specifically
Some tools lean hardest into evals as a primary surface:
- Promptfoo — eval-as-code. Configure scenarios in YAML, run from CLI, integrate into CI.
- Inspect (UK AI Safety Institute) — research-grade eval framework, popular for capability and safety evaluations.
- DeepEval — Python eval framework with many built-in metrics; pytest-style integration.
- Ragas — eval framework specifically for RAG pipelines; metrics for faithfulness, answer relevance, context recall.
These can be used standalone or alongside an observability platform.
Choosing
A pragmatic decision tree:
- Just need logging and basic dashboards — Langfuse self-hosted, or Helicone.
- Eval-driven team, CI gating matters — Braintrust, with Promptfoo for CLI workflows.
- Already on LangChain — LangSmith is the path of least resistance.
- Already on W&B for ML — Weave fits in cleanly.
- Existing APM stack — extend it before adding a specialized tool.
- RAG-heavy — Ragas for pipeline-specific metrics, on top of any of the above.
Most teams end up with one observability platform plus one or two eval-specific tools.
What Good Looks Like
A mature eval/observability setup has:
- Every production call traced with full inputs and outputs (sampled if volume requires it).
- An offline eval set with hundreds of curated examples, version-controlled.
- CI gating on prompt and model changes — regressions block merges.
- Online evals running on a sample of production traffic continuously.
- Annotation queues so humans can label tricky cases and grow the eval set.
- Dashboards for cost, latency, refusal rate, quality proxies — visible to the team daily.
Reaching this level is incremental. Start with tracing. Add an eval set. Add CI gating. Each step pays for itself before the next is needed.
A Reminder
The tool isn't the discipline. A team without an eval set won't suddenly have one because they signed up for Braintrust. Pick a tool that fits your workflow and use it; the curation work is what actually moves quality.