Steven's Knowledge

Eval & Observability Tools

The category of tools that turns "the model got worse" from a feeling into a finding

Once an LLM application is real, the gap between teams that improve it and teams that don't usually traces back to one thing: whether they can see what's happening. Eval and observability tools occupy that gap. They overlap heavily — most "observability" tools also run evals; most "eval" tools also log production traces — and the right answer is usually one tool that does both well.

What These Tools Do

  • Trace logging — capture every model call with inputs, outputs, tokens, latency, cost.
  • Multi-step tracing — connect agent runs and chains across many calls.
  • Eval running — score outputs against rubrics, on offline sets and on production samples.
  • LLM-as-judge — built-in or BYO judge prompts to grade open-ended outputs.
  • Dataset management — eval sets, golden examples, version-controlled.
  • Prompt versioning — diff prompts, attribute regressions.
  • A/B testing — compare prompt or model variants on real traffic.
  • Annotation — humans labeling outputs, which becomes more eval data.

The exact mix differs by tool; almost all of them do most of these.

The Major Players

Langfuse — open source, self-hostable, broad integrations. Strong tracing, growing eval surface. Easy to start with, hard to outgrow.

Braintrust — eval-first, very polished UX for offline evals and CI gating. Production tracing and online evals as well. Popular with teams that take evaluation seriously.

Arize Phoenix / Arize AX — Phoenix is the open-source side, AX the managed product. Strong on observability and embedding-level debugging; comes from a classical ML observability background and that shows.

Helicone — observability-first, low-friction integration (proxy or async logging). Good cost/usage analytics; lighter on the eval side.

LangSmith — LangChain's own tool. Tightest integration with LangChain/LangGraph; works fine standalone too.

Weights & Biases (Weave) — W&B's LLM-specific layer. Useful if you're already using W&B for ML training; otherwise overkill.

Honeycomb / Datadog / New Relic — your existing APM, increasingly with LLM-specific features. The "we already have this tool" answer that often beats specialized options for non-LLM teams.

OpenTelemetry GenAI conventions — not a tool but a standard. Once stable, lets you instrument once and choose backends freely. Worth tracking even if you don't use it yet.

Eval Tooling Specifically

Some tools lean hardest into evals as a primary surface:

  • Promptfoo — eval-as-code. Configure scenarios in YAML, run from CLI, integrate into CI.
  • Inspect (UK AI Safety Institute) — research-grade eval framework, popular for capability and safety evaluations.
  • DeepEval — Python eval framework with many built-in metrics; pytest-style integration.
  • Ragas — eval framework specifically for RAG pipelines; metrics for faithfulness, answer relevance, context recall.

These can be used standalone or alongside an observability platform.

Choosing

A pragmatic decision tree:

  1. Just need logging and basic dashboards — Langfuse self-hosted, or Helicone.
  2. Eval-driven team, CI gating matters — Braintrust, with Promptfoo for CLI workflows.
  3. Already on LangChain — LangSmith is the path of least resistance.
  4. Already on W&B for ML — Weave fits in cleanly.
  5. Existing APM stack — extend it before adding a specialized tool.
  6. RAG-heavy — Ragas for pipeline-specific metrics, on top of any of the above.

Most teams end up with one observability platform plus one or two eval-specific tools.

What Good Looks Like

A mature eval/observability setup has:

  • Every production call traced with full inputs and outputs (sampled if volume requires it).
  • An offline eval set with hundreds of curated examples, version-controlled.
  • CI gating on prompt and model changes — regressions block merges.
  • Online evals running on a sample of production traffic continuously.
  • Annotation queues so humans can label tricky cases and grow the eval set.
  • Dashboards for cost, latency, refusal rate, quality proxies — visible to the team daily.

Reaching this level is incremental. Start with tracing. Add an eval set. Add CI gating. Each step pays for itself before the next is needed.

A Reminder

The tool isn't the discipline. A team without an eval set won't suddenly have one because they signed up for Braintrust. Pick a tool that fits your workflow and use it; the curation work is what actually moves quality.

On this page