Observability

LLM features fail in ways traditional services don't. There are no exceptions, no stack traces, no obvious error rates. The model produces something plausible-looking, the user gets a worse answer, and nothing in your dashboards lights up. Observability is what closes that gap.

What to Log

For every LLM call, capture at minimum:

Inputs — full prompt, tools available, model and parameters.
Outputs — full response, tool calls, finish reason.
Token counts — input, output, cached.
Latency — time-to-first-token, total.
Cost — derived from tokens and model pricing.
Identifiers — request ID, user ID, session ID, prompt version.

Sampling is fine for high-volume systems, but errors and outliers should always be captured in full.

Tracing Multi-Step Flows

A single user action can fan out into many model calls — retrieval, planning, tool calls, sub-agent invocations. Without tracing, debugging a regression is hopeless.

Adopt a tracing standard early. OpenTelemetry conventions for GenAI are stabilizing; tools like Langfuse, Helicone, Arize Phoenix, and Braintrust all build on similar primitives. The specific tool matters less than having one.

Prompt Versioning

Prompts are code. Version them like code:

Hash every prompt sent to a model and log the hash with each call.
Tag releases so you can correlate quality changes with prompt changes.
Diff prompts in code review like any other change.

Without this, "the model got worse" becomes unfalsifiable.

Quality Signals from Production

You can't run offline evals on every request, but you can collect proxies:

Implicit feedback — copy actions, regenerate clicks, abandonment.
Explicit feedback — thumbs, ratings, structured forms.
Downstream outcomes — did the user complete the task? Did they come back?
Self-reports — let the model flag its own low-confidence responses.

The signal is noisy individually, but trends over thousands of calls reliably catch regressions.

Alerting

Useful alerts for LLM systems:

Cost spike — token spend deviation from baseline.
Latency regression — p95 TTFT or end-to-end time.
Quality proxy regression — drop in thumbs-up rate, jump in regenerate rate.
Tool failure rate — cascading failures often start in a tool, not the model.
Refusal / safety filter rate — sudden jumps usually mean a prompt regression.

What This Buys You

Once observability is in place, the team's iteration speed changes. You can answer "did this prompt change make things better?" with data instead of vibes. That's the difference between a feature that gets better over time and one that quietly rots.