Tracing
Distributed tracing - OpenTelemetry, Jaeger, Tempo, Honeycomb - see how a request flows through your services
Tracing
A trace is the full story of a single request as it flows through your system. Every service the request touches records a span — a unit of work with start/end timestamps and metadata. Combined, the spans reconstruct the request's journey: which service called which, how long each step took, where errors happened.
If metrics tell you "the system is slow" and logs tell you "this specific thing happened", tracing tells you "this specific request was slow because the user-service called the inventory-service which called the pricing-service which timed out".
The Three Pillars
| Pillar | What it tells you | Examples |
|---|---|---|
| Metrics | Aggregated numbers over time | RPS, P99 latency, CPU usage — Prometheus |
| Logs | Discrete events with context | "User 42 failed login at 14:32" — ELK |
| Traces | The path of a single request | "Request X went through services A→B→C, took 1.2s, errored at C" |
Each answers different questions. You want all three; tracing is the one most often missing.
Why Trace
| Without traces | With traces |
|---|---|
| "Why is this request slow?" → guesswork | Visual flame graph shows the bottleneck instantly |
| "Which service is the bottleneck?" → blame meetings | Trace shows percentage time per service |
| "Did the user hit retries?" → log archeology | Trace shows every internal retry |
| "Cross-service errors are mysterious" | The span at the error has full context |
| "Hard to debug across teams" | Each team sees only the spans they own; whole trace links them |
The killer use case: the request that's slow only sometimes. Metrics show "P99 is bad"; logs show many normal lines; tracing shows the actual slow request, end-to-end.
OpenTelemetry: The Standard
The ecosystem has converged on OpenTelemetry (OTel) — a CNCF project that defines:
- Span / trace data model (matches W3C Trace Context)
- SDKs in every major language for emitting spans
- OTel Collector — a service that ingests OTel data and forwards to your backend
- Auto-instrumentation for popular frameworks (Express, Fastify, Django, Spring, etc)
The key benefit: decouple the SDK from the backend. You instrument your code once with OTel; you can swap Jaeger for Tempo for Honeycomb for Datadog by changing the collector config, not your code.
Backends
OTel SDKs emit data; you need a backend to store and visualize:
| Backend | Self-host | Notes |
|---|---|---|
| Jaeger | Yes | CNCF; the classic; great UI; storage adapter for various stores |
| Grafana Tempo | Yes | Object-storage-backed; cheap to run; integrates with Grafana |
| Zipkin | Yes | The original Dapper-inspired tracing system; simpler than Jaeger |
| Honeycomb | SaaS | Best-in-class analysis UI; expensive at scale |
| Lightstep / ServiceNow Cloud Observability | SaaS | Enterprise; high cardinality |
| Datadog APM | SaaS | Tightly integrated with their metrics/logs |
| New Relic / AppSignal / AppDynamics | SaaS | Traditional APM with tracing |
| AWS X-Ray / GCP Cloud Trace / Azure Application Insights | Cloud-managed | Per-cloud integration |
| SigNoz | Self-host or cloud | Open-source unified observability (OTel-native) |
| Sentry | SaaS | Errors + traces in one platform |
For new projects: OpenTelemetry SDK + Tempo (self-host) or OpenTelemetry SDK + a SaaS (Honeycomb / Datadog / SigNoz). Don't lock yourself to a vendor SDK — instrument with OTel.
Learning Path
1. Getting Started
Stand up Jaeger and the OTel Collector with Docker; trace a Node.js app end-to-end
2. Instrumentation
Auto-instrumentation, manual spans, attributes, error tracking, context propagation
3. Best Practices
Sampling, cardinality, retention, correlating with logs/metrics, cost, pitfalls
Anatomy of a Trace
┌────────────────────────────────────────────────────────────────────┐
│ Trace abc123 1200 ms │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ GET /api/checkout service: api │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ AUTH validate service: api 15 ms │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ DB query users service: api 25 ms │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ HTTP inventory-service service: api 200 ms │ │ │
│ │ │ ┌────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ DB query inventory service: inv 180 ms │ │ │ │
│ │ │ └────────────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ HTTP payment-service service: api 950 ms ⚠ │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘Each box is a span. The whole stack is a trace. Reading top to bottom: a checkout request took 1.2s, and 950 ms of that was the payment service — which means that's your bottleneck.
| Concept | Meaning |
|---|---|
| Trace | One end-to-end request flow |
| Span | A unit of work within the trace |
| Trace ID | Globally unique ID for the trace |
| Span ID | Unique ID for one span |
| Parent span ID | The span that triggered this one |
| Attributes | Key/value pairs on a span (HTTP method, DB query, user ID) |
| Events | Timestamped log entries within a span |
| Status | OK / Error |
| Baggage | Key/value pairs propagated across spans for context |
Context Propagation
A trace spans services. The trace context (trace ID, parent span ID) must travel with the request:
Service A Service B
┌──────────┐ HTTP request ┌──────────┐
│ span 1 │ traceparent: 00-abc...─→ │ span 2 │
│ trace abc│ │ trace abc│
│ id=1 │ │ id=2 │
│ parent=- │ │ parent=1 │
└──────────┘ └──────────┘W3C Trace Context is the standard — traceparent and tracestate HTTP headers. Every modern HTTP client and OTel SDK propagates them automatically.
For Kafka / RabbitMQ / SQS messages, propagate the same fields in message headers. For database queries, comments (SET application_name='trace=abc123' or -- trace=abc123) carry the ID into DB query logs.
A trace is only as good as the parts that are instrumented. One service that doesn't propagate context breaks the trace — you see two unconnected traces instead of one. Aim for 100% propagation across your services before you worry about depth of instrumentation.