Distributed tracing - OpenTelemetry, Jaeger, Tempo, Honeycomb - see how a request flows through your services

Tracing

A trace is the full story of a single request as it flows through your system. Every service the request touches records a span — a unit of work with start/end timestamps and metadata. Combined, the spans reconstruct the request's journey: which service called which, how long each step took, where errors happened.

If metrics tell you "the system is slow" and logs tell you "this specific thing happened", tracing tells you "this specific request was slow because the user-service called the inventory-service which called the pricing-service which timed out".

The Three Pillars

Pillar	What it tells you	Examples
Metrics	Aggregated numbers over time	RPS, P99 latency, CPU usage — Prometheus
Logs	Discrete events with context	"User 42 failed login at 14:32" — ELK
Traces	The path of a single request	"Request X went through services A→B→C, took 1.2s, errored at C"

Each answers different questions. You want all three; tracing is the one most often missing.

Why Trace

Without traces	With traces
"Why is this request slow?" → guesswork	Visual flame graph shows the bottleneck instantly
"Which service is the bottleneck?" → blame meetings	Trace shows percentage time per service
"Did the user hit retries?" → log archeology	Trace shows every internal retry
"Cross-service errors are mysterious"	The span at the error has full context
"Hard to debug across teams"	Each team sees only the spans they own; whole trace links them

The killer use case: the request that's slow only sometimes. Metrics show "P99 is bad"; logs show many normal lines; tracing shows the actual slow request, end-to-end.

OpenTelemetry: The Standard

The ecosystem has converged on OpenTelemetry (OTel) — a CNCF project that defines:

Span / trace data model (matches W3C Trace Context)
SDKs in every major language for emitting spans
OTel Collector — a service that ingests OTel data and forwards to your backend
Auto-instrumentation for popular frameworks (Express, Fastify, Django, Spring, etc)

The key benefit: decouple the SDK from the backend. You instrument your code once with OTel; you can swap Jaeger for Tempo for Honeycomb for Datadog by changing the collector config, not your code.

Backends

OTel SDKs emit data; you need a backend to store and visualize:

Backend	Self-host	Notes
Jaeger	Yes	CNCF; the classic; great UI; storage adapter for various stores
Grafana Tempo	Yes	Object-storage-backed; cheap to run; integrates with Grafana
Zipkin	Yes	The original Dapper-inspired tracing system; simpler than Jaeger
Honeycomb	SaaS	Best-in-class analysis UI; expensive at scale
Lightstep / ServiceNow Cloud Observability	SaaS	Enterprise; high cardinality
Datadog APM	SaaS	Tightly integrated with their metrics/logs
New Relic / AppSignal / AppDynamics	SaaS	Traditional APM with tracing
AWS X-Ray / GCP Cloud Trace / Azure Application Insights	Cloud-managed	Per-cloud integration
SigNoz	Self-host or cloud	Open-source unified observability (OTel-native)
Sentry	SaaS	Errors + traces in one platform

For new projects: OpenTelemetry SDK + Tempo (self-host) or OpenTelemetry SDK + a SaaS (Honeycomb / Datadog / SigNoz). Don't lock yourself to a vendor SDK — instrument with OTel.

Learning Path

1. Getting Started

Stand up Jaeger and the OTel Collector with Docker; trace a Node.js app end-to-end

2. Instrumentation

Auto-instrumentation, manual spans, attributes, error tracking, context propagation

3. Best Practices

Sampling, cardinality, retention, correlating with logs/metrics, cost, pitfalls

Anatomy of a Trace

┌────────────────────────────────────────────────────────────────────┐
│ Trace abc123                                              1200 ms │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ GET /api/checkout                              service: api    │ │
│ │ ┌──────────────────────────────────────────────────────────┐  │ │
│ │ │ AUTH validate            service: api          15 ms     │  │ │
│ │ └──────────────────────────────────────────────────────────┘  │ │
│ │ ┌──────────────────────────────────────────────────────────┐  │ │
│ │ │ DB query users           service: api          25 ms     │  │ │
│ │ └──────────────────────────────────────────────────────────┘  │ │
│ │ ┌──────────────────────────────────────────────────────────┐  │ │
│ │ │ HTTP inventory-service   service: api         200 ms     │  │ │
│ │ │   ┌────────────────────────────────────────────────────┐ │  │ │
│ │ │   │ DB query inventory   service: inv      180 ms      │ │  │ │
│ │ │   └────────────────────────────────────────────────────┘ │  │ │
│ │ └──────────────────────────────────────────────────────────┘  │ │
│ │ ┌──────────────────────────────────────────────────────────┐  │ │
│ │ │ HTTP payment-service     service: api         950 ms ⚠   │  │ │
│ │ └──────────────────────────────────────────────────────────┘  │ │
│ └───────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘

Each box is a span. The whole stack is a trace. Reading top to bottom: a checkout request took 1.2s, and 950 ms of that was the payment service — which means that's your bottleneck.

Concept	Meaning
Trace	One end-to-end request flow
Span	A unit of work within the trace
Trace ID	Globally unique ID for the trace
Span ID	Unique ID for one span
Parent span ID	The span that triggered this one
Attributes	Key/value pairs on a span (HTTP method, DB query, user ID)
Events	Timestamped log entries within a span
Status	OK / Error
Baggage	Key/value pairs propagated across spans for context

Context Propagation

A trace spans services. The trace context (trace ID, parent span ID) must travel with the request:

Service A                              Service B
┌──────────┐  HTTP request             ┌──────────┐
│ span 1   │  traceparent: 00-abc...─→ │ span 2   │
│ trace abc│                            │ trace abc│
│ id=1     │                            │ id=2     │
│ parent=- │                            │ parent=1 │
└──────────┘                            └──────────┘

W3C Trace Context is the standard — traceparent and tracestate HTTP headers. Every modern HTTP client and OTel SDK propagates them automatically.

For Kafka / RabbitMQ / SQS messages, propagate the same fields in message headers. For database queries, comments (SET application_name='trace=abc123' or -- trace=abc123) carry the ID into DB query logs.

A trace is only as good as the parts that are instrumented. One service that doesn't propagate context breaks the trace — you see two unconnected traces instead of one. Aim for 100% propagation across your services before you worry about depth of instrumentation.

Tracing

1. Getting Started

2. Instrumentation

3. Best Practices

On this page