Steven's Knowledge

Tracing

Distributed tracing - OpenTelemetry, Jaeger, Tempo, Honeycomb - see how a request flows through your services

Tracing

A trace is the full story of a single request as it flows through your system. Every service the request touches records a span — a unit of work with start/end timestamps and metadata. Combined, the spans reconstruct the request's journey: which service called which, how long each step took, where errors happened.

If metrics tell you "the system is slow" and logs tell you "this specific thing happened", tracing tells you "this specific request was slow because the user-service called the inventory-service which called the pricing-service which timed out".

The Three Pillars

PillarWhat it tells youExamples
MetricsAggregated numbers over timeRPS, P99 latency, CPU usage — Prometheus
LogsDiscrete events with context"User 42 failed login at 14:32" — ELK
TracesThe path of a single request"Request X went through services A→B→C, took 1.2s, errored at C"

Each answers different questions. You want all three; tracing is the one most often missing.

Why Trace

Without tracesWith traces
"Why is this request slow?" → guessworkVisual flame graph shows the bottleneck instantly
"Which service is the bottleneck?" → blame meetingsTrace shows percentage time per service
"Did the user hit retries?" → log archeologyTrace shows every internal retry
"Cross-service errors are mysterious"The span at the error has full context
"Hard to debug across teams"Each team sees only the spans they own; whole trace links them

The killer use case: the request that's slow only sometimes. Metrics show "P99 is bad"; logs show many normal lines; tracing shows the actual slow request, end-to-end.

OpenTelemetry: The Standard

The ecosystem has converged on OpenTelemetry (OTel) — a CNCF project that defines:

  • Span / trace data model (matches W3C Trace Context)
  • SDKs in every major language for emitting spans
  • OTel Collector — a service that ingests OTel data and forwards to your backend
  • Auto-instrumentation for popular frameworks (Express, Fastify, Django, Spring, etc)

The key benefit: decouple the SDK from the backend. You instrument your code once with OTel; you can swap Jaeger for Tempo for Honeycomb for Datadog by changing the collector config, not your code.

Backends

OTel SDKs emit data; you need a backend to store and visualize:

BackendSelf-hostNotes
JaegerYesCNCF; the classic; great UI; storage adapter for various stores
Grafana TempoYesObject-storage-backed; cheap to run; integrates with Grafana
ZipkinYesThe original Dapper-inspired tracing system; simpler than Jaeger
HoneycombSaaSBest-in-class analysis UI; expensive at scale
Lightstep / ServiceNow Cloud ObservabilitySaaSEnterprise; high cardinality
Datadog APMSaaSTightly integrated with their metrics/logs
New Relic / AppSignal / AppDynamicsSaaSTraditional APM with tracing
AWS X-Ray / GCP Cloud Trace / Azure Application InsightsCloud-managedPer-cloud integration
SigNozSelf-host or cloudOpen-source unified observability (OTel-native)
SentrySaaSErrors + traces in one platform

For new projects: OpenTelemetry SDK + Tempo (self-host) or OpenTelemetry SDK + a SaaS (Honeycomb / Datadog / SigNoz). Don't lock yourself to a vendor SDK — instrument with OTel.

Learning Path

Anatomy of a Trace

┌────────────────────────────────────────────────────────────────────┐
│ Trace abc123                                              1200 ms │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ GET /api/checkout                              service: api    │ │
│ │ ┌──────────────────────────────────────────────────────────┐  │ │
│ │ │ AUTH validate            service: api          15 ms     │  │ │
│ │ └──────────────────────────────────────────────────────────┘  │ │
│ │ ┌──────────────────────────────────────────────────────────┐  │ │
│ │ │ DB query users           service: api          25 ms     │  │ │
│ │ └──────────────────────────────────────────────────────────┘  │ │
│ │ ┌──────────────────────────────────────────────────────────┐  │ │
│ │ │ HTTP inventory-service   service: api         200 ms     │  │ │
│ │ │   ┌────────────────────────────────────────────────────┐ │  │ │
│ │ │   │ DB query inventory   service: inv      180 ms      │ │  │ │
│ │ │   └────────────────────────────────────────────────────┘ │  │ │
│ │ └──────────────────────────────────────────────────────────┘  │ │
│ │ ┌──────────────────────────────────────────────────────────┐  │ │
│ │ │ HTTP payment-service     service: api         950 ms ⚠   │  │ │
│ │ └──────────────────────────────────────────────────────────┘  │ │
│ └───────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘

Each box is a span. The whole stack is a trace. Reading top to bottom: a checkout request took 1.2s, and 950 ms of that was the payment service — which means that's your bottleneck.

ConceptMeaning
TraceOne end-to-end request flow
SpanA unit of work within the trace
Trace IDGlobally unique ID for the trace
Span IDUnique ID for one span
Parent span IDThe span that triggered this one
AttributesKey/value pairs on a span (HTTP method, DB query, user ID)
EventsTimestamped log entries within a span
StatusOK / Error
BaggageKey/value pairs propagated across spans for context

Context Propagation

A trace spans services. The trace context (trace ID, parent span ID) must travel with the request:

Service A                              Service B
┌──────────┐  HTTP request             ┌──────────┐
│ span 1   │  traceparent: 00-abc...─→ │ span 2   │
│ trace abc│                            │ trace abc│
│ id=1     │                            │ id=2     │
│ parent=- │                            │ parent=1 │
└──────────┘                            └──────────┘

W3C Trace Context is the standard — traceparent and tracestate HTTP headers. Every modern HTTP client and OTel SDK propagates them automatically.

For Kafka / RabbitMQ / SQS messages, propagate the same fields in message headers. For database queries, comments (SET application_name='trace=abc123' or -- trace=abc123) carry the ID into DB query logs.

A trace is only as good as the parts that are instrumented. One service that doesn't propagate context breaks the trace — you see two unconnected traces instead of one. Aim for 100% propagation across your services before you worry about depth of instrumentation.

On this page