Best Practices
Production tracing - sampling strategies, retention, cost, correlating with logs/metrics, pitfalls
Best Practices
Tracing's cost is roughly proportional to volume × cardinality × retention. The patterns below keep all three in check.
Sampling
You can't store every span at scale. The two-axis question:
When to decide:
- Head sampling — the first service decides at the start of the trace; downstream services obey.
- Tail sampling — collect everything, decide at the end based on what happened.
What to keep:
- All errors / slow traces — they're rare and high-value.
- A percentage of normal traffic — to baseline against.
- Specific user IDs (e.g., enterprise customers, support tickets).
- Specific routes — keep everything from
/checkout, sample 1% from/health.
Head Sampling
Simplest, cheapest:
const sdk = new NodeSDK({
sampler: new TraceIdRatioBasedSampler(0.1), // 10% of traces
});The decision propagates via traceparent flags. All services in a trace agree.
Downside: you sample blindly. The one slow trace you really wanted is in the 90% you dropped.
Tail Sampling at the Collector
Better at scale. Collect everything, decide at the collector based on the full trace:
# otel-config.yaml
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
- name: enterprise-users
type: string_attribute
string_attribute:
key: enterprise.account
values: ["true"]This config keeps:
- All errored traces
- All traces over 1 second
- 5% of everything else
- All enterprise customer traces
Result: valuable data preserved, noise filtered. Cost goes way down.
The collector buffers spans per trace ID; when the trace is "complete" (all spans arrived OR decision_wait elapsed) it applies the policy.
Vendor backends (Honeycomb, Datadog, Lightstep) often have built-in tail sampling — preferable to running your own collector for that purpose.
Retention
Hot retention (queryable in the UI): typically 7-30 days. Cold archive: object storage if you really need it. Most teams keep traces 7-14 days hot and don't archive.
Rule of thumb: you investigate today's traces today. Yesterday's are 90% useless unless something's structurally broken.
Cardinality
Every unique combination of attributes is a "series" the backend indexes. Common explosions:
| Anti-pattern | Symptom | Fix |
|---|---|---|
http.url with full query string | Millions of unique URLs | Use http.target (the route template) |
| User ID as a span name | Same | User ID is an attribute, not a name |
| Random IDs in span names | "POST /api/users/abc-123" | Normalize: POST /api/users/{id} |
| Putting full bodies in attributes | Backend storage cost spikes | Don't; use logs / blob storage for bodies |
| Timestamp in span name | Same | Span name should describe the operation, not the instance |
Span names should be low-cardinality (operation name, route template). Attributes can be high-cardinality (user ID, order ID).
Trace ID in Everything
The lynchpin of observability: trace ID appears in logs, metrics labels (where cardinality allows), error reports, error pages, customer-facing requests.
// Every log line
pino.info({ trace_id: traceId, ... }, 'order placed');
// Error response to the user
res.status(500).json({ error: 'Internal error', trace_id: traceId });When a customer files a support ticket, they include the trace ID. Support pastes it into the tracing UI. You see the full request flow. Hours of debugging compressed to seconds.
Correlating with Logs and Metrics
A real observability stack lets you jump between the three pillars:
| From | To | Mechanism |
|---|---|---|
| Trace UI | Logs for that trace | Trace ID in logs; deep-link |
| Trace UI | Metrics for that service | Service name attribute; jump to dashboards |
| Logs | Trace | Click trace_id in log line; opens tracing UI |
| Metrics | Trace | "Show example traces for this high-latency window" (exemplars) |
Exemplars are the modern bridge: Prometheus metrics carry trace IDs of sampled requests, so you can jump from a latency histogram bucket to an actual slow trace. Grafana, Datadog, Honeycomb all support this.
Cost
Tracing volume × retention × backend cost. Real costs at scale:
- Storage — most backends are pay-per-span ingested. 100 spans per request × 1M requests/day = 100M spans/day.
- Query — some backends charge for query time / cardinality.
- Network egress — collectors send to backend; bytes matter.
Cost levers:
| Lever | Effect |
|---|---|
| Tail sampling | Order-of-magnitude reduction |
| Reduce span attributes | Per-span size matters |
| Reduce auto-instrumentation depth | Some auto-instrumentation produces many spans per request |
| Shorter retention | Linear in cost |
| Cheaper backend | Tempo + S3 is dramatically cheaper than Datadog |
Tail sampling is almost always the biggest win.
Backend Choice Revisited
| If you have | Pick |
|---|---|
| Existing Grafana / Prometheus / Loki | Tempo — same ecosystem, exemplars work natively |
| Don't want to operate | Honeycomb (best UX) / Datadog (most coverage) / SigNoz Cloud |
| Strict data residency | Self-host (Jaeger / Tempo / SigNoz) |
| AWS-only | X-Ray (cheapest) or Honeycomb on AWS |
| Already on Sentry | Sentry Performance |
Hot take: the analysis UX between backends differs more than people expect. Honeycomb in particular is designed for "ask any question about any field"; Jaeger is more "look at one trace at a time." Try a few before committing.
What Tracing Doesn't Solve
Honest:
- Heisenbugs that don't reproduce — tracing helps if it happened recently AND you sampled it.
- Performance issues outside instrumented code — kernel, network, hardware are dark spots.
- Real-time alerting — metrics-based alerting is faster and cheaper. Traces feed alerts on aggregate, not per-request.
- Replacing logs — they're complementary; don't delete your logs.
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Vendor-specific SDK lock-in | Migration is a rewrite | Use OpenTelemetry SDK |
| No trace_id in logs | Can't jump from log to trace | Log middleware that adds trace_id |
| 100% sampling in production | Bills explode | Tail sample; keep errors, slow, baseline |
| Auto-instrumenting too deeply | Span count explosion | Disable verbose instrumentation (Redis per command, etc) |
| Custom HTTP client without OTel patch | Breaks the trace | Use a supported client or manually inject |
| Per-user-ID span names | Cardinality explosion | Span name = operation; user ID = attribute |
http.url with PII / tokens | Trace UI leaks secrets | Strip / hash in collector processors |
| Stopping at "we have OTel" | Auto coverage only, no business attributes | Add manual spans for key business operations |
| Not propagating across queues | Background work disconnected from trace | Inject/extract in message headers |
| Trace data in transit unencrypted | Compliance issue | TLS between SDK → Collector → Backend |
Security
- TLS between SDK and collector, collector and backend.
- PII scrubbing in the collector —
attributesprocessor with regex. - Don't log auth tokens / passwords in span attributes; strip them.
- RBAC on your tracing UI — traces can include sensitive business data.
- Audit access — who's looking at traces with PII attributes.
Checklist
Production tracing checklist
- OpenTelemetry SDK (not vendor SDK)
- OTel Collector between apps and backend
- W3C Trace Context propagation across all HTTP services
- Manual spans for key business operations
- Trace ID emitted in every log line
- Trace ID returned to users in error responses
- Auto-instrumentation enabled for HTTP / DB / queue libraries
- Context propagation across queues / background jobs
- Tail sampling: keep all errors, all slow traces, baseline %
- Retention sized for typical investigation window (7-30 days)
- Span names are operations, not instances (low cardinality)
- PII / secrets scrubbed in collector before backend
- TLS between SDK / collector / backend
- Exemplars wired from metrics → traces
- Logs UI links to traces and back
- Backend cost monitored; alert on unexpected spend