Best Practices

Production tracing - sampling strategies, retention, cost, correlating with logs/metrics, pitfalls

Best Practices

Tracing's cost is roughly proportional to volume × cardinality × retention. The patterns below keep all three in check.

Sampling

You can't store every span at scale. The two-axis question:

When to decide:

Head sampling — the first service decides at the start of the trace; downstream services obey.
Tail sampling — collect everything, decide at the end based on what happened.

What to keep:

All errors / slow traces — they're rare and high-value.
A percentage of normal traffic — to baseline against.
Specific user IDs (e.g., enterprise customers, support tickets).
Specific routes — keep everything from /checkout, sample 1% from /health.

Head Sampling

Simplest, cheapest:

const sdk = new NodeSDK({
  sampler: new TraceIdRatioBasedSampler(0.1),   // 10% of traces
});

The decision propagates via traceparent flags. All services in a trace agree.

Downside: you sample blindly. The one slow trace you really wanted is in the 90% you dropped.

Tail Sampling at the Collector

Better at scale. Collect everything, decide at the collector based on the full trace:

# otel-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
      - name: enterprise-users
        type: string_attribute
        string_attribute:
          key: enterprise.account
          values: ["true"]

This config keeps:

All errored traces
All traces over 1 second
5% of everything else
All enterprise customer traces

Result: valuable data preserved, noise filtered. Cost goes way down.

The collector buffers spans per trace ID; when the trace is "complete" (all spans arrived OR decision_wait elapsed) it applies the policy.

Vendor backends (Honeycomb, Datadog, Lightstep) often have built-in tail sampling — preferable to running your own collector for that purpose.

Retention

Hot retention (queryable in the UI): typically 7-30 days. Cold archive: object storage if you really need it. Most teams keep traces 7-14 days hot and don't archive.

Rule of thumb: you investigate today's traces today. Yesterday's are 90% useless unless something's structurally broken.

Cardinality

Every unique combination of attributes is a "series" the backend indexes. Common explosions:

Anti-pattern	Symptom	Fix
`http.url` with full query string	Millions of unique URLs	Use `http.target` (the route template)
User ID as a span name	Same	User ID is an attribute, not a name
Random IDs in span names	"POST /api/users/abc-123"	Normalize: `POST /api/users/{id}`
Putting full bodies in attributes	Backend storage cost spikes	Don't; use logs / blob storage for bodies
Timestamp in span name	Same	Span name should describe the operation, not the instance

Span names should be low-cardinality (operation name, route template). Attributes can be high-cardinality (user ID, order ID).

Trace ID in Everything

The lynchpin of observability: trace ID appears in logs, metrics labels (where cardinality allows), error reports, error pages, customer-facing requests.

// Every log line
pino.info({ trace_id: traceId, ... }, 'order placed');

// Error response to the user
res.status(500).json({ error: 'Internal error', trace_id: traceId });

When a customer files a support ticket, they include the trace ID. Support pastes it into the tracing UI. You see the full request flow. Hours of debugging compressed to seconds.

Correlating with Logs and Metrics

A real observability stack lets you jump between the three pillars:

From	To	Mechanism
Trace UI	Logs for that trace	Trace ID in logs; deep-link
Trace UI	Metrics for that service	Service name attribute; jump to dashboards
Logs	Trace	Click trace_id in log line; opens tracing UI
Metrics	Trace	"Show example traces for this high-latency window" (exemplars)

Exemplars are the modern bridge: Prometheus metrics carry trace IDs of sampled requests, so you can jump from a latency histogram bucket to an actual slow trace. Grafana, Datadog, Honeycomb all support this.

Cost

Tracing volume × retention × backend cost. Real costs at scale:

Storage — most backends are pay-per-span ingested. 100 spans per request × 1M requests/day = 100M spans/day.
Query — some backends charge for query time / cardinality.
Network egress — collectors send to backend; bytes matter.

Cost levers:

Lever	Effect
Tail sampling	Order-of-magnitude reduction
Reduce span attributes	Per-span size matters
Reduce auto-instrumentation depth	Some auto-instrumentation produces many spans per request
Shorter retention	Linear in cost
Cheaper backend	Tempo + S3 is dramatically cheaper than Datadog

Tail sampling is almost always the biggest win.

Backend Choice Revisited

If you have	Pick
Existing Grafana / Prometheus / Loki	Tempo — same ecosystem, exemplars work natively
Don't want to operate	Honeycomb (best UX) / Datadog (most coverage) / SigNoz Cloud
Strict data residency	Self-host (Jaeger / Tempo / SigNoz)
AWS-only	X-Ray (cheapest) or Honeycomb on AWS
Already on Sentry	Sentry Performance

Hot take: the analysis UX between backends differs more than people expect. Honeycomb in particular is designed for "ask any question about any field"; Jaeger is more "look at one trace at a time." Try a few before committing.

What Tracing Doesn't Solve

Honest:

Heisenbugs that don't reproduce — tracing helps if it happened recently AND you sampled it.
Performance issues outside instrumented code — kernel, network, hardware are dark spots.
Real-time alerting — metrics-based alerting is faster and cheaper. Traces feed alerts on aggregate, not per-request.
Replacing logs — they're complementary; don't delete your logs.

Common Pitfalls

Pitfall	Symptom	Fix
Vendor-specific SDK lock-in	Migration is a rewrite	Use OpenTelemetry SDK
No trace_id in logs	Can't jump from log to trace	Log middleware that adds trace_id
100% sampling in production	Bills explode	Tail sample; keep errors, slow, baseline
Auto-instrumenting too deeply	Span count explosion	Disable verbose instrumentation (Redis per command, etc)
Custom HTTP client without OTel patch	Breaks the trace	Use a supported client or manually inject
Per-user-ID span names	Cardinality explosion	Span name = operation; user ID = attribute
`http.url` with PII / tokens	Trace UI leaks secrets	Strip / hash in collector processors
Stopping at "we have OTel"	Auto coverage only, no business attributes	Add manual spans for key business operations
Not propagating across queues	Background work disconnected from trace	Inject/extract in message headers
Trace data in transit unencrypted	Compliance issue	TLS between SDK → Collector → Backend

Security

TLS between SDK and collector, collector and backend.
PII scrubbing in the collector — attributes processor with regex.
Don't log auth tokens / passwords in span attributes; strip them.
RBAC on your tracing UI — traces can include sensitive business data.
Audit access — who's looking at traces with PII attributes.

Best Practices

Best Practices

Sampling

Head Sampling

Tail Sampling at the Collector

Retention

Cardinality

Trace ID in Everything

Correlating with Logs and Metrics

Cost

Backend Choice Revisited

What Tracing Doesn't Solve

Common Pitfalls

Security

Checklist

On this page