Steven's Knowledge

Best Practices

Production tracing - sampling strategies, retention, cost, correlating with logs/metrics, pitfalls

Best Practices

Tracing's cost is roughly proportional to volume × cardinality × retention. The patterns below keep all three in check.

Sampling

You can't store every span at scale. The two-axis question:

When to decide:

  • Head sampling — the first service decides at the start of the trace; downstream services obey.
  • Tail sampling — collect everything, decide at the end based on what happened.

What to keep:

  • All errors / slow traces — they're rare and high-value.
  • A percentage of normal traffic — to baseline against.
  • Specific user IDs (e.g., enterprise customers, support tickets).
  • Specific routes — keep everything from /checkout, sample 1% from /health.

Head Sampling

Simplest, cheapest:

const sdk = new NodeSDK({
  sampler: new TraceIdRatioBasedSampler(0.1),   // 10% of traces
});

The decision propagates via traceparent flags. All services in a trace agree.

Downside: you sample blindly. The one slow trace you really wanted is in the 90% you dropped.

Tail Sampling at the Collector

Better at scale. Collect everything, decide at the collector based on the full trace:

# otel-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
      - name: enterprise-users
        type: string_attribute
        string_attribute:
          key: enterprise.account
          values: ["true"]

This config keeps:

  • All errored traces
  • All traces over 1 second
  • 5% of everything else
  • All enterprise customer traces

Result: valuable data preserved, noise filtered. Cost goes way down.

The collector buffers spans per trace ID; when the trace is "complete" (all spans arrived OR decision_wait elapsed) it applies the policy.

Vendor backends (Honeycomb, Datadog, Lightstep) often have built-in tail sampling — preferable to running your own collector for that purpose.

Retention

Hot retention (queryable in the UI): typically 7-30 days. Cold archive: object storage if you really need it. Most teams keep traces 7-14 days hot and don't archive.

Rule of thumb: you investigate today's traces today. Yesterday's are 90% useless unless something's structurally broken.

Cardinality

Every unique combination of attributes is a "series" the backend indexes. Common explosions:

Anti-patternSymptomFix
http.url with full query stringMillions of unique URLsUse http.target (the route template)
User ID as a span nameSameUser ID is an attribute, not a name
Random IDs in span names"POST /api/users/abc-123"Normalize: POST /api/users/{id}
Putting full bodies in attributesBackend storage cost spikesDon't; use logs / blob storage for bodies
Timestamp in span nameSameSpan name should describe the operation, not the instance

Span names should be low-cardinality (operation name, route template). Attributes can be high-cardinality (user ID, order ID).

Trace ID in Everything

The lynchpin of observability: trace ID appears in logs, metrics labels (where cardinality allows), error reports, error pages, customer-facing requests.

// Every log line
pino.info({ trace_id: traceId, ... }, 'order placed');

// Error response to the user
res.status(500).json({ error: 'Internal error', trace_id: traceId });

When a customer files a support ticket, they include the trace ID. Support pastes it into the tracing UI. You see the full request flow. Hours of debugging compressed to seconds.

Correlating with Logs and Metrics

A real observability stack lets you jump between the three pillars:

FromToMechanism
Trace UILogs for that traceTrace ID in logs; deep-link
Trace UIMetrics for that serviceService name attribute; jump to dashboards
LogsTraceClick trace_id in log line; opens tracing UI
MetricsTrace"Show example traces for this high-latency window" (exemplars)

Exemplars are the modern bridge: Prometheus metrics carry trace IDs of sampled requests, so you can jump from a latency histogram bucket to an actual slow trace. Grafana, Datadog, Honeycomb all support this.

Cost

Tracing volume × retention × backend cost. Real costs at scale:

  • Storage — most backends are pay-per-span ingested. 100 spans per request × 1M requests/day = 100M spans/day.
  • Query — some backends charge for query time / cardinality.
  • Network egress — collectors send to backend; bytes matter.

Cost levers:

LeverEffect
Tail samplingOrder-of-magnitude reduction
Reduce span attributesPer-span size matters
Reduce auto-instrumentation depthSome auto-instrumentation produces many spans per request
Shorter retentionLinear in cost
Cheaper backendTempo + S3 is dramatically cheaper than Datadog

Tail sampling is almost always the biggest win.

Backend Choice Revisited

If you havePick
Existing Grafana / Prometheus / LokiTempo — same ecosystem, exemplars work natively
Don't want to operateHoneycomb (best UX) / Datadog (most coverage) / SigNoz Cloud
Strict data residencySelf-host (Jaeger / Tempo / SigNoz)
AWS-onlyX-Ray (cheapest) or Honeycomb on AWS
Already on SentrySentry Performance

Hot take: the analysis UX between backends differs more than people expect. Honeycomb in particular is designed for "ask any question about any field"; Jaeger is more "look at one trace at a time." Try a few before committing.

What Tracing Doesn't Solve

Honest:

  • Heisenbugs that don't reproduce — tracing helps if it happened recently AND you sampled it.
  • Performance issues outside instrumented code — kernel, network, hardware are dark spots.
  • Real-time alerting — metrics-based alerting is faster and cheaper. Traces feed alerts on aggregate, not per-request.
  • Replacing logs — they're complementary; don't delete your logs.

Common Pitfalls

PitfallSymptomFix
Vendor-specific SDK lock-inMigration is a rewriteUse OpenTelemetry SDK
No trace_id in logsCan't jump from log to traceLog middleware that adds trace_id
100% sampling in productionBills explodeTail sample; keep errors, slow, baseline
Auto-instrumenting too deeplySpan count explosionDisable verbose instrumentation (Redis per command, etc)
Custom HTTP client without OTel patchBreaks the traceUse a supported client or manually inject
Per-user-ID span namesCardinality explosionSpan name = operation; user ID = attribute
http.url with PII / tokensTrace UI leaks secretsStrip / hash in collector processors
Stopping at "we have OTel"Auto coverage only, no business attributesAdd manual spans for key business operations
Not propagating across queuesBackground work disconnected from traceInject/extract in message headers
Trace data in transit unencryptedCompliance issueTLS between SDK → Collector → Backend

Security

  • TLS between SDK and collector, collector and backend.
  • PII scrubbing in the collector — attributes processor with regex.
  • Don't log auth tokens / passwords in span attributes; strip them.
  • RBAC on your tracing UI — traces can include sensitive business data.
  • Audit access — who's looking at traces with PII attributes.

Checklist

Production tracing checklist

  • OpenTelemetry SDK (not vendor SDK)
  • OTel Collector between apps and backend
  • W3C Trace Context propagation across all HTTP services
  • Manual spans for key business operations
  • Trace ID emitted in every log line
  • Trace ID returned to users in error responses
  • Auto-instrumentation enabled for HTTP / DB / queue libraries
  • Context propagation across queues / background jobs
  • Tail sampling: keep all errors, all slow traces, baseline %
  • Retention sized for typical investigation window (7-30 days)
  • Span names are operations, not instances (low cardinality)
  • PII / secrets scrubbed in collector before backend
  • TLS between SDK / collector / backend
  • Exemplars wired from metrics → traces
  • Logs UI links to traces and back
  • Backend cost monitored; alert on unexpected spend

On this page