Steven's Knowledge

Observability Pipelines

Vector, OpenTelemetry Collector, Fluent Bit, Cribl - route, transform, sample, and reduce telemetry between producers and backends

Observability Pipelines

An observability pipeline sits between your applications and your observability backends. It collects, transforms, samples, enriches, and routes logs, metrics, and traces — turning raw firehose into something useful at a manageable cost.

Before observability pipelines, apps wrote directly to Datadog or Splunk via SDKs. The bill was unbounded; the lock-in was complete; changing backends meant changing code in every service. With a pipeline, apps emit OpenTelemetry / syslog / OTLP, and the pipeline decides where it goes and what it costs.

Why a Pipeline

WithoutWith
Every service has a vendor SDKServices emit OTLP / syslog; pipeline routes
Vendor lock-in: switching means redeployingSwitch backends by changing the pipeline
Ingest is unbounded; bill is a surpriseSample, drop, aggregate before egress
Same data goes to one backendSame data to many (Datadog + S3 archive + SIEM)
Sensitive data leaks to vendorPipeline scrubs PII before egress
Different teams use different shippersOne pipeline supports all formats
10 collectors per hostOne agent, multiple inputs

The economic argument alone justifies pipelines at any real scale. Datadog at 1TB/day is $~10k/day before discounts. Drop 70% of debug logs at the pipeline edge and the bill follows.

The Players

Pipeline tools

ToolStrengthsWeaknesses
OpenTelemetry CollectorOSS standard; logs + metrics + traces; vast ecosystemNewer; learning curve
Vector (Datadog OSS)Rust, fast, low memory; rich VRL transformsLogs/metrics strong; traces weaker
Fluent BitTiny binary; great on edge devices and K8s nodesLess powerful transformation
FluentdMature; Ruby-based; many pluginsHeavy, older; Fluent Bit usually preferred now
Cribl StreamCommercial; powerful UI; routing/replayCost; not OSS
LogstashMature; rich plugins; part of ELKJVM heavy; being replaced by Fluent Bit + Vector
Telegraf (InfluxData)Lightweight metrics agentMetrics-focused
PromtailLoki's collectorLoki-only

For a fresh stack: OpenTelemetry Collector for traces and metrics + Vector for logs, or Fluent Bit on every node feeding Vector / OTel Collector as the aggregator. Or just OTel Collector for everything if you can navigate its config.

What goes through them

Sources                  Pipeline              Sinks
─────────                ────────              ─────
- App stdout             ┌────────────┐        - Datadog
- App OTLP        ──▶    │            │  ──▶   - S3 / GCS archive
- Syslog                 │ Transform  │        - Splunk / SIEM
- Kubernetes logs        │ Sample     │        - Elasticsearch / OpenSearch
- Cloud metrics          │ Filter     │        - Loki
- Prometheus scrape      │ Route      │        - Honeycomb
- AWS CloudWatch         │ Enrich     │        - Custom HTTP / Kafka
                         └────────────┘

What Transforms Do

A pipeline isn't just routing — it's transformation. Common operations:

OperationExample
DropDiscard healthcheck logs
Sample1% of debug logs, 100% of errors
FilterOnly logs from specific namespaces
ParseJSON, regex, grok, multiline assembly
EnrichAdd cluster, region, environment tags
MaskReplace credit card numbers, emails, IPs with ***
AggregateCombine N log lines into one metric
ConvertLog → metric; metric → trace span
RouteErrors → SIEM; everything → S3; rest → Datadog
BufferHold data on disk during backend outage

VRL (Vector Remap Language) — Vector's transformation language — is purpose-built for this. OpenTelemetry uses YAML processor chains.

Cost Math

A realistic example. 100-service production:

Raw log volume:      2 TB/day
Datadog list price:  $0.10 per GB ingested = $200/day = $73k/year

Pipeline transformations:

  • Drop healthchecks (-30%): 1.4 TB
  • Drop debug-level in production (-40%): 0.84 TB
  • Sample successful 200 responses (-20%): 0.67 TB
  • Route errors to SIEM, archive to S3 (still 0.67 TB to Datadog)
Filtered volume:     0.67 TB/day
Datadog cost:        $67/day = $24k/year
Pipeline cost:       maybe $200/month in compute
Net savings:         ~$45k/year

The pipeline pays for itself an order of magnitude over. And you can switch backends.

Sampling Patterns

For traces, sampling is essential — full-fidelity tracing is prohibitive at scale:

StrategyDescription
Head samplingDecide at trace start; cheap but blind to outcomes
Tail samplingDecide after trace completes; keep slow / error traces
ProbabilisticRandom 1%; uniform but loses rare events
Adaptive / rate-limitingCap at N traces/second
ConditionalAll errors, all P99-latency, 1% baseline

OpenTelemetry Collector's tail_sampling processor is the standard implementation:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - { name: errors-policy, type: status_code, status_code: {status_codes: [ERROR]} }
      - { name: slow-policy, type: latency, latency: {threshold_ms: 1000} }
      - { name: probabilistic, type: probabilistic, probabilistic: {sampling_percentage: 1} }

This keeps every error trace, every slow trace, and 1% of the rest. Storage cost ~5% of full-fidelity; learning value ~95%.

Logs as Metrics

The cheapest log is the one you converted to a metric:

log: "checkout.completed user_id=abc total=42.50"
        ↓ pipeline extracts
metric: checkout_completed_total{} 1
        checkout_total_dollars 42.50

Now you can alert on metric thresholds without keeping the log. Most pipelines support this via metric_relabel_configs (OTel) or VRL to_metric (Vector).

A standard pattern: logs are kept for debugging (short retention); metrics are kept for trends (long retention). The pipeline does the conversion.

Learning Path

When You Don't Need a Pipeline (Yet)

Honest cases:

  • Small fleet, single backend, low volume. Direct SDK from app → Datadog is fine until ~$1k/month.
  • Single team, no compliance scoping. PII masking, SIEM routing, archive — these are scale concerns.
  • You're standing up your first observability stack. Get something working first; insert a pipeline once you understand what you're collecting.

But the moment any of these become true, add a pipeline:

  • Multiple backends to feed
  • Vendor cost concern
  • PII / compliance routing needs
  • Multiple shippers per host
  • Switching vendors becomes a project

The biggest pipeline mistake: writing complex transforms in the pipeline that should be done at the source. If your app emits structured JSON with the right fields, you barely need transformation. If your app emits "INFO 2024-01-15 ... checkout completed for user 123 total $42.50", the pipeline has to parse-regex-extract-rename, every event, forever. Push the structure upstream; let the pipeline focus on routing and sampling.

On this page