Observability Pipelines
Vector, OpenTelemetry Collector, Fluent Bit, Cribl - route, transform, sample, and reduce telemetry between producers and backends
Observability Pipelines
An observability pipeline sits between your applications and your observability backends. It collects, transforms, samples, enriches, and routes logs, metrics, and traces — turning raw firehose into something useful at a manageable cost.
Before observability pipelines, apps wrote directly to Datadog or Splunk via SDKs. The bill was unbounded; the lock-in was complete; changing backends meant changing code in every service. With a pipeline, apps emit OpenTelemetry / syslog / OTLP, and the pipeline decides where it goes and what it costs.
Why a Pipeline
| Without | With |
|---|---|
| Every service has a vendor SDK | Services emit OTLP / syslog; pipeline routes |
| Vendor lock-in: switching means redeploying | Switch backends by changing the pipeline |
| Ingest is unbounded; bill is a surprise | Sample, drop, aggregate before egress |
| Same data goes to one backend | Same data to many (Datadog + S3 archive + SIEM) |
| Sensitive data leaks to vendor | Pipeline scrubs PII before egress |
| Different teams use different shippers | One pipeline supports all formats |
| 10 collectors per host | One agent, multiple inputs |
The economic argument alone justifies pipelines at any real scale. Datadog at 1TB/day is $~10k/day before discounts. Drop 70% of debug logs at the pipeline edge and the bill follows.
The Players
Pipeline tools
| Tool | Strengths | Weaknesses |
|---|---|---|
| OpenTelemetry Collector | OSS standard; logs + metrics + traces; vast ecosystem | Newer; learning curve |
| Vector (Datadog OSS) | Rust, fast, low memory; rich VRL transforms | Logs/metrics strong; traces weaker |
| Fluent Bit | Tiny binary; great on edge devices and K8s nodes | Less powerful transformation |
| Fluentd | Mature; Ruby-based; many plugins | Heavy, older; Fluent Bit usually preferred now |
| Cribl Stream | Commercial; powerful UI; routing/replay | Cost; not OSS |
| Logstash | Mature; rich plugins; part of ELK | JVM heavy; being replaced by Fluent Bit + Vector |
| Telegraf (InfluxData) | Lightweight metrics agent | Metrics-focused |
| Promtail | Loki's collector | Loki-only |
For a fresh stack: OpenTelemetry Collector for traces and metrics + Vector for logs, or Fluent Bit on every node feeding Vector / OTel Collector as the aggregator. Or just OTel Collector for everything if you can navigate its config.
What goes through them
Sources Pipeline Sinks
───────── ──────── ─────
- App stdout ┌────────────┐ - Datadog
- App OTLP ──▶ │ │ ──▶ - S3 / GCS archive
- Syslog │ Transform │ - Splunk / SIEM
- Kubernetes logs │ Sample │ - Elasticsearch / OpenSearch
- Cloud metrics │ Filter │ - Loki
- Prometheus scrape │ Route │ - Honeycomb
- AWS CloudWatch │ Enrich │ - Custom HTTP / Kafka
└────────────┘What Transforms Do
A pipeline isn't just routing — it's transformation. Common operations:
| Operation | Example |
|---|---|
| Drop | Discard healthcheck logs |
| Sample | 1% of debug logs, 100% of errors |
| Filter | Only logs from specific namespaces |
| Parse | JSON, regex, grok, multiline assembly |
| Enrich | Add cluster, region, environment tags |
| Mask | Replace credit card numbers, emails, IPs with *** |
| Aggregate | Combine N log lines into one metric |
| Convert | Log → metric; metric → trace span |
| Route | Errors → SIEM; everything → S3; rest → Datadog |
| Buffer | Hold data on disk during backend outage |
VRL (Vector Remap Language) — Vector's transformation language — is purpose-built for this. OpenTelemetry uses YAML processor chains.
Cost Math
A realistic example. 100-service production:
Raw log volume: 2 TB/day
Datadog list price: $0.10 per GB ingested = $200/day = $73k/yearPipeline transformations:
- Drop healthchecks (-30%): 1.4 TB
- Drop debug-level in production (-40%): 0.84 TB
- Sample successful 200 responses (-20%): 0.67 TB
- Route errors to SIEM, archive to S3 (still 0.67 TB to Datadog)
Filtered volume: 0.67 TB/day
Datadog cost: $67/day = $24k/year
Pipeline cost: maybe $200/month in compute
Net savings: ~$45k/yearThe pipeline pays for itself an order of magnitude over. And you can switch backends.
Sampling Patterns
For traces, sampling is essential — full-fidelity tracing is prohibitive at scale:
| Strategy | Description |
|---|---|
| Head sampling | Decide at trace start; cheap but blind to outcomes |
| Tail sampling | Decide after trace completes; keep slow / error traces |
| Probabilistic | Random 1%; uniform but loses rare events |
| Adaptive / rate-limiting | Cap at N traces/second |
| Conditional | All errors, all P99-latency, 1% baseline |
OpenTelemetry Collector's tail_sampling processor is the standard implementation:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- { name: errors-policy, type: status_code, status_code: {status_codes: [ERROR]} }
- { name: slow-policy, type: latency, latency: {threshold_ms: 1000} }
- { name: probabilistic, type: probabilistic, probabilistic: {sampling_percentage: 1} }This keeps every error trace, every slow trace, and 1% of the rest. Storage cost ~5% of full-fidelity; learning value ~95%.
Logs as Metrics
The cheapest log is the one you converted to a metric:
log: "checkout.completed user_id=abc total=42.50"
↓ pipeline extracts
metric: checkout_completed_total{} 1
checkout_total_dollars 42.50Now you can alert on metric thresholds without keeping the log. Most pipelines support this via metric_relabel_configs (OTel) or VRL to_metric (Vector).
A standard pattern: logs are kept for debugging (short retention); metrics are kept for trends (long retention). The pipeline does the conversion.
Learning Path
1. Getting Started
Deploy Vector and OpenTelemetry Collector locally; route logs through transforms; ship to multiple backends; observe with Prometheus
2. Patterns
Multi-backend routing, sampling strategies, PII masking, logs-to-metrics, edge vs aggregator, buffering, error budgets
3. Best Practices
Pipeline reliability, capacity, monitoring the pipeline, config management, common pitfalls, scaling
When You Don't Need a Pipeline (Yet)
Honest cases:
- Small fleet, single backend, low volume. Direct SDK from app → Datadog is fine until ~$1k/month.
- Single team, no compliance scoping. PII masking, SIEM routing, archive — these are scale concerns.
- You're standing up your first observability stack. Get something working first; insert a pipeline once you understand what you're collecting.
But the moment any of these become true, add a pipeline:
- Multiple backends to feed
- Vendor cost concern
- PII / compliance routing needs
- Multiple shippers per host
- Switching vendors becomes a project
The biggest pipeline mistake: writing complex transforms in the pipeline that should be done at the source. If your app emits structured JSON with the right fields, you barely need transformation. If your app emits "INFO 2024-01-15 ... checkout completed for user 123 total $42.50", the pipeline has to parse-regex-extract-rename, every event, forever. Push the structure upstream; let the pipeline focus on routing and sampling.