Vector, OpenTelemetry Collector, Fluent Bit, Cribl - route, transform, sample, and reduce telemetry between producers and backends

Observability Pipelines

An observability pipeline sits between your applications and your observability backends. It collects, transforms, samples, enriches, and routes logs, metrics, and traces — turning raw firehose into something useful at a manageable cost.

Before observability pipelines, apps wrote directly to Datadog or Splunk via SDKs. The bill was unbounded; the lock-in was complete; changing backends meant changing code in every service. With a pipeline, apps emit OpenTelemetry / syslog / OTLP, and the pipeline decides where it goes and what it costs.

Why a Pipeline

Without	With
Every service has a vendor SDK	Services emit OTLP / syslog; pipeline routes
Vendor lock-in: switching means redeploying	Switch backends by changing the pipeline
Ingest is unbounded; bill is a surprise	Sample, drop, aggregate before egress
Same data goes to one backend	Same data to many (Datadog + S3 archive + SIEM)
Sensitive data leaks to vendor	Pipeline scrubs PII before egress
Different teams use different shippers	One pipeline supports all formats
10 collectors per host	One agent, multiple inputs

The economic argument alone justifies pipelines at any real scale. Datadog at 1TB/day is $~10k/day before discounts. Drop 70% of debug logs at the pipeline edge and the bill follows.

The Players

Pipeline tools

Tool	Strengths	Weaknesses
OpenTelemetry Collector	OSS standard; logs + metrics + traces; vast ecosystem	Newer; learning curve
Vector (Datadog OSS)	Rust, fast, low memory; rich VRL transforms	Logs/metrics strong; traces weaker
Fluent Bit	Tiny binary; great on edge devices and K8s nodes	Less powerful transformation
Fluentd	Mature; Ruby-based; many plugins	Heavy, older; Fluent Bit usually preferred now
Cribl Stream	Commercial; powerful UI; routing/replay	Cost; not OSS
Logstash	Mature; rich plugins; part of ELK	JVM heavy; being replaced by Fluent Bit + Vector
Telegraf (InfluxData)	Lightweight metrics agent	Metrics-focused
Promtail	Loki's collector	Loki-only

For a fresh stack: OpenTelemetry Collector for traces and metrics + Vector for logs, or Fluent Bit on every node feeding Vector / OTel Collector as the aggregator. Or just OTel Collector for everything if you can navigate its config.

What goes through them

Sources                  Pipeline              Sinks
─────────                ────────              ─────
- App stdout             ┌────────────┐        - Datadog
- App OTLP        ──▶    │            │  ──▶   - S3 / GCS archive
- Syslog                 │ Transform  │        - Splunk / SIEM
- Kubernetes logs        │ Sample     │        - Elasticsearch / OpenSearch
- Cloud metrics          │ Filter     │        - Loki
- Prometheus scrape      │ Route      │        - Honeycomb
- AWS CloudWatch         │ Enrich     │        - Custom HTTP / Kafka
                         └────────────┘

What Transforms Do

A pipeline isn't just routing — it's transformation. Common operations:

Operation	Example
Drop	Discard healthcheck logs
Sample	1% of debug logs, 100% of errors
Filter	Only logs from specific namespaces
Parse	JSON, regex, grok, multiline assembly
Enrich	Add cluster, region, environment tags
Mask	Replace credit card numbers, emails, IPs with `***`
Aggregate	Combine N log lines into one metric
Convert	Log → metric; metric → trace span
Route	Errors → SIEM; everything → S3; rest → Datadog
Buffer	Hold data on disk during backend outage

VRL (Vector Remap Language) — Vector's transformation language — is purpose-built for this. OpenTelemetry uses YAML processor chains.

Cost Math

A realistic example. 100-service production:

Raw log volume:      2 TB/day
Datadog list price:  $0.10 per GB ingested = $200/day = $73k/year

Pipeline transformations:

Drop healthchecks (-30%): 1.4 TB
Drop debug-level in production (-40%): 0.84 TB
Sample successful 200 responses (-20%): 0.67 TB
Route errors to SIEM, archive to S3 (still 0.67 TB to Datadog)

Filtered volume:     0.67 TB/day
Datadog cost:        $67/day = $24k/year
Pipeline cost:       maybe $200/month in compute
Net savings:         ~$45k/year

The pipeline pays for itself an order of magnitude over. And you can switch backends.

Sampling Patterns

For traces, sampling is essential — full-fidelity tracing is prohibitive at scale:

Strategy	Description
Head sampling	Decide at trace start; cheap but blind to outcomes
Tail sampling	Decide after trace completes; keep slow / error traces
Probabilistic	Random 1%; uniform but loses rare events
Adaptive / rate-limiting	Cap at N traces/second
Conditional	All errors, all P99-latency, 1% baseline

OpenTelemetry Collector's tail_sampling processor is the standard implementation:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - { name: errors-policy, type: status_code, status_code: {status_codes: [ERROR]} }
      - { name: slow-policy, type: latency, latency: {threshold_ms: 1000} }
      - { name: probabilistic, type: probabilistic, probabilistic: {sampling_percentage: 1} }

This keeps every error trace, every slow trace, and 1% of the rest. Storage cost ~5% of full-fidelity; learning value ~95%.

Logs as Metrics

The cheapest log is the one you converted to a metric:

log: "checkout.completed user_id=abc total=42.50"
        ↓ pipeline extracts
metric: checkout_completed_total{} 1
        checkout_total_dollars 42.50

Now you can alert on metric thresholds without keeping the log. Most pipelines support this via metric_relabel_configs (OTel) or VRL to_metric (Vector).

A standard pattern: logs are kept for debugging (short retention); metrics are kept for trends (long retention). The pipeline does the conversion.

Learning Path

1. Getting Started

Deploy Vector and OpenTelemetry Collector locally; route logs through transforms; ship to multiple backends; observe with Prometheus

2. Patterns

Multi-backend routing, sampling strategies, PII masking, logs-to-metrics, edge vs aggregator, buffering, error budgets

3. Best Practices

Pipeline reliability, capacity, monitoring the pipeline, config management, common pitfalls, scaling

When You Don't Need a Pipeline (Yet)

Honest cases:

Small fleet, single backend, low volume. Direct SDK from app → Datadog is fine until ~$1k/month.
Single team, no compliance scoping. PII masking, SIEM routing, archive — these are scale concerns.
You're standing up your first observability stack. Get something working first; insert a pipeline once you understand what you're collecting.

But the moment any of these become true, add a pipeline:

Multiple backends to feed
Vendor cost concern
PII / compliance routing needs
Multiple shippers per host
Switching vendors becomes a project

The biggest pipeline mistake: writing complex transforms in the pipeline that should be done at the source. If your app emits structured JSON with the right fields, you barely need transformation. If your app emits "INFO 2024-01-15 ... checkout completed for user 123 total $42.50", the pipeline has to parse-regex-extract-rename, every event, forever. Push the structure upstream; let the pipeline focus on routing and sampling.

Observability Pipelines

1. Getting Started

2. Patterns

3. Best Practices

On this page