Best Practices

The operational realities of pipelines that handle terabytes per day.

Monitor the Pipeline

The most common failure: pipeline is dropping data and no one knows. Required metrics on the pipeline itself:

Events received per source per minute — sudden drop = upstream broken
Events sent per sink per minute — sudden drop = downstream broken
Queue depth per sink — rising = downstream backpressure
Drop rate — anything > 0 is a problem worth investigating
CPU + memory per pipeline node — saturation = data loss imminent
End-to-end latency — time from event arrival to sink confirmation

Vector exposes Prometheus metrics at :9598; OTel Collector at :8888. Scrape them and alert.

Sample alert rules:

# Drop rate > 0.1% in 5 min
rate(vector_events_dropped_total[5m]) / rate(vector_component_received_events_total[5m]) > 0.001

# Queue depth growing for 10 min
delta(vector_buffer_events[10m]) > 1000 and vector_buffer_events > 5000

# Sink failure rate > 1%
rate(vector_sink_send_errors_total[5m]) / rate(vector_events_total[5m]) > 0.01

Capacity Planning

Sizing rules of thumb (logs):

Volume	Setup
< 1 GB/day	Edge-only, single agent per host
1-100 GB/day	Edge + 1 aggregator (2-4 cores, 8 GB RAM)
100 GB - 1 TB/day	Edge + 2-4 aggregators (4-8 cores each)
1-10 TB/day	Edge + aggregator cluster + Kafka in front
> 10 TB/day	Multi-tier with regional aggregators

Vector and Fluent Bit are very efficient (single core handles 50-100 MB/s). OTel Collector is slower per-core because it's more general (logs + metrics + traces), expect 10-30 MB/s per core.

Traces have very different shape — high cardinality, batched poorly, harder to sample correctly. Plan separately from logs.

Config Management

The pipeline config is critical infrastructure. Treat it accordingly:

Source in Git. Every config change is a PR with diff and review.
Lint in CI. vector validate --config /etc/vector/vector.toml or otelcol validate --config /etc/otel/config.yaml.
Test in CI. Feed sample events through the pipeline binary in CI; assert sinks receive the expected output.
Stage before prod. New config to a staging pipeline first; only roll to prod once metrics confirm it works.
Canary deploys. Roll new config to one aggregator at a time; watch metrics; abort if errors rise.

A bad transform that drops every log is worse than no transform. Treat pipeline config like code.

Hot Reloads vs Restarts

Both Vector and OTel support config reload without restart, but with caveats:

In-flight events: a reload typically waits for current batches to flush. Buffered data may stick to the old config.
Buffer compatibility: changing disk buffer formats requires migration; in-memory buffers are dropped on reload.
External plugins may not survive reload.

For risky changes, restart is safer than reload — accept brief unavailability over inconsistent behavior.

Disk-Backed Buffers

Memory-only buffers lose data on restart. Disk buffers persist but are slower and need capacity planning:

[sinks.datadog]
buffer.type = "disk"
buffer.max_size = 10737418240  # 10 GB per sink
buffer.when_full = "block"

Size the buffer for your worst expected downstream outage. If Datadog can be down for 30 min, and you ingest 100 MB/s, you need 180 GB of buffer to survive without loss. Realistic answer: pick a smaller buffer + block + accept brief upstream backpressure.

Security

The pipeline often holds everything — secrets in logs, request bodies, customer data. Protect it:

TLS on every hop (source → agent → aggregator → sink).
Authentication: mTLS between agents and aggregators; API tokens to backends.
Secrets (API keys, tokens) injected via environment / secrets manager — never in config files committed to Git.
RBAC on configs: who can change the pipeline (which routes get added/changed)?
PII masking as a default; opt-in to ship raw data only after legal review.
Audit logs of pipeline config changes.

A pipeline compromise is a privacy nightmare. Treat it like a database with everyone's data.

Pipeline as Vendor Switching Insurance

The strategic value of a pipeline: you can switch backends without re-deploying every service.

To preserve this:

Apps emit OTLP or structured JSON (vendor-neutral).
No vendor SDKs in app code (or thin wrappers easy to swap).
Pipeline does the vendor formatting at the sink.
Maintain ability to fan-out: send same data to current vendor + experimental backend during evaluation.

This is how you keep DataDog honest about pricing. They know you can dual-write to OpenSearch + Loki in a week.

Backpressure Strategy

The chain: app → agent → aggregator → backend. Each can stall. Decide upfront where data gets dropped:

Drop at agent: lose recent local data; backend stays healthy. Best for tracing.
Drop at aggregator: lose central; agents keep buffering. Best for archived data.
Block at agent: stalls the app. Almost always wrong (degrades user experience for telemetry).
Block at aggregator: backpressures agents; agents buffer.

Default for logs: agent buffers small, aggregator buffers large, drop at aggregator if buffers fill. Default for traces: drop at agent (sampling is the answer; tail-based at aggregator).

Test Replay

If you archive to S3, periodically replay a slice through the pipeline to confirm the archive is readable:

aws s3 cp s3://logs-archive/2026-05-01/ - | vector --config replay.toml

Untested archive is untested backup. Catch format / encoding / compression issues early.

Multi-Region

For globally distributed apps, run a pipeline per region:

EU apps → EU agent → EU aggregator → EU Datadog region (data residency)
US apps → US agent → US aggregator → US Datadog region

Cross-region forwarding is expensive (egress) and breaks data residency requirements. Keep the pipeline regional.

Cost Allocation

The pipeline often runs as platform infrastructure; cost should attribute to teams:

Tag every event with team based on source.
Aggregate per-team event count + bytes shipped per backend.
Showback monthly: "team payments shipped 2.3 TB last month to Datadog, ~$8k."
Attach to FinOps process for Cloud Cost attribution.

This is what turns the pipeline from "platform overhead" into "self-serve cost-aware logging."

Common Pitfalls

Underestimating bursts. Average is 100 MB/s; bursts hit 1 GB/s. If your pipeline is sized for the average, bursts drop data. Size for P99 with headroom.

Treating logs and traces the same. Logs are append; traces are aggregated post-completion. Routing them through the same processing chain doesn't work — they need different topologies.

Forgetting kernel limits. Linux's open-file-descriptor limit kills pipelines that watch many log files. Raise ulimit -n and fs.file-max.

Pipeline death by deep copy. Some transform graphs copy entire events unnecessarily. Profile transforms; structure them to mutate in place.

Vendor-specific features used as core. "Datadog magic field detection" — feels great, locks you in. Process at the pipeline, send structured data, don't depend on backend-side magic.

No pipeline-of-pipelines plan. As the org scales you'll end up with multiple pipelines (per team / per cloud region / for specific compliance). Plan for that early; don't build a single monolith you can't decompose.

Schema drift. App teams add fields without telling pipeline owners. Pipeline drops or mis-maps them. Document a contract: "what apps emit, what pipeline expects."

Checklist

What's Next

You have a pipeline practice. Connect it to:

Monitoring — pipeline metrics flow through Prometheus
Tracing — OpenTelemetry traces benefit most from tail sampling
ELK — Elasticsearch is a common sink behind a pipeline
FinOps — observability is often a top-5 cost line; pipelines are the primary lever
Secrets — pipeline API keys belong in Vault, not config files

Best Practices

Monitor the Pipeline

Capacity Planning

Config Management

Hot Reloads vs Restarts

Disk-Backed Buffers

Security

Pipeline as Vendor Switching Insurance

Backpressure Strategy

Test Replay

Multi-Region

Cost Allocation

Common Pitfalls

Checklist

What's Next

Best Practices

On this page