Best Practices
Pipeline reliability, capacity planning, monitoring the pipeline itself, config management, common pitfalls, scaling
Best Practices
The operational realities of pipelines that handle terabytes per day.
Monitor the Pipeline
The most common failure: pipeline is dropping data and no one knows. Required metrics on the pipeline itself:
- Events received per source per minute — sudden drop = upstream broken
- Events sent per sink per minute — sudden drop = downstream broken
- Queue depth per sink — rising = downstream backpressure
- Drop rate — anything > 0 is a problem worth investigating
- CPU + memory per pipeline node — saturation = data loss imminent
- End-to-end latency — time from event arrival to sink confirmation
Vector exposes Prometheus metrics at :9598; OTel Collector at :8888. Scrape them and alert.
Sample alert rules:
# Drop rate > 0.1% in 5 min
rate(vector_events_dropped_total[5m]) / rate(vector_component_received_events_total[5m]) > 0.001
# Queue depth growing for 10 min
delta(vector_buffer_events[10m]) > 1000 and vector_buffer_events > 5000
# Sink failure rate > 1%
rate(vector_sink_send_errors_total[5m]) / rate(vector_events_total[5m]) > 0.01Capacity Planning
Sizing rules of thumb (logs):
| Volume | Setup |
|---|---|
| < 1 GB/day | Edge-only, single agent per host |
| 1-100 GB/day | Edge + 1 aggregator (2-4 cores, 8 GB RAM) |
| 100 GB - 1 TB/day | Edge + 2-4 aggregators (4-8 cores each) |
| 1-10 TB/day | Edge + aggregator cluster + Kafka in front |
| > 10 TB/day | Multi-tier with regional aggregators |
Vector and Fluent Bit are very efficient (single core handles 50-100 MB/s). OTel Collector is slower per-core because it's more general (logs + metrics + traces), expect 10-30 MB/s per core.
Traces have very different shape — high cardinality, batched poorly, harder to sample correctly. Plan separately from logs.
Config Management
The pipeline config is critical infrastructure. Treat it accordingly:
- Source in Git. Every config change is a PR with diff and review.
- Lint in CI.
vector validate --config /etc/vector/vector.tomlorotelcol validate --config /etc/otel/config.yaml. - Test in CI. Feed sample events through the pipeline binary in CI; assert sinks receive the expected output.
- Stage before prod. New config to a staging pipeline first; only roll to prod once metrics confirm it works.
- Canary deploys. Roll new config to one aggregator at a time; watch metrics; abort if errors rise.
A bad transform that drops every log is worse than no transform. Treat pipeline config like code.
Hot Reloads vs Restarts
Both Vector and OTel support config reload without restart, but with caveats:
- In-flight events: a reload typically waits for current batches to flush. Buffered data may stick to the old config.
- Buffer compatibility: changing disk buffer formats requires migration; in-memory buffers are dropped on reload.
- External plugins may not survive reload.
For risky changes, restart is safer than reload — accept brief unavailability over inconsistent behavior.
Disk-Backed Buffers
Memory-only buffers lose data on restart. Disk buffers persist but are slower and need capacity planning:
[sinks.datadog]
buffer.type = "disk"
buffer.max_size = 10737418240 # 10 GB per sink
buffer.when_full = "block"Size the buffer for your worst expected downstream outage. If Datadog can be down for 30 min, and you ingest 100 MB/s, you need 180 GB of buffer to survive without loss. Realistic answer: pick a smaller buffer + block + accept brief upstream backpressure.
Security
The pipeline often holds everything — secrets in logs, request bodies, customer data. Protect it:
- TLS on every hop (source → agent → aggregator → sink).
- Authentication: mTLS between agents and aggregators; API tokens to backends.
- Secrets (API keys, tokens) injected via environment / secrets manager — never in config files committed to Git.
- RBAC on configs: who can change the pipeline (which routes get added/changed)?
- PII masking as a default; opt-in to ship raw data only after legal review.
- Audit logs of pipeline config changes.
A pipeline compromise is a privacy nightmare. Treat it like a database with everyone's data.
Pipeline as Vendor Switching Insurance
The strategic value of a pipeline: you can switch backends without re-deploying every service.
To preserve this:
- Apps emit OTLP or structured JSON (vendor-neutral).
- No vendor SDKs in app code (or thin wrappers easy to swap).
- Pipeline does the vendor formatting at the sink.
- Maintain ability to fan-out: send same data to current vendor + experimental backend during evaluation.
This is how you keep DataDog honest about pricing. They know you can dual-write to OpenSearch + Loki in a week.
Backpressure Strategy
The chain: app → agent → aggregator → backend. Each can stall. Decide upfront where data gets dropped:
- Drop at agent: lose recent local data; backend stays healthy. Best for tracing.
- Drop at aggregator: lose central; agents keep buffering. Best for archived data.
- Block at agent: stalls the app. Almost always wrong (degrades user experience for telemetry).
- Block at aggregator: backpressures agents; agents buffer.
Default for logs: agent buffers small, aggregator buffers large, drop at aggregator if buffers fill. Default for traces: drop at agent (sampling is the answer; tail-based at aggregator).
Test Replay
If you archive to S3, periodically replay a slice through the pipeline to confirm the archive is readable:
aws s3 cp s3://logs-archive/2026-05-01/ - | vector --config replay.tomlUntested archive is untested backup. Catch format / encoding / compression issues early.
Multi-Region
For globally distributed apps, run a pipeline per region:
EU apps → EU agent → EU aggregator → EU Datadog region (data residency)
US apps → US agent → US aggregator → US Datadog regionCross-region forwarding is expensive (egress) and breaks data residency requirements. Keep the pipeline regional.
Cost Allocation
The pipeline often runs as platform infrastructure; cost should attribute to teams:
- Tag every event with
teambased on source. - Aggregate per-team event count + bytes shipped per backend.
- Showback monthly: "team payments shipped 2.3 TB last month to Datadog, ~$8k."
- Attach to FinOps process for Cloud Cost attribution.
This is what turns the pipeline from "platform overhead" into "self-serve cost-aware logging."
Common Pitfalls
Underestimating bursts. Average is 100 MB/s; bursts hit 1 GB/s. If your pipeline is sized for the average, bursts drop data. Size for P99 with headroom.
Treating logs and traces the same. Logs are append; traces are aggregated post-completion. Routing them through the same processing chain doesn't work — they need different topologies.
Forgetting kernel limits. Linux's open-file-descriptor limit kills pipelines that watch many log files. Raise ulimit -n and fs.file-max.
Pipeline death by deep copy. Some transform graphs copy entire events unnecessarily. Profile transforms; structure them to mutate in place.
Vendor-specific features used as core. "Datadog magic field detection" — feels great, locks you in. Process at the pipeline, send structured data, don't depend on backend-side magic.
No pipeline-of-pipelines plan. As the org scales you'll end up with multiple pipelines (per team / per cloud region / for specific compliance). Plan for that early; don't build a single monolith you can't decompose.
Schema drift. App teams add fields without telling pipeline owners. Pipeline drops or mis-maps them. Document a contract: "what apps emit, what pipeline expects."
Checklist
Observability pipeline production readiness:
- Config source in Git, reviewed via PR
- CI validates config syntax + sample event tests
- Pipeline staged before prod rollout
- Pipeline emits its own metrics (received/sent/dropped/queue)
- Alerts on drop rate, queue depth, sink errors
- Disk-backed buffers on critical sinks (or accepted RPO for data loss)
- TLS on all hops; secrets injected via env / secret manager
- PII masking on by default; opt-in for raw routes
- Multi-backend routing tested (failover scenario rehearsed)
- Capacity sized for P99 ingest, not average
- Cost attribution per team (showback ready)
- Replay from archive tested at least quarterly
- Regional pipelines for data residency where required
- Documented event schema contract between apps and pipeline
What's Next
You have a pipeline practice. Connect it to:
- Monitoring — pipeline metrics flow through Prometheus
- Tracing — OpenTelemetry traces benefit most from tail sampling
- ELK — Elasticsearch is a common sink behind a pipeline
- FinOps — observability is often a top-5 cost line; pipelines are the primary lever
- Secrets — pipeline API keys belong in Vault, not config files