Best Practices

Cardinality control, capacity planning, high availability, backups, query optimization, common pitfalls

Best Practices

The operational realities of running a TSDB at scale.

Cardinality Control

Cardinality is the single most common failure mode. Control it from day one.

Define a tag policy. Document which fields can become tags and which can't. Review it in code.

Measure active series. Most TSDBs expose this. Alert when it grows unexpectedly:

-- TimescaleDB-ish
SELECT COUNT(DISTINCT (service, endpoint, region)) FROM http_metrics
WHERE time > NOW() - INTERVAL '1 hour';

# Prometheus
count by (__name__)({__name__=~".+"})

Bound user-controlled labels. If users can define metric labels (e.g., custom dimensions), validate against an allowlist or limit count.
Drop high-cardinality labels in ingest pipelines. Vector / Telegraf / Fluent Bit can strip labels before they reach the TSDB.
Have a kill switch. A bad deploy can multiply cardinality 100×; you need to drop labels at the agent layer quickly.

Capacity Planning

Sizing rules of thumb for a single-node deployment:

Component	Roughly drives
Active series count	RAM (indexes); each ~1-3 KB resident
Write rate	CPU + disk IOPS
Retention × points/series	Disk space
Concurrent query load	RAM (working sets) + CPU

Typical capacity per node (well-tuned):

TSDB	Active series	Write rate	Retention
Prometheus	1-10M	1M samples/s	15 days local
TimescaleDB	1-100M	1M rows/s	Years (compressed)
InfluxDB v3	10M+	1M+/s	Years
VictoriaMetrics	100M+	10M+/s	Years (cheap)

When you outgrow a node: shard by tenant/service, or move to a cluster (Cortex, Mimir, VictoriaMetrics cluster, InfluxDB Enterprise/Cloud, TimescaleDB Cloud multi-node).

High Availability

Strategies:

Replicate at ingest: dual-write from collectors (Prometheus federation, Telegraf to two backends).
Replicate at storage: Postgres streaming replication for TimescaleDB; clustered modes in InfluxDB Enterprise / VictoriaMetrics.
Manage outage with downsampling tiers: hot tier in one region, cool/cold in another.

Prometheus by itself isn't HA: deploy two instances scraping the same targets, federate to a long-term store (Mimir, VictoriaMetrics, Thanos). The dual instances handle "one node down"; the long-term store handles persistence.

For TimescaleDB:

# Standby with logical replication
SELECT pg_create_physical_replication_slot('standby1');
# Then on standby:
# pg_basebackup -h primary -D /var/lib/postgresql/data -X stream -R --slot=standby1

Use Patroni or TimescaleDB HA Helm chart for automated failover.

Backups

Time-series data is operational evidence. Lose it, lose your incident reconstruction. Strategy:

TimescaleDB: standard pg_dump / pg_basebackup / WAL archiving. Compressed chunks are dumped as compressed; fast.
InfluxDB: influx backup to local or S3.
Prometheus: snapshots via API (/api/v1/admin/tsdb/snapshot) — backup script copies the snapshot to S3.

Test restores quarterly. Untested backups don't exist.

Query Optimization

The same query can be fast or terrible depending on how you write it.

Always include a time predicate. Without it, the query scans every chunk:

-- GOOD: pruned to one chunk
SELECT AVG(latency_ms) FROM http_metrics
WHERE time > NOW() - INTERVAL '1 hour' AND service = 'checkout';

-- BAD: scans every chunk
SELECT AVG(latency_ms) FROM http_metrics WHERE service = 'checkout';

Use continuous aggregates for ranges > 24h. Raw queries on long ranges are expensive.

Order matters in WHERE. Most TSDBs partition by time first; put time predicate first.

Avoid SELECT *. Each column has its own block — fewer columns, less I/O.

Pre-aggregate in app for tight loops. If you read 10M rows to sum them every second, summing in your service first reduces the TSDB load.

Check execution plans. TimescaleDB has EXPLAIN ANALYZE. InfluxDB v3 supports EXPLAIN via SQL.

Monitoring the TSDB

The TSDB is monitoring everything else — who monitors it?

Key metrics to alert on:

Active series count — sudden growth is usually a label explosion
Ingest rate vs. expected — sudden drop = a collector is down
Query latency P99 — slow queries hurt dashboards
Disk usage and growth rate — TSDBs love disk
Compaction / chunk operations failing — usually a sign of disk pressure
Replication lag if HA

Prometheus exposes prometheus_tsdb_* metrics natively. InfluxDB has /health and /metrics. TimescaleDB exposes via pg_stat_* and timescaledb_information views.

Security

Authentication and authorization: never expose a TSDB directly. Put it behind a gateway with auth, especially for write endpoints.
TLS for client-server and replication.
Don't put secrets in tag values. Tags are searchable, often logged, sometimes leaked in error messages.
Audit ingestion sources: a forgotten test agent writing to prod can blow up cardinality.

Common Pitfalls

Unbounded label growth. A new feature emits user_id as a label, active series 10×s, ingest slows, queries timeout. Detection: dashboard on active series, alert on > 2× baseline.

No downsampling. A year in, queries on dashboards take 30s. Adding continuous aggregates retroactively is doable but disruptive — build them from the start.

Mixing OLTP and TSDB workload on TimescaleDB. A heavily-updated transactional table next to a hot hypertable share IO bandwidth. Separate databases or instances if both are hot.

Late-arriving data into compressed chunks. Errors or expensive decompress-on-write. Keep recent chunks uncompressed.

Building dashboards on raw data over weeks. Slow now, unusable in 3 months. Aggregate.

No retention policy. Disk grows until pager fires. Set retention before you have data, not after.

Choosing by hype. "We need ClickHouse" / "We need Druid" — start with TimescaleDB or InfluxDB and grow into ClickHouse only when you've actually hit the limit. The migration to a more specialized system is much cheaper than the migration from one you didn't need.

Treating the TSDB as a write-only log. The point is fast read; check that the queries you'll run are fast before committing.

Checklist

What's Next

You have a working time-series stack. Connect it to:

Monitoring — Prometheus + Grafana sits naturally next to your TSDB
Tracing — high-cardinality data lives in traces, not metrics
ELK — logs complement metrics; both feed Grafana / OpenSearch
FinOps — observability is often a top-5 cost line; downsampling has a big bill impact

Best Practices

On this page