Steven's Knowledge

Best Practices

Cardinality control, capacity planning, high availability, backups, query optimization, common pitfalls

Best Practices

The operational realities of running a TSDB at scale.

Cardinality Control

Cardinality is the single most common failure mode. Control it from day one.

  • Define a tag policy. Document which fields can become tags and which can't. Review it in code.

  • Measure active series. Most TSDBs expose this. Alert when it grows unexpectedly:

    -- TimescaleDB-ish
    SELECT COUNT(DISTINCT (service, endpoint, region)) FROM http_metrics
    WHERE time > NOW() - INTERVAL '1 hour';
    # Prometheus
    count by (__name__)({__name__=~".+"})
  • Bound user-controlled labels. If users can define metric labels (e.g., custom dimensions), validate against an allowlist or limit count.

  • Drop high-cardinality labels in ingest pipelines. Vector / Telegraf / Fluent Bit can strip labels before they reach the TSDB.

  • Have a kill switch. A bad deploy can multiply cardinality 100×; you need to drop labels at the agent layer quickly.

Capacity Planning

Sizing rules of thumb for a single-node deployment:

ComponentRoughly drives
Active series countRAM (indexes); each ~1-3 KB resident
Write rateCPU + disk IOPS
Retention × points/seriesDisk space
Concurrent query loadRAM (working sets) + CPU

Typical capacity per node (well-tuned):

TSDBActive seriesWrite rateRetention
Prometheus1-10M1M samples/s15 days local
TimescaleDB1-100M1M rows/sYears (compressed)
InfluxDB v310M+1M+/sYears
VictoriaMetrics100M+10M+/sYears (cheap)

When you outgrow a node: shard by tenant/service, or move to a cluster (Cortex, Mimir, VictoriaMetrics cluster, InfluxDB Enterprise/Cloud, TimescaleDB Cloud multi-node).

High Availability

Strategies:

  • Replicate at ingest: dual-write from collectors (Prometheus federation, Telegraf to two backends).
  • Replicate at storage: Postgres streaming replication for TimescaleDB; clustered modes in InfluxDB Enterprise / VictoriaMetrics.
  • Manage outage with downsampling tiers: hot tier in one region, cool/cold in another.

Prometheus by itself isn't HA: deploy two instances scraping the same targets, federate to a long-term store (Mimir, VictoriaMetrics, Thanos). The dual instances handle "one node down"; the long-term store handles persistence.

For TimescaleDB:

# Standby with logical replication
SELECT pg_create_physical_replication_slot('standby1');
# Then on standby:
# pg_basebackup -h primary -D /var/lib/postgresql/data -X stream -R --slot=standby1

Use Patroni or TimescaleDB HA Helm chart for automated failover.

Backups

Time-series data is operational evidence. Lose it, lose your incident reconstruction. Strategy:

  • TimescaleDB: standard pg_dump / pg_basebackup / WAL archiving. Compressed chunks are dumped as compressed; fast.
  • InfluxDB: influx backup to local or S3.
  • Prometheus: snapshots via API (/api/v1/admin/tsdb/snapshot) — backup script copies the snapshot to S3.

Test restores quarterly. Untested backups don't exist.

Query Optimization

The same query can be fast or terrible depending on how you write it.

Always include a time predicate. Without it, the query scans every chunk:

-- GOOD: pruned to one chunk
SELECT AVG(latency_ms) FROM http_metrics
WHERE time > NOW() - INTERVAL '1 hour' AND service = 'checkout';

-- BAD: scans every chunk
SELECT AVG(latency_ms) FROM http_metrics WHERE service = 'checkout';

Use continuous aggregates for ranges > 24h. Raw queries on long ranges are expensive.

Order matters in WHERE. Most TSDBs partition by time first; put time predicate first.

Avoid SELECT *. Each column has its own block — fewer columns, less I/O.

Pre-aggregate in app for tight loops. If you read 10M rows to sum them every second, summing in your service first reduces the TSDB load.

Check execution plans. TimescaleDB has EXPLAIN ANALYZE. InfluxDB v3 supports EXPLAIN via SQL.

Monitoring the TSDB

The TSDB is monitoring everything else — who monitors it?

Key metrics to alert on:

  • Active series count — sudden growth is usually a label explosion
  • Ingest rate vs. expected — sudden drop = a collector is down
  • Query latency P99 — slow queries hurt dashboards
  • Disk usage and growth rate — TSDBs love disk
  • Compaction / chunk operations failing — usually a sign of disk pressure
  • Replication lag if HA

Prometheus exposes prometheus_tsdb_* metrics natively. InfluxDB has /health and /metrics. TimescaleDB exposes via pg_stat_* and timescaledb_information views.

Security

  • Authentication and authorization: never expose a TSDB directly. Put it behind a gateway with auth, especially for write endpoints.
  • TLS for client-server and replication.
  • Don't put secrets in tag values. Tags are searchable, often logged, sometimes leaked in error messages.
  • Audit ingestion sources: a forgotten test agent writing to prod can blow up cardinality.

Common Pitfalls

Unbounded label growth. A new feature emits user_id as a label, active series 10×s, ingest slows, queries timeout. Detection: dashboard on active series, alert on > 2× baseline.

No downsampling. A year in, queries on dashboards take 30s. Adding continuous aggregates retroactively is doable but disruptive — build them from the start.

Mixing OLTP and TSDB workload on TimescaleDB. A heavily-updated transactional table next to a hot hypertable share IO bandwidth. Separate databases or instances if both are hot.

Late-arriving data into compressed chunks. Errors or expensive decompress-on-write. Keep recent chunks uncompressed.

Building dashboards on raw data over weeks. Slow now, unusable in 3 months. Aggregate.

No retention policy. Disk grows until pager fires. Set retention before you have data, not after.

Choosing by hype. "We need ClickHouse" / "We need Druid" — start with TimescaleDB or InfluxDB and grow into ClickHouse only when you've actually hit the limit. The migration to a more specialized system is much cheaper than the migration from one you didn't need.

Treating the TSDB as a write-only log. The point is fast read; check that the queries you'll run are fast before committing.

Checklist

TSDB production readiness:

  • Schema reviewed; tag vs. field decisions documented
  • Cardinality budget defined; active series monitored and alerted
  • Continuous aggregates / recording rules for any dashboard range > 24h
  • Retention policy active on raw data
  • Compression policy active on warm data
  • Backup strategy in place and restore tested
  • HA strategy or accepted RPO/RTO
  • TLS + auth on all ingest and query endpoints
  • TSDB itself is monitored (active series, ingest, query latency, disk)
  • Kill switch for high-cardinality label drops at ingest
  • Capacity plan with projection for 12 months
  • Query patterns reviewed; slow queries indexed or aggregated

What's Next

You have a working time-series stack. Connect it to:

  • Monitoring — Prometheus + Grafana sits naturally next to your TSDB
  • Tracing — high-cardinality data lives in traces, not metrics
  • ELK — logs complement metrics; both feed Grafana / OpenSearch
  • FinOps — observability is often a top-5 cost line; downsampling has a big bill impact

On this page