Best Practices

Production-ready Prometheus and Grafana - cardinality, recording rules, retention, federation, long-term storage

Best Practices

The stack is easy to stand up. Running it at scale, with alerts you trust and a budget you didn't blow, takes a few specific habits.

Cardinality, the One That Bites

A time series is a unique combination of metric name + labels. Prometheus stores one chunk of data per series. Total series count is the dominant cost driver:

RAM (head block)
Disk (compressed chunks)
Query latency

Common cardinality bombs:

Mistake	Impact
`user_id` as a label	One series per user — millions
`request_id` as a label	One series per request — explodes
Unbounded `path` (raw URL including IDs)	One series per URL
Unbounded `error_message`	One series per error variant

Rules of thumb:

A metric should have at most ~10 labels.
A metric × all label combinations should produce < ~10,000 active series.
Use route templates (/users/:id), not raw paths.
For high-cardinality "what-happened" data, use logs or traces, not metrics.

Check cardinality before it's a problem:

# Top 20 metrics by series count
topk(20, count by (__name__) ({__name__=~".+"}))

# Series count for one specific metric
count(http_requests_total)

Recording Rules

Some queries are expensive — large sum over big histograms, joins, percentile calculations. Recording rules pre-compute them on Prometheus' evaluation interval and store the result as a new metric. Dashboards and alerts then query the cheap pre-computed series.

# rules/recording.yml
groups:
  - name: precomputed
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_error_rate:ratio5m
        expr: job:http_errors:rate5m / job:http_requests:rate5m

      - record: job:http_p99:5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

Naming convention: <level>:<metric>:<operations>. Now your dashboards just query job:http_p99:5m directly — orders of magnitude cheaper than re-evaluating the histogram on every render.

Retention and Storage

The default storage is local — fine for a single Prometheus, limited horizon (~weeks to a couple months).

prometheus \
  --storage.tsdb.retention.time=30d \   # keep 30 days
  --storage.tsdb.retention.size=100GB   # or cap at 100 GB

Retention need	Solution
Days to weeks	Single Prometheus, local disk
Months	Bigger disk + recording rules to thin out resolution
Year+	Long-term storage: Thanos / Cortex / Mimir / VictoriaMetrics

Long-Term Storage: Thanos, Mimir, Cortex

For multi-year retention, multi-cluster federation, or horizontal scaling, layer one of these on:

System	Notes
Thanos	Sidecar pattern; ships TSDB blocks to S3/GCS; deduplication, downsampling
Grafana Mimir	Cortex's successor; horizontally-scalable distributor + ingesters
Cortex	Original horizontally-scalable Prometheus backend
VictoriaMetrics	Single-binary or clustered; PromQL-compatible; very efficient

A typical Thanos setup:

Prometheus (per cluster, short retention)
   └─ Thanos Sidecar ──→ S3
                          │
                          ▼
                     Thanos Store ←── Thanos Query ←── Grafana

Local Prometheus stays fast for recent queries; long-term lives in S3 and is fronted by Thanos Store.

Federation

When you have many Prometheus instances (one per region, one per cluster), a top-level Prometheus can pull selected aggregates from each:

# Top-level prometheus.yml
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"job:.+"}'           # only recording-rule outputs
        - '{__name__=~"up"}'
    static_configs:
      - targets:
          - 'prom-us-east:9090'
          - 'prom-eu-west:9090'
          - 'prom-ap-south:9090'

Federate aggregates, not raw metrics — otherwise you're shipping everything to one box. Modern setups use Thanos/Mimir's global query layer instead of vanilla federation.

High Availability

Prometheus has no clustering. HA is two parallel Prometheuses scraping the same targets. Alertmanager dedupes, so you don't get duplicate pages.

alerting:
  alertmanagers:
    - static_configs:
        - targets: [alertmanager-1:9093, alertmanager-2:9093]

Run 2-3 Alertmanagers; they gossip and synchronize silences and notifications.

Security

Concern	What to do
`/metrics` exposes internals	Restrict who can scrape; firewall it off
Prometheus UI is anonymous	Put it behind an auth proxy (OAuth2 Proxy, nginx + SSO)
Alertmanager webhook secrets	Keep `slack_api_url`, PagerDuty keys out of git; use a secret manager
Federation across teams	Federate only the recording-rule level, not raw metrics
TLS scrape targets	`scrape_configs.tls_config` supports certs and CA validation

SLOs and Error Budgets

Once metrics are reliable, build SLOs on top:

# SLO: 99.9% of requests succeed over 30 days
# This is a 30-day rolling error budget burn rate
- record: slo:http_errors:burn_rate30d
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[30d]))
    / sum(rate(http_requests_total[30d]))
    / 0.001

burn_rate > 1 means you're burning the budget faster than it accumulates — alert on that, not on absolute thresholds.

Tools that automate SLO calculations and multi-window burn-rate alerts:

Pyrra — SLO controller for Kubernetes
Sloth — generate Prometheus rules from SLO YAML
OpenSLO — vendor-neutral SLO spec

Operational Habits

A handful that pay off:

Alert on symptoms, not causes. "5xx rate > 1%" pages; "CPU > 90%" usually doesn't.
Every alert has a runbook. Set runbook_url in annotations.
Test alert rules in CI. promtool test rules.
Review dashboards quarterly. Delete the ones nobody opens.
Watch your cardinality. topk query above should be on a dashboard.
Tag releases as annotations. "Did this start at the deploy?" should answer itself.
Run game days. Page yourself on purpose. Find out which alerts are noise.
One Prometheus per failure domain. Per cluster, per region. Federate / Thanos above.

Best Practices

Best Practices

Cardinality, the One That Bites

Recording Rules

Retention and Storage

Long-Term Storage: Thanos, Mimir, Cortex

Federation

High Availability

Security

SLOs and Error Budgets

Operational Habits

Checklist

On this page