Steven's Knowledge

Best Practices

Production-ready Prometheus and Grafana - cardinality, recording rules, retention, federation, long-term storage

Best Practices

The stack is easy to stand up. Running it at scale, with alerts you trust and a budget you didn't blow, takes a few specific habits.

Cardinality, the One That Bites

A time series is a unique combination of metric name + labels. Prometheus stores one chunk of data per series. Total series count is the dominant cost driver:

  • RAM (head block)
  • Disk (compressed chunks)
  • Query latency

Common cardinality bombs:

MistakeImpact
user_id as a labelOne series per user — millions
request_id as a labelOne series per request — explodes
Unbounded path (raw URL including IDs)One series per URL
Unbounded error_messageOne series per error variant

Rules of thumb:

  • A metric should have at most ~10 labels.
  • A metric × all label combinations should produce < ~10,000 active series.
  • Use route templates (/users/:id), not raw paths.
  • For high-cardinality "what-happened" data, use logs or traces, not metrics.

Check cardinality before it's a problem:

# Top 20 metrics by series count
topk(20, count by (__name__) ({__name__=~".+"}))

# Series count for one specific metric
count(http_requests_total)

Recording Rules

Some queries are expensive — large sum over big histograms, joins, percentile calculations. Recording rules pre-compute them on Prometheus' evaluation interval and store the result as a new metric. Dashboards and alerts then query the cheap pre-computed series.

# rules/recording.yml
groups:
  - name: precomputed
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_error_rate:ratio5m
        expr: job:http_errors:rate5m / job:http_requests:rate5m

      - record: job:http_p99:5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

Naming convention: <level>:<metric>:<operations>. Now your dashboards just query job:http_p99:5m directly — orders of magnitude cheaper than re-evaluating the histogram on every render.

Retention and Storage

The default storage is local — fine for a single Prometheus, limited horizon (~weeks to a couple months).

prometheus \
  --storage.tsdb.retention.time=30d \   # keep 30 days
  --storage.tsdb.retention.size=100GB   # or cap at 100 GB
Retention needSolution
Days to weeksSingle Prometheus, local disk
MonthsBigger disk + recording rules to thin out resolution
Year+Long-term storage: Thanos / Cortex / Mimir / VictoriaMetrics

Long-Term Storage: Thanos, Mimir, Cortex

For multi-year retention, multi-cluster federation, or horizontal scaling, layer one of these on:

SystemNotes
ThanosSidecar pattern; ships TSDB blocks to S3/GCS; deduplication, downsampling
Grafana MimirCortex's successor; horizontally-scalable distributor + ingesters
CortexOriginal horizontally-scalable Prometheus backend
VictoriaMetricsSingle-binary or clustered; PromQL-compatible; very efficient

A typical Thanos setup:

Prometheus (per cluster, short retention)
   └─ Thanos Sidecar ──→ S3


                     Thanos Store ←── Thanos Query ←── Grafana

Local Prometheus stays fast for recent queries; long-term lives in S3 and is fronted by Thanos Store.

Federation

When you have many Prometheus instances (one per region, one per cluster), a top-level Prometheus can pull selected aggregates from each:

# Top-level prometheus.yml
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{__name__=~"job:.+"}'           # only recording-rule outputs
        - '{__name__=~"up"}'
    static_configs:
      - targets:
          - 'prom-us-east:9090'
          - 'prom-eu-west:9090'
          - 'prom-ap-south:9090'

Federate aggregates, not raw metrics — otherwise you're shipping everything to one box. Modern setups use Thanos/Mimir's global query layer instead of vanilla federation.

High Availability

Prometheus has no clustering. HA is two parallel Prometheuses scraping the same targets. Alertmanager dedupes, so you don't get duplicate pages.

alerting:
  alertmanagers:
    - static_configs:
        - targets: [alertmanager-1:9093, alertmanager-2:9093]

Run 2-3 Alertmanagers; they gossip and synchronize silences and notifications.

Security

ConcernWhat to do
/metrics exposes internalsRestrict who can scrape; firewall it off
Prometheus UI is anonymousPut it behind an auth proxy (OAuth2 Proxy, nginx + SSO)
Alertmanager webhook secretsKeep slack_api_url, PagerDuty keys out of git; use a secret manager
Federation across teamsFederate only the recording-rule level, not raw metrics
TLS scrape targetsscrape_configs.tls_config supports certs and CA validation

SLOs and Error Budgets

Once metrics are reliable, build SLOs on top:

# SLO: 99.9% of requests succeed over 30 days
# This is a 30-day rolling error budget burn rate
- record: slo:http_errors:burn_rate30d
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[30d]))
    / sum(rate(http_requests_total[30d]))
    / 0.001

burn_rate > 1 means you're burning the budget faster than it accumulates — alert on that, not on absolute thresholds.

Tools that automate SLO calculations and multi-window burn-rate alerts:

  • Pyrra — SLO controller for Kubernetes
  • Sloth — generate Prometheus rules from SLO YAML
  • OpenSLO — vendor-neutral SLO spec

Operational Habits

A handful that pay off:

  1. Alert on symptoms, not causes. "5xx rate > 1%" pages; "CPU > 90%" usually doesn't.
  2. Every alert has a runbook. Set runbook_url in annotations.
  3. Test alert rules in CI. promtool test rules.
  4. Review dashboards quarterly. Delete the ones nobody opens.
  5. Watch your cardinality. topk query above should be on a dashboard.
  6. Tag releases as annotations. "Did this start at the deploy?" should answer itself.
  7. Run game days. Page yourself on purpose. Find out which alerts are noise.
  8. One Prometheus per failure domain. Per cluster, per region. Federate / Thanos above.

Checklist

Pre-production Prometheus + Grafana checklist

  • Scrape interval ≤ 30s; retention set explicitly
  • HA: two parallel Prometheus instances, two Alertmanager instances
  • Long-term storage configured (Thanos / Mimir / VictoriaMetrics) for retention > a couple months
  • Cardinality monitored — topk query on a dashboard
  • Recording rules for every expensive dashboard / alert query
  • Alerts have for:, severity:, runbook_url annotations
  • up == 0 alert for every scrape job
  • promtool check rules and promtool test rules in CI
  • Dashboards version-controlled and provisioned from JSON
  • Release annotations posted from CI
  • Auth in front of Prometheus and Grafana UIs
  • Sensitive webhooks (Slack, PagerDuty) come from a secret manager
  • SLO and burn-rate alerts for top services

On this page