Best Practices
Production-ready Prometheus and Grafana - cardinality, recording rules, retention, federation, long-term storage
Best Practices
The stack is easy to stand up. Running it at scale, with alerts you trust and a budget you didn't blow, takes a few specific habits.
Cardinality, the One That Bites
A time series is a unique combination of metric name + labels. Prometheus stores one chunk of data per series. Total series count is the dominant cost driver:
- RAM (head block)
- Disk (compressed chunks)
- Query latency
Common cardinality bombs:
| Mistake | Impact |
|---|---|
user_id as a label | One series per user — millions |
request_id as a label | One series per request — explodes |
Unbounded path (raw URL including IDs) | One series per URL |
Unbounded error_message | One series per error variant |
Rules of thumb:
- A metric should have at most ~10 labels.
- A metric × all label combinations should produce < ~10,000 active series.
- Use route templates (
/users/:id), not raw paths. - For high-cardinality "what-happened" data, use logs or traces, not metrics.
Check cardinality before it's a problem:
# Top 20 metrics by series count
topk(20, count by (__name__) ({__name__=~".+"}))
# Series count for one specific metric
count(http_requests_total)Recording Rules
Some queries are expensive — large sum over big histograms, joins, percentile calculations. Recording rules pre-compute them on Prometheus' evaluation interval and store the result as a new metric. Dashboards and alerts then query the cheap pre-computed series.
# rules/recording.yml
groups:
- name: precomputed
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_error_rate:ratio5m
expr: job:http_errors:rate5m / job:http_requests:rate5m
- record: job:http_p99:5m
expr: |
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)Naming convention: <level>:<metric>:<operations>. Now your dashboards just query job:http_p99:5m directly — orders of magnitude cheaper than re-evaluating the histogram on every render.
Retention and Storage
The default storage is local — fine for a single Prometheus, limited horizon (~weeks to a couple months).
prometheus \
--storage.tsdb.retention.time=30d \ # keep 30 days
--storage.tsdb.retention.size=100GB # or cap at 100 GB| Retention need | Solution |
|---|---|
| Days to weeks | Single Prometheus, local disk |
| Months | Bigger disk + recording rules to thin out resolution |
| Year+ | Long-term storage: Thanos / Cortex / Mimir / VictoriaMetrics |
Long-Term Storage: Thanos, Mimir, Cortex
For multi-year retention, multi-cluster federation, or horizontal scaling, layer one of these on:
| System | Notes |
|---|---|
| Thanos | Sidecar pattern; ships TSDB blocks to S3/GCS; deduplication, downsampling |
| Grafana Mimir | Cortex's successor; horizontally-scalable distributor + ingesters |
| Cortex | Original horizontally-scalable Prometheus backend |
| VictoriaMetrics | Single-binary or clustered; PromQL-compatible; very efficient |
A typical Thanos setup:
Prometheus (per cluster, short retention)
└─ Thanos Sidecar ──→ S3
│
▼
Thanos Store ←── Thanos Query ←── GrafanaLocal Prometheus stays fast for recent queries; long-term lives in S3 and is fronted by Thanos Store.
Federation
When you have many Prometheus instances (one per region, one per cluster), a top-level Prometheus can pull selected aggregates from each:
# Top-level prometheus.yml
scrape_configs:
- job_name: 'federate'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"job:.+"}' # only recording-rule outputs
- '{__name__=~"up"}'
static_configs:
- targets:
- 'prom-us-east:9090'
- 'prom-eu-west:9090'
- 'prom-ap-south:9090'Federate aggregates, not raw metrics — otherwise you're shipping everything to one box. Modern setups use Thanos/Mimir's global query layer instead of vanilla federation.
High Availability
Prometheus has no clustering. HA is two parallel Prometheuses scraping the same targets. Alertmanager dedupes, so you don't get duplicate pages.
alerting:
alertmanagers:
- static_configs:
- targets: [alertmanager-1:9093, alertmanager-2:9093]Run 2-3 Alertmanagers; they gossip and synchronize silences and notifications.
Security
| Concern | What to do |
|---|---|
/metrics exposes internals | Restrict who can scrape; firewall it off |
| Prometheus UI is anonymous | Put it behind an auth proxy (OAuth2 Proxy, nginx + SSO) |
| Alertmanager webhook secrets | Keep slack_api_url, PagerDuty keys out of git; use a secret manager |
| Federation across teams | Federate only the recording-rule level, not raw metrics |
| TLS scrape targets | scrape_configs.tls_config supports certs and CA validation |
SLOs and Error Budgets
Once metrics are reliable, build SLOs on top:
# SLO: 99.9% of requests succeed over 30 days
# This is a 30-day rolling error budget burn rate
- record: slo:http_errors:burn_rate30d
expr: |
sum(rate(http_requests_total{status=~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
/ 0.001burn_rate > 1 means you're burning the budget faster than it accumulates — alert on that, not on absolute thresholds.
Tools that automate SLO calculations and multi-window burn-rate alerts:
- Pyrra — SLO controller for Kubernetes
- Sloth — generate Prometheus rules from SLO YAML
- OpenSLO — vendor-neutral SLO spec
Operational Habits
A handful that pay off:
- Alert on symptoms, not causes. "5xx rate > 1%" pages; "CPU > 90%" usually doesn't.
- Every alert has a runbook. Set
runbook_urlin annotations. - Test alert rules in CI.
promtool test rules. - Review dashboards quarterly. Delete the ones nobody opens.
- Watch your cardinality.
topkquery above should be on a dashboard. - Tag releases as annotations. "Did this start at the deploy?" should answer itself.
- Run game days. Page yourself on purpose. Find out which alerts are noise.
- One Prometheus per failure domain. Per cluster, per region. Federate / Thanos above.
Checklist
Pre-production Prometheus + Grafana checklist
- Scrape interval ≤ 30s; retention set explicitly
- HA: two parallel Prometheus instances, two Alertmanager instances
- Long-term storage configured (Thanos / Mimir / VictoriaMetrics) for retention > a couple months
- Cardinality monitored —
topkquery on a dashboard - Recording rules for every expensive dashboard / alert query
- Alerts have
for:,severity:,runbook_urlannotations -
up == 0alert for every scrape job -
promtool check rulesandpromtool test rulesin CI - Dashboards version-controlled and provisioned from JSON
- Release annotations posted from CI
- Auth in front of Prometheus and Grafana UIs
- Sensitive webhooks (Slack, PagerDuty) come from a secret manager
- SLO and burn-rate alerts for top services