Alerting
Alert rules, Alertmanager routing, grouping, inhibition - building an alerting setup that pages humans only when needed
Alerting
Two pieces work together:
- Prometheus evaluates alert rules on a schedule. When a rule's expression is true, an alert fires.
- Alertmanager receives firing alerts, deduplicates and groups them, and routes to receivers (Slack, PagerDuty, email...).
Alert Rules
Rules live in YAML files referenced from prometheus.yml's rule_files::
# alerts/application.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate ({{ $value | humanizePercentage }})"
description: "Error rate is above 5% for more than 5 minutes."
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
> 2
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency is {{ $value }}s (threshold: 2s)"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Pod {{ $labels.pod }} is crash looping in {{ $labels.namespace }}"
- name: infrastructure
rules:
- alert: HighMemoryUsage
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes > 0.90
for: 10m
labels:
severity: warning
annotations:
summary: "Memory above 90% on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
< 0.10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
- alert: TargetDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Target {{ $labels.job }}/{{ $labels.instance }} is down"The for: Clause is Crucial
for: 5m means "the expression must be true continuously for 5 minutes before firing." Without it, every brief spike pages you. Tune for: per alert based on:
- How long is "long enough to care"?
- How long can you tolerate noticing?
- How noisy is the underlying metric?
A typical baseline: for: 2m for clear failures, for: 10-15m for warning-level latency/saturation alerts.
Reload Rules Without Restart
curl -X POST http://prometheus:9090/-/reloadAlertmanager
Alertmanager turns firing alerts into messages someone will read. Three jobs: grouping, routing, silencing.
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/xxx/xxx/xxx"
route:
receiver: "default"
group_by: ["alertname", "severity", "team"] # same alert across hosts = one notification
group_wait: 30s # wait briefly to gather more firings before paging
group_interval: 5m # send updates every 5m if more alerts are added
repeat_interval: 4h # re-notify for unresolved alerts every 4h
routes:
# Critical → PagerDuty + Slack
- match: { severity: critical }
receiver: "pagerduty"
continue: true
- match: { severity: critical }
receiver: "slack-critical"
# Warnings → Slack only
- match: { severity: warning }
receiver: "slack-warning"
# Per-team routing
- match: { team: data }
receiver: "slack-data-team"
continue: true
receivers:
- name: "default"
slack_configs:
- channel: "#alerts"
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: "slack-critical"
slack_configs:
- channel: "#alerts-critical"
title: '🚨 {{ .CommonAnnotations.summary }}'
- name: "slack-warning"
slack_configs:
- channel: "#alerts-warning"
title: '⚠️ {{ .CommonAnnotations.summary }}'
- name: "slack-data-team"
slack_configs:
- channel: "#data-alerts"
- name: "pagerduty"
pagerduty_configs:
- routing_key: "YOUR_PD_ROUTING_KEY"
severity: critical
# Inhibition: when a critical fires, suppress warnings for the same target
inhibit_rules:
- source_match: { severity: "critical" }
target_match: { severity: "warning" }
equal: ["alertname", "instance"]Grouping
A bad day looks like 200 pods crashing and 200 separate pages. Grouping collapses them into one notification:
group_by: ["alertname", "severity"]
group_wait: 30salertname: same alert (PodCrashLooping) groups together.group_wait: 30s: when the first alert in a new group fires, wait 30s before paging — usually the other 199 arrive in that window.
Choose your group_by keys to collapse the noise but keep ownership distinguishable.
Routing
Routes are a tree. Each route can match labels and forward to a receiver. With continue: true the alert keeps walking the tree, hitting more receivers. Without it, the first match wins.
Common patterns:
| Goal | Approach |
|---|---|
| Critical pages on-call AND posts to Slack | Two routes matching severity: critical, first has continue: true |
| Team A's alerts go to Team A's channel | match: { team: A } route to their receiver |
| Office-hours-only warnings | mute_time_intervals on the route |
| Test alerts to a dev channel | match: { environment: dev } route |
Inhibition
Inhibition suppresses lower-severity alerts when a higher-severity one is firing for the same target. Without it, a downed node generates one NodeDown and twenty derivative alerts.
inhibit_rules:
- source_match: { alertname: "NodeDown" }
target_match_re: { alertname: ".+" }
equal: ["instance"]Silencing
Silences are temporary mutes for known maintenance — created via the Alertmanager UI or amtool:
amtool --alertmanager.url=http://alertmanager:9093 silence add \
alertname=HighLatency \
instance=api-01 \
--duration=2h \
--comment="planned upgrade"Designing Good Alerts
A few rules that prevent alert fatigue:
- Alert on symptoms, not causes. "Users see 5xx errors" is actionable; "memory above 80%" often isn't.
- Every alert must be actionable. If the runbook is "ignore it for 24h, it'll clear up," it's not an alert.
- Severity has meaning. Critical = wake someone up. Warning = read it tomorrow. Use them sparingly.
- Every alert links to a runbook. Add
runbook_urlin annotations; receivers can render it. - Tune
for:honestly. Too short = noise; too long = late detection. Iterate. - Review alerts quarterly. Delete alerts that haven't fired. Tune ones that fired uselessly.
Testing Alerts
Don't ship rules without testing them. Two ways:
# Validate syntax
promtool check rules alerts/*.yml
# Unit tests for rules (run in CI)
promtool test rules alerts/tests.yml# alerts/tests.yml
rule_files:
- application.yml
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{status="500"}'
values: '0 0 1 2 3 4 5 6 7 8 9 10'
- series: 'http_requests_total{status="200"}'
values: '100 100 100 100 100 100 100 100 100 100 100 100'
alert_rule_test:
- eval_time: 10m
alertname: HighErrorRate
exp_alerts:
- exp_labels: { severity: critical }
exp_annotations:
summary: "High error rate (10%)"What's Next
You have alerts firing and getting routed. Now make them — and your day-to-day operating view — visible: dashboards → Grafana.