Alert rules, Alertmanager routing, grouping, inhibition - building an alerting setup that pages humans only when needed

Alerting

Two pieces work together:

Prometheus evaluates alert rules on a schedule. When a rule's expression is true, an alert fires.
Alertmanager receives firing alerts, deduplicates and groups them, and routes to receivers (Slack, PagerDuty, email...).

Alert Rules

Rules live in YAML files referenced from prometheus.yml's rule_files::

# alerts/application.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate ({{ $value | humanizePercentage }})"
          description: "Error rate is above 5% for more than 5 minutes."

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
          > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency is {{ $value }}s (threshold: 2s)"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping in {{ $labels.namespace }}"

  - name: infrastructure
    rules:
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes > 0.90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory above 90% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
          < 0.10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

      - alert: TargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.job }}/{{ $labels.instance }} is down"

The `for:` Clause is Crucial

for: 5m means "the expression must be true continuously for 5 minutes before firing." Without it, every brief spike pages you. Tune for: per alert based on:

How long is "long enough to care"?
How long can you tolerate noticing?
How noisy is the underlying metric?

A typical baseline: for: 2m for clear failures, for: 10-15m for warning-level latency/saturation alerts.

Reload Rules Without Restart

curl -X POST http://prometheus:9090/-/reload

Alertmanager

Alertmanager turns firing alerts into messages someone will read. Three jobs: grouping, routing, silencing.

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/xxx/xxx/xxx"

route:
  receiver: "default"
  group_by: ["alertname", "severity", "team"]   # same alert across hosts = one notification
  group_wait: 30s                                # wait briefly to gather more firings before paging
  group_interval: 5m                             # send updates every 5m if more alerts are added
  repeat_interval: 4h                            # re-notify for unresolved alerts every 4h

  routes:
    # Critical → PagerDuty + Slack
    - match: { severity: critical }
      receiver: "pagerduty"
      continue: true
    - match: { severity: critical }
      receiver: "slack-critical"

    # Warnings → Slack only
    - match: { severity: warning }
      receiver: "slack-warning"

    # Per-team routing
    - match: { team: data }
      receiver: "slack-data-team"
      continue: true

receivers:
  - name: "default"
    slack_configs:
      - channel: "#alerts"
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: "slack-critical"
    slack_configs:
      - channel: "#alerts-critical"
        title: '🚨 {{ .CommonAnnotations.summary }}'

  - name: "slack-warning"
    slack_configs:
      - channel: "#alerts-warning"
        title: '⚠️ {{ .CommonAnnotations.summary }}'

  - name: "slack-data-team"
    slack_configs:
      - channel: "#data-alerts"

  - name: "pagerduty"
    pagerduty_configs:
      - routing_key: "YOUR_PD_ROUTING_KEY"
        severity: critical

# Inhibition: when a critical fires, suppress warnings for the same target
inhibit_rules:
  - source_match: { severity: "critical" }
    target_match: { severity: "warning" }
    equal: ["alertname", "instance"]

Grouping

A bad day looks like 200 pods crashing and 200 separate pages. Grouping collapses them into one notification:

group_by: ["alertname", "severity"]
group_wait: 30s

alertname: same alert (PodCrashLooping) groups together.
group_wait: 30s: when the first alert in a new group fires, wait 30s before paging — usually the other 199 arrive in that window.

Choose your group_by keys to collapse the noise but keep ownership distinguishable.

Routing

Routes are a tree. Each route can match labels and forward to a receiver. With continue: true the alert keeps walking the tree, hitting more receivers. Without it, the first match wins.

Common patterns:

Goal	Approach
Critical pages on-call AND posts to Slack	Two routes matching `severity: critical`, first has `continue: true`
Team A's alerts go to Team A's channel	`match: { team: A }` route to their receiver
Office-hours-only warnings	`mute_time_intervals` on the route
Test alerts to a dev channel	`match: { environment: dev }` route

Inhibition

Inhibition suppresses lower-severity alerts when a higher-severity one is firing for the same target. Without it, a downed node generates one NodeDown and twenty derivative alerts.

inhibit_rules:
  - source_match: { alertname: "NodeDown" }
    target_match_re: { alertname: ".+" }
    equal: ["instance"]

Silencing

Silences are temporary mutes for known maintenance — created via the Alertmanager UI or amtool:

amtool --alertmanager.url=http://alertmanager:9093 silence add \
  alertname=HighLatency \
  instance=api-01 \
  --duration=2h \
  --comment="planned upgrade"

Designing Good Alerts

A few rules that prevent alert fatigue:

Alert on symptoms, not causes. "Users see 5xx errors" is actionable; "memory above 80%" often isn't.
Every alert must be actionable. If the runbook is "ignore it for 24h, it'll clear up," it's not an alert.
Severity has meaning. Critical = wake someone up. Warning = read it tomorrow. Use them sparingly.
Every alert links to a runbook. Add runbook_url in annotations; receivers can render it.
Tune for: honestly. Too short = noise; too long = late detection. Iterate.
Review alerts quarterly. Delete alerts that haven't fired. Tune ones that fired uselessly.

Testing Alerts

Don't ship rules without testing them. Two ways:

# Validate syntax
promtool check rules alerts/*.yml

# Unit tests for rules (run in CI)
promtool test rules alerts/tests.yml

# alerts/tests.yml
rule_files:
  - application.yml

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{status="500"}'
        values: '0 0 1 2 3 4 5 6 7 8 9 10'
      - series: 'http_requests_total{status="200"}'
        values: '100 100 100 100 100 100 100 100 100 100 100 100'

    alert_rule_test:
      - eval_time: 10m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels: { severity: critical }
            exp_annotations:
              summary: "High error rate (10%)"

What's Next

You have alerts firing and getting routed. Now make them — and your day-to-day operating view — visible: dashboards → Grafana.

Alerting

On this page