Steven's Knowledge

Alerting

Alert rules, Alertmanager routing, grouping, inhibition - building an alerting setup that pages humans only when needed

Alerting

Two pieces work together:

  • Prometheus evaluates alert rules on a schedule. When a rule's expression is true, an alert fires.
  • Alertmanager receives firing alerts, deduplicates and groups them, and routes to receivers (Slack, PagerDuty, email...).

Alert Rules

Rules live in YAML files referenced from prometheus.yml's rule_files::

# alerts/application.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate ({{ $value | humanizePercentage }})"
          description: "Error rate is above 5% for more than 5 minutes."

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
          > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency is {{ $value }}s (threshold: 2s)"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping in {{ $labels.namespace }}"

  - name: infrastructure
    rules:
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes > 0.90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Memory above 90% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
          < 0.10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

      - alert: TargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.job }}/{{ $labels.instance }} is down"

The for: Clause is Crucial

for: 5m means "the expression must be true continuously for 5 minutes before firing." Without it, every brief spike pages you. Tune for: per alert based on:

  • How long is "long enough to care"?
  • How long can you tolerate noticing?
  • How noisy is the underlying metric?

A typical baseline: for: 2m for clear failures, for: 10-15m for warning-level latency/saturation alerts.

Reload Rules Without Restart

curl -X POST http://prometheus:9090/-/reload

Alertmanager

Alertmanager turns firing alerts into messages someone will read. Three jobs: grouping, routing, silencing.

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/xxx/xxx/xxx"

route:
  receiver: "default"
  group_by: ["alertname", "severity", "team"]   # same alert across hosts = one notification
  group_wait: 30s                                # wait briefly to gather more firings before paging
  group_interval: 5m                             # send updates every 5m if more alerts are added
  repeat_interval: 4h                            # re-notify for unresolved alerts every 4h

  routes:
    # Critical → PagerDuty + Slack
    - match: { severity: critical }
      receiver: "pagerduty"
      continue: true
    - match: { severity: critical }
      receiver: "slack-critical"

    # Warnings → Slack only
    - match: { severity: warning }
      receiver: "slack-warning"

    # Per-team routing
    - match: { team: data }
      receiver: "slack-data-team"
      continue: true

receivers:
  - name: "default"
    slack_configs:
      - channel: "#alerts"
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: "slack-critical"
    slack_configs:
      - channel: "#alerts-critical"
        title: '🚨 {{ .CommonAnnotations.summary }}'

  - name: "slack-warning"
    slack_configs:
      - channel: "#alerts-warning"
        title: '⚠️ {{ .CommonAnnotations.summary }}'

  - name: "slack-data-team"
    slack_configs:
      - channel: "#data-alerts"

  - name: "pagerduty"
    pagerduty_configs:
      - routing_key: "YOUR_PD_ROUTING_KEY"
        severity: critical

# Inhibition: when a critical fires, suppress warnings for the same target
inhibit_rules:
  - source_match: { severity: "critical" }
    target_match: { severity: "warning" }
    equal: ["alertname", "instance"]

Grouping

A bad day looks like 200 pods crashing and 200 separate pages. Grouping collapses them into one notification:

group_by: ["alertname", "severity"]
group_wait: 30s
  • alertname: same alert (PodCrashLooping) groups together.
  • group_wait: 30s: when the first alert in a new group fires, wait 30s before paging — usually the other 199 arrive in that window.

Choose your group_by keys to collapse the noise but keep ownership distinguishable.

Routing

Routes are a tree. Each route can match labels and forward to a receiver. With continue: true the alert keeps walking the tree, hitting more receivers. Without it, the first match wins.

Common patterns:

GoalApproach
Critical pages on-call AND posts to SlackTwo routes matching severity: critical, first has continue: true
Team A's alerts go to Team A's channelmatch: { team: A } route to their receiver
Office-hours-only warningsmute_time_intervals on the route
Test alerts to a dev channelmatch: { environment: dev } route

Inhibition

Inhibition suppresses lower-severity alerts when a higher-severity one is firing for the same target. Without it, a downed node generates one NodeDown and twenty derivative alerts.

inhibit_rules:
  - source_match: { alertname: "NodeDown" }
    target_match_re: { alertname: ".+" }
    equal: ["instance"]

Silencing

Silences are temporary mutes for known maintenance — created via the Alertmanager UI or amtool:

amtool --alertmanager.url=http://alertmanager:9093 silence add \
  alertname=HighLatency \
  instance=api-01 \
  --duration=2h \
  --comment="planned upgrade"

Designing Good Alerts

A few rules that prevent alert fatigue:

  1. Alert on symptoms, not causes. "Users see 5xx errors" is actionable; "memory above 80%" often isn't.
  2. Every alert must be actionable. If the runbook is "ignore it for 24h, it'll clear up," it's not an alert.
  3. Severity has meaning. Critical = wake someone up. Warning = read it tomorrow. Use them sparingly.
  4. Every alert links to a runbook. Add runbook_url in annotations; receivers can render it.
  5. Tune for: honestly. Too short = noise; too long = late detection. Iterate.
  6. Review alerts quarterly. Delete alerts that haven't fired. Tune ones that fired uselessly.

Testing Alerts

Don't ship rules without testing them. Two ways:

# Validate syntax
promtool check rules alerts/*.yml

# Unit tests for rules (run in CI)
promtool test rules alerts/tests.yml
# alerts/tests.yml
rule_files:
  - application.yml

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{status="500"}'
        values: '0 0 1 2 3 4 5 6 7 8 9 10'
      - series: 'http_requests_total{status="200"}'
        values: '100 100 100 100 100 100 100 100 100 100 100 100'

    alert_rule_test:
      - eval_time: 10m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels: { severity: critical }
            exp_annotations:
              summary: "High error rate (10%)"

What's Next

You have alerts firing and getting routed. Now make them — and your day-to-day operating view — visible: dashboards → Grafana.

On this page