Steven's Knowledge

Grafana Dashboards

Data sources, panels, dashboard design - building dashboards people actually look at

Grafana Dashboards

Prometheus has its own UI for ad-hoc queries. Grafana is for dashboards — the persistent views your team checks daily, the screens on the wall during an incident.

Data Source Configuration

Provisioned at startup so you never click through setup:

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"          # matches your scrape_interval

For long-term storage (Thanos, Cortex, Mimir), add another data source pointing at it.

Anatomy of a Dashboard

A Grafana dashboard is YAML/JSON containing panels, each with one or more queries and a visualization. The relevant pieces:

  • Variables — drop-downs at the top (environment, service, pod) that get templated into every query.
  • Time range — global; every panel respects it.
  • Refresh — how often panels re-query (5s for incident dashboards, 1m for overview).
  • Panels — graphs, stats, gauges, tables, heatmaps.

Panel Types Cheat Sheet

PanelWhen to use
Time seriesMost metrics — rate, latency, gauge over time
StatSingle big number — current error rate, total RPS
GaugeBounded value with thresholds — CPU %, disk %
Bar gaugeCompare a current value across labels — RPS per service
HeatmapHistograms over time — latency distribution
TableListing — top 10 noisy pods, current alerts
LogsLoki integration; render log lines
State timelineDiscrete states over time — up/down

A "Golden Signals" Dashboard

Build one of these for every service. PromQL for each panel:

PanelQueryVisualization
Request ratesum(rate(http_requests_total[5m]))Time series
Error rate (%)sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100Stat (with thresholds)
P50 latencyhistogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))Time series
P99 latencyhistogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))Time series
Saturation (CPU)avg(rate(container_cpu_usage_seconds_total{pod=~"$service-.*"}[5m]))Gauge
Saturation (memory)avg(container_memory_working_set_bytes{pod=~"$service-.*"} / container_spec_memory_limit_bytes)Gauge
Pods runningcount(kube_pod_status_phase{phase="Running", pod=~"$service-.*"})Stat

Variables (the Dropdowns)

Make a dashboard reusable across services and environments by templating:

# Variable: environment
Query type: Label values
Label: environment
# Variable: service (depends on environment)
Query type: Label values
Metric: up{environment="$environment"}
Label: service

Use them in queries:

sum(rate(http_requests_total{environment="$environment", service="$service"}[5m]))

Now the same dashboard works for staging-api, production-api, staging-worker, etc. Pick from the dropdown.

Dashboard Design Principles

A dashboard is a product. A few principles:

  1. Top-to-bottom = high level to detail. Service health up top; per-pod drilldown below.
  2. Left-to-right = traffic to errors to latency. Roughly the Golden Signals order.
  3. One signal per panel. Don't pile six lines onto one graph if they have different units.
  4. Y-axis units matter. "Bytes" not "Bytes (1000s)." Grafana auto-formats; help it.
  5. Thresholds on stat/gauge panels. Green/yellow/red so anyone can read it at a glance.
  6. Annotate releases. Configure a Grafana annotation source so dashboards show deploy markers.
  7. Don't overload. A dashboard with 40 panels is two dashboards. Split.

Three Dashboards Every Service Should Have

DashboardAudienceRefresh
Service OverviewEveryone (incl. non-engineers)30s
Operational DetailOperating engineers15s
InfrastructurePlatform team30s

Service Overview holds Golden Signals plus current alerts and SLO burn. Operational Detail goes deeper — per-endpoint latency, queue depth, downstream dependency health. Infrastructure shows pod/node resource pressure.

Annotations

Grafana annotations are markers drawn across panels — releases, incidents, maintenance windows. Wire them up from your deploy pipeline:

# Add an annotation from CI after a successful deploy
curl -X POST "$GRAFANA_URL/api/annotations" \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"dashboardId\": $DASHBOARD_ID,
    \"tags\": [\"deploy\", \"$SERVICE\", \"$ENVIRONMENT\"],
    \"text\": \"Deploy $VERSION\"
  }"

Now during an incident, the question "did this start when we deployed?" answers itself visually.

Versioning Dashboards

Click-and-save in the UI is fine for ad-hoc dashboards. For ones that matter:

  • Export to JSON. Commit it to git.
  • Provision on startup. grafana/provisioning/dashboards/*.yml points at JSON files; Grafana loads them.
  • Review changes. Dashboard changes go through PR review like code.
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'default'
    folder: ''
    type: file
    options:
      path: /var/lib/grafana/dashboards

Drop the JSON files into /var/lib/grafana/dashboards (mounted from the repo) and they appear in Grafana.

Useful Community Dashboards

Don't build everything from scratch. Grafana Labs publishes dashboards for common exporters:

ExporterDashboard ID
Node Exporter1860 (Node Exporter Full)
cAdvisor14282
Kubernetes cluster7249, 315
PostgreSQL9628
Redis11835
NGINX12708

Import by ID at + → Import dashboard. Use them as a starting point; customize the queries for your environment.

What's Next

You can build dashboards people actually look at. The last page covers running this stack well at scale — cardinality, recording rules, retention, federation → Best Practices.

On this page