Grafana Dashboards
Data sources, panels, dashboard design - building dashboards people actually look at
Grafana Dashboards
Prometheus has its own UI for ad-hoc queries. Grafana is for dashboards — the persistent views your team checks daily, the screens on the wall during an incident.
Data Source Configuration
Provisioned at startup so you never click through setup:
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
timeInterval: "15s" # matches your scrape_intervalFor long-term storage (Thanos, Cortex, Mimir), add another data source pointing at it.
Anatomy of a Dashboard
A Grafana dashboard is YAML/JSON containing panels, each with one or more queries and a visualization. The relevant pieces:
- Variables — drop-downs at the top (environment, service, pod) that get templated into every query.
- Time range — global; every panel respects it.
- Refresh — how often panels re-query (5s for incident dashboards, 1m for overview).
- Panels — graphs, stats, gauges, tables, heatmaps.
Panel Types Cheat Sheet
| Panel | When to use |
|---|---|
| Time series | Most metrics — rate, latency, gauge over time |
| Stat | Single big number — current error rate, total RPS |
| Gauge | Bounded value with thresholds — CPU %, disk % |
| Bar gauge | Compare a current value across labels — RPS per service |
| Heatmap | Histograms over time — latency distribution |
| Table | Listing — top 10 noisy pods, current alerts |
| Logs | Loki integration; render log lines |
| State timeline | Discrete states over time — up/down |
A "Golden Signals" Dashboard
Build one of these for every service. PromQL for each panel:
| Panel | Query | Visualization |
|---|---|---|
| Request rate | sum(rate(http_requests_total[5m])) | Time series |
| Error rate (%) | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 | Stat (with thresholds) |
| P50 latency | histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) | Time series |
| P99 latency | histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) | Time series |
| Saturation (CPU) | avg(rate(container_cpu_usage_seconds_total{pod=~"$service-.*"}[5m])) | Gauge |
| Saturation (memory) | avg(container_memory_working_set_bytes{pod=~"$service-.*"} / container_spec_memory_limit_bytes) | Gauge |
| Pods running | count(kube_pod_status_phase{phase="Running", pod=~"$service-.*"}) | Stat |
Variables (the Dropdowns)
Make a dashboard reusable across services and environments by templating:
# Variable: environment
Query type: Label values
Label: environment# Variable: service (depends on environment)
Query type: Label values
Metric: up{environment="$environment"}
Label: serviceUse them in queries:
sum(rate(http_requests_total{environment="$environment", service="$service"}[5m]))Now the same dashboard works for staging-api, production-api, staging-worker, etc. Pick from the dropdown.
Dashboard Design Principles
A dashboard is a product. A few principles:
- Top-to-bottom = high level to detail. Service health up top; per-pod drilldown below.
- Left-to-right = traffic to errors to latency. Roughly the Golden Signals order.
- One signal per panel. Don't pile six lines onto one graph if they have different units.
- Y-axis units matter. "Bytes" not "Bytes (1000s)." Grafana auto-formats; help it.
- Thresholds on stat/gauge panels. Green/yellow/red so anyone can read it at a glance.
- Annotate releases. Configure a Grafana annotation source so dashboards show deploy markers.
- Don't overload. A dashboard with 40 panels is two dashboards. Split.
Three Dashboards Every Service Should Have
| Dashboard | Audience | Refresh |
|---|---|---|
| Service Overview | Everyone (incl. non-engineers) | 30s |
| Operational Detail | Operating engineers | 15s |
| Infrastructure | Platform team | 30s |
Service Overview holds Golden Signals plus current alerts and SLO burn. Operational Detail goes deeper — per-endpoint latency, queue depth, downstream dependency health. Infrastructure shows pod/node resource pressure.
Annotations
Grafana annotations are markers drawn across panels — releases, incidents, maintenance windows. Wire them up from your deploy pipeline:
# Add an annotation from CI after a successful deploy
curl -X POST "$GRAFANA_URL/api/annotations" \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"dashboardId\": $DASHBOARD_ID,
\"tags\": [\"deploy\", \"$SERVICE\", \"$ENVIRONMENT\"],
\"text\": \"Deploy $VERSION\"
}"Now during an incident, the question "did this start when we deployed?" answers itself visually.
Versioning Dashboards
Click-and-save in the UI is fine for ad-hoc dashboards. For ones that matter:
- Export to JSON. Commit it to git.
- Provision on startup.
grafana/provisioning/dashboards/*.ymlpoints at JSON files; Grafana loads them. - Review changes. Dashboard changes go through PR review like code.
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default'
folder: ''
type: file
options:
path: /var/lib/grafana/dashboardsDrop the JSON files into /var/lib/grafana/dashboards (mounted from the repo) and they appear in Grafana.
Useful Community Dashboards
Don't build everything from scratch. Grafana Labs publishes dashboards for common exporters:
| Exporter | Dashboard ID |
|---|---|
| Node Exporter | 1860 (Node Exporter Full) |
| cAdvisor | 14282 |
| Kubernetes cluster | 7249, 315 |
| PostgreSQL | 9628 |
| Redis | 11835 |
| NGINX | 12708 |
Import by ID at + → Import dashboard. Use them as a starting point; customize the queries for your environment.
What's Next
You can build dashboards people actually look at. The last page covers running this stack well at scale — cardinality, recording rules, retention, federation → Best Practices.