Data sources, panels, dashboard design - building dashboards people actually look at

Grafana Dashboards

Prometheus has its own UI for ad-hoc queries. Grafana is for dashboards — the persistent views your team checks daily, the screens on the wall during an incident.

Data Source Configuration

Provisioned at startup so you never click through setup:

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"          # matches your scrape_interval

For long-term storage (Thanos, Cortex, Mimir), add another data source pointing at it.

Anatomy of a Dashboard

A Grafana dashboard is YAML/JSON containing panels, each with one or more queries and a visualization. The relevant pieces:

Variables — drop-downs at the top (environment, service, pod) that get templated into every query.
Time range — global; every panel respects it.
Refresh — how often panels re-query (5s for incident dashboards, 1m for overview).
Panels — graphs, stats, gauges, tables, heatmaps.

Panel Types Cheat Sheet

Panel	When to use
Time series	Most metrics — rate, latency, gauge over time
Stat	Single big number — current error rate, total RPS
Gauge	Bounded value with thresholds — CPU %, disk %
Bar gauge	Compare a current value across labels — RPS per service
Heatmap	Histograms over time — latency distribution
Table	Listing — top 10 noisy pods, current alerts
Logs	Loki integration; render log lines
State timeline	Discrete states over time — `up`/`down`

A "Golden Signals" Dashboard

Build one of these for every service. PromQL for each panel:

Panel	Query	Visualization
Request rate	`sum(rate(http_requests_total[5m]))`	Time series
Error rate (%)	`sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100`	Stat (with thresholds)
P50 latency	`histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))`	Time series
P99 latency	`histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))`	Time series
Saturation (CPU)	`avg(rate(container_cpu_usage_seconds_total{pod=~"$service-.*"}[5m]))`	Gauge
Saturation (memory)	`avg(container_memory_working_set_bytes{pod=~"$service-.*"} / container_spec_memory_limit_bytes)`	Gauge
Pods running	`count(kube_pod_status_phase{phase="Running", pod=~"$service-.*"})`	Stat

Variables (the Dropdowns)

Make a dashboard reusable across services and environments by templating:

# Variable: environment
Query type: Label values
Label: environment

# Variable: service (depends on environment)
Query type: Label values
Metric: up{environment="$environment"}
Label: service

Use them in queries:

sum(rate(http_requests_total{environment="$environment", service="$service"}[5m]))

Now the same dashboard works for staging-api, production-api, staging-worker, etc. Pick from the dropdown.

Dashboard Design Principles

A dashboard is a product. A few principles:

Top-to-bottom = high level to detail. Service health up top; per-pod drilldown below.
Left-to-right = traffic to errors to latency. Roughly the Golden Signals order.
One signal per panel. Don't pile six lines onto one graph if they have different units.
Y-axis units matter. "Bytes" not "Bytes (1000s)." Grafana auto-formats; help it.
Thresholds on stat/gauge panels. Green/yellow/red so anyone can read it at a glance.
Annotate releases. Configure a Grafana annotation source so dashboards show deploy markers.
Don't overload. A dashboard with 40 panels is two dashboards. Split.

Three Dashboards Every Service Should Have

Dashboard	Audience	Refresh
Service Overview	Everyone (incl. non-engineers)	30s
Operational Detail	Operating engineers	15s
Infrastructure	Platform team	30s

Service Overview holds Golden Signals plus current alerts and SLO burn. Operational Detail goes deeper — per-endpoint latency, queue depth, downstream dependency health. Infrastructure shows pod/node resource pressure.

Annotations

Grafana annotations are markers drawn across panels — releases, incidents, maintenance windows. Wire them up from your deploy pipeline:

# Add an annotation from CI after a successful deploy
curl -X POST "$GRAFANA_URL/api/annotations" \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"dashboardId\": $DASHBOARD_ID,
    \"tags\": [\"deploy\", \"$SERVICE\", \"$ENVIRONMENT\"],
    \"text\": \"Deploy $VERSION\"
  }"

Now during an incident, the question "did this start when we deployed?" answers itself visually.

Versioning Dashboards

Click-and-save in the UI is fine for ad-hoc dashboards. For ones that matter:

Export to JSON. Commit it to git.
Provision on startup. grafana/provisioning/dashboards/*.yml points at JSON files; Grafana loads them.
Review changes. Dashboard changes go through PR review like code.

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'default'
    folder: ''
    type: file
    options:
      path: /var/lib/grafana/dashboards

Drop the JSON files into /var/lib/grafana/dashboards (mounted from the repo) and they appear in Grafana.

Useful Community Dashboards

Don't build everything from scratch. Grafana Labs publishes dashboards for common exporters:

Exporter	Dashboard ID
Node Exporter	1860 (Node Exporter Full)
cAdvisor	14282
Kubernetes cluster	7249, 315
PostgreSQL	9628
Redis	11835
NGINX	12708

Import by ID at + → Import dashboard. Use them as a starting point; customize the queries for your environment.

What's Next

You can build dashboards people actually look at. The last page covers running this stack well at scale — cardinality, recording rules, retention, federation → Best Practices.

Grafana Dashboards

On this page