Steven's Knowledge

Prometheus & Grafana

The default open-source monitoring stack - from first scrape to production-grade alerting and dashboards

Prometheus & Grafana

Prometheus is an open-source time-series database and monitoring system. Grafana is a visualization platform that connects to Prometheus (and other data sources) to build dashboards and alerts. Together they're the default open-source monitoring stack.

Why This Stack

Without metricsWith Prometheus + Grafana
Reactive: users tell you it's brokenProactive: alerts fire before users notice
top and tail -f on a boxOne dashboard for every service
No baseline for "normal"Histograms and percentiles for every endpoint
Capacity planning by guessCapacity planning from actual usage trends

The Architecture

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ App+/metrics│  │Node Exporter│  │  cAdvisor   │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       └────────────────┼────────────────┘
                        │ pull (scrape)
                 ┌──────▼──────┐
                 │  Prometheus │  (TSDB + PromQL)
                 └──────┬──────┘
                        │ query
              ┌─────────┼─────────┐
              ▼                   ▼
       ┌────────────┐     ┌────────────┐
       │   Grafana  │     │Alertmanager│
       │(Dashboards)│     │  (Alerts)  │
       └────────────┘     └────────────┘

Two ideas to internalize:

  1. Pull, not push. Prometheus scrapes GET /metrics on a schedule. Apps don't push.
  2. Labels are dimensions. A metric http_requests_total with labels {method, path, status} becomes a multi-dimensional dataset you can slice with PromQL.

Learning Path

Read in this order if you're new — each page builds on the previous one.

When NOT to Use Prometheus

It's the right default, but not the only option:

  • Massive cardinality / per-user metrics? Prometheus struggles. Look at ClickHouse-based stacks (Cube, Aperture), or commercial APM (Datadog, New Relic).
  • You want logs + traces + metrics in one place? Grafana's broader stack (Loki for logs, Tempo for traces) or an APM SaaS.
  • You can't run servers? Managed Prometheus exists — AWS Managed Prometheus, Grafana Cloud, Chronosphere.

Prometheus collects metrics: numeric time-series sampled over time. For logs (string events) use Loki / Elasticsearch / a SaaS; for traces (request paths through services) use Tempo / Jaeger. Together they form the "three pillars of observability."

On this page