Steven's Knowledge

Data Warehouses & Lakehouses

Snowflake, BigQuery, Databricks, ClickHouse, Redshift - column-oriented storage and compute for analytics over billions of rows

Data Warehouses & Lakehouses

A data warehouse is purpose-built for analytical queries — SELECT region, SUM(revenue) FROM orders WHERE year = 2026 GROUP BY region over a billion rows in seconds. It's the opposite of an OLTP database (Postgres, MySQL) which is built for transactional reads and writes.

Modern data warehouses are cloud-native, separated storage from compute, and columnar. The lines between "warehouse" and "lakehouse" have blurred — Databricks-style lakehouses store data as open formats (Parquet, Iceberg, Delta) on cheap object storage with multiple compute engines layered on top. Snowflake and BigQuery offer similar separation with proprietary formats.

The category is its own beast. It's adjacent to but distinct from:

If your query reads millions of rows and returns aggregates, a data warehouse is the right tool.

Why a Warehouse and Not Postgres

ConcernPostgresWarehouse
Single-row readsMicrosecondsSlow (10-100ms)
Aggregations over millions of rowsSlow without tuningSeconds, by design
Storage formatRow-orientedColumn-oriented (10-100× compression)
Compute scalingVertical (one big box)Horizontal (clusters of workers)
Concurrency modelOLTP MVCCSnapshot reads at massive parallelism
Update-heavy workloadStrongGenerally weak (append, not update)
Cost per TB-month$$ (always-on instance)$ (compute decoupled from storage)

The breaking point is usually single-table size > 1 billion rows + analytical query patterns. Below that, Postgres handles analytics fine with proper indexes.

The Players

Cloud-native warehouses (proprietary)

ToolStrengths
SnowflakeBest-in-class UX; clear separation of compute clusters; multi-cloud; SaaS
BigQuery (GCP)Serverless; per-query pricing; great for ad-hoc; deep ML integration
Redshift (AWS)AWS-native; cheap reserved capacity; RA3 (separated storage) reasonable
Synapse (Azure)Azure-native; SQL + Spark together

Lakehouses (open formats)

ToolStrengths
DatabricksDelta Lake; ML-first; Apache Spark; Unity Catalog
Apache Iceberg + engineOpen table format; works with Trino, Spark, Snowflake, BigQuery, Athena
Apache HudiOpen; transactional; CDC-friendly
Delta LakeOpen format; born at Databricks

OLAP engines (often self-hosted)

ToolStrengths
ClickHouseFastest OSS; columnar; SQL; ingestion at MM/s; insanely fast point lookups
Apache DruidReal-time + batch; pre-computed indexes; sub-second slice/dice
Apache PinotLinkedIn's; real-time analytics; strong at user-facing analytics
DuckDBIn-process; embedded analytics; "SQLite for analytics"

Query engines (decoupled from storage)

ToolStrengths
Trino (PrestoSQL fork)Query S3, Hive, Iceberg, JDBC sources; federated SQL
Spark SQLInside Databricks/EMR; integrates with ML pipelines
Athena (AWS)Managed Presto over S3; pay per query
DuckDBCan query Parquet on S3 directly; surprisingly capable

How to pick:

  • Don't want to run infra, willing to paySnowflake or BigQuery
  • AWS-nativeAthena (light) or Redshift (heavy)
  • Self-hosted, fastest, single-DBClickHouse
  • ML pipelines first-classDatabricks
  • Want open formats / vendor flexibilityIceberg + Trino or Databricks + Delta
  • User-facing analytics dashboardsClickHouse, Pinot, Druid
  • Embedded / local / notebookDuckDB

What "Columnar" Actually Means

OLTP databases store rows together:

[row 1: id=1, name="Alice", revenue=100, region="US"]
[row 2: id=2, name="Bob",   revenue=50,  region="EU"]

When you SELECT name FROM users WHERE id = 1, this is efficient — read one row, return one column.

Columnar storage groups by column:

ids:     [1, 2, 3, 4, ...]
names:   ["Alice", "Bob", "Carol", "Dave", ...]
revenue: [100, 50, 200, 80, ...]
region:  ["US", "EU", "US", "AS", ...]

When you SELECT SUM(revenue) WHERE region='US':

  • Read only the revenue and region columns (skip name, id)
  • Compression on similar values is huge (run-length encoding on region, delta encoding on id)
  • SIMD-vectorized aggregation across the column

The result: 10-100× less I/O for typical analytical queries, and CPU instructions designed for the shape.

The trade: single-row update is slow (rewrite the column). Update-heavy workloads stay OLTP; append-heavy + read-heavy moves to columnar.

Separated Storage and Compute

Modern warehouses store data in cheap object storage (S3, GCS, Azure Blob) and run compute clusters that read from it on demand. Three big consequences:

  1. Storage costs are cheap and decoupled: $20/TB-month vs. $200/TB-month for legacy DB storage.
  2. Compute can scale independently: run a giant query, spin up 100 workers, return result, shut them down.
  3. Multiple compute clusters can read the same data: dev cluster, prod cluster, ML pipeline cluster — all on the same source.

Snowflake popularized this. BigQuery is serverless (you don't see the clusters). Databricks runs Spark clusters on object storage. ClickHouse Cloud also separates them. Older warehouses (Redshift classic, on-prem) couple storage and compute and cost a lot more.

ELT vs. ETL

The data engineering shift of the 2010s:

  • ETL (Extract → Transform → Load): transform in flight, load clean data into a small warehouse. The shape was constrained by the warehouse size.
  • ELT (Extract → Load → Transform): dump raw into a cheap, massive warehouse; transform with SQL inside. Dominant pattern today.

Tools:

  • Fivetran / Airbyte / Estuary — managed extract+load from SaaS and DBs
  • dbt — SQL transformations in the warehouse; the de facto standard for analytics engineering
  • Workflow orchestration (Workflow Orchestration) — runs the dbt models, ingestion jobs

A modern data stack:

[SaaS / OLTP / events] → Fivetran/Airbyte → warehouse (raw schema)
                                                 ↓ dbt models (SQL)
                                            warehouse (analytics schema)

                                          BI tool (Looker/Metabase/Hex)

Learning Path

When You Don't Need a Warehouse

Honest cases:

  • Postgres works fine for your analytics until ~100GB and a few hundred queries/day. Don't over-engineer.
  • Your "analytics" is a few daily reports — a nightly Postgres replica is cheaper.
  • You have small data and want fast iteration — DuckDB embedded into a notebook is a serious analytics environment.
  • You need real-time row-level decisions — that's OLTP, not warehouse.

The warehouse pays off when:

  • Your analytical queries are slowing down OLTP
  • You want to mix data from many sources (SaaS, app DB, events)
  • Multiple teams (analytics, ML, product) want the same data
  • Compliance / audit requires a queryable history of everything

The warehouse you don't think about is the one running well. A warehouse where every query feels fast, dbt builds in minutes, and the bill is predictable is doing its job. The mistake people make is buying the most "powerful" warehouse before they have the data shape that justifies it. Start small (DuckDB, Postgres, ClickHouse self-hosted, Snowflake free tier), grow as needed. The warehouse tier you can outgrow is far cheaper than the warehouse you under-use.

On this page