Snowflake, BigQuery, Databricks, ClickHouse, Redshift - column-oriented storage and compute for analytics over billions of rows

Data Warehouses & Lakehouses

A data warehouse is purpose-built for analytical queries — SELECT region, SUM(revenue) FROM orders WHERE year = 2026 GROUP BY region over a billion rows in seconds. It's the opposite of an OLTP database (Postgres, MySQL) which is built for transactional reads and writes.

Modern data warehouses are cloud-native, separated storage from compute, and columnar. The lines between "warehouse" and "lakehouse" have blurred — Databricks-style lakehouses store data as open formats (Parquet, Iceberg, Delta) on cheap object storage with multiple compute engines layered on top. Snowflake and BigQuery offer similar separation with proprietary formats.

The category is its own beast. It's adjacent to but distinct from:

Time-series databases — time-shaped data, short queries
Search — text retrieval
Vector databases — embedding similarity
OLTP (Postgres, MySQL) — small transactional reads/writes

If your query reads millions of rows and returns aggregates, a data warehouse is the right tool.

Why a Warehouse and Not Postgres

Concern	Postgres	Warehouse
Single-row reads	Microseconds	Slow (10-100ms)
Aggregations over millions of rows	Slow without tuning	Seconds, by design
Storage format	Row-oriented	Column-oriented (10-100× compression)
Compute scaling	Vertical (one big box)	Horizontal (clusters of workers)
Concurrency model	OLTP MVCC	Snapshot reads at massive parallelism
Update-heavy workload	Strong	Generally weak (append, not update)
Cost per TB-month	$$ (always-on instance)	$ (compute decoupled from storage)

The breaking point is usually single-table size > 1 billion rows + analytical query patterns. Below that, Postgres handles analytics fine with proper indexes.

The Players

Cloud-native warehouses (proprietary)

Tool	Strengths
Snowflake	Best-in-class UX; clear separation of compute clusters; multi-cloud; SaaS
BigQuery (GCP)	Serverless; per-query pricing; great for ad-hoc; deep ML integration
Redshift (AWS)	AWS-native; cheap reserved capacity; RA3 (separated storage) reasonable
Synapse (Azure)	Azure-native; SQL + Spark together

Lakehouses (open formats)

Tool	Strengths
Databricks	Delta Lake; ML-first; Apache Spark; Unity Catalog
Apache Iceberg + engine	Open table format; works with Trino, Spark, Snowflake, BigQuery, Athena
Apache Hudi	Open; transactional; CDC-friendly
Delta Lake	Open format; born at Databricks

OLAP engines (often self-hosted)

Tool	Strengths
ClickHouse	Fastest OSS; columnar; SQL; ingestion at MM/s; insanely fast point lookups
Apache Druid	Real-time + batch; pre-computed indexes; sub-second slice/dice
Apache Pinot	LinkedIn's; real-time analytics; strong at user-facing analytics
DuckDB	In-process; embedded analytics; "SQLite for analytics"

Query engines (decoupled from storage)

Tool	Strengths
Trino (PrestoSQL fork)	Query S3, Hive, Iceberg, JDBC sources; federated SQL
Spark SQL	Inside Databricks/EMR; integrates with ML pipelines
Athena (AWS)	Managed Presto over S3; pay per query
DuckDB	Can query Parquet on S3 directly; surprisingly capable

How to pick:

Don't want to run infra, willing to pay → Snowflake or BigQuery
AWS-native → Athena (light) or Redshift (heavy)
Self-hosted, fastest, single-DB → ClickHouse
ML pipelines first-class → Databricks
Want open formats / vendor flexibility → Iceberg + Trino or Databricks + Delta
User-facing analytics dashboards → ClickHouse, Pinot, Druid
Embedded / local / notebook → DuckDB

What "Columnar" Actually Means

OLTP databases store rows together:

[row 1: id=1, name="Alice", revenue=100, region="US"]
[row 2: id=2, name="Bob",   revenue=50,  region="EU"]

When you SELECT name FROM users WHERE id = 1, this is efficient — read one row, return one column.

Columnar storage groups by column:

ids:     [1, 2, 3, 4, ...]
names:   ["Alice", "Bob", "Carol", "Dave", ...]
revenue: [100, 50, 200, 80, ...]
region:  ["US", "EU", "US", "AS", ...]

When you SELECT SUM(revenue) WHERE region='US':

Read only the revenue and region columns (skip name, id)
Compression on similar values is huge (run-length encoding on region, delta encoding on id)
SIMD-vectorized aggregation across the column

The result: 10-100× less I/O for typical analytical queries, and CPU instructions designed for the shape.

The trade: single-row update is slow (rewrite the column). Update-heavy workloads stay OLTP; append-heavy + read-heavy moves to columnar.

Separated Storage and Compute

Modern warehouses store data in cheap object storage (S3, GCS, Azure Blob) and run compute clusters that read from it on demand. Three big consequences:

Storage costs are cheap and decoupled: $20/TB-month vs. $200/TB-month for legacy DB storage.
Compute can scale independently: run a giant query, spin up 100 workers, return result, shut them down.
Multiple compute clusters can read the same data: dev cluster, prod cluster, ML pipeline cluster — all on the same source.

Snowflake popularized this. BigQuery is serverless (you don't see the clusters). Databricks runs Spark clusters on object storage. ClickHouse Cloud also separates them. Older warehouses (Redshift classic, on-prem) couple storage and compute and cost a lot more.

ELT vs. ETL

The data engineering shift of the 2010s:

ETL (Extract → Transform → Load): transform in flight, load clean data into a small warehouse. The shape was constrained by the warehouse size.
ELT (Extract → Load → Transform): dump raw into a cheap, massive warehouse; transform with SQL inside. Dominant pattern today.

Tools:

Fivetran / Airbyte / Estuary — managed extract+load from SaaS and DBs
dbt — SQL transformations in the warehouse; the de facto standard for analytics engineering
Workflow orchestration (Workflow Orchestration) — runs the dbt models, ingestion jobs

A modern data stack:

[SaaS / OLTP / events] → Fivetran/Airbyte → warehouse (raw schema)
                                                 ↓ dbt models (SQL)
                                            warehouse (analytics schema)
                                                 ↓
                                          BI tool (Looker/Metabase/Hex)

Learning Path

1. Getting Started

Run ClickHouse locally; load a public dataset; explore columnar speed; try DuckDB for in-process analytics

2. Patterns

Star schemas, slowly changing dimensions, dbt models, partitioning and clustering, materialized views, change data capture

3. Best Practices

Cost control, query optimization, data quality, governance, GDPR, lakehouse vs warehouse trade-offs

When You Don't Need a Warehouse

Honest cases:

Postgres works fine for your analytics until ~100GB and a few hundred queries/day. Don't over-engineer.
Your "analytics" is a few daily reports — a nightly Postgres replica is cheaper.
You have small data and want fast iteration — DuckDB embedded into a notebook is a serious analytics environment.
You need real-time row-level decisions — that's OLTP, not warehouse.

The warehouse pays off when:

Your analytical queries are slowing down OLTP
You want to mix data from many sources (SaaS, app DB, events)
Multiple teams (analytics, ML, product) want the same data
Compliance / audit requires a queryable history of everything

The warehouse you don't think about is the one running well. A warehouse where every query feels fast, dbt builds in minutes, and the bill is predictable is doing its job. The mistake people make is buying the most "powerful" warehouse before they have the data shape that justifies it. Start small (DuckDB, Postgres, ClickHouse self-hosted, Snowflake free tier), grow as needed. The warehouse tier you can outgrow is far cheaper than the warehouse you under-use.

Data Warehouses & Lakehouses

1. Getting Started

2. Patterns

3. Best Practices

On this page