Data Warehouses & Lakehouses
Snowflake, BigQuery, Databricks, ClickHouse, Redshift - column-oriented storage and compute for analytics over billions of rows
Data Warehouses & Lakehouses
A data warehouse is purpose-built for analytical queries — SELECT region, SUM(revenue) FROM orders WHERE year = 2026 GROUP BY region over a billion rows in seconds. It's the opposite of an OLTP database (Postgres, MySQL) which is built for transactional reads and writes.
Modern data warehouses are cloud-native, separated storage from compute, and columnar. The lines between "warehouse" and "lakehouse" have blurred — Databricks-style lakehouses store data as open formats (Parquet, Iceberg, Delta) on cheap object storage with multiple compute engines layered on top. Snowflake and BigQuery offer similar separation with proprietary formats.
The category is its own beast. It's adjacent to but distinct from:
- Time-series databases — time-shaped data, short queries
- Search — text retrieval
- Vector databases — embedding similarity
- OLTP (Postgres, MySQL) — small transactional reads/writes
If your query reads millions of rows and returns aggregates, a data warehouse is the right tool.
Why a Warehouse and Not Postgres
| Concern | Postgres | Warehouse |
|---|---|---|
| Single-row reads | Microseconds | Slow (10-100ms) |
| Aggregations over millions of rows | Slow without tuning | Seconds, by design |
| Storage format | Row-oriented | Column-oriented (10-100× compression) |
| Compute scaling | Vertical (one big box) | Horizontal (clusters of workers) |
| Concurrency model | OLTP MVCC | Snapshot reads at massive parallelism |
| Update-heavy workload | Strong | Generally weak (append, not update) |
| Cost per TB-month | $$ (always-on instance) | $ (compute decoupled from storage) |
The breaking point is usually single-table size > 1 billion rows + analytical query patterns. Below that, Postgres handles analytics fine with proper indexes.
The Players
Cloud-native warehouses (proprietary)
| Tool | Strengths |
|---|---|
| Snowflake | Best-in-class UX; clear separation of compute clusters; multi-cloud; SaaS |
| BigQuery (GCP) | Serverless; per-query pricing; great for ad-hoc; deep ML integration |
| Redshift (AWS) | AWS-native; cheap reserved capacity; RA3 (separated storage) reasonable |
| Synapse (Azure) | Azure-native; SQL + Spark together |
Lakehouses (open formats)
| Tool | Strengths |
|---|---|
| Databricks | Delta Lake; ML-first; Apache Spark; Unity Catalog |
| Apache Iceberg + engine | Open table format; works with Trino, Spark, Snowflake, BigQuery, Athena |
| Apache Hudi | Open; transactional; CDC-friendly |
| Delta Lake | Open format; born at Databricks |
OLAP engines (often self-hosted)
| Tool | Strengths |
|---|---|
| ClickHouse | Fastest OSS; columnar; SQL; ingestion at MM/s; insanely fast point lookups |
| Apache Druid | Real-time + batch; pre-computed indexes; sub-second slice/dice |
| Apache Pinot | LinkedIn's; real-time analytics; strong at user-facing analytics |
| DuckDB | In-process; embedded analytics; "SQLite for analytics" |
Query engines (decoupled from storage)
| Tool | Strengths |
|---|---|
| Trino (PrestoSQL fork) | Query S3, Hive, Iceberg, JDBC sources; federated SQL |
| Spark SQL | Inside Databricks/EMR; integrates with ML pipelines |
| Athena (AWS) | Managed Presto over S3; pay per query |
| DuckDB | Can query Parquet on S3 directly; surprisingly capable |
How to pick:
- Don't want to run infra, willing to pay → Snowflake or BigQuery
- AWS-native → Athena (light) or Redshift (heavy)
- Self-hosted, fastest, single-DB → ClickHouse
- ML pipelines first-class → Databricks
- Want open formats / vendor flexibility → Iceberg + Trino or Databricks + Delta
- User-facing analytics dashboards → ClickHouse, Pinot, Druid
- Embedded / local / notebook → DuckDB
What "Columnar" Actually Means
OLTP databases store rows together:
[row 1: id=1, name="Alice", revenue=100, region="US"]
[row 2: id=2, name="Bob", revenue=50, region="EU"]When you SELECT name FROM users WHERE id = 1, this is efficient — read one row, return one column.
Columnar storage groups by column:
ids: [1, 2, 3, 4, ...]
names: ["Alice", "Bob", "Carol", "Dave", ...]
revenue: [100, 50, 200, 80, ...]
region: ["US", "EU", "US", "AS", ...]When you SELECT SUM(revenue) WHERE region='US':
- Read only the
revenueandregioncolumns (skipname,id) - Compression on similar values is huge (run-length encoding on
region, delta encoding onid) - SIMD-vectorized aggregation across the column
The result: 10-100× less I/O for typical analytical queries, and CPU instructions designed for the shape.
The trade: single-row update is slow (rewrite the column). Update-heavy workloads stay OLTP; append-heavy + read-heavy moves to columnar.
Separated Storage and Compute
Modern warehouses store data in cheap object storage (S3, GCS, Azure Blob) and run compute clusters that read from it on demand. Three big consequences:
- Storage costs are cheap and decoupled: $20/TB-month vs. $200/TB-month for legacy DB storage.
- Compute can scale independently: run a giant query, spin up 100 workers, return result, shut them down.
- Multiple compute clusters can read the same data: dev cluster, prod cluster, ML pipeline cluster — all on the same source.
Snowflake popularized this. BigQuery is serverless (you don't see the clusters). Databricks runs Spark clusters on object storage. ClickHouse Cloud also separates them. Older warehouses (Redshift classic, on-prem) couple storage and compute and cost a lot more.
ELT vs. ETL
The data engineering shift of the 2010s:
- ETL (Extract → Transform → Load): transform in flight, load clean data into a small warehouse. The shape was constrained by the warehouse size.
- ELT (Extract → Load → Transform): dump raw into a cheap, massive warehouse; transform with SQL inside. Dominant pattern today.
Tools:
- Fivetran / Airbyte / Estuary — managed extract+load from SaaS and DBs
- dbt — SQL transformations in the warehouse; the de facto standard for analytics engineering
- Workflow orchestration (Workflow Orchestration) — runs the dbt models, ingestion jobs
A modern data stack:
[SaaS / OLTP / events] → Fivetran/Airbyte → warehouse (raw schema)
↓ dbt models (SQL)
warehouse (analytics schema)
↓
BI tool (Looker/Metabase/Hex)Learning Path
1. Getting Started
Run ClickHouse locally; load a public dataset; explore columnar speed; try DuckDB for in-process analytics
2. Patterns
Star schemas, slowly changing dimensions, dbt models, partitioning and clustering, materialized views, change data capture
3. Best Practices
Cost control, query optimization, data quality, governance, GDPR, lakehouse vs warehouse trade-offs
When You Don't Need a Warehouse
Honest cases:
- Postgres works fine for your analytics until ~100GB and a few hundred queries/day. Don't over-engineer.
- Your "analytics" is a few daily reports — a nightly Postgres replica is cheaper.
- You have small data and want fast iteration — DuckDB embedded into a notebook is a serious analytics environment.
- You need real-time row-level decisions — that's OLTP, not warehouse.
The warehouse pays off when:
- Your analytical queries are slowing down OLTP
- You want to mix data from many sources (SaaS, app DB, events)
- Multiple teams (analytics, ML, product) want the same data
- Compliance / audit requires a queryable history of everything
The warehouse you don't think about is the one running well. A warehouse where every query feels fast, dbt builds in minutes, and the bill is predictable is doing its job. The mistake people make is buying the most "powerful" warehouse before they have the data shape that justifies it. Start small (DuckDB, Postgres, ClickHouse self-hosted, Snowflake free tier), grow as needed. The warehouse tier you can outgrow is far cheaper than the warehouse you under-use.