Patterns
Star schemas, slowly changing dimensions, dbt models, partitioning and clustering, materialized views, change data capture
Patterns
The patterns that scale warehouse analytics from "one team's reports" to "the source of truth for the company."
Star Schema
The canonical analytics layout:
┌──────────────┐
│ dim_date │
│ - date_key │
│ - day │
│ - month │
│ - quarter │
└──────┬───────┘
│
┌──────────────┐ ┌────┴───────┐ ┌──────────────┐
│ dim_customer │◄───┤ fact_sales ├───►│ dim_product │
│ - cust_key │ │ - sale_id │ │ - prod_key │
│ - name │ │ - cust_key │ │ - name │
│ - segment │ │ - prod_key │ │ - category │
│ - region │ │ - date_key │ │ - price │
└──────────────┘ │ - quantity │ └──────────────┘
│ - revenue │
└────────────┘- Facts: events with metrics (sales, page views, signups). Large; append-mostly.
- Dimensions: descriptive context (customers, products, time). Small; updateable.
- Foreign keys: facts reference dimensions.
Queries follow the pattern:
SELECT
d.month,
c.region,
p.category,
SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_customer c ON f.cust_key = c.cust_key
JOIN dim_product p ON f.prod_key = p.prod_key
WHERE d.year = 2026
GROUP BY 1, 2, 3
ORDER BY 1, 4 DESC;Star schema isn't sacred — many modern warehouses are fine with denormalized "wide tables." But star schemas remain the cleanest mental model and the easiest to evolve.
Slowly Changing Dimensions (SCD)
Dimensions change over time. A customer moves regions. How do you record that?
| Type | Behavior | When |
|---|---|---|
| SCD Type 1 | Overwrite. Old value is lost. | When history doesn't matter |
| SCD Type 2 | Insert a new row with valid_from/valid_to. Old row stays. | Most analytical use — preserves history |
| SCD Type 3 | Add a column (current_region, previous_region). | Limited history |
| SCD Type 6 | Hybrid: current value + history table | Complex needs |
SCD Type 2 in practice:
-- Customer with region history
CREATE TABLE dim_customer (
cust_sk BIGSERIAL, -- surrogate key
cust_id INT, -- natural key
name TEXT,
region TEXT,
valid_from TIMESTAMP,
valid_to TIMESTAMP,
is_current BOOLEAN
);
-- A customer changes region
INSERT INTO dim_customer
(cust_id, name, region, valid_from, valid_to, is_current)
VALUES
(42, 'Alice', 'EU', '2026-05-01', NULL, TRUE);
-- Update the previous row
UPDATE dim_customer
SET valid_to = '2026-04-30', is_current = FALSE
WHERE cust_id = 42 AND is_current = TRUE AND cust_sk != /* new SK */;fact_sales references cust_sk (surrogate), so historical facts join to the historical dimension row. Reports for 2025 show Alice in her old region; 2026 reports show her in the new one.
dbt: Transformations as Code
dbt is the standard for SQL-based ELT. Models are SQL files; dbt handles dependencies, materialization, tests, and docs.
-- models/staging/stg_sales.sql
WITH source AS (
SELECT * FROM {{ source('raw', 'sales_events') }}
)
SELECT
event_id AS sale_id,
customer_id,
product_id,
CAST(amount AS NUMERIC) AS revenue,
CAST(event_time AS TIMESTAMP) AS sale_time
FROM source
WHERE event_type = 'purchase'-- models/marts/fact_sales.sql
{{ config(materialized='incremental', unique_key='sale_id', on_schema_change='append_new_columns') }}
SELECT
s.sale_id,
s.customer_id,
s.product_id,
s.revenue,
s.sale_time,
TO_CHAR(s.sale_time, 'YYYY-MM-DD') AS date_key
FROM {{ ref('stg_sales') }} s
{% if is_incremental() %}
WHERE s.sale_time > (SELECT MAX(sale_time) FROM {{ this }})
{% endif %}dbt run rebuilds the DAG in dependency order. {{ ref() }} lets dbt know what's downstream of what. Incremental models append new rows instead of rebuilding the whole table.
Tests
# models/marts/schema.yml
version: 2
models:
- name: fact_sales
columns:
- name: sale_id
tests: [unique, not_null]
- name: customer_id
tests:
- not_null
- relationships: { to: ref('dim_customer'), field: customer_id }
- name: revenue
tests:
- dbt_utils.expression_is_true:
expression: ">= 0"dbt test runs them after each build. Tests fail the pipeline if data quality drops.
dbt Macros (Reuse)
-- macros/clean_phone.sql
{% macro clean_phone(column_name) %}
REGEXP_REPLACE({{ column_name }}, '[^0-9]', '')
{% endmacro %}
-- use in models
SELECT id, {{ clean_phone('phone') }} AS phone_normalized FROM customersMacros eliminate copy-paste SQL across models. Adopt them when you find yourself duplicating logic.
Partitioning and Clustering
A 10-billion-row table without partitioning is painful. Warehouses use:
| Concept | What | Example |
|---|---|---|
| Partition | Physical split by a column; queries can prune entire partitions | PARTITION BY DATE(event_time) |
| Cluster (Snowflake) / Sort key (Redshift) / Order by (ClickHouse) | Order data within a partition for range queries | ORDER BY (region, user_id, event_time) |
In BigQuery:
CREATE TABLE events
PARTITION BY DATE(event_time)
CLUSTER BY user_id, event_type
AS SELECT ... FROM source;A query with WHERE DATE(event_time) BETWEEN ... AND user_id = X prunes partitions and clusters, scanning a fraction of the table. Partition pruning is the difference between a $0.05 query and a $50 query.
In ClickHouse: choose ORDER BY carefully — that's also the primary index. Common: (event_type, event_date, user_id).
Materialized Views and Incremental Aggregations
The "100M rows of source data, queried 1000 times" problem. Pre-compute the rollup:
-- BigQuery materialized view
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT
DATE(sale_time) AS day,
region,
SUM(revenue) AS revenue,
COUNT(*) AS sales
FROM fact_sales
GROUP BY 1, 2;The MV refreshes as source changes; queries hit the small aggregated table. Equivalent in Snowflake is MATERIALIZED VIEW, in dbt materialized='incremental'.
A dashboard reading from the MV is sub-second instead of multi-second, and much cheaper.
Change Data Capture (CDC)
Getting OLTP data into the warehouse continuously, not just nightly:
OLTP DB → CDC tool → Kafka → Warehouse
(Debezium, (raw schema)
Fivetran) ↓ dbt
analytics schemaTools: Debezium (open source, log-based), Fivetran (managed, easy), Airbyte (open), Estuary. CDC reads the database's write-ahead log and turns inserts/updates/deletes into events.
This is how you get sub-minute freshness in the warehouse instead of nightly. Trade: more moving pieces, more chance for drift between OLTP and warehouse.
Wide Tables vs. Snowflake Schema
Two extremes of normalization:
| Wide table | Snowflake schema |
|---|---|
| Single denormalized fact table with all the dimensions inlined | Fully normalized; star schema with dimension tables further split into sub-dimensions |
| Fast queries (no joins) | Slower queries (many joins) |
| Storage cost: high (duplicated dimension data) | Storage cost: low |
| Schema changes hurt | Schema changes localized |
| Modern columnar handles wide tables fine | Old-school best practice |
Modern warehouses (Snowflake, BigQuery, ClickHouse) make wide tables practical because:
- Columns you don't query don't get read
- Compression on repeated dimension values is good
- No join cost
A pragmatic middle: star schema in the analytics layer (clean, evolvable), wide table for serving (fast for dashboards).
Time-Travel and Snapshotting
Modern warehouses keep recent history:
-- Snowflake: query as of 2 hours ago
SELECT * FROM orders AT(OFFSET => -7200);
-- BigQuery: query as of timestamp
SELECT * FROM `project.dataset.orders`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR);
-- Iceberg (open table format): time travel by snapshot id
SELECT * FROM orders.history;
SELECT * FROM orders FOR SYSTEM_VERSION AS OF 1234567890;Use cases:
- Undo a bad transform: revert to the table state before yesterday's dbt run
- Audit: "what did the data look like on this date?"
- Reproducibility: ML model training reproduces with the exact data snapshot
Time-travel retention is typically days (default) to months (paid feature). Cheap, useful, often saves a Friday.
Lakehouse Pattern (Bronze/Silver/Gold)
Databricks' framing, useful even on other stacks:
Bronze (raw): ingested as-is, immutable, fast to land
↓ light cleanup
Silver (cleaned): parsed, normalized, deduped, sometimes joined
↓ aggregation, business logic
Gold (curated): analytics-ready facts and dimensions; what BI usesEach layer is a set of tables, often in Parquet/Iceberg/Delta. Re-running Bronze→Silver→Gold is cheap because storage is cheap.
dbt's "staging → marts" structure mirrors this. The lakehouse is just the storage + table-format flavor.
Anti-Patterns
Treating the warehouse like OLTP. Single-row inserts every second. Death by overhead. Batch.
Updates over big tables. Columnar storage rewrites a whole partition per update. Append + use the latest row; or use MERGE carefully.
No partitioning on the biggest table. Every query scans the full thing. Bill bloats; queries slow.
dbt sprawl. 500 models, no organization, unclear ownership. Group by domain; enforce naming.
Source-of-truth confusion. Five teams compute "active users" five different ways. Define metrics in one place (dbt models, Cube, Looker LookML).
No tests. Data quality drifts; dashboards lie. Tests on critical columns; check rows-per-day doesn't fall by half overnight.
Querying raw source tables in dashboards. Performance is terrible; raw shape changes break things. Always go through transformed marts.
What's Next
- Best Practices — cost control, query optimization, data quality, governance, lakehouse trade-offs