Steven's Knowledge

Patterns

Star schemas, slowly changing dimensions, dbt models, partitioning and clustering, materialized views, change data capture

Patterns

The patterns that scale warehouse analytics from "one team's reports" to "the source of truth for the company."

Star Schema

The canonical analytics layout:

                  ┌──────────────┐
                  │ dim_date     │
                  │ - date_key   │
                  │ - day        │
                  │ - month      │
                  │ - quarter    │
                  └──────┬───────┘

┌──────────────┐    ┌────┴───────┐    ┌──────────────┐
│ dim_customer │◄───┤ fact_sales ├───►│ dim_product  │
│ - cust_key   │    │ - sale_id  │    │ - prod_key   │
│ - name       │    │ - cust_key │    │ - name       │
│ - segment    │    │ - prod_key │    │ - category   │
│ - region     │    │ - date_key │    │ - price      │
└──────────────┘    │ - quantity │    └──────────────┘
                    │ - revenue  │
                    └────────────┘
  • Facts: events with metrics (sales, page views, signups). Large; append-mostly.
  • Dimensions: descriptive context (customers, products, time). Small; updateable.
  • Foreign keys: facts reference dimensions.

Queries follow the pattern:

SELECT
  d.month,
  c.region,
  p.category,
  SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_date d     ON f.date_key = d.date_key
JOIN dim_customer c ON f.cust_key = c.cust_key
JOIN dim_product p  ON f.prod_key = p.prod_key
WHERE d.year = 2026
GROUP BY 1, 2, 3
ORDER BY 1, 4 DESC;

Star schema isn't sacred — many modern warehouses are fine with denormalized "wide tables." But star schemas remain the cleanest mental model and the easiest to evolve.

Slowly Changing Dimensions (SCD)

Dimensions change over time. A customer moves regions. How do you record that?

TypeBehaviorWhen
SCD Type 1Overwrite. Old value is lost.When history doesn't matter
SCD Type 2Insert a new row with valid_from/valid_to. Old row stays.Most analytical use — preserves history
SCD Type 3Add a column (current_region, previous_region).Limited history
SCD Type 6Hybrid: current value + history tableComplex needs

SCD Type 2 in practice:

-- Customer with region history
CREATE TABLE dim_customer (
  cust_sk     BIGSERIAL,         -- surrogate key
  cust_id     INT,               -- natural key
  name        TEXT,
  region      TEXT,
  valid_from  TIMESTAMP,
  valid_to    TIMESTAMP,
  is_current  BOOLEAN
);

-- A customer changes region
INSERT INTO dim_customer
  (cust_id, name, region, valid_from, valid_to, is_current)
VALUES
  (42, 'Alice', 'EU', '2026-05-01', NULL, TRUE);

-- Update the previous row
UPDATE dim_customer
SET valid_to = '2026-04-30', is_current = FALSE
WHERE cust_id = 42 AND is_current = TRUE AND cust_sk != /* new SK */;

fact_sales references cust_sk (surrogate), so historical facts join to the historical dimension row. Reports for 2025 show Alice in her old region; 2026 reports show her in the new one.

dbt: Transformations as Code

dbt is the standard for SQL-based ELT. Models are SQL files; dbt handles dependencies, materialization, tests, and docs.

-- models/staging/stg_sales.sql
WITH source AS (
  SELECT * FROM {{ source('raw', 'sales_events') }}
)
SELECT
  event_id AS sale_id,
  customer_id,
  product_id,
  CAST(amount AS NUMERIC) AS revenue,
  CAST(event_time AS TIMESTAMP) AS sale_time
FROM source
WHERE event_type = 'purchase'
-- models/marts/fact_sales.sql
{{ config(materialized='incremental', unique_key='sale_id', on_schema_change='append_new_columns') }}

SELECT
  s.sale_id,
  s.customer_id,
  s.product_id,
  s.revenue,
  s.sale_time,
  TO_CHAR(s.sale_time, 'YYYY-MM-DD') AS date_key
FROM {{ ref('stg_sales') }} s
{% if is_incremental() %}
WHERE s.sale_time > (SELECT MAX(sale_time) FROM {{ this }})
{% endif %}

dbt run rebuilds the DAG in dependency order. {{ ref() }} lets dbt know what's downstream of what. Incremental models append new rows instead of rebuilding the whole table.

Tests

# models/marts/schema.yml
version: 2
models:
  - name: fact_sales
    columns:
      - name: sale_id
        tests: [unique, not_null]
      - name: customer_id
        tests:
          - not_null
          - relationships: { to: ref('dim_customer'), field: customer_id }
      - name: revenue
        tests:
          - dbt_utils.expression_is_true:
              expression: ">= 0"

dbt test runs them after each build. Tests fail the pipeline if data quality drops.

dbt Macros (Reuse)

-- macros/clean_phone.sql
{% macro clean_phone(column_name) %}
  REGEXP_REPLACE({{ column_name }}, '[^0-9]', '')
{% endmacro %}

-- use in models
SELECT id, {{ clean_phone('phone') }} AS phone_normalized FROM customers

Macros eliminate copy-paste SQL across models. Adopt them when you find yourself duplicating logic.

Partitioning and Clustering

A 10-billion-row table without partitioning is painful. Warehouses use:

ConceptWhatExample
PartitionPhysical split by a column; queries can prune entire partitionsPARTITION BY DATE(event_time)
Cluster (Snowflake) / Sort key (Redshift) / Order by (ClickHouse)Order data within a partition for range queriesORDER BY (region, user_id, event_time)

In BigQuery:

CREATE TABLE events
PARTITION BY DATE(event_time)
CLUSTER BY user_id, event_type
AS SELECT ... FROM source;

A query with WHERE DATE(event_time) BETWEEN ... AND user_id = X prunes partitions and clusters, scanning a fraction of the table. Partition pruning is the difference between a $0.05 query and a $50 query.

In ClickHouse: choose ORDER BY carefully — that's also the primary index. Common: (event_type, event_date, user_id).

Materialized Views and Incremental Aggregations

The "100M rows of source data, queried 1000 times" problem. Pre-compute the rollup:

-- BigQuery materialized view
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT
  DATE(sale_time) AS day,
  region,
  SUM(revenue) AS revenue,
  COUNT(*) AS sales
FROM fact_sales
GROUP BY 1, 2;

The MV refreshes as source changes; queries hit the small aggregated table. Equivalent in Snowflake is MATERIALIZED VIEW, in dbt materialized='incremental'.

A dashboard reading from the MV is sub-second instead of multi-second, and much cheaper.

Change Data Capture (CDC)

Getting OLTP data into the warehouse continuously, not just nightly:

OLTP DB  →  CDC tool   →  Kafka      →  Warehouse
            (Debezium,                  (raw schema)
             Fivetran)                  ↓ dbt
                                        analytics schema

Tools: Debezium (open source, log-based), Fivetran (managed, easy), Airbyte (open), Estuary. CDC reads the database's write-ahead log and turns inserts/updates/deletes into events.

This is how you get sub-minute freshness in the warehouse instead of nightly. Trade: more moving pieces, more chance for drift between OLTP and warehouse.

Wide Tables vs. Snowflake Schema

Two extremes of normalization:

Wide tableSnowflake schema
Single denormalized fact table with all the dimensions inlinedFully normalized; star schema with dimension tables further split into sub-dimensions
Fast queries (no joins)Slower queries (many joins)
Storage cost: high (duplicated dimension data)Storage cost: low
Schema changes hurtSchema changes localized
Modern columnar handles wide tables fineOld-school best practice

Modern warehouses (Snowflake, BigQuery, ClickHouse) make wide tables practical because:

  • Columns you don't query don't get read
  • Compression on repeated dimension values is good
  • No join cost

A pragmatic middle: star schema in the analytics layer (clean, evolvable), wide table for serving (fast for dashboards).

Time-Travel and Snapshotting

Modern warehouses keep recent history:

-- Snowflake: query as of 2 hours ago
SELECT * FROM orders AT(OFFSET => -7200);

-- BigQuery: query as of timestamp
SELECT * FROM `project.dataset.orders`
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR);

-- Iceberg (open table format): time travel by snapshot id
SELECT * FROM orders.history;
SELECT * FROM orders FOR SYSTEM_VERSION AS OF 1234567890;

Use cases:

  • Undo a bad transform: revert to the table state before yesterday's dbt run
  • Audit: "what did the data look like on this date?"
  • Reproducibility: ML model training reproduces with the exact data snapshot

Time-travel retention is typically days (default) to months (paid feature). Cheap, useful, often saves a Friday.

Lakehouse Pattern (Bronze/Silver/Gold)

Databricks' framing, useful even on other stacks:

Bronze (raw):     ingested as-is, immutable, fast to land
   ↓ light cleanup
Silver (cleaned): parsed, normalized, deduped, sometimes joined
   ↓ aggregation, business logic
Gold (curated):   analytics-ready facts and dimensions; what BI uses

Each layer is a set of tables, often in Parquet/Iceberg/Delta. Re-running Bronze→Silver→Gold is cheap because storage is cheap.

dbt's "staging → marts" structure mirrors this. The lakehouse is just the storage + table-format flavor.

Anti-Patterns

Treating the warehouse like OLTP. Single-row inserts every second. Death by overhead. Batch.

Updates over big tables. Columnar storage rewrites a whole partition per update. Append + use the latest row; or use MERGE carefully.

No partitioning on the biggest table. Every query scans the full thing. Bill bloats; queries slow.

dbt sprawl. 500 models, no organization, unclear ownership. Group by domain; enforce naming.

Source-of-truth confusion. Five teams compute "active users" five different ways. Define metrics in one place (dbt models, Cube, Looker LookML).

No tests. Data quality drifts; dashboards lie. Tests on critical columns; check rows-per-day doesn't fall by half overnight.

Querying raw source tables in dashboards. Performance is terrible; raw shape changes break things. Always go through transformed marts.

What's Next

  • Best Practices — cost control, query optimization, data quality, governance, lakehouse trade-offs

On this page