Steven's Knowledge

Best Practices

Cost control, query optimization, data quality, governance, GDPR, lakehouse vs warehouse trade-offs

Best Practices

The operational realities of running a warehouse that produces value without producing a runaway bill.

Cost Control

The warehouse bill is the most surprising line in many cloud invoices. Levers:

Per-query cost awareness

In BigQuery, every query has a bytesProcessed number. Show it to engineers:

-- Dry-run to see cost
SELECT * FROM `project.dataset.events`;
-- "This query will process 1.2 TB"
-- At $5/TB, that's $6 per run.

A single careless query can cost more than a month of a small Postgres instance. Make this visible in your BI tool, dbt logs, etc.

Cluster sizing (Snowflake)

A small warehouse (compute cluster) is cheap. A massive one is expensive. Right-size per workload:

-- For BI dashboards (many small queries)
ALTER WAREHOUSE BI_WH SET WAREHOUSE_SIZE = 'SMALL';

-- For nightly ETL
ALTER WAREHOUSE ETL_WH SET WAREHOUSE_SIZE = 'LARGE';

-- Auto-suspend after idle
ALTER WAREHOUSE BI_WH SET AUTO_SUSPEND = 60;  -- seconds

Multiple warehouses (clusters) so heavy ETL doesn't starve dashboards. Auto-suspend so you don't pay for idle.

Slot reservation (BigQuery)

Pay-per-query is great until you have many queries. Flat-rate / slot commitments can save 50%+ at scale:

# Commit to 500 slots for 1 year
bq mk --reservation --location=US --slots=500 \
  --plan=ANNUAL my_reservation

Above 5-10TB/day query volume, slots win.

Storage tiering

Cold data goes to cheap storage:

  • Snowflake: time-travel + fail-safe count toward storage. Reduce retention on huge tables.
  • BigQuery: tables become long_term_storage after 90 days (50% cheaper) — automatic.
  • S3/GCS (lakehouse): lifecycle rules to Glacier / Coldline.

A 5TB table you query once a quarter shouldn't sit in hot storage.

Materialized views and caching

Re-running an aggregation that hasn't changed wastes money. MVs and result caching turn N queries into 1 + (N-1) cache hits.

Cost attribution

Tag queries / warehouses to teams. Snowflake has tags + query tags. BigQuery has labels. Show each team their monthly spend. Cost transparency drives accountability.

Connect to FinOps for cross-cutting cost ops.

Query Optimization

The same data, same warehouse, can be 100× faster with better SQL.

Read fewer columns

-- Bad: pulls every column from a 200-column table
SELECT * FROM events WHERE day = '2026-05-21';

-- Good: only what you need
SELECT user_id, event_type, revenue FROM events WHERE day = '2026-05-21';

Columnar storage means SELECT * is often 10-50× more expensive than naming columns.

Filter early

Push WHERE clauses as low as possible:

-- Bad: filters after joining (millions of joined rows)
SELECT ...
FROM big_fact f JOIN dim d ON f.id = d.id
WHERE f.event_date = '2026-05-21';

-- Good (often automatic, but explicit):
SELECT ...
FROM (SELECT * FROM big_fact WHERE event_date = '2026-05-21') f
JOIN dim d ON f.id = d.id;

Most warehouses' optimizers do this automatically. Don't trust — check.

Approximate aggregations

Exact distinct count over a billion rows is expensive. Approximations are usually fine:

-- Snowflake / BigQuery
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

-- ClickHouse
SELECT uniq(user_id) FROM events;

99% accuracy, 10× faster. For dashboards, the accuracy is invisible to humans.

EXPLAIN your queries

-- BigQuery
SELECT ... -- click "Execution details" in the UI

-- Snowflake
EXPLAIN USING TEXT SELECT ...

-- ClickHouse
EXPLAIN PLAN SELECT ...

-- DuckDB
EXPLAIN SELECT ...

Look for: huge scans (missing filter), nested joins (re-order), spill to disk (under-provisioned).

Avoid join order pitfalls

In some warehouses (especially older / self-hosted), join order matters. The smaller table goes on the right of an inner join for most engines. The query planner usually fixes this — but check on truly large tables.

Data Quality

A wrong number in the dashboard is worse than no number. Discipline:

Tests at every layer

  • Source freshness: did Fivetran load today's data? (dbt source freshness)
  • Row count sanity: today's daily count is between 80-120% of yesterday's
  • Schema drift: source added a new column, your model ignores it (or fails loudly)
  • Uniqueness: unique test on primary key columns
  • Relationships: foreign keys actually resolve
  • Custom: business invariants ("revenue is never negative")

dbt + dbt-utils + dbt-expectations cover most.

Contracts on shared models

If marketing and analytics both use dim_user, changes ripple. Define a contract:

# dbt model contract
models:
  - name: dim_user
    config:
      contract: { enforced: true }
    columns:
      - name: user_id
        data_type: bigint
        constraints: [{ type: not_null }, { type: primary_key }]
      - name: email
        data_type: varchar

Schema changes that violate the contract fail at build. Downstream teams aren't surprised.

Monitoring

Tools like Monte Carlo, Bigeye, Soda, Anomalo, Elementary (open) detect anomalies in your warehouse — unexpected row count drops, schema changes, freshness misses. The "Datadog for data" category.

For smaller orgs: dbt + a Slack notification on test failure is enough.

Governance

As more teams use the warehouse:

  • Naming conventions: fact_*, dim_*, stg_*, int_*. Predictable.
  • Documentation: every dbt model has a description and meta.owner. The catalog is browsable.
  • Lineage: dbt docs renders the DAG; downstream effects of changes are visible.
  • Access control: role-based (analytics_role, ml_role, exec_role). Tables tagged with sensitivity (pii, financial).
  • Approval for PII access: query an audit log; legal reviews requests.
  • Single source of metrics: a metric layer (dbt metrics, Cube, MetricFlow, Looker LookML) prevents 5 versions of "active users".

The warehouse becomes the operational data system for the company. Govern it like one.

PII and Compliance

The warehouse often contains PII. Approach:

  • Column-level encryption / masking: Snowflake's Dynamic Data Masking, BigQuery's column-level encryption. Only authorized roles see the cleartext.
  • Tokenization: customer emails replaced with tokens; reversible only by an isolated service.
  • PII inventory: catalog which columns are PII; review quarterly.
  • GDPR right-to-be-forgotten: handle deletion. Hard in a warehouse (immutable formats). Patterns:
    • Tombstone table: mark deleted; views filter them out
    • Periodic rebuild: rebuild affected tables monthly without forgotten users
    • Iceberg / Delta merge-delete: modern open-format support
  • Cross-region transfer: GDPR may forbid sending EU data to a US warehouse. Per-region warehouses or aggregation-only export.

Document the PII flow. Auditors ask.

Lakehouse vs. Warehouse Trade-Offs

Both can be the right answer. Choose by:

ConcernPure warehouse (Snowflake/BigQuery)Lakehouse (Databricks/Iceberg+Trino)
Vendor lock-inHigh (proprietary formats)Low (open formats: Parquet, Iceberg, Delta)
Time to valueFastSlower (more pieces to configure)
SQL ergonomicsExcellentGood
ML workflowsOKFirst-class (Spark, MLflow integrated)
Cost at scaleExpensive at top endCheaper (cheap object storage)
Operational burdenLow (managed)Higher (you operate more)
Multi-engineLimitedNative (same data, multiple engines)

A pragmatic mid-path: Snowflake or BigQuery as primary, Iceberg-on-S3 as long-term cold archive. Best of both.

Backup, DR, Migration

The warehouse is your business intelligence; treat it like a critical system. See Disaster Recovery.

  • Time-travel for short-term ($ to days back)
  • Snapshot exports to object storage for longer recovery
  • Cross-region replication (Snowflake replication, BigQuery cross-region, Databricks DR)
  • Documented migration plan: vendor change is hard. Iceberg/Delta on object storage makes future migrations cheaper.

When Models Get Slow

Slowdowns happen as data grows. Diagnose:

  1. Single query slow → query optimization, partitioning, clustering
  2. All queries slow → warehouse size too small or concurrent queries fighting
  3. dbt build slow → incremental models; parallelize independent models
  4. Specific model slow → break into smaller models; materialize as table not view; precompute aggregates

A monthly "slow query review" with the data team keeps things tuned. Stale models from 2 years ago that nobody uses but still run nightly should be deleted.

Adoption Patterns

For organizations adopting a warehouse:

  1. First team: pick one team with real analytical needs (often product analytics). Get them shipping value.
  2. Second team: marketing or finance. Different data, similar pattern.
  3. Centralize on dbt: prevents N versions of similar transforms.
  4. Single metric layer: define metrics in code, used by all BI tools.
  5. Self-serve queries: business users in tools like Hex, Mode, Metabase, Tableau. Analytics engineers maintain the marts.
  6. Productionize: dashboards have SLAs; alerts; on-call rotation for data infra.

This takes years for big orgs. Start small; demonstrate value; expand.

Common Pitfalls

No partitioning. Big table; every query reads it all. Bills surprise; queries slow.

Stale data, no freshness check. Dashboards quietly show yesterday's data because Fivetran broke. Add source freshness tests.

Cost-blind engineers. Writing exploratory queries against TB-tables. Cost guardrails (BigQuery max-bytes-billed limit, Snowflake query timeout + warehouse size limit).

Wide unused tables. Models that nobody queries — but rebuild nightly. Audit dbt model usage; drop dead models.

Warehouse as OLTP. Inserts every request. Use a streaming layer (Kafka → warehouse batch) or a separate OLTP DB.

N versions of a metric. Without a metrics layer, "monthly active users" diverges across teams. Painful re-alignment.

Lift-and-shift from Postgres. Same schema, same queries. The warehouse pattern is different — wide tables, denormalization for serving, aggregated marts. Re-design, don't just port.

Cost transparency only quarterly. Surprise bills. Daily-level cost reporting at team granularity.

Checklist

Data warehouse production readiness:

  • Source data freshness monitored (alerts on staleness)
  • Critical tables have data quality tests
  • dbt or equivalent for transformations (no ad-hoc SQL in production paths)
  • Large tables partitioned + clustered correctly
  • Materialized views or rollups for high-traffic queries
  • PII inventory; column-level masking or tokenization
  • GDPR / data subject request process tested
  • Cost guardrails per user / query (max bytes scanned, warehouse limits)
  • Cost attribution per team; monthly showback
  • Time travel or snapshot strategy for DR
  • Cross-region or cross-cloud replication if business-critical
  • Documented lineage (dbt docs)
  • Metrics defined in one place (metrics layer)
  • Access control with role-based privileges
  • Slow query review monthly; dead model cleanup
  • Engineers can see cost of their queries before running

What's Next

You have a warehouse practice. Connect it to:

On this page