Cost control, query optimization, data quality, governance, GDPR, lakehouse vs warehouse trade-offs

Best Practices

The operational realities of running a warehouse that produces value without producing a runaway bill.

Cost Control

The warehouse bill is the most surprising line in many cloud invoices. Levers:

Per-query cost awareness

In BigQuery, every query has a bytesProcessed number. Show it to engineers:

-- Dry-run to see cost
SELECT * FROM `project.dataset.events`;
-- "This query will process 1.2 TB"
-- At $5/TB, that's $6 per run.

A single careless query can cost more than a month of a small Postgres instance. Make this visible in your BI tool, dbt logs, etc.

Cluster sizing (Snowflake)

A small warehouse (compute cluster) is cheap. A massive one is expensive. Right-size per workload:

-- For BI dashboards (many small queries)
ALTER WAREHOUSE BI_WH SET WAREHOUSE_SIZE = 'SMALL';

-- For nightly ETL
ALTER WAREHOUSE ETL_WH SET WAREHOUSE_SIZE = 'LARGE';

-- Auto-suspend after idle
ALTER WAREHOUSE BI_WH SET AUTO_SUSPEND = 60;  -- seconds

Multiple warehouses (clusters) so heavy ETL doesn't starve dashboards. Auto-suspend so you don't pay for idle.

Slot reservation (BigQuery)

Pay-per-query is great until you have many queries. Flat-rate / slot commitments can save 50%+ at scale:

# Commit to 500 slots for 1 year
bq mk --reservation --location=US --slots=500 \
  --plan=ANNUAL my_reservation

Above 5-10TB/day query volume, slots win.

Storage tiering

Cold data goes to cheap storage:

Snowflake: time-travel + fail-safe count toward storage. Reduce retention on huge tables.
BigQuery: tables become long_term_storage after 90 days (50% cheaper) — automatic.
S3/GCS (lakehouse): lifecycle rules to Glacier / Coldline.

A 5TB table you query once a quarter shouldn't sit in hot storage.

Materialized views and caching

Re-running an aggregation that hasn't changed wastes money. MVs and result caching turn N queries into 1 + (N-1) cache hits.

Cost attribution

Tag queries / warehouses to teams. Snowflake has tags + query tags. BigQuery has labels. Show each team their monthly spend. Cost transparency drives accountability.

Connect to FinOps for cross-cutting cost ops.

Query Optimization

The same data, same warehouse, can be 100× faster with better SQL.

Read fewer columns

-- Bad: pulls every column from a 200-column table
SELECT * FROM events WHERE day = '2026-05-21';

-- Good: only what you need
SELECT user_id, event_type, revenue FROM events WHERE day = '2026-05-21';

Columnar storage means SELECT * is often 10-50× more expensive than naming columns.

Filter early

Push WHERE clauses as low as possible:

-- Bad: filters after joining (millions of joined rows)
SELECT ...
FROM big_fact f JOIN dim d ON f.id = d.id
WHERE f.event_date = '2026-05-21';

-- Good (often automatic, but explicit):
SELECT ...
FROM (SELECT * FROM big_fact WHERE event_date = '2026-05-21') f
JOIN dim d ON f.id = d.id;

Most warehouses' optimizers do this automatically. Don't trust — check.

Approximate aggregations

Exact distinct count over a billion rows is expensive. Approximations are usually fine:

-- Snowflake / BigQuery
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

-- ClickHouse
SELECT uniq(user_id) FROM events;

99% accuracy, 10× faster. For dashboards, the accuracy is invisible to humans.

EXPLAIN your queries

-- BigQuery
SELECT ... -- click "Execution details" in the UI

-- Snowflake
EXPLAIN USING TEXT SELECT ...

-- ClickHouse
EXPLAIN PLAN SELECT ...

-- DuckDB
EXPLAIN SELECT ...

Look for: huge scans (missing filter), nested joins (re-order), spill to disk (under-provisioned).

In some warehouses (especially older / self-hosted), join order matters. The smaller table goes on the right of an inner join for most engines. The query planner usually fixes this — but check on truly large tables.

Data Quality

A wrong number in the dashboard is worse than no number. Discipline:

Tests at every layer

Source freshness: did Fivetran load today's data? (dbt source freshness)
Row count sanity: today's daily count is between 80-120% of yesterday's
Schema drift: source added a new column, your model ignores it (or fails loudly)
Uniqueness: unique test on primary key columns
Relationships: foreign keys actually resolve
Custom: business invariants ("revenue is never negative")

dbt + dbt-utils + dbt-expectations cover most.

Contracts on shared models

If marketing and analytics both use dim_user, changes ripple. Define a contract:

# dbt model contract
models:
  - name: dim_user
    config:
      contract: { enforced: true }
    columns:
      - name: user_id
        data_type: bigint
        constraints: [{ type: not_null }, { type: primary_key }]
      - name: email
        data_type: varchar

Schema changes that violate the contract fail at build. Downstream teams aren't surprised.

Monitoring

Tools like Monte Carlo, Bigeye, Soda, Anomalo, Elementary (open) detect anomalies in your warehouse — unexpected row count drops, schema changes, freshness misses. The "Datadog for data" category.

For smaller orgs: dbt + a Slack notification on test failure is enough.

Governance

As more teams use the warehouse:

Naming conventions: fact_*, dim_*, stg_*, int_*. Predictable.
Documentation: every dbt model has a description and meta.owner. The catalog is browsable.
Lineage: dbt docs renders the DAG; downstream effects of changes are visible.
Access control: role-based (analytics_role, ml_role, exec_role). Tables tagged with sensitivity (pii, financial).
Approval for PII access: query an audit log; legal reviews requests.
Single source of metrics: a metric layer (dbt metrics, Cube, MetricFlow, Looker LookML) prevents 5 versions of "active users".

The warehouse becomes the operational data system for the company. Govern it like one.

PII and Compliance

The warehouse often contains PII. Approach:

Column-level encryption / masking: Snowflake's Dynamic Data Masking, BigQuery's column-level encryption. Only authorized roles see the cleartext.
Tokenization: customer emails replaced with tokens; reversible only by an isolated service.
PII inventory: catalog which columns are PII; review quarterly.
GDPR right-to-be-forgotten: handle deletion. Hard in a warehouse (immutable formats). Patterns:
- Tombstone table: mark deleted; views filter them out
- Periodic rebuild: rebuild affected tables monthly without forgotten users
- Iceberg / Delta merge-delete: modern open-format support
Cross-region transfer: GDPR may forbid sending EU data to a US warehouse. Per-region warehouses or aggregation-only export.

Document the PII flow. Auditors ask.

Lakehouse vs. Warehouse Trade-Offs

Both can be the right answer. Choose by:

Concern	Pure warehouse (Snowflake/BigQuery)	Lakehouse (Databricks/Iceberg+Trino)
Vendor lock-in	High (proprietary formats)	Low (open formats: Parquet, Iceberg, Delta)
Time to value	Fast	Slower (more pieces to configure)
SQL ergonomics	Excellent	Good
ML workflows	OK	First-class (Spark, MLflow integrated)
Cost at scale	Expensive at top end	Cheaper (cheap object storage)
Operational burden	Low (managed)	Higher (you operate more)
Multi-engine	Limited	Native (same data, multiple engines)

A pragmatic mid-path: Snowflake or BigQuery as primary, Iceberg-on-S3 as long-term cold archive. Best of both.

Backup, DR, Migration

The warehouse is your business intelligence; treat it like a critical system. See Disaster Recovery.

Time-travel for short-term ($ to days back)
Snapshot exports to object storage for longer recovery
Cross-region replication (Snowflake replication, BigQuery cross-region, Databricks DR)
Documented migration plan: vendor change is hard. Iceberg/Delta on object storage makes future migrations cheaper.

When Models Get Slow

Slowdowns happen as data grows. Diagnose:

Single query slow → query optimization, partitioning, clustering
All queries slow → warehouse size too small or concurrent queries fighting
dbt build slow → incremental models; parallelize independent models
Specific model slow → break into smaller models; materialize as table not view; precompute aggregates

A monthly "slow query review" with the data team keeps things tuned. Stale models from 2 years ago that nobody uses but still run nightly should be deleted.

Adoption Patterns

For organizations adopting a warehouse:

First team: pick one team with real analytical needs (often product analytics). Get them shipping value.
Second team: marketing or finance. Different data, similar pattern.
Centralize on dbt: prevents N versions of similar transforms.
Single metric layer: define metrics in code, used by all BI tools.
Self-serve queries: business users in tools like Hex, Mode, Metabase, Tableau. Analytics engineers maintain the marts.
Productionize: dashboards have SLAs; alerts; on-call rotation for data infra.

This takes years for big orgs. Start small; demonstrate value; expand.

Common Pitfalls

No partitioning. Big table; every query reads it all. Bills surprise; queries slow.

Stale data, no freshness check. Dashboards quietly show yesterday's data because Fivetran broke. Add source freshness tests.

Cost-blind engineers. Writing exploratory queries against TB-tables. Cost guardrails (BigQuery max-bytes-billed limit, Snowflake query timeout + warehouse size limit).

Wide unused tables. Models that nobody queries — but rebuild nightly. Audit dbt model usage; drop dead models.

Warehouse as OLTP. Inserts every request. Use a streaming layer (Kafka → warehouse batch) or a separate OLTP DB.

N versions of a metric. Without a metrics layer, "monthly active users" diverges across teams. Painful re-alignment.

Lift-and-shift from Postgres. Same schema, same queries. The warehouse pattern is different — wide tables, denormalization for serving, aggregated marts. Re-design, don't just port.

Cost transparency only quarterly. Surprise bills. Daily-level cost reporting at team granularity.

Checklist

What's Next

You have a warehouse practice. Connect it to:

Workflow Orchestration — Airflow/Dagster/Prefect orchestrate the dbt runs, ingestion, exports
Stream Processing — real-time feeds into the warehouse via Kafka + Materialize
Time-Series Databases — operational metrics next to analytical data
Object Storage — S3/GCS is the backbone of the lakehouse pattern
FinOps — warehouse cost is often a top-5 cloud line
MLOps — warehouse data feeds ML feature pipelines

Best Practices

Best Practices

Cost Control

Per-query cost awareness

Cluster sizing (Snowflake)

Slot reservation (BigQuery)

Storage tiering

Materialized views and caching

Cost attribution

Query Optimization

Read fewer columns

Filter early

Approximate aggregations

EXPLAIN your queries

Avoid join order pitfalls

Data Quality

Tests at every layer

Contracts on shared models

Monitoring

Governance

PII and Compliance

Lakehouse vs. Warehouse Trade-Offs

Backup, DR, Migration

When Models Get Slow

Adoption Patterns

Common Pitfalls

Checklist

What's Next

On this page