Best Practices
Cost control, query optimization, data quality, governance, GDPR, lakehouse vs warehouse trade-offs
Best Practices
The operational realities of running a warehouse that produces value without producing a runaway bill.
Cost Control
The warehouse bill is the most surprising line in many cloud invoices. Levers:
Per-query cost awareness
In BigQuery, every query has a bytesProcessed number. Show it to engineers:
-- Dry-run to see cost
SELECT * FROM `project.dataset.events`;
-- "This query will process 1.2 TB"
-- At $5/TB, that's $6 per run.A single careless query can cost more than a month of a small Postgres instance. Make this visible in your BI tool, dbt logs, etc.
Cluster sizing (Snowflake)
A small warehouse (compute cluster) is cheap. A massive one is expensive. Right-size per workload:
-- For BI dashboards (many small queries)
ALTER WAREHOUSE BI_WH SET WAREHOUSE_SIZE = 'SMALL';
-- For nightly ETL
ALTER WAREHOUSE ETL_WH SET WAREHOUSE_SIZE = 'LARGE';
-- Auto-suspend after idle
ALTER WAREHOUSE BI_WH SET AUTO_SUSPEND = 60; -- secondsMultiple warehouses (clusters) so heavy ETL doesn't starve dashboards. Auto-suspend so you don't pay for idle.
Slot reservation (BigQuery)
Pay-per-query is great until you have many queries. Flat-rate / slot commitments can save 50%+ at scale:
# Commit to 500 slots for 1 year
bq mk --reservation --location=US --slots=500 \
--plan=ANNUAL my_reservationAbove 5-10TB/day query volume, slots win.
Storage tiering
Cold data goes to cheap storage:
- Snowflake: time-travel + fail-safe count toward storage. Reduce retention on huge tables.
- BigQuery: tables become
long_term_storageafter 90 days (50% cheaper) — automatic. - S3/GCS (lakehouse): lifecycle rules to Glacier / Coldline.
A 5TB table you query once a quarter shouldn't sit in hot storage.
Materialized views and caching
Re-running an aggregation that hasn't changed wastes money. MVs and result caching turn N queries into 1 + (N-1) cache hits.
Cost attribution
Tag queries / warehouses to teams. Snowflake has tags + query tags. BigQuery has labels. Show each team their monthly spend. Cost transparency drives accountability.
Connect to FinOps for cross-cutting cost ops.
Query Optimization
The same data, same warehouse, can be 100× faster with better SQL.
Read fewer columns
-- Bad: pulls every column from a 200-column table
SELECT * FROM events WHERE day = '2026-05-21';
-- Good: only what you need
SELECT user_id, event_type, revenue FROM events WHERE day = '2026-05-21';Columnar storage means SELECT * is often 10-50× more expensive than naming columns.
Filter early
Push WHERE clauses as low as possible:
-- Bad: filters after joining (millions of joined rows)
SELECT ...
FROM big_fact f JOIN dim d ON f.id = d.id
WHERE f.event_date = '2026-05-21';
-- Good (often automatic, but explicit):
SELECT ...
FROM (SELECT * FROM big_fact WHERE event_date = '2026-05-21') f
JOIN dim d ON f.id = d.id;Most warehouses' optimizers do this automatically. Don't trust — check.
Approximate aggregations
Exact distinct count over a billion rows is expensive. Approximations are usually fine:
-- Snowflake / BigQuery
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;
-- ClickHouse
SELECT uniq(user_id) FROM events;99% accuracy, 10× faster. For dashboards, the accuracy is invisible to humans.
EXPLAIN your queries
-- BigQuery
SELECT ... -- click "Execution details" in the UI
-- Snowflake
EXPLAIN USING TEXT SELECT ...
-- ClickHouse
EXPLAIN PLAN SELECT ...
-- DuckDB
EXPLAIN SELECT ...Look for: huge scans (missing filter), nested joins (re-order), spill to disk (under-provisioned).
Avoid join order pitfalls
In some warehouses (especially older / self-hosted), join order matters. The smaller table goes on the right of an inner join for most engines. The query planner usually fixes this — but check on truly large tables.
Data Quality
A wrong number in the dashboard is worse than no number. Discipline:
Tests at every layer
- Source freshness: did Fivetran load today's data? (
dbt source freshness) - Row count sanity: today's daily count is between 80-120% of yesterday's
- Schema drift: source added a new column, your model ignores it (or fails loudly)
- Uniqueness:
uniquetest on primary key columns - Relationships: foreign keys actually resolve
- Custom: business invariants ("
revenueis never negative")
dbt + dbt-utils + dbt-expectations cover most.
Contracts on shared models
If marketing and analytics both use dim_user, changes ripple. Define a contract:
# dbt model contract
models:
- name: dim_user
config:
contract: { enforced: true }
columns:
- name: user_id
data_type: bigint
constraints: [{ type: not_null }, { type: primary_key }]
- name: email
data_type: varcharSchema changes that violate the contract fail at build. Downstream teams aren't surprised.
Monitoring
Tools like Monte Carlo, Bigeye, Soda, Anomalo, Elementary (open) detect anomalies in your warehouse — unexpected row count drops, schema changes, freshness misses. The "Datadog for data" category.
For smaller orgs: dbt + a Slack notification on test failure is enough.
Governance
As more teams use the warehouse:
- Naming conventions:
fact_*,dim_*,stg_*,int_*. Predictable. - Documentation: every dbt model has a
descriptionandmeta.owner. The catalog is browsable. - Lineage: dbt docs renders the DAG; downstream effects of changes are visible.
- Access control: role-based (analytics_role, ml_role, exec_role). Tables tagged with sensitivity (
pii,financial). - Approval for PII access: query an audit log; legal reviews requests.
- Single source of metrics: a metric layer (dbt metrics, Cube, MetricFlow, Looker LookML) prevents 5 versions of "active users".
The warehouse becomes the operational data system for the company. Govern it like one.
PII and Compliance
The warehouse often contains PII. Approach:
- Column-level encryption / masking: Snowflake's Dynamic Data Masking, BigQuery's column-level encryption. Only authorized roles see the cleartext.
- Tokenization: customer emails replaced with tokens; reversible only by an isolated service.
- PII inventory: catalog which columns are PII; review quarterly.
- GDPR right-to-be-forgotten: handle deletion. Hard in a warehouse (immutable formats). Patterns:
- Tombstone table: mark deleted; views filter them out
- Periodic rebuild: rebuild affected tables monthly without forgotten users
- Iceberg / Delta merge-delete: modern open-format support
- Cross-region transfer: GDPR may forbid sending EU data to a US warehouse. Per-region warehouses or aggregation-only export.
Document the PII flow. Auditors ask.
Lakehouse vs. Warehouse Trade-Offs
Both can be the right answer. Choose by:
| Concern | Pure warehouse (Snowflake/BigQuery) | Lakehouse (Databricks/Iceberg+Trino) |
|---|---|---|
| Vendor lock-in | High (proprietary formats) | Low (open formats: Parquet, Iceberg, Delta) |
| Time to value | Fast | Slower (more pieces to configure) |
| SQL ergonomics | Excellent | Good |
| ML workflows | OK | First-class (Spark, MLflow integrated) |
| Cost at scale | Expensive at top end | Cheaper (cheap object storage) |
| Operational burden | Low (managed) | Higher (you operate more) |
| Multi-engine | Limited | Native (same data, multiple engines) |
A pragmatic mid-path: Snowflake or BigQuery as primary, Iceberg-on-S3 as long-term cold archive. Best of both.
Backup, DR, Migration
The warehouse is your business intelligence; treat it like a critical system. See Disaster Recovery.
- Time-travel for short-term ($ to days back)
- Snapshot exports to object storage for longer recovery
- Cross-region replication (Snowflake replication, BigQuery cross-region, Databricks DR)
- Documented migration plan: vendor change is hard. Iceberg/Delta on object storage makes future migrations cheaper.
When Models Get Slow
Slowdowns happen as data grows. Diagnose:
- Single query slow → query optimization, partitioning, clustering
- All queries slow → warehouse size too small or concurrent queries fighting
- dbt build slow → incremental models; parallelize independent models
- Specific model slow → break into smaller models; materialize as table not view; precompute aggregates
A monthly "slow query review" with the data team keeps things tuned. Stale models from 2 years ago that nobody uses but still run nightly should be deleted.
Adoption Patterns
For organizations adopting a warehouse:
- First team: pick one team with real analytical needs (often product analytics). Get them shipping value.
- Second team: marketing or finance. Different data, similar pattern.
- Centralize on dbt: prevents N versions of similar transforms.
- Single metric layer: define metrics in code, used by all BI tools.
- Self-serve queries: business users in tools like Hex, Mode, Metabase, Tableau. Analytics engineers maintain the marts.
- Productionize: dashboards have SLAs; alerts; on-call rotation for data infra.
This takes years for big orgs. Start small; demonstrate value; expand.
Common Pitfalls
No partitioning. Big table; every query reads it all. Bills surprise; queries slow.
Stale data, no freshness check. Dashboards quietly show yesterday's data because Fivetran broke. Add source freshness tests.
Cost-blind engineers. Writing exploratory queries against TB-tables. Cost guardrails (BigQuery max-bytes-billed limit, Snowflake query timeout + warehouse size limit).
Wide unused tables. Models that nobody queries — but rebuild nightly. Audit dbt model usage; drop dead models.
Warehouse as OLTP. Inserts every request. Use a streaming layer (Kafka → warehouse batch) or a separate OLTP DB.
N versions of a metric. Without a metrics layer, "monthly active users" diverges across teams. Painful re-alignment.
Lift-and-shift from Postgres. Same schema, same queries. The warehouse pattern is different — wide tables, denormalization for serving, aggregated marts. Re-design, don't just port.
Cost transparency only quarterly. Surprise bills. Daily-level cost reporting at team granularity.
Checklist
Data warehouse production readiness:
- Source data freshness monitored (alerts on staleness)
- Critical tables have data quality tests
- dbt or equivalent for transformations (no ad-hoc SQL in production paths)
- Large tables partitioned + clustered correctly
- Materialized views or rollups for high-traffic queries
- PII inventory; column-level masking or tokenization
- GDPR / data subject request process tested
- Cost guardrails per user / query (max bytes scanned, warehouse limits)
- Cost attribution per team; monthly showback
- Time travel or snapshot strategy for DR
- Cross-region or cross-cloud replication if business-critical
- Documented lineage (dbt docs)
- Metrics defined in one place (metrics layer)
- Access control with role-based privileges
- Slow query review monthly; dead model cleanup
- Engineers can see cost of their queries before running
What's Next
You have a warehouse practice. Connect it to:
- Workflow Orchestration — Airflow/Dagster/Prefect orchestrate the dbt runs, ingestion, exports
- Stream Processing — real-time feeds into the warehouse via Kafka + Materialize
- Time-Series Databases — operational metrics next to analytical data
- Object Storage — S3/GCS is the backbone of the lakehouse pattern
- FinOps — warehouse cost is often a top-5 cloud line
- MLOps — warehouse data feeds ML feature pipelines