Steven's Knowledge

Data Quality

Data quality dimensions, testing frameworks (Great Expectations, dbt tests, Soda), anomaly detection, quality gates, data SLAs/SLOs, and alerting.

Data Quality

Bad data is worse than no data. No data fails loudly; bad data fails silently, propagates through every downstream model, and surfaces as a wrong number in an executive dashboard weeks later. Data quality engineering is the practice of catching problems before they reach consumers, and proving to those consumers that the data they depend on is trustworthy.

The ETL/ELT page introduced quality checks as one step in a pipeline. This page goes deep: what to measure, how to test, how to detect the anomalies you did not write a test for, and how to wrap it all in SLAs that make quality a contract rather than a hope.

The Six Dimensions of Data Quality

You cannot improve what you cannot name. Quality issues fall into six classic dimensions. Most real-world incidents map cleanly onto one of them.

DimensionQuestion it answersExample failure
CompletenessIs all the data present?10% of orders missing customer_id
AccuracyDoes the data reflect reality?country = "XX" for a real address
ConsistencyDoes data agree across systems?Revenue in finance DB ≠ revenue in warehouse
TimelinessIs the data fresh enough?Dashboard shows yesterday's data, SLA is 1 hour
UniquenessAre there unexpected duplicates?Same order_id appears 3 times
ValidityDoes data conform to rules/types?email field contains a phone number

Mapping Dimensions to Tests

Each dimension translates into concrete, automatable assertions:

# A mental model: every quality check belongs to a dimension
quality_dimensions = {
    "completeness": ["not_null", "row_count_within_range", "no_missing_partitions"],
    "accuracy":     ["matches_reference_data", "reconciliation_with_source"],
    "consistency":  ["cross_system_totals_match", "referential_integrity"],
    "timeliness":   ["freshness_within_sla", "no_late_arriving_data"],
    "uniqueness":   ["unique_key", "no_duplicate_business_keys"],
    "validity":     ["accepted_values", "regex_match", "type_check", "range_check"],
}

A practical rule: completeness and uniqueness catch most ingestion bugs; validity and accuracy catch most logic bugs; timeliness catches most operational bugs. Cover all six and you have eliminated entire categories of incidents.

Testing Data: Three Frameworks

There is no shortage of tools. The three that dominate are dbt tests (in-warehouse, SQL-native), Great Expectations (Python, framework-grade), and Soda (declarative, YAML-first). They overlap but each has a sweet spot.

dbt Tests

If you already transform with dbt, its built-in tests are the cheapest quality wins available. Tests compile to SQL that returns failing rows; zero rows means a pass.

# models/marts/schema.yml
version: 2

models:
  - name: fct_orders
    description: "One row per completed order"
    tests:
      # Model-level test: row count should not collapse
      - dbt_utils.equal_rowcount:
          compare_model: ref('stg_orders')
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: order_total
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']
      - name: customer_id
        tests:
          - relationships:
              to: ref('dim_customers')
              field: customer_id

For logic that built-in tests cannot express, write a singular test - any SQL query that should return zero rows:

-- tests/assert_revenue_reconciles.sql
-- Fails if warehouse revenue diverges from the source ledger by >$1
WITH warehouse AS (
    SELECT date_trunc('day', ordered_at) AS day, SUM(order_total) AS rev
    FROM {{ ref('fct_orders') }}
    GROUP BY 1
),
ledger AS (
    SELECT day, SUM(amount) AS rev
    FROM {{ source('finance', 'daily_ledger') }}
    GROUP BY 1
)
SELECT w.day, w.rev AS warehouse_rev, l.rev AS ledger_rev
FROM warehouse w
JOIN ledger l USING (day)
WHERE ABS(w.rev - l.rev) > 1.00

Great Expectations

Great Expectations (GX) shines when quality lives outside the warehouse - validating files at ingestion, dataframes mid-pipeline, or sources you do not control. Its vocabulary of "expectations" is expressive and self-documenting.

import great_expectations as gx

context = gx.get_context()
batch = context.sources.pandas_default.read_csv("incoming_payments.csv")

# Completeness
batch.expect_column_values_to_not_be_null("payment_id")
batch.expect_table_row_count_to_be_between(min_value=100, max_value=1_000_000)

# Uniqueness
batch.expect_column_values_to_be_unique("payment_id")

# Validity
batch.expect_column_values_to_match_regex(
    "email", r"^[^@\s]+@[^@\s]+\.[^@\s]+$"
)
batch.expect_column_values_to_be_in_set(
    "currency", ["USD", "EUR", "GBP", "NZD", "AUD"]
)

# Accuracy / consistency between columns
batch.expect_column_pair_values_a_to_be_greater_than_b(
    "paid_at", "created_at", or_equal=True
)

# Distribution (statistical accuracy)
batch.expect_column_mean_to_be_between("amount", min_value=10, max_value=500)

results = batch.validate()
if not results.success:
    raise ValueError(f"Validation failed: {results['statistics']}")

Soda

Soda uses a compact YAML-like checks language (SodaCL) that non-engineers can read and edit. It is a strong choice when analysts and stakeholders need to contribute checks.

# checks/orders.yml
checks for fct_orders:
  - row_count > 0
  - missing_count(customer_id) = 0
  - duplicate_count(order_id) = 0
  - invalid_percent(status) = 0:
      valid values: [pending, shipped, delivered, cancelled]
  - freshness(ordered_at) < 1h
  # Anomaly detection on a metric over time
  - anomaly score for row_count < default
  # Cross-check against another dataset
  - values in (customer_id) must exist in dim_customers (customer_id)

Choosing a Framework

NeedBest fit
Tests on dbt models, in-warehousedbt tests
Validate files/dataframes before loadingGreat Expectations
Non-engineers author checksSoda
Rich data docs / data profilingGreat Expectations
Lightweight, declarative, CI-friendlySoda
Custom reconciliation logicdbt singular tests

Most mature teams run dbt tests for in-warehouse models plus one of GX/Soda for ingestion-edge validation. Do not adopt all three - the maintenance cost of overlapping frameworks exceeds the benefit.

Anomaly Detection

Tests assert rules you can articulate. But the most damaging incidents are the ones you did not anticipate: a row count that quietly drops 40%, an amount distribution that shifts after an upstream bug. Anomaly detection learns the normal shape of your data and alerts on deviation.

Statistical Thresholds

The simplest approach compares today's metric against a rolling baseline:

-- Flag days where row count deviates >3 standard deviations from the
-- trailing 30-day mean (a classic z-score check)
WITH daily AS (
    SELECT
        date_trunc('day', ordered_at) AS day,
        COUNT(*) AS row_count
    FROM fct_orders
    GROUP BY 1
),
stats AS (
    SELECT
        day,
        row_count,
        AVG(row_count)    OVER w AS mean_count,
        STDDEV(row_count) OVER w AS std_count
    FROM daily
    WINDOW w AS (ORDER BY day ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING)
)
SELECT day, row_count, mean_count, std_count
FROM stats
WHERE ABS(row_count - mean_count) > 3 * std_count

Practical Detection Strategy

# What to monitor for anomalies, by metric type
anomaly_targets = {
    "volume":       "row count per partition vs trailing baseline",
    "freshness":    "max(loaded_at) vs expected arrival time",
    "distribution": "mean/median/null-rate of key columns over time",
    "schema":       "column set and types vs last known good",
    "cardinality":  "distinct counts (e.g. active users) vs baseline",
}

Two warnings from experience:

  • Seasonality breaks naive baselines. A 30-day mean will false-alarm every Monday if traffic is weekly-cyclical. Use day-of-week-aware baselines or a tool with seasonality modeling (Monte Carlo, Anomalo, Soda).
  • Tune for signal, not coverage. An anomaly detector that fires daily gets muted within a week. Start with a small set of high-value metrics and tight-but-not-noisy thresholds.

Quality Gates in Pipelines

A test that runs but does not block anything is documentation, not a gate. A quality gate stops bad data from advancing. The pattern: stage data, validate the staged data, and only promote it to the serving layer if validation passes. This is the Write-Audit-Publish (WAP) pattern.

Extract → Load to STAGING → AUDIT (run quality checks) → PUBLISH (swap into PROD)

                                  └── fail → halt, alert, do not publish
# Write-Audit-Publish gate in an orchestrator (see ./orchestration)
def audit_then_publish(table: str):
    """Promote staged data only if it passes quality checks."""
    staging_table = f"staging.{table}"
    prod_table = f"analytics.{table}"

    failures = run_quality_checks(staging_table)  # returns list of failed checks
    blocking = [f for f in failures if f.severity == "error"]

    if blocking:
        send_alert(
            channel="#data-incidents",
            message=f"BLOCKED publish of {table}: {[f.name for f in blocking]}",
        )
        raise QualityGateError(f"{len(blocking)} blocking checks failed")

    # All blocking checks passed - atomic swap into production
    atomic_swap(staging_table, prod_table)

The discipline that makes WAP work: distinguish warning from error. A warning logs and continues (a 2% null rate on an optional field); an error halts the pipeline (a duplicated primary key, a freshness SLA breach). Treat every check as blocking and you will train your team to disable checks. Treat none as blocking and bad data ships.

Data SLAs, SLOs, and SLIs

Borrowing from SRE, quality becomes a contract when you define measurable objectives and publish them to consumers.

  • SLI (Service Level Indicator): the metric you measure - e.g. "percentage of days the orders table arrives by 7 AM."
  • SLO (Service Level Objective): the target for that metric - e.g. "99% of days."
  • SLA (Service Level Agreement): the commitment to consumers, often with consequences if breached.
# data_sla.yml - a publishable contract for a critical dataset
dataset: analytics.fct_orders
owner: data-engineering
consumers: [finance, growth, exec-dashboard]

slos:
  freshness:
    sli: "time between event and availability in warehouse"
    objective: "p95 < 60 minutes"
    measurement_window: 30d
  completeness:
    sli: "percent of expected daily partitions present"
    objective: ">= 99.9%"
  accuracy:
    sli: "revenue reconciliation vs finance ledger"
    objective: "diff < 0.1% on 100% of days"

on_breach:
  notify: ["#data-incidents", "pagerduty:data-oncall"]
  status_page: true

The point of an explicit SLO is not bureaucracy - it is alignment. When finance demands "real-time" data, an SLO forces the conversation: real-time costs streaming infrastructure (see streaming); do you actually need sub-minute freshness, or is hourly fine? Most "real-time" requests evaporate once the cost is named.

Alerting That People Actually Read

The hardest part of data quality is not detection - it is alerting without causing alert fatigue. Rules that work:

  1. Route by severity. Errors page on-call; warnings go to a Slack channel; info goes to a daily digest. Never page for a warning.
  2. Make alerts actionable. A good alert names the dataset, the failed check, the likely blast radius (downstream models), and a runbook link. "fct_orders failed a test" is useless.
  3. Deduplicate and group. One upstream failure that breaks 40 models should be one alert, not 40.
  4. Own every alert. Every check has an owner team in metadata. Unowned alerts get ignored.
  5. Track and prune. Review alert volume monthly. Any check that has never caught a real issue but fires regularly is noise - fix or delete it.
# An actionable alert payload
alert = {
    "severity":       "error",
    "dataset":        "analytics.fct_orders",
    "check":          "freshness_within_sla",
    "detail":         "Latest partition is 3h old; SLO is 1h",
    "blast_radius":   ["exec_dashboard", "finance.daily_revenue", "growth.cohorts"],
    "owner":          "data-engineering",
    "runbook":        "https://wiki/runbooks/orders-freshness",
    "since":          "2026-05-30T07:00:00Z",
}

Putting It Together

A production data quality program has four layers:

  1. Schema and contract enforcement at the edge - reject malformed data before it lands (see data contracts).
  2. Declarative tests on every model - completeness, uniqueness, validity, referential integrity (dbt/GX/Soda).
  3. Anomaly detection on key metrics - catch the failures you did not write a test for.
  4. Quality gates plus SLOs plus actionable alerting - block bad data, measure against commitments, and tell the right people fast.

Quality is not a one-time project; it is an operating discipline. The teams that earn trust in their data are not the ones with zero incidents - they are the ones whose pipelines fail loudly, fail safely, and recover fast. The next step is operationalizing all of this, covered in DataOps & observability.

On this page