Steven's Knowledge

DataOps & Observability

Applying DevOps to data, CI/CD for data pipelines, testing strategies, data lineage, freshness/volume/distribution monitoring, incident response for data, environments and deployment, and cost/governance ops.

DataOps & Observability

Data teams have spent a decade learning a lesson software teams learned before them: you cannot ship reliable systems by hand. DataOps is the application of DevOps practices - version control, automated testing, CI/CD, monitoring, incident response - to data pipelines. Observability is the part that lets you answer "is the data healthy right now, and if not, why?" without manually querying tables.

This page ties together the rest of the data engineering material. It assumes the building blocks - pipelines (ETL & ELT), orchestration (Orchestration), tests (Data Quality), and contracts (Data Contracts) - and covers how to operate them as a disciplined, observable system.

DevOps Applied to Data

The principles transfer, but the failure modes differ. A software service fails loudly with a 500; a data pipeline can "succeed" while producing silently wrong output.

DevOps practiceSoftwareData equivalent
Version controlApp code in GitPipeline code, SQL, schemas, contracts in Git
Automated testingUnit/integration testsData quality tests, schema checks, unit tests on logic
CI/CDBuild, test, deploy on mergeValidate transforms, run tests, deploy DAGs/models
MonitoringLatency, error rate, uptimeFreshness, volume, distribution, schema drift
Incident responseOn-call, runbooks, postmortemsData on-call, lineage-driven triage, backfills
Environmentsdev/staging/proddev/staging/prod with isolated data

The defining DataOps insight: data has a second axis of failure that code does not. Code is either correct or buggy. Data can be structurally valid and semantically wrong. So DataOps adds quality and observability layers that have no exact software analog.

CI/CD for Data Pipelines

A data pipeline is code, and it deserves the same gate every other code change gets: nothing reaches production without passing automated checks.

# .github/workflows/data-ci.yml
name: Data Pipeline CI
on: [pull_request]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # 1. Lint and compile - catch syntax errors before runtime
      - run: sqlfluff lint models/
      - run: dbt parse           # compiles the project, catches ref/macro errors

      # 2. Build into an isolated CI schema (never prod)
      - run: dbt build --target ci --select state:modified+
        # 'state:modified+' = only changed models and their downstream dependents

      # 3. Run data quality tests against the CI build
      - run: dbt test --target ci --select state:modified+

      # 4. Check schema/contract compatibility (see ./data-contracts)
      - run: contract-cli compatibility --against registry://prod --mode backward

The key techniques that make data CI/CD work:

  • Slim CI with state comparison. state:modified+ builds and tests only what changed and its downstream dependents, not the whole warehouse. Without this, CI takes hours and gets bypassed.
  • Isolated CI schemas. Each PR builds into its own throwaway schema so tests never touch production data.
  • Test before deploy. A merge that fails a quality test does not deploy. This is the gate that makes the rest of the system trustworthy.

Testing Strategy

Data testing operates at several levels. A mature pipeline uses all of them; relying on only one leaves gaps.

LevelWhat it testsTooling
UnitPure transformation logic on fixed inputspytest, dbt unit tests
Data qualityReal data against rulesdbt tests, GX, Soda
ContractSchema structure and compatibilityschema registry, contract CI
IntegrationEnd-to-end pipeline on sample datadbt build in CI, dockerized runs
RegressionNew output vs known-good snapshotdbt audit, recce, snapshot diffs
# A dbt unit test: deterministic logic check with mocked inputs.
# This catches logic bugs without running against live data.
# tests/unit/test_revenue_logic.yml
unit_tests:
  - name: test_discount_applied_correctly
    model: int_order_revenue
    given:
      - input: ref('stg_orders')
        rows:
          - {order_id: 1, subtotal: 100.00, discount_pct: 10}
          - {order_id: 2, subtotal: 50.00,  discount_pct: 0}
    expect:
      rows:
        - {order_id: 1, final_total: 90.00}
        - {order_id: 2, final_total: 50.00}

The distinction that matters: unit tests check your logic with controlled inputs; data quality tests check reality with live inputs. You need both. Unit tests catch the discount bug before deploy; quality tests catch the upstream source feeding you nulls in production.

Data Lineage

Lineage is the map of how data flows: which sources feed which models, which models feed which dashboards. It is the foundation of impact analysis - the ability to answer "if this breaks, what breaks downstream?" and "this number looks wrong, where did it come from?"

raw.stripe_payments ──┐
                      ├──> stg_payments ──> fct_revenue ──> exec_dashboard
raw.orders ───────────┘                          │
                                                  └──────> finance.daily_close

Lineage powers three critical operations:

  • Impact analysis (downstream): before changing stg_payments, lineage shows it feeds fct_revenue, the exec dashboard, and the finance close. You now know who to notify.
  • Root cause analysis (upstream): when exec_dashboard shows wrong revenue, lineage walks you back through fct_revenue to stg_payments to the raw Stripe source - dramatically narrowing the search.
  • Blast radius for alerts: an alert that includes the downstream lineage tells responders exactly what consumers are affected (this is the blast_radius field in the Data Quality alert payload).

dbt generates column-level lineage automatically from ref() and source() calls; catalog tools (DataHub, OpenMetadata, Atlan, Unity Catalog) stitch lineage across systems beyond a single dbt project.

Observability: The Pillars

Software observability has metrics, logs, and traces. Data observability has its own pillars - the dimensions you monitor to know whether data is healthy.

PillarQuestionWhat you alert on
FreshnessIs the data recent?Last update older than SLA
VolumeIs the right amount of data here?Row count outside expected range
DistributionDo the values look normal?Null rate, mean, cardinality shifts
SchemaDid the structure change?New/dropped/retyped columns
LineageWhat is connected to what?(Used for triage, not alerts)
# Observability checks emitted as metrics after every pipeline run
def emit_observability_metrics(table: str, conn) -> dict:
    """Capture the four monitorable pillars for a table."""
    return {
        "freshness_minutes": query(conn,
            f"SELECT EXTRACT(EPOCH FROM (NOW() - MAX(updated_at)))/60 FROM {table}"),
        "volume_rows": query(conn,
            f"SELECT COUNT(*) FROM {table}"),
        "distribution_null_rate": query(conn,
            f"SELECT AVG(CASE WHEN key_col IS NULL THEN 1 ELSE 0 END) FROM {table}"),
        "schema_fingerprint": query(conn,
            f"SELECT md5(string_agg(column_name||data_type, ',' ORDER BY ordinal_position)) "
            f"FROM information_schema.columns WHERE table_name = '{table}'"),
    }
    # Push these to your metrics store; alert when any drifts from baseline.
    # Anomaly detection on these series is covered in ./data-quality.

The difference between monitoring and observability: monitoring tells you a known metric crossed a threshold (freshness > 1h). Observability lets you investigate an unknown problem - correlate the freshness miss with a schema change upstream and a volume drop, without writing a new query for each question. Managed platforms (Monte Carlo, Anomalo, Metaplane, Bigeye) automate all four pillars across the whole warehouse; you can build the basics yourself with the checks above plus a metrics backend.

Incident Response for Data

When data breaks, the response follows the same shape as a software incident, adapted for data's silent-failure problem.

Detect → Triage → Mitigate → Resolve → Backfill → Postmortem
  1. Detect. An observability alert or a consumer report. The goal is to detect via monitoring before a consumer notices - if humans always find it first, your observability is inadequate.
  2. Triage. Use lineage to scope the blast radius. Which downstream assets are affected? Who depends on them? Is the bad data already serving consumers?
  3. Mitigate. Stop the bleeding. Often this means halting publish (the quality gate from Data Quality) so consumers see stale-but-correct data rather than fresh-but-wrong data. Stale is usually safer than wrong.
  4. Resolve. Fix the root cause - the upstream schema change, the broken join, the bad source file.
  5. Backfill. Reprocess the affected partitions. This is where idempotent pipelines (see ETL & ELT) and replayable sources (see Stream Processing) pay off - you can rerun a date range safely.
  6. Postmortem. Blameless. Capture the root cause and add a test or monitor so this specific failure cannot recur silently. Every incident should produce a new check.

The cultural rule: a recurring incident with no new test is a process failure. The point of a postmortem is not blame; it is to convert a one-time surprise into a permanent guardrail.

Environments and Deployment

Data needs the same environment isolation as application code, with an extra wrinkle: data itself must be isolated, not just code.

EnvironmentCodeDataPurpose
devFeature branchSample / subset / clonedBuild and iterate fast
stagingMerged, pre-prodProd-like (cloned or recent copy)Validate against realistic data
prodReleasedReal production dataServe consumers
# Environment-aware configuration; the same code, different targets
ENVIRONMENTS = {
    "dev":     {"schema": "dev_{user}",  "source": "sampled_1pct",  "alerts": False},
    "staging": {"schema": "staging",     "source": "prod_clone",    "alerts": "slack"},
    "prod":    {"schema": "analytics",   "source": "production",    "alerts": "pagerduty"},
}

Two practices that make this work: zero-copy clones (Snowflake/Databricks let you clone production data into staging instantly and cheaply, so you test against realistic data without duplicating storage) and blue-green deployment for tables (build the new version alongside the old, validate, then atomically swap - the write-audit-publish pattern again). Never test transformations against a tiny synthetic dataset and assume they hold at production scale and skew; they often do not.

Cost and Governance Ops

DataOps is not only about reliability - in cloud warehouses, compute is metered, and an unwatched pipeline can quietly burn money.

  • Cost attribution. Tag queries and jobs by team and pipeline so spend is traceable. A single mis-written incremental model doing full scans can dominate the bill; attribution is how you find it.
  • Cost monitoring. Alert on cost anomalies the same way you alert on data anomalies. A 5x jump in warehouse spend overnight is an incident.
  • Optimization. Partition pruning, incremental models, right-sized warehouses, and killing unused tables. Most warehouse bills have substantial easy savings in models that scan more than they need.
  • Governance. Access control, PII classification, audit logging, and retention policies. Increasingly enforced through the catalog and contracts (see Data Contracts) rather than tribal knowledge.
-- Find the expensive queries worth optimizing (Snowflake example)
SELECT
    query_tag,
    COUNT(*)                                  AS executions,
    SUM(credits_used_cloud_services)          AS total_credits,
    AVG(total_elapsed_time) / 1000            AS avg_seconds
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
GROUP BY query_tag
ORDER BY total_credits DESC
LIMIT 20;

Putting It Together

DataOps is the operating system for everything else in this section. A mature data platform looks like:

  1. Everything in version control - pipelines, SQL, schemas, contracts, even infrastructure.
  2. CI that builds and tests changes in isolation - nothing reaches prod unvalidated.
  3. Layered tests - unit for logic, quality for data, contracts for structure.
  4. Observability on freshness, volume, distribution, and schema - with lineage for triage.
  5. A real incident process - detect via monitoring, triage via lineage, mitigate by halting, and convert every incident into a new test.
  6. Isolated environments and atomic deployment - test against realistic data, swap safely.
  7. Cost and governance as first-class ops - because in the cloud, reliability and spend are the same conversation.

None of this is exotic - it is the same engineering discipline that made software delivery reliable, applied to a domain where the failures are quieter and the trust is harder-won. Teams that adopt DataOps stop firefighting and start shipping data products their organization can actually depend on.

On this page