DataOps & Observability
Applying DevOps to data, CI/CD for data pipelines, testing strategies, data lineage, freshness/volume/distribution monitoring, incident response for data, environments and deployment, and cost/governance ops.
DataOps & Observability
Data teams have spent a decade learning a lesson software teams learned before them: you cannot ship reliable systems by hand. DataOps is the application of DevOps practices - version control, automated testing, CI/CD, monitoring, incident response - to data pipelines. Observability is the part that lets you answer "is the data healthy right now, and if not, why?" without manually querying tables.
This page ties together the rest of the data engineering material. It assumes the building blocks - pipelines (ETL & ELT), orchestration (Orchestration), tests (Data Quality), and contracts (Data Contracts) - and covers how to operate them as a disciplined, observable system.
DevOps Applied to Data
The principles transfer, but the failure modes differ. A software service fails loudly with a 500; a data pipeline can "succeed" while producing silently wrong output.
| DevOps practice | Software | Data equivalent |
|---|---|---|
| Version control | App code in Git | Pipeline code, SQL, schemas, contracts in Git |
| Automated testing | Unit/integration tests | Data quality tests, schema checks, unit tests on logic |
| CI/CD | Build, test, deploy on merge | Validate transforms, run tests, deploy DAGs/models |
| Monitoring | Latency, error rate, uptime | Freshness, volume, distribution, schema drift |
| Incident response | On-call, runbooks, postmortems | Data on-call, lineage-driven triage, backfills |
| Environments | dev/staging/prod | dev/staging/prod with isolated data |
The defining DataOps insight: data has a second axis of failure that code does not. Code is either correct or buggy. Data can be structurally valid and semantically wrong. So DataOps adds quality and observability layers that have no exact software analog.
CI/CD for Data Pipelines
A data pipeline is code, and it deserves the same gate every other code change gets: nothing reaches production without passing automated checks.
# .github/workflows/data-ci.yml
name: Data Pipeline CI
on: [pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# 1. Lint and compile - catch syntax errors before runtime
- run: sqlfluff lint models/
- run: dbt parse # compiles the project, catches ref/macro errors
# 2. Build into an isolated CI schema (never prod)
- run: dbt build --target ci --select state:modified+
# 'state:modified+' = only changed models and their downstream dependents
# 3. Run data quality tests against the CI build
- run: dbt test --target ci --select state:modified+
# 4. Check schema/contract compatibility (see ./data-contracts)
- run: contract-cli compatibility --against registry://prod --mode backwardThe key techniques that make data CI/CD work:
- Slim CI with state comparison.
state:modified+builds and tests only what changed and its downstream dependents, not the whole warehouse. Without this, CI takes hours and gets bypassed. - Isolated CI schemas. Each PR builds into its own throwaway schema so tests never touch production data.
- Test before deploy. A merge that fails a quality test does not deploy. This is the gate that makes the rest of the system trustworthy.
Testing Strategy
Data testing operates at several levels. A mature pipeline uses all of them; relying on only one leaves gaps.
| Level | What it tests | Tooling |
|---|---|---|
| Unit | Pure transformation logic on fixed inputs | pytest, dbt unit tests |
| Data quality | Real data against rules | dbt tests, GX, Soda |
| Contract | Schema structure and compatibility | schema registry, contract CI |
| Integration | End-to-end pipeline on sample data | dbt build in CI, dockerized runs |
| Regression | New output vs known-good snapshot | dbt audit, recce, snapshot diffs |
# A dbt unit test: deterministic logic check with mocked inputs.
# This catches logic bugs without running against live data.
# tests/unit/test_revenue_logic.yml
unit_tests:
- name: test_discount_applied_correctly
model: int_order_revenue
given:
- input: ref('stg_orders')
rows:
- {order_id: 1, subtotal: 100.00, discount_pct: 10}
- {order_id: 2, subtotal: 50.00, discount_pct: 0}
expect:
rows:
- {order_id: 1, final_total: 90.00}
- {order_id: 2, final_total: 50.00}The distinction that matters: unit tests check your logic with controlled inputs; data quality tests check reality with live inputs. You need both. Unit tests catch the discount bug before deploy; quality tests catch the upstream source feeding you nulls in production.
Data Lineage
Lineage is the map of how data flows: which sources feed which models, which models feed which dashboards. It is the foundation of impact analysis - the ability to answer "if this breaks, what breaks downstream?" and "this number looks wrong, where did it come from?"
raw.stripe_payments ──┐
├──> stg_payments ──> fct_revenue ──> exec_dashboard
raw.orders ───────────┘ │
└──────> finance.daily_closeLineage powers three critical operations:
- Impact analysis (downstream): before changing
stg_payments, lineage shows it feedsfct_revenue, the exec dashboard, and the finance close. You now know who to notify. - Root cause analysis (upstream): when
exec_dashboardshows wrong revenue, lineage walks you back throughfct_revenuetostg_paymentsto the raw Stripe source - dramatically narrowing the search. - Blast radius for alerts: an alert that includes the downstream lineage tells responders exactly what consumers are affected (this is the
blast_radiusfield in the Data Quality alert payload).
dbt generates column-level lineage automatically from ref() and source() calls; catalog tools (DataHub, OpenMetadata, Atlan, Unity Catalog) stitch lineage across systems beyond a single dbt project.
Observability: The Pillars
Software observability has metrics, logs, and traces. Data observability has its own pillars - the dimensions you monitor to know whether data is healthy.
| Pillar | Question | What you alert on |
|---|---|---|
| Freshness | Is the data recent? | Last update older than SLA |
| Volume | Is the right amount of data here? | Row count outside expected range |
| Distribution | Do the values look normal? | Null rate, mean, cardinality shifts |
| Schema | Did the structure change? | New/dropped/retyped columns |
| Lineage | What is connected to what? | (Used for triage, not alerts) |
# Observability checks emitted as metrics after every pipeline run
def emit_observability_metrics(table: str, conn) -> dict:
"""Capture the four monitorable pillars for a table."""
return {
"freshness_minutes": query(conn,
f"SELECT EXTRACT(EPOCH FROM (NOW() - MAX(updated_at)))/60 FROM {table}"),
"volume_rows": query(conn,
f"SELECT COUNT(*) FROM {table}"),
"distribution_null_rate": query(conn,
f"SELECT AVG(CASE WHEN key_col IS NULL THEN 1 ELSE 0 END) FROM {table}"),
"schema_fingerprint": query(conn,
f"SELECT md5(string_agg(column_name||data_type, ',' ORDER BY ordinal_position)) "
f"FROM information_schema.columns WHERE table_name = '{table}'"),
}
# Push these to your metrics store; alert when any drifts from baseline.
# Anomaly detection on these series is covered in ./data-quality.The difference between monitoring and observability: monitoring tells you a known metric crossed a threshold (freshness > 1h). Observability lets you investigate an unknown problem - correlate the freshness miss with a schema change upstream and a volume drop, without writing a new query for each question. Managed platforms (Monte Carlo, Anomalo, Metaplane, Bigeye) automate all four pillars across the whole warehouse; you can build the basics yourself with the checks above plus a metrics backend.
Incident Response for Data
When data breaks, the response follows the same shape as a software incident, adapted for data's silent-failure problem.
Detect → Triage → Mitigate → Resolve → Backfill → Postmortem- Detect. An observability alert or a consumer report. The goal is to detect via monitoring before a consumer notices - if humans always find it first, your observability is inadequate.
- Triage. Use lineage to scope the blast radius. Which downstream assets are affected? Who depends on them? Is the bad data already serving consumers?
- Mitigate. Stop the bleeding. Often this means halting publish (the quality gate from Data Quality) so consumers see stale-but-correct data rather than fresh-but-wrong data. Stale is usually safer than wrong.
- Resolve. Fix the root cause - the upstream schema change, the broken join, the bad source file.
- Backfill. Reprocess the affected partitions. This is where idempotent pipelines (see ETL & ELT) and replayable sources (see Stream Processing) pay off - you can rerun a date range safely.
- Postmortem. Blameless. Capture the root cause and add a test or monitor so this specific failure cannot recur silently. Every incident should produce a new check.
The cultural rule: a recurring incident with no new test is a process failure. The point of a postmortem is not blame; it is to convert a one-time surprise into a permanent guardrail.
Environments and Deployment
Data needs the same environment isolation as application code, with an extra wrinkle: data itself must be isolated, not just code.
| Environment | Code | Data | Purpose |
|---|---|---|---|
| dev | Feature branch | Sample / subset / cloned | Build and iterate fast |
| staging | Merged, pre-prod | Prod-like (cloned or recent copy) | Validate against realistic data |
| prod | Released | Real production data | Serve consumers |
# Environment-aware configuration; the same code, different targets
ENVIRONMENTS = {
"dev": {"schema": "dev_{user}", "source": "sampled_1pct", "alerts": False},
"staging": {"schema": "staging", "source": "prod_clone", "alerts": "slack"},
"prod": {"schema": "analytics", "source": "production", "alerts": "pagerduty"},
}Two practices that make this work: zero-copy clones (Snowflake/Databricks let you clone production data into staging instantly and cheaply, so you test against realistic data without duplicating storage) and blue-green deployment for tables (build the new version alongside the old, validate, then atomically swap - the write-audit-publish pattern again). Never test transformations against a tiny synthetic dataset and assume they hold at production scale and skew; they often do not.
Cost and Governance Ops
DataOps is not only about reliability - in cloud warehouses, compute is metered, and an unwatched pipeline can quietly burn money.
- Cost attribution. Tag queries and jobs by team and pipeline so spend is traceable. A single mis-written incremental model doing full scans can dominate the bill; attribution is how you find it.
- Cost monitoring. Alert on cost anomalies the same way you alert on data anomalies. A 5x jump in warehouse spend overnight is an incident.
- Optimization. Partition pruning, incremental models, right-sized warehouses, and killing unused tables. Most warehouse bills have substantial easy savings in models that scan more than they need.
- Governance. Access control, PII classification, audit logging, and retention policies. Increasingly enforced through the catalog and contracts (see Data Contracts) rather than tribal knowledge.
-- Find the expensive queries worth optimizing (Snowflake example)
SELECT
query_tag,
COUNT(*) AS executions,
SUM(credits_used_cloud_services) AS total_credits,
AVG(total_elapsed_time) / 1000 AS avg_seconds
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
GROUP BY query_tag
ORDER BY total_credits DESC
LIMIT 20;Putting It Together
DataOps is the operating system for everything else in this section. A mature data platform looks like:
- Everything in version control - pipelines, SQL, schemas, contracts, even infrastructure.
- CI that builds and tests changes in isolation - nothing reaches prod unvalidated.
- Layered tests - unit for logic, quality for data, contracts for structure.
- Observability on freshness, volume, distribution, and schema - with lineage for triage.
- A real incident process - detect via monitoring, triage via lineage, mitigate by halting, and convert every incident into a new test.
- Isolated environments and atomic deployment - test against realistic data, swap safely.
- Cost and governance as first-class ops - because in the cloud, reliability and spend are the same conversation.
None of this is exotic - it is the same engineering discipline that made software delivery reliable, applied to a domain where the failures are quieter and the trust is harder-won. Teams that adopt DataOps stop firefighting and start shipping data products their organization can actually depend on.