Workflow Orchestration
Airflow, Dagster, Prefect, Argo Workflows - scheduling, dependency management, and observability for DAG-based data and infrastructure pipelines
Workflow Orchestration
A workflow orchestrator runs the pipelines: nightly ETL, hourly model rebuilds, weekly reports, retraining jobs, infrastructure tasks. It manages DAGs (directed acyclic graphs of tasks), handles dependencies, schedules runs, retries failures, and lets you see what happened.
This is distinct from the queue-style job systems in Background Jobs. Those handle "user signed up, send welcome email." Orchestrators handle "every day at 03:00 UTC, extract from these 50 sources, run these 200 dbt models, run these 30 quality tests, train these 5 models, refresh these dashboards, and tell me what broke."
It's also distinct from Stream Processing — orchestrators run batches; stream engines run continuously over events.
Why a Real Orchestrator
| Without (cron + scripts) | With orchestrator |
|---|---|
| Tasks succeed independently; dependencies via timing | Dependency graph; downstream waits for upstream |
Failure = silent log in /var/log/... | Failure surfaces in a UI; retries; alerts |
| Reruns require manual intervention | Backfill any time range, any subset |
| "Did yesterday's job actually run?" | History page shows it |
| Each pipeline reinvents schedule/retry/timeout logic | One orchestrator handles it for all |
| 50 cron jobs, no one knows what's where | Cataloged, searchable, owned |
| Cross-team integration via shared databases | Cross-team via task dependencies |
The bar is low: anything more than ~10 cron jobs benefits from an orchestrator. Anything that touches multiple data sources, has dependencies, or matters to the business demands one.
The Players
| Tool | Style | Best for |
|---|---|---|
| Apache Airflow | Pythonic DAGs; Operators for everything | Most popular; mature ecosystem; heavy |
| Dagster | Asset-centric (output-focused); strong types | Modern data platforms; software-engineering culture |
| Prefect | Pythonic; hybrid execution; "negative engineering" focus | Faster Airflow alternative; Cloud + OSS |
| Argo Workflows | Kubernetes-native; YAML/CRDs | K8s-first orgs; ML pipelines |
| Kubeflow Pipelines | Argo + ML SDK | ML training pipelines |
| Temporal | Code-as-workflow; durable; long-running | Application workflows; less data-eng |
| AWS Step Functions | JSON state machine; managed | AWS-native; light orchestration |
| GCP Cloud Workflows / Composer | Managed; Composer = managed Airflow | GCP-native |
| Azure Data Factory | Visual + code | Azure-native ETL |
| Mage | OSS modern Airflow alternative | Newer; Python notebook-style |
| Flyte | Strongly-typed; ML/data | Cross-cloud; type safety; Lyft origin |
How to pick:
- Data engineering / ETL, mature org → Airflow (you'll inherit one anyway)
- Modern data platform, asset-thinking → Dagster
- Python-first, easier developer UX → Prefect
- Kubernetes-native, ML/CI pipelines → Argo Workflows
- Long-running application workflows (sagas, business processes) → Temporal
- AWS-only, light needs → Step Functions
For a fresh data platform in 2026: Dagster is the most modern choice; Airflow is the safe choice; Prefect sits between.
A DAG, Concretely
# Airflow example
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
with DAG(
"daily_analytics",
start_date=datetime(2026, 1, 1),
schedule="0 3 * * *", # 03:00 UTC daily
catchup=False,
default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
) as dag:
extract = PythonOperator(task_id="extract_from_stripe", python_callable=extract_stripe)
transform = PythonOperator(task_id="run_dbt", python_callable=run_dbt)
quality = PythonOperator(task_id="run_dbt_tests", python_callable=run_dbt_tests)
notify = PythonOperator(task_id="post_to_slack", python_callable=post_to_slack)
extract >> transform >> quality >> notifyFive things to notice:
- Schedule: standard cron expression.
- Dependencies:
>>operator. Ifextractfails,transformdoesn't run. - Retries: failures retry automatically.
- Idempotent design: each task should be safely re-runnable.
- Catchup: if the orchestrator was off for 3 days, should it run the missed days? (Usually
False.)
Asset-Centric (Dagster)
Dagster reframes from "tasks" to "assets" — the data that gets produced. Same pipeline:
from dagster import asset
@asset
def stripe_payments() -> DataFrame:
return fetch_stripe()
@asset(deps=[stripe_payments])
def daily_revenue(stripe_payments: DataFrame) -> DataFrame:
return stripe_payments.groupby('day').sum()
@asset(deps=[daily_revenue])
def revenue_dashboard():
refresh_grafana()The DAG is inferred from the data dependencies. Each asset has a lineage, can be backfilled per-asset, has a history of runs and freshness. This is closer to how engineers actually think about pipelines — outputs, not steps.
Common Workflow Patterns
| Pattern | Example |
|---|---|
| Daily ETL | Source → load → transform → publish |
| Hourly model rebuild | Read source updates → re-materialize dbt models |
| Backfill | Re-run a window of historical data (after schema change, bug fix, new column) |
| Sensor-driven | Wait for a file in S3, an event in Kafka, an external API condition |
| Fan-out / map | One task spawns N parallel tasks (one per region, one per customer batch) |
| ML pipeline | Featurize → train → evaluate → register → deploy |
| Approval gate | Pause for human review before continuing |
| Long-running | A task that may take hours; orchestrator handles checkpoints |
What the Orchestrator Owns vs. Doesn't
A common confusion: the orchestrator schedules and observes; it usually doesn't process the data itself.
Orchestrator (Airflow / Dagster / Prefect)
├── triggers → dbt run (which talks to warehouse)
├── triggers → Fivetran sync (which copies data)
├── triggers → Spark job on Databricks (which crunches)
├── triggers → custom Python on a worker (heavy lifting)
└── observes → did each one succeed? retry? notify?The orchestrator's job is dispatch + dependency + visibility. The actual computation lives elsewhere — your warehouse, Spark, an API, a container. This is healthy: the orchestrator's process stays small while heavy work scales out.
Anti-pattern: doing the data processing inside the orchestrator's worker (pandas.read_sql → join 50GB in memory). The worker dies; the orchestrator becomes flaky.
Learning Path
1. Getting Started
Spin up Airflow locally; write a DAG; trigger a dbt run; explore Dagster's asset model; compare execution
2. Patterns
DAG design, sensors, backfills, dynamic task generation, retries and idempotency, secrets, ML pipelines
3. Best Practices
Code organization, testing, observability, capacity, scaling, governance, common pitfalls
When You Don't Need One
Honest cases:
- Under 10 scheduled jobs, no inter-dependencies — cron + a Slack notifier is fine.
- Streaming-only workloads — use a Stream Processing engine.
- App-level business workflows — saga patterns and durable execution are better served by Temporal than a data orchestrator.
- Single-tenant, single-script ETL — dbt + a cron trigger is enough; add an orchestrator when you have more than one pipeline.
The orchestrator is the central nervous system of a mature data platform. Build it sloppily and every team feels the friction. Build it well — clean DAG patterns, good observability, sane retry behavior, owned schedules — and it disappears into the background, the way good infrastructure should. The path is rarely "buy one big tool"; it's "start with what fits, evolve to match the platform's complexity."