Airflow, Dagster, Prefect, Argo Workflows - scheduling, dependency management, and observability for DAG-based data and infrastructure pipelines

Workflow Orchestration

A workflow orchestrator runs the pipelines: nightly ETL, hourly model rebuilds, weekly reports, retraining jobs, infrastructure tasks. It manages DAGs (directed acyclic graphs of tasks), handles dependencies, schedules runs, retries failures, and lets you see what happened.

This is distinct from the queue-style job systems in Background Jobs. Those handle "user signed up, send welcome email." Orchestrators handle "every day at 03:00 UTC, extract from these 50 sources, run these 200 dbt models, run these 30 quality tests, train these 5 models, refresh these dashboards, and tell me what broke."

It's also distinct from Stream Processing — orchestrators run batches; stream engines run continuously over events.

Why a Real Orchestrator

Without (cron + scripts)	With orchestrator
Tasks succeed independently; dependencies via timing	Dependency graph; downstream waits for upstream
Failure = silent log in `/var/log/...`	Failure surfaces in a UI; retries; alerts
Reruns require manual intervention	Backfill any time range, any subset
"Did yesterday's job actually run?"	History page shows it
Each pipeline reinvents schedule/retry/timeout logic	One orchestrator handles it for all
50 cron jobs, no one knows what's where	Cataloged, searchable, owned
Cross-team integration via shared databases	Cross-team via task dependencies

The bar is low: anything more than ~10 cron jobs benefits from an orchestrator. Anything that touches multiple data sources, has dependencies, or matters to the business demands one.

The Players

Tool	Style	Best for
Apache Airflow	Pythonic DAGs; Operators for everything	Most popular; mature ecosystem; heavy
Dagster	Asset-centric (output-focused); strong types	Modern data platforms; software-engineering culture
Prefect	Pythonic; hybrid execution; "negative engineering" focus	Faster Airflow alternative; Cloud + OSS
Argo Workflows	Kubernetes-native; YAML/CRDs	K8s-first orgs; ML pipelines
Kubeflow Pipelines	Argo + ML SDK	ML training pipelines
Temporal	Code-as-workflow; durable; long-running	Application workflows; less data-eng
AWS Step Functions	JSON state machine; managed	AWS-native; light orchestration
GCP Cloud Workflows / Composer	Managed; Composer = managed Airflow	GCP-native
Azure Data Factory	Visual + code	Azure-native ETL
Mage	OSS modern Airflow alternative	Newer; Python notebook-style
Flyte	Strongly-typed; ML/data	Cross-cloud; type safety; Lyft origin

How to pick:

Data engineering / ETL, mature org → Airflow (you'll inherit one anyway)
Modern data platform, asset-thinking → Dagster
Python-first, easier developer UX → Prefect
Kubernetes-native, ML/CI pipelines → Argo Workflows
Long-running application workflows (sagas, business processes) → Temporal
AWS-only, light needs → Step Functions

For a fresh data platform in 2026: Dagster is the most modern choice; Airflow is the safe choice; Prefect sits between.

A DAG, Concretely

# Airflow example
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    "daily_analytics",
    start_date=datetime(2026, 1, 1),
    schedule="0 3 * * *",          # 03:00 UTC daily
    catchup=False,
    default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
) as dag:
    extract = PythonOperator(task_id="extract_from_stripe", python_callable=extract_stripe)
    transform = PythonOperator(task_id="run_dbt", python_callable=run_dbt)
    quality = PythonOperator(task_id="run_dbt_tests", python_callable=run_dbt_tests)
    notify = PythonOperator(task_id="post_to_slack", python_callable=post_to_slack)

    extract >> transform >> quality >> notify

Five things to notice:

Schedule: standard cron expression.
Dependencies: >> operator. If extract fails, transform doesn't run.
Retries: failures retry automatically.
Idempotent design: each task should be safely re-runnable.
Catchup: if the orchestrator was off for 3 days, should it run the missed days? (Usually False.)

Asset-Centric (Dagster)

Dagster reframes from "tasks" to "assets" — the data that gets produced. Same pipeline:

from dagster import asset

@asset
def stripe_payments() -> DataFrame:
    return fetch_stripe()

@asset(deps=[stripe_payments])
def daily_revenue(stripe_payments: DataFrame) -> DataFrame:
    return stripe_payments.groupby('day').sum()

@asset(deps=[daily_revenue])
def revenue_dashboard():
    refresh_grafana()

The DAG is inferred from the data dependencies. Each asset has a lineage, can be backfilled per-asset, has a history of runs and freshness. This is closer to how engineers actually think about pipelines — outputs, not steps.

Common Workflow Patterns

Pattern	Example
Daily ETL	Source → load → transform → publish
Hourly model rebuild	Read source updates → re-materialize dbt models
Backfill	Re-run a window of historical data (after schema change, bug fix, new column)
Sensor-driven	Wait for a file in S3, an event in Kafka, an external API condition
Fan-out / map	One task spawns N parallel tasks (one per region, one per customer batch)
ML pipeline	Featurize → train → evaluate → register → deploy
Approval gate	Pause for human review before continuing
Long-running	A task that may take hours; orchestrator handles checkpoints

What the Orchestrator Owns vs. Doesn't

A common confusion: the orchestrator schedules and observes; it usually doesn't process the data itself.

Orchestrator (Airflow / Dagster / Prefect)
  ├── triggers → dbt run (which talks to warehouse)
  ├── triggers → Fivetran sync (which copies data)
  ├── triggers → Spark job on Databricks (which crunches)
  ├── triggers → custom Python on a worker (heavy lifting)
  └── observes → did each one succeed? retry? notify?

The orchestrator's job is dispatch + dependency + visibility. The actual computation lives elsewhere — your warehouse, Spark, an API, a container. This is healthy: the orchestrator's process stays small while heavy work scales out.

Anti-pattern: doing the data processing inside the orchestrator's worker (pandas.read_sql → join 50GB in memory). The worker dies; the orchestrator becomes flaky.

Learning Path

1. Getting Started

Spin up Airflow locally; write a DAG; trigger a dbt run; explore Dagster's asset model; compare execution

2. Patterns

DAG design, sensors, backfills, dynamic task generation, retries and idempotency, secrets, ML pipelines

3. Best Practices

Code organization, testing, observability, capacity, scaling, governance, common pitfalls

When You Don't Need One

Honest cases:

Under 10 scheduled jobs, no inter-dependencies — cron + a Slack notifier is fine.
Streaming-only workloads — use a Stream Processing engine.
App-level business workflows — saga patterns and durable execution are better served by Temporal than a data orchestrator.
Single-tenant, single-script ETL — dbt + a cron trigger is enough; add an orchestrator when you have more than one pipeline.

The orchestrator is the central nervous system of a mature data platform. Build it sloppily and every team feels the friction. Build it well — clean DAG patterns, good observability, sane retry behavior, owned schedules — and it disappears into the background, the way good infrastructure should. The path is rarely "buy one big tool"; it's "start with what fits, evolve to match the platform's complexity."

Workflow Orchestration

1. Getting Started

2. Patterns

3. Best Practices

On this page