Steven's Knowledge

Workflow Orchestration

Airflow, Dagster, Prefect, Argo Workflows - scheduling, dependency management, and observability for DAG-based data and infrastructure pipelines

Workflow Orchestration

A workflow orchestrator runs the pipelines: nightly ETL, hourly model rebuilds, weekly reports, retraining jobs, infrastructure tasks. It manages DAGs (directed acyclic graphs of tasks), handles dependencies, schedules runs, retries failures, and lets you see what happened.

This is distinct from the queue-style job systems in Background Jobs. Those handle "user signed up, send welcome email." Orchestrators handle "every day at 03:00 UTC, extract from these 50 sources, run these 200 dbt models, run these 30 quality tests, train these 5 models, refresh these dashboards, and tell me what broke."

It's also distinct from Stream Processing — orchestrators run batches; stream engines run continuously over events.

Why a Real Orchestrator

Without (cron + scripts)With orchestrator
Tasks succeed independently; dependencies via timingDependency graph; downstream waits for upstream
Failure = silent log in /var/log/...Failure surfaces in a UI; retries; alerts
Reruns require manual interventionBackfill any time range, any subset
"Did yesterday's job actually run?"History page shows it
Each pipeline reinvents schedule/retry/timeout logicOne orchestrator handles it for all
50 cron jobs, no one knows what's whereCataloged, searchable, owned
Cross-team integration via shared databasesCross-team via task dependencies

The bar is low: anything more than ~10 cron jobs benefits from an orchestrator. Anything that touches multiple data sources, has dependencies, or matters to the business demands one.

The Players

ToolStyleBest for
Apache AirflowPythonic DAGs; Operators for everythingMost popular; mature ecosystem; heavy
DagsterAsset-centric (output-focused); strong typesModern data platforms; software-engineering culture
PrefectPythonic; hybrid execution; "negative engineering" focusFaster Airflow alternative; Cloud + OSS
Argo WorkflowsKubernetes-native; YAML/CRDsK8s-first orgs; ML pipelines
Kubeflow PipelinesArgo + ML SDKML training pipelines
TemporalCode-as-workflow; durable; long-runningApplication workflows; less data-eng
AWS Step FunctionsJSON state machine; managedAWS-native; light orchestration
GCP Cloud Workflows / ComposerManaged; Composer = managed AirflowGCP-native
Azure Data FactoryVisual + codeAzure-native ETL
MageOSS modern Airflow alternativeNewer; Python notebook-style
FlyteStrongly-typed; ML/dataCross-cloud; type safety; Lyft origin

How to pick:

  • Data engineering / ETL, mature orgAirflow (you'll inherit one anyway)
  • Modern data platform, asset-thinkingDagster
  • Python-first, easier developer UXPrefect
  • Kubernetes-native, ML/CI pipelinesArgo Workflows
  • Long-running application workflows (sagas, business processes) → Temporal
  • AWS-only, light needsStep Functions

For a fresh data platform in 2026: Dagster is the most modern choice; Airflow is the safe choice; Prefect sits between.

A DAG, Concretely

# Airflow example
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    "daily_analytics",
    start_date=datetime(2026, 1, 1),
    schedule="0 3 * * *",          # 03:00 UTC daily
    catchup=False,
    default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
) as dag:
    extract = PythonOperator(task_id="extract_from_stripe", python_callable=extract_stripe)
    transform = PythonOperator(task_id="run_dbt", python_callable=run_dbt)
    quality = PythonOperator(task_id="run_dbt_tests", python_callable=run_dbt_tests)
    notify = PythonOperator(task_id="post_to_slack", python_callable=post_to_slack)

    extract >> transform >> quality >> notify

Five things to notice:

  1. Schedule: standard cron expression.
  2. Dependencies: >> operator. If extract fails, transform doesn't run.
  3. Retries: failures retry automatically.
  4. Idempotent design: each task should be safely re-runnable.
  5. Catchup: if the orchestrator was off for 3 days, should it run the missed days? (Usually False.)

Asset-Centric (Dagster)

Dagster reframes from "tasks" to "assets" — the data that gets produced. Same pipeline:

from dagster import asset

@asset
def stripe_payments() -> DataFrame:
    return fetch_stripe()

@asset(deps=[stripe_payments])
def daily_revenue(stripe_payments: DataFrame) -> DataFrame:
    return stripe_payments.groupby('day').sum()

@asset(deps=[daily_revenue])
def revenue_dashboard():
    refresh_grafana()

The DAG is inferred from the data dependencies. Each asset has a lineage, can be backfilled per-asset, has a history of runs and freshness. This is closer to how engineers actually think about pipelines — outputs, not steps.

Common Workflow Patterns

PatternExample
Daily ETLSource → load → transform → publish
Hourly model rebuildRead source updates → re-materialize dbt models
BackfillRe-run a window of historical data (after schema change, bug fix, new column)
Sensor-drivenWait for a file in S3, an event in Kafka, an external API condition
Fan-out / mapOne task spawns N parallel tasks (one per region, one per customer batch)
ML pipelineFeaturize → train → evaluate → register → deploy
Approval gatePause for human review before continuing
Long-runningA task that may take hours; orchestrator handles checkpoints

What the Orchestrator Owns vs. Doesn't

A common confusion: the orchestrator schedules and observes; it usually doesn't process the data itself.

Orchestrator (Airflow / Dagster / Prefect)
  ├── triggers → dbt run (which talks to warehouse)
  ├── triggers → Fivetran sync (which copies data)
  ├── triggers → Spark job on Databricks (which crunches)
  ├── triggers → custom Python on a worker (heavy lifting)
  └── observes → did each one succeed? retry? notify?

The orchestrator's job is dispatch + dependency + visibility. The actual computation lives elsewhere — your warehouse, Spark, an API, a container. This is healthy: the orchestrator's process stays small while heavy work scales out.

Anti-pattern: doing the data processing inside the orchestrator's worker (pandas.read_sql → join 50GB in memory). The worker dies; the orchestrator becomes flaky.

Learning Path

When You Don't Need One

Honest cases:

  • Under 10 scheduled jobs, no inter-dependencies — cron + a Slack notifier is fine.
  • Streaming-only workloads — use a Stream Processing engine.
  • App-level business workflows — saga patterns and durable execution are better served by Temporal than a data orchestrator.
  • Single-tenant, single-script ETL — dbt + a cron trigger is enough; add an orchestrator when you have more than one pipeline.

The orchestrator is the central nervous system of a mature data platform. Build it sloppily and every team feels the friction. Build it well — clean DAG patterns, good observability, sane retry behavior, owned schedules — and it disappears into the background, the way good infrastructure should. The path is rarely "buy one big tool"; it's "start with what fits, evolve to match the platform's complexity."

On this page