Steven's Knowledge

Best Practices

Code organization, testing, observability, capacity, scaling, governance, common pitfalls

Best Practices

The operational realities of running an orchestrator across many teams and many pipelines.

Code Organization

DAGs grow into hundreds. Without structure, the repo becomes unmaintainable.

A workable layout:

airflow_repo/
├── dags/
│   ├── domain_a/                   # by business domain
│   │   ├── extract_stripe.py
│   │   ├── transform_revenue.py
│   │   └── ...
│   ├── domain_b/
│   └── platform/                   # platform-wide / cross-team
├── plugins/                         # custom operators
│   └── my_operators.py
├── tests/
│   ├── test_dags.py                # DAG import + linting
│   └── test_operators.py           # unit tests
├── utils/                           # shared helpers
│   ├── slack_alerts.py
│   └── data_validations.py
└── requirements.txt

Patterns:

  • One DAG per file — even a small DAG; easier to find and review.
  • CODEOWNERS by directory — each domain owned by a team.
  • Shared utils in a package; don't copy code across DAGs.
  • Custom operators for repeated patterns (e.g., "send to our Slack with this format" becomes one operator, not 50 copies).

For Dagster, similar but organized by asset/group instead of by DAG.

Testing

DAGs are code. Test them.

DAG-import test

# tests/test_dags.py
from airflow.models.dagbag import DagBag

def test_all_dags_import_cleanly():
    dag_bag = DagBag(dag_folder='dags', include_examples=False)
    assert len(dag_bag.import_errors) == 0, dag_bag.import_errors
    assert len(dag_bag.dags) > 0

Catches: syntax errors, missing imports, circular dependencies, undefined operators. Run in CI on every PR.

Operator unit tests

def test_extract_data():
    result = extract_data_fn(date="2026-05-21")
    assert len(result) > 0
    assert all('region' in r for r in result)

Test the function the operator calls, not the orchestrator. The orchestrator's job is dispatch — your code is what you test.

Integration tests

# Bring up Airflow in a test container, run a DAG end-to-end
def test_full_pipeline():
    airflow = AirflowTestEnv()
    airflow.run_dag("daily_revenue", run_date="2026-05-21")
    assert airflow.task_state("daily_revenue", "publish") == "success"

Slower; run on a schedule, not every PR.

Observability

The orchestrator is itself a system you operate.

Key metrics:

  • DAG success rate (per DAG, per team) — should be >95% baseline
  • Task duration (P50/P99) — alert on regression
  • Scheduler heartbeat — orchestrator process health
  • Task queue length — backlog implies under-scaling
  • Worker utilization — under 30% suggests over-provisioning; over 80% suggests under-provisioning

Wire to Monitoring:

# Airflow's StatsD exporter
[metrics]
statsd_on = True
statsd_host = statsd-exporter
statsd_port = 9125
statsd_prefix = airflow

statsd_exporter translates to Prometheus, then Grafana.

Alert on:

  • A DAG hasn't run for > schedule interval (didn't trigger)
  • Failure rate above team's tolerance
  • Scheduler heartbeat missing
  • Backlog growing for > N minutes

Capacity Planning

Workers run tasks. Plan for peak (often nightly window between 02:00-06:00 UTC).

Rules of thumb:

  • Each Celery worker = 1-8 task slots (configurable). Pods = 1 task each in K8s executor.
  • Sensors in reschedule mode hold no worker.
  • Heavy tasks (large pandas, Spark driver) need bigger workers.
  • Light tasks (trigger external API) need many small.

In Airflow with KubernetesExecutor: each task is its own pod, autoscales naturally. CeleryExecutor with autoscaling Celery: similar effect but heavier.

For Dagster: similar — dagster-k8s for each run as a pod.

A common growth pattern:

  1. Start: single-VM Airflow, LocalExecutor
  2. ~50 DAGs / 1000 tasks/day: CeleryExecutor with 2-4 workers
  3. ~500 DAGs / 50k tasks/day: KubernetesExecutor, autoscaling
  4. Multi-team: separate Airflow per team OR one big Airflow with strong RBAC

Scaling Across Teams

When 10 teams use one orchestrator:

  • RBAC: each team sees only their DAGs (Airflow has role-based access for this)
  • Resource quotas: per-team task-slot limits prevent one team from monopolizing
  • Naming convention: {team}_{purpose} so listings are findable
  • Cost attribution: tag tasks to teams; report monthly
  • Separate environments: dev / staging / prod Airflows; rarely just one

At some scale, "one Airflow for all" breaks. Split:

  • Per-team Airflow — each team owns one; cross-team dependencies via dataset triggers
  • Hub-and-spoke — central platform Airflow, team Airflows for specifics
  • Dagster Cloud / Astronomer / Prefect Cloud — SaaS that handles the scaling

The right size is "biggest single thing your platform team comfortably operates." Push for bigger before splitting.

Governance

The orchestrator becomes a critical system; treat it like one.

  • Code review for DAG changes — PR to merge
  • Change management — production DAGs don't change at 4 PM Friday
  • Documentationdescription= on every DAG; doc_md= for context
  • Ownershipowner field per DAG; CODEOWNERS in Git; on-call when failures
  • Lineage — track which dashboards / models depend on which DAG outputs

Migration Strategy

Migrating from cron / ad-hoc / older Airflow to a new orchestrator is its own project:

  1. Inventory existing schedules
  2. Categorize: production-critical / important / cleanup
  3. Pick one team's pipelines to migrate first
  4. Migrate critical first (motivates everyone), or non-critical first (lower risk) — depends on org
  5. Run new + old in parallel for verification (one validates the other)
  6. Cut over to new; remove old after a quiet period

Never "big bang" migrate — gradual is much safer.

Disaster Recovery

The orchestrator stores metadata (history, schedule state, secrets). Back it up.

  • Airflow's metadata DB (Postgres / MySQL) — standard DB backup
  • Dagster's daemon storage — backup
  • DAG / asset code — already in Git
  • Secrets — in Vault ideally; don't depend solely on Airflow Connections

A 1-day outage on the orchestrator = 1 day of missed pipelines. Plan for it in Disaster Recovery.

Performance

Slow scheduler is a common pain. Causes:

  • Too many DAG files (each one takes time to parse)
  • Heavy top-level code in DAGs (anything outside with DAG(...): runs every parse)
  • Inefficient queries on the metadata DB

Fixes:

  • Move heavy imports / computation inside Python callables, not at module load
  • Use min_file_process_interval (Airflow) to throttle DAG parsing
  • Bigger metadata DB; turn on query optimization
  • Reduce DAG count by combining where it makes sense (but not too much — see anti-patterns)

For Dagster: similar — keep asset functions slim; heavy work inside.

Cost Management

The orchestrator itself isn't expensive; the work it triggers is. Strategies:

  • Schedule wisely: not all jobs need to run at midnight. Spread out → smaller peak compute.
  • Right-size workers: 2-CPU pods for most; bigger only for genuinely heavy tasks.
  • Spot/preemptible workers: orchestrators with K8s executors run on Spot well — Karpenter / Spot.io.
  • Kill dead DAGs: 2-year-old DAGs that nobody knows about still consume resources.

Common Pitfalls

DAG-as-code anti-patterns: heavy work at module top-level (parsed on every schedule loop). Keep DAG definition lean; do work inside operators.

Hard-coded dates: start_date=datetime.now() — always set a fixed start_date.

No retries: transient failures (network, API rate limit) take down nightly pipelines. Sensible default retries everywhere.

Too many retries: an auth failure that retries 100 times wastes compute and obscures the issue. Don't retry permanent failures.

Untyped data flow: XCom is Any. Misuse goes silent. Dagster's typed I/O catches this; in Airflow be disciplined.

Untested DAGs: DAGs are software. Test in CI.

Pipeline drift: DAGs in production no one tests in dev. Use a dev environment with realistic data.

No alert thresholds calibrated: every failure pages someone. After a month, on-call ignores all of them. Tune alert thresholds; suppress noise.

Schedule overlap: a long-running DAG starts before the previous finishes. Set max_active_runs=1 unless you really want overlap.

Checklist

Orchestrator production readiness:

  • DAGs in version control, reviewed via PR
  • CI runs DAG import test on every PR
  • Each DAG has owner, retries, alerts on failure
  • Idempotent: re-running any task produces same result
  • Backfills tested for historically-dated DAGs
  • Sensors used (not "schedule late and hope") for external-condition triggers
  • Heavy compute pushed to warehouse / Spark / dedicated workers, not inside scheduler
  • Secrets via Vault / cloud-native secret manager (not env vars in DAG)
  • Metadata DB backed up
  • Scheduler health + DAG success rate monitored
  • Alerts: missed schedules, repeated failures, scheduler heartbeat
  • Capacity sized for peak (nightly window)
  • RBAC if multi-team; per-team task slot limits
  • Dead DAGs cleaned quarterly
  • Documentation: every DAG has description + ownership
  • Data quality gates on critical pipelines

What's Next

You have an orchestrator practice. Connect it to:

  • Data Warehouses — most pipelines orchestrate dbt + warehouse
  • Stream Processing — batch and stream complement each other
  • MLOps — ML pipelines often live in the orchestrator
  • Monitoring — pipeline metrics + alerts
  • Secrets — Vault holds pipeline credentials
  • FinOps — pipeline cost attribution per team

On this page