Best Practices

The operational realities of running an orchestrator across many teams and many pipelines.

Code Organization

DAGs grow into hundreds. Without structure, the repo becomes unmaintainable.

A workable layout:

airflow_repo/
├── dags/
│   ├── domain_a/                   # by business domain
│   │   ├── extract_stripe.py
│   │   ├── transform_revenue.py
│   │   └── ...
│   ├── domain_b/
│   └── platform/                   # platform-wide / cross-team
├── plugins/                         # custom operators
│   └── my_operators.py
├── tests/
│   ├── test_dags.py                # DAG import + linting
│   └── test_operators.py           # unit tests
├── utils/                           # shared helpers
│   ├── slack_alerts.py
│   └── data_validations.py
└── requirements.txt

Patterns:

One DAG per file — even a small DAG; easier to find and review.
CODEOWNERS by directory — each domain owned by a team.
Shared utils in a package; don't copy code across DAGs.
Custom operators for repeated patterns (e.g., "send to our Slack with this format" becomes one operator, not 50 copies).

For Dagster, similar but organized by asset/group instead of by DAG.

Testing

DAGs are code. Test them.

DAG-import test

# tests/test_dags.py
from airflow.models.dagbag import DagBag

def test_all_dags_import_cleanly():
    dag_bag = DagBag(dag_folder='dags', include_examples=False)
    assert len(dag_bag.import_errors) == 0, dag_bag.import_errors
    assert len(dag_bag.dags) > 0

Catches: syntax errors, missing imports, circular dependencies, undefined operators. Run in CI on every PR.

Operator unit tests

def test_extract_data():
    result = extract_data_fn(date="2026-05-21")
    assert len(result) > 0
    assert all('region' in r for r in result)

Test the function the operator calls, not the orchestrator. The orchestrator's job is dispatch — your code is what you test.

Integration tests

# Bring up Airflow in a test container, run a DAG end-to-end
def test_full_pipeline():
    airflow = AirflowTestEnv()
    airflow.run_dag("daily_revenue", run_date="2026-05-21")
    assert airflow.task_state("daily_revenue", "publish") == "success"

Slower; run on a schedule, not every PR.

Observability

The orchestrator is itself a system you operate.

Key metrics:

DAG success rate (per DAG, per team) — should be >95% baseline
Task duration (P50/P99) — alert on regression
Scheduler heartbeat — orchestrator process health
Task queue length — backlog implies under-scaling
Worker utilization — under 30% suggests over-provisioning; over 80% suggests under-provisioning

Wire to Monitoring:

# Airflow's StatsD exporter
[metrics]
statsd_on = True
statsd_host = statsd-exporter
statsd_port = 9125
statsd_prefix = airflow

statsd_exporter translates to Prometheus, then Grafana.

Alert on:

A DAG hasn't run for > schedule interval (didn't trigger)
Failure rate above team's tolerance
Scheduler heartbeat missing
Backlog growing for > N minutes

Capacity Planning

Workers run tasks. Plan for peak (often nightly window between 02:00-06:00 UTC).

Rules of thumb:

Each Celery worker = 1-8 task slots (configurable). Pods = 1 task each in K8s executor.
Sensors in reschedule mode hold no worker.
Heavy tasks (large pandas, Spark driver) need bigger workers.
Light tasks (trigger external API) need many small.

In Airflow with KubernetesExecutor: each task is its own pod, autoscales naturally. CeleryExecutor with autoscaling Celery: similar effect but heavier.

For Dagster: similar — dagster-k8s for each run as a pod.

A common growth pattern:

Start: single-VM Airflow, LocalExecutor
~50 DAGs / 1000 tasks/day: CeleryExecutor with 2-4 workers
~500 DAGs / 50k tasks/day: KubernetesExecutor, autoscaling
Multi-team: separate Airflow per team OR one big Airflow with strong RBAC

Scaling Across Teams

When 10 teams use one orchestrator:

RBAC: each team sees only their DAGs (Airflow has role-based access for this)
Resource quotas: per-team task-slot limits prevent one team from monopolizing
Naming convention: {team}_{purpose} so listings are findable
Cost attribution: tag tasks to teams; report monthly
Separate environments: dev / staging / prod Airflows; rarely just one

At some scale, "one Airflow for all" breaks. Split:

Per-team Airflow — each team owns one; cross-team dependencies via dataset triggers
Hub-and-spoke — central platform Airflow, team Airflows for specifics
Dagster Cloud / Astronomer / Prefect Cloud — SaaS that handles the scaling

The right size is "biggest single thing your platform team comfortably operates." Push for bigger before splitting.

Governance

The orchestrator becomes a critical system; treat it like one.

Code review for DAG changes — PR to merge
Change management — production DAGs don't change at 4 PM Friday
Documentation — description= on every DAG; doc_md= for context
Ownership — owner field per DAG; CODEOWNERS in Git; on-call when failures
Lineage — track which dashboards / models depend on which DAG outputs

Migration Strategy

Migrating from cron / ad-hoc / older Airflow to a new orchestrator is its own project:

Inventory existing schedules
Categorize: production-critical / important / cleanup
Pick one team's pipelines to migrate first
Migrate critical first (motivates everyone), or non-critical first (lower risk) — depends on org
Run new + old in parallel for verification (one validates the other)
Cut over to new; remove old after a quiet period

Never "big bang" migrate — gradual is much safer.

Disaster Recovery

The orchestrator stores metadata (history, schedule state, secrets). Back it up.

Airflow's metadata DB (Postgres / MySQL) — standard DB backup
Dagster's daemon storage — backup
DAG / asset code — already in Git
Secrets — in Vault ideally; don't depend solely on Airflow Connections

A 1-day outage on the orchestrator = 1 day of missed pipelines. Plan for it in Disaster Recovery.

Performance

Slow scheduler is a common pain. Causes:

Too many DAG files (each one takes time to parse)
Heavy top-level code in DAGs (anything outside with DAG(...): runs every parse)
Inefficient queries on the metadata DB

Fixes:

Move heavy imports / computation inside Python callables, not at module load
Use min_file_process_interval (Airflow) to throttle DAG parsing
Bigger metadata DB; turn on query optimization
Reduce DAG count by combining where it makes sense (but not too much — see anti-patterns)

For Dagster: similar — keep asset functions slim; heavy work inside.

Cost Management

The orchestrator itself isn't expensive; the work it triggers is. Strategies:

Schedule wisely: not all jobs need to run at midnight. Spread out → smaller peak compute.
Right-size workers: 2-CPU pods for most; bigger only for genuinely heavy tasks.
Spot/preemptible workers: orchestrators with K8s executors run on Spot well — Karpenter / Spot.io.
Kill dead DAGs: 2-year-old DAGs that nobody knows about still consume resources.

Common Pitfalls

DAG-as-code anti-patterns: heavy work at module top-level (parsed on every schedule loop). Keep DAG definition lean; do work inside operators.

Hard-coded dates: start_date=datetime.now() — always set a fixed start_date.

No retries: transient failures (network, API rate limit) take down nightly pipelines. Sensible default retries everywhere.

Too many retries: an auth failure that retries 100 times wastes compute and obscures the issue. Don't retry permanent failures.

Untyped data flow: XCom is Any. Misuse goes silent. Dagster's typed I/O catches this; in Airflow be disciplined.

Untested DAGs: DAGs are software. Test in CI.

Pipeline drift: DAGs in production no one tests in dev. Use a dev environment with realistic data.

No alert thresholds calibrated: every failure pages someone. After a month, on-call ignores all of them. Tune alert thresholds; suppress noise.

Schedule overlap: a long-running DAG starts before the previous finishes. Set max_active_runs=1 unless you really want overlap.

Checklist

What's Next

You have an orchestrator practice. Connect it to:

Data Warehouses — most pipelines orchestrate dbt + warehouse
Stream Processing — batch and stream complement each other
MLOps — ML pipelines often live in the orchestrator
Monitoring — pipeline metrics + alerts
Secrets — Vault holds pipeline credentials
FinOps — pipeline cost attribution per team

Best Practices

Code Organization

Testing

DAG-import test

Operator unit tests

Integration tests

Observability

Capacity Planning

Scaling Across Teams

Governance

Migration Strategy

Disaster Recovery

Performance

Cost Management

Common Pitfalls

Checklist

What's Next

Best Practices

On this page