Best Practices
Code organization, testing, observability, capacity, scaling, governance, common pitfalls
Best Practices
The operational realities of running an orchestrator across many teams and many pipelines.
Code Organization
DAGs grow into hundreds. Without structure, the repo becomes unmaintainable.
A workable layout:
airflow_repo/
├── dags/
│ ├── domain_a/ # by business domain
│ │ ├── extract_stripe.py
│ │ ├── transform_revenue.py
│ │ └── ...
│ ├── domain_b/
│ └── platform/ # platform-wide / cross-team
├── plugins/ # custom operators
│ └── my_operators.py
├── tests/
│ ├── test_dags.py # DAG import + linting
│ └── test_operators.py # unit tests
├── utils/ # shared helpers
│ ├── slack_alerts.py
│ └── data_validations.py
└── requirements.txtPatterns:
- One DAG per file — even a small DAG; easier to find and review.
- CODEOWNERS by directory — each domain owned by a team.
- Shared utils in a package; don't copy code across DAGs.
- Custom operators for repeated patterns (e.g., "send to our Slack with this format" becomes one operator, not 50 copies).
For Dagster, similar but organized by asset/group instead of by DAG.
Testing
DAGs are code. Test them.
DAG-import test
# tests/test_dags.py
from airflow.models.dagbag import DagBag
def test_all_dags_import_cleanly():
dag_bag = DagBag(dag_folder='dags', include_examples=False)
assert len(dag_bag.import_errors) == 0, dag_bag.import_errors
assert len(dag_bag.dags) > 0Catches: syntax errors, missing imports, circular dependencies, undefined operators. Run in CI on every PR.
Operator unit tests
def test_extract_data():
result = extract_data_fn(date="2026-05-21")
assert len(result) > 0
assert all('region' in r for r in result)Test the function the operator calls, not the orchestrator. The orchestrator's job is dispatch — your code is what you test.
Integration tests
# Bring up Airflow in a test container, run a DAG end-to-end
def test_full_pipeline():
airflow = AirflowTestEnv()
airflow.run_dag("daily_revenue", run_date="2026-05-21")
assert airflow.task_state("daily_revenue", "publish") == "success"Slower; run on a schedule, not every PR.
Observability
The orchestrator is itself a system you operate.
Key metrics:
- DAG success rate (per DAG, per team) — should be >95% baseline
- Task duration (P50/P99) — alert on regression
- Scheduler heartbeat — orchestrator process health
- Task queue length — backlog implies under-scaling
- Worker utilization — under 30% suggests over-provisioning; over 80% suggests under-provisioning
Wire to Monitoring:
# Airflow's StatsD exporter
[metrics]
statsd_on = True
statsd_host = statsd-exporter
statsd_port = 9125
statsd_prefix = airflowstatsd_exporter translates to Prometheus, then Grafana.
Alert on:
- A DAG hasn't run for > schedule interval (didn't trigger)
- Failure rate above team's tolerance
- Scheduler heartbeat missing
- Backlog growing for > N minutes
Capacity Planning
Workers run tasks. Plan for peak (often nightly window between 02:00-06:00 UTC).
Rules of thumb:
- Each Celery worker = 1-8 task slots (configurable). Pods = 1 task each in K8s executor.
- Sensors in reschedule mode hold no worker.
- Heavy tasks (large pandas, Spark driver) need bigger workers.
- Light tasks (trigger external API) need many small.
In Airflow with KubernetesExecutor: each task is its own pod, autoscales naturally. CeleryExecutor with autoscaling Celery: similar effect but heavier.
For Dagster: similar — dagster-k8s for each run as a pod.
A common growth pattern:
- Start: single-VM Airflow, LocalExecutor
- ~50 DAGs / 1000 tasks/day: CeleryExecutor with 2-4 workers
- ~500 DAGs / 50k tasks/day: KubernetesExecutor, autoscaling
- Multi-team: separate Airflow per team OR one big Airflow with strong RBAC
Scaling Across Teams
When 10 teams use one orchestrator:
- RBAC: each team sees only their DAGs (Airflow has role-based access for this)
- Resource quotas: per-team task-slot limits prevent one team from monopolizing
- Naming convention:
{team}_{purpose}so listings are findable - Cost attribution: tag tasks to teams; report monthly
- Separate environments: dev / staging / prod Airflows; rarely just one
At some scale, "one Airflow for all" breaks. Split:
- Per-team Airflow — each team owns one; cross-team dependencies via dataset triggers
- Hub-and-spoke — central platform Airflow, team Airflows for specifics
- Dagster Cloud / Astronomer / Prefect Cloud — SaaS that handles the scaling
The right size is "biggest single thing your platform team comfortably operates." Push for bigger before splitting.
Governance
The orchestrator becomes a critical system; treat it like one.
- Code review for DAG changes — PR to merge
- Change management — production DAGs don't change at 4 PM Friday
- Documentation —
description=on every DAG;doc_md=for context - Ownership —
ownerfield per DAG; CODEOWNERS in Git; on-call when failures - Lineage — track which dashboards / models depend on which DAG outputs
Migration Strategy
Migrating from cron / ad-hoc / older Airflow to a new orchestrator is its own project:
- Inventory existing schedules
- Categorize: production-critical / important / cleanup
- Pick one team's pipelines to migrate first
- Migrate critical first (motivates everyone), or non-critical first (lower risk) — depends on org
- Run new + old in parallel for verification (one validates the other)
- Cut over to new; remove old after a quiet period
Never "big bang" migrate — gradual is much safer.
Disaster Recovery
The orchestrator stores metadata (history, schedule state, secrets). Back it up.
- Airflow's metadata DB (Postgres / MySQL) — standard DB backup
- Dagster's daemon storage — backup
- DAG / asset code — already in Git
- Secrets — in Vault ideally; don't depend solely on Airflow Connections
A 1-day outage on the orchestrator = 1 day of missed pipelines. Plan for it in Disaster Recovery.
Performance
Slow scheduler is a common pain. Causes:
- Too many DAG files (each one takes time to parse)
- Heavy top-level code in DAGs (anything outside
with DAG(...):runs every parse) - Inefficient queries on the metadata DB
Fixes:
- Move heavy imports / computation inside Python callables, not at module load
- Use
min_file_process_interval(Airflow) to throttle DAG parsing - Bigger metadata DB; turn on query optimization
- Reduce DAG count by combining where it makes sense (but not too much — see anti-patterns)
For Dagster: similar — keep asset functions slim; heavy work inside.
Cost Management
The orchestrator itself isn't expensive; the work it triggers is. Strategies:
- Schedule wisely: not all jobs need to run at midnight. Spread out → smaller peak compute.
- Right-size workers: 2-CPU pods for most; bigger only for genuinely heavy tasks.
- Spot/preemptible workers: orchestrators with K8s executors run on Spot well — Karpenter / Spot.io.
- Kill dead DAGs: 2-year-old DAGs that nobody knows about still consume resources.
Common Pitfalls
DAG-as-code anti-patterns: heavy work at module top-level (parsed on every schedule loop). Keep DAG definition lean; do work inside operators.
Hard-coded dates: start_date=datetime.now() — always set a fixed start_date.
No retries: transient failures (network, API rate limit) take down nightly pipelines. Sensible default retries everywhere.
Too many retries: an auth failure that retries 100 times wastes compute and obscures the issue. Don't retry permanent failures.
Untyped data flow: XCom is Any. Misuse goes silent. Dagster's typed I/O catches this; in Airflow be disciplined.
Untested DAGs: DAGs are software. Test in CI.
Pipeline drift: DAGs in production no one tests in dev. Use a dev environment with realistic data.
No alert thresholds calibrated: every failure pages someone. After a month, on-call ignores all of them. Tune alert thresholds; suppress noise.
Schedule overlap: a long-running DAG starts before the previous finishes. Set max_active_runs=1 unless you really want overlap.
Checklist
Orchestrator production readiness:
- DAGs in version control, reviewed via PR
- CI runs DAG import test on every PR
- Each DAG has owner, retries, alerts on failure
- Idempotent: re-running any task produces same result
- Backfills tested for historically-dated DAGs
- Sensors used (not "schedule late and hope") for external-condition triggers
- Heavy compute pushed to warehouse / Spark / dedicated workers, not inside scheduler
- Secrets via Vault / cloud-native secret manager (not env vars in DAG)
- Metadata DB backed up
- Scheduler health + DAG success rate monitored
- Alerts: missed schedules, repeated failures, scheduler heartbeat
- Capacity sized for peak (nightly window)
- RBAC if multi-team; per-team task slot limits
- Dead DAGs cleaned quarterly
- Documentation: every DAG has description + ownership
- Data quality gates on critical pipelines
What's Next
You have an orchestrator practice. Connect it to:
- Data Warehouses — most pipelines orchestrate dbt + warehouse
- Stream Processing — batch and stream complement each other
- MLOps — ML pipelines often live in the orchestrator
- Monitoring — pipeline metrics + alerts
- Secrets — Vault holds pipeline credentials
- FinOps — pipeline cost attribution per team