Patterns
Multi-region failover, snapshot lifecycle, immutable backups, application-consistent backups, runbooks, GameDay drills
Patterns
The patterns that turn "we have backups" into a working DR practice.
Active-Active vs Active-Passive
The two main multi-region shapes:
Active-Passive
us-east-1: live traffic (primary)
us-west-2: warm standby, takes traffic if east dies- Standby database is a read replica
- Application infra exists but doesn't serve user traffic
- Failover involves DNS / health-check change + DB promotion
- RTO: 5-30 minutes typical
- Cost: ~1.5x single region
Active-Active
us-east-1: serves east users
us-west-2: serves west users
Either can serve both- Database is multi-master (DynamoDB Global, Aurora Global Database, CockroachDB)
- Or sharded by region with cross-region replication
- Both regions live; failover is a routing change
- RTO: seconds to ~zero (depending on what failed)
- Cost: ~2x single region + multi-master complexity
Active-active is the gold standard for tier-1 services but is genuinely complex — eventual consistency, conflict resolution, schema migrations across multi-master. Many "active-active" claims are actually "primary in one region, read-only standby in another." Verify the architecture matches the claim.
Application-Consistent vs Crash-Consistent
A snapshot taken without coordination is crash-consistent: it's as if the machine crashed at that moment. Most databases will recover (they're designed to). Some won't, or recover with data loss.
Application-consistent snapshots involve:
- Quiesce the application — flush in-memory state, pause writes briefly
- Snapshot the underlying volume
- Resume writes
For databases, this means:
# Postgres
docker exec pg pg_start_backup('snapshot', true)
# ... take volume snapshot ...
docker exec pg pg_stop_backup()
# MySQL
docker exec mysql mysql -e "FLUSH TABLES WITH READ LOCK; ..."Velero with the velero-plugin-for-csi + CSI driver supporting application freeze does this for databases on Kubernetes. AWS Backup orchestrates the freeze for RDS/EBS.
If you skip the freeze: restoring may require crash recovery, and some edge cases may corrupt. Apply the application-consistent treatment to anything you can't afford to lose.
Snapshot Lifecycle
Snapshots are point-in-time references; without lifecycle management, they multiply forever and cost a fortune.
A common scheme (GFS - Grandfather-Father-Son):
| Frequency | Retention |
|---|---|
| Hourly snapshots | Last 24 |
| Daily snapshots | Last 7 |
| Weekly snapshots | Last 4 |
| Monthly snapshots | Last 12 |
| Yearly snapshots | Last 7 |
Implementation:
# AWS Backup plan
{
"BackupPlan": {
"BackupPlanName": "tier1-services",
"Rules": [
{ "RuleName": "hourly", "ScheduleExpression": "cron(0 * * * ? *)",
"Lifecycle": { "DeleteAfterDays": 1 } },
{ "RuleName": "daily", "ScheduleExpression": "cron(0 5 * * ? *)",
"Lifecycle": { "DeleteAfterDays": 7 } },
{ "RuleName": "monthly", "ScheduleExpression": "cron(0 5 1 * ? *)",
"Lifecycle": { "MoveToColdStorageAfterDays": 30, "DeleteAfterDays": 365 } }
]
}
}Move older snapshots to cold storage (Glacier, S3 Deep Archive) — same restorability, much lower cost.
Immutable / Ransomware-Resistant Backups
Ransomware encrypts everything reachable. If your backups are reachable from the application, they get encrypted too. Defenses:
| Mechanism | What it provides |
|---|---|
| S3 Object Lock (Compliance mode) | Object can't be deleted or modified before retention expires, even by root |
| Glacier Vault Lock | Same for Glacier |
| AWS Backup Vault Lock | Same for AWS Backup |
| Air-gapped tape | Physical isolation; tested by major banks |
| Cross-account, cross-cloud copies | An attacker with one account doesn't have the other |
| WORM (write-once-read-many) NAS | Hardware enforcement |
A minimum modern config: backups go to S3 with Object Lock in compliance mode, with a retention of weeks/months. Even compromised admin credentials can't delete them.
Cross-Region Replication
For databases:
- AWS RDS: read replicas in another region, async; promote on failure
- Aurora Global Database: continuous physical replication; failover in ~1 minute
- DynamoDB Global Tables: multi-master, no failover concept — every region writeable
- Postgres: logical replication (subscription) or physical streaming
- MySQL: classic primary-replica, semi-sync available
For objects:
- S3 Cross-Region Replication (CRR): automatic
- GCS Multi-region buckets: built-in
- Object Lambda + custom replication: for fine control
For Kubernetes resources:
- Velero with
--snapshot-move-data: backup in one region, restore in another - Multi-cluster apps: GitOps deploys to multiple clusters; cluster failure = traffic shift, not data restore
Runbooks
A runbook is the document an on-call engineer reads at 3 AM when production is on fire. Required sections:
# Region Failover Runbook
## When to use
Primary region (us-east-1) unhealthy for >5 minutes per synthetics.
## Pre-conditions
- Read replica in us-west-2 is healthy and not too far behind (< 30s replica lag)
- Standby application infra is current (deployed via GitOps)
- DNS health checks configured
## Steps
1. Declare incident in Slack #incidents
2. Verify in dashboard: primary region truly down (not just our monitoring)
3. Run: `./scripts/failover.sh us-west-2`
4. Verify: synthetic transactions succeed against us-west-2
5. Communicate: status page, customer-facing message
6. Failback runbook: separate document, run after primary recovers
## Verification
- Check that ... command outputs ...
- Customer journey X works
- Metrics in Datadog show ...
## Rollback (if failover itself fails)
- ...
## On-call escalation
- DBA on-call for PostgreSQL promotion issues
- Networking on-call for DNS / load balancer issuesRunbooks live in Git, are reviewed quarterly, and are tested during GameDay drills. A runbook untested for a year is fiction.
GameDay Drills
The practice that separates real DR from theater: deliberately break production-like environments and execute recovery.
Typical GameDay:
- Pre-day: scope (what fails, what's safe to break)
- Day: fail something at announced time; team executes runbook with real-time metrics
- Post-day: write findings; file tickets for gaps
Common drill scenarios:
| Scenario | What it tests |
|---|---|
| Primary DB unreachable | DB failover; replica promotion; client reconnection |
| Region API endpoints fail | DNS failover; standby app capacity; data consistency post-recovery |
| Single AZ outage | Cross-AZ resilience; load balancer health checks |
| Backup restore | Backup integrity; restore time; data correctness |
| Ransomware simulated | Immutable backup restore; from scratch infra rebuild |
| Whole region | Full cross-region failover; coordination across teams |
Frequency: quarterly minimum for critical services. Monthly for the highest-impact ones. New services should drill once before they go production.
Connect to Chaos Engineering; chaos is the technique, GameDay is the format.
Backup of Stateful K8s Workloads
For databases running in Kubernetes (often a regret, but common):
- Use operators: Zalando Postgres Operator, CrunchyData PGO, Strimzi Kafka. They orchestrate snapshots + WAL + restore.
- Velero + CSI snapshots: snapshot the PVC; restore creates a new PV from the snapshot.
- Database-native backup to S3: WAL-G for Postgres, mongodump to S3, etc. Often simpler than Velero for stateful workloads.
- Backup to a separate cluster: a "DR cluster" with restore tooling pre-deployed.
Configuration and Secrets Backup
Don't forget non-data:
- Vault / Secrets Manager: secrets are data too. Vault's
operator raft snapshot savefor HA recovery. - Cluster certificates and CAs: PKI bootstrap is painful to recreate; back up the root CA.
- TerraForm state: state file in S3 with versioning enabled. State loss = "we have no idea what we own."
- CI/CD configuration: GitHub Actions secrets, runner tokens.
- Identity provider / SSO configuration: app registrations, group mappings.
- DNS records: domain registrar dump or DNS provider export.
When you do a full restore, you need all of these in addition to data.
Tested Restore Process
The restore is the only thing that matters. Process:
- Pick a random recent backup
- Stand up a fresh environment (or designated restore environment)
- Restore data
- Run application against restored data
- Verify business-meaningful queries return expected results
- Document elapsed time
This is the actual RTO. Whatever your target says, this is the truth.
Automate at least the snapshot side. The harder parts — verifying business correctness, customer journey — usually require human judgment but the time should be measured.
Anti-Patterns
Backup ≠ DR. Backup is a tool. DR is the practice (backup + runbooks + failover + drills + communication).
The "we'll figure it out" plan. The actual incident is the worst time to plan. Pre-write everything; rehearse.
Backups in the same region. The disaster that hits your primary will hit your in-region backup. Always offsite.
Daily but never restored. Schrödinger's backup: both works and doesn't work, until you check.
Untested runbook. Steps reference a tool that's been deprecated, a person who left, a command that no longer exists. Runbooks rot.
Single-point-of-knowledge. The one person who understands the failover is on vacation when the disaster hits. Spread the practice; rotate GameDay leadership.
RTO/RPO statements without actual measurement. "Our RTO is 30 minutes" — when was it last measured? State your number based on the latest drill.
What's Next
- Best Practices — testing cadence, RTO measurement, encryption, compliance, pitfalls