Multi-region failover, snapshot lifecycle, immutable backups, application-consistent backups, runbooks, GameDay drills

Patterns

The patterns that turn "we have backups" into a working DR practice.

Active-Active vs Active-Passive

The two main multi-region shapes:

Active-Passive

us-east-1: live traffic (primary)
us-west-2: warm standby, takes traffic if east dies

Standby database is a read replica
Application infra exists but doesn't serve user traffic
Failover involves DNS / health-check change + DB promotion
RTO: 5-30 minutes typical
Cost: ~1.5x single region

Active-Active

us-east-1: serves east users
us-west-2: serves west users
Either can serve both

Database is multi-master (DynamoDB Global, Aurora Global Database, CockroachDB)
Or sharded by region with cross-region replication
Both regions live; failover is a routing change
RTO: seconds to ~zero (depending on what failed)
Cost: ~2x single region + multi-master complexity

Active-active is the gold standard for tier-1 services but is genuinely complex — eventual consistency, conflict resolution, schema migrations across multi-master. Many "active-active" claims are actually "primary in one region, read-only standby in another." Verify the architecture matches the claim.

Application-Consistent vs Crash-Consistent

A snapshot taken without coordination is crash-consistent: it's as if the machine crashed at that moment. Most databases will recover (they're designed to). Some won't, or recover with data loss.

Application-consistent snapshots involve:

Quiesce the application — flush in-memory state, pause writes briefly
Snapshot the underlying volume
Resume writes

For databases, this means:

# Postgres
docker exec pg pg_start_backup('snapshot', true)
# ... take volume snapshot ...
docker exec pg pg_stop_backup()

# MySQL
docker exec mysql mysql -e "FLUSH TABLES WITH READ LOCK; ..."

Velero with the velero-plugin-for-csi + CSI driver supporting application freeze does this for databases on Kubernetes. AWS Backup orchestrates the freeze for RDS/EBS.

If you skip the freeze: restoring may require crash recovery, and some edge cases may corrupt. Apply the application-consistent treatment to anything you can't afford to lose.

Snapshot Lifecycle

Snapshots are point-in-time references; without lifecycle management, they multiply forever and cost a fortune.

A common scheme (GFS - Grandfather-Father-Son):

Frequency	Retention
Hourly snapshots	Last 24
Daily snapshots	Last 7
Weekly snapshots	Last 4
Monthly snapshots	Last 12
Yearly snapshots	Last 7

Implementation:

# AWS Backup plan
{
  "BackupPlan": {
    "BackupPlanName": "tier1-services",
    "Rules": [
      { "RuleName": "hourly", "ScheduleExpression": "cron(0 * * * ? *)",
        "Lifecycle": { "DeleteAfterDays": 1 } },
      { "RuleName": "daily", "ScheduleExpression": "cron(0 5 * * ? *)",
        "Lifecycle": { "DeleteAfterDays": 7 } },
      { "RuleName": "monthly", "ScheduleExpression": "cron(0 5 1 * ? *)",
        "Lifecycle": { "MoveToColdStorageAfterDays": 30, "DeleteAfterDays": 365 } }
    ]
  }
}

Move older snapshots to cold storage (Glacier, S3 Deep Archive) — same restorability, much lower cost.

Immutable / Ransomware-Resistant Backups

Ransomware encrypts everything reachable. If your backups are reachable from the application, they get encrypted too. Defenses:

Mechanism	What it provides
S3 Object Lock (Compliance mode)	Object can't be deleted or modified before retention expires, even by root
Glacier Vault Lock	Same for Glacier
AWS Backup Vault Lock	Same for AWS Backup
Air-gapped tape	Physical isolation; tested by major banks
Cross-account, cross-cloud copies	An attacker with one account doesn't have the other
WORM (write-once-read-many) NAS	Hardware enforcement

A minimum modern config: backups go to S3 with Object Lock in compliance mode, with a retention of weeks/months. Even compromised admin credentials can't delete them.

Cross-Region Replication

For databases:

AWS RDS: read replicas in another region, async; promote on failure
Aurora Global Database: continuous physical replication; failover in ~1 minute
DynamoDB Global Tables: multi-master, no failover concept — every region writeable
Postgres: logical replication (subscription) or physical streaming
MySQL: classic primary-replica, semi-sync available

For objects:

S3 Cross-Region Replication (CRR): automatic
GCS Multi-region buckets: built-in
Object Lambda + custom replication: for fine control

For Kubernetes resources:

Velero with --snapshot-move-data: backup in one region, restore in another
Multi-cluster apps: GitOps deploys to multiple clusters; cluster failure = traffic shift, not data restore

Runbooks

A runbook is the document an on-call engineer reads at 3 AM when production is on fire. Required sections:

# Region Failover Runbook

## When to use
Primary region (us-east-1) unhealthy for >5 minutes per synthetics.

## Pre-conditions
- Read replica in us-west-2 is healthy and not too far behind (< 30s replica lag)
- Standby application infra is current (deployed via GitOps)
- DNS health checks configured

## Steps
1. Declare incident in Slack #incidents
2. Verify in dashboard: primary region truly down (not just our monitoring)
3. Run: `./scripts/failover.sh us-west-2`
4. Verify: synthetic transactions succeed against us-west-2
5. Communicate: status page, customer-facing message
6. Failback runbook: separate document, run after primary recovers

## Verification
- Check that ... command outputs ...
- Customer journey X works
- Metrics in Datadog show ...

## Rollback (if failover itself fails)
- ...

## On-call escalation
- DBA on-call for PostgreSQL promotion issues
- Networking on-call for DNS / load balancer issues

Runbooks live in Git, are reviewed quarterly, and are tested during GameDay drills. A runbook untested for a year is fiction.

GameDay Drills

The practice that separates real DR from theater: deliberately break production-like environments and execute recovery.

Typical GameDay:

Pre-day: scope (what fails, what's safe to break)
Day: fail something at announced time; team executes runbook with real-time metrics
Post-day: write findings; file tickets for gaps

Common drill scenarios:

Scenario	What it tests
Primary DB unreachable	DB failover; replica promotion; client reconnection
Region API endpoints fail	DNS failover; standby app capacity; data consistency post-recovery
Single AZ outage	Cross-AZ resilience; load balancer health checks
Backup restore	Backup integrity; restore time; data correctness
Ransomware simulated	Immutable backup restore; from scratch infra rebuild
Whole region	Full cross-region failover; coordination across teams

Frequency: quarterly minimum for critical services. Monthly for the highest-impact ones. New services should drill once before they go production.

Connect to Chaos Engineering; chaos is the technique, GameDay is the format.

Backup of Stateful K8s Workloads

For databases running in Kubernetes (often a regret, but common):

Use operators: Zalando Postgres Operator, CrunchyData PGO, Strimzi Kafka. They orchestrate snapshots + WAL + restore.
Velero + CSI snapshots: snapshot the PVC; restore creates a new PV from the snapshot.
Database-native backup to S3: WAL-G for Postgres, mongodump to S3, etc. Often simpler than Velero for stateful workloads.
Backup to a separate cluster: a "DR cluster" with restore tooling pre-deployed.

Configuration and Secrets Backup

Don't forget non-data:

Vault / Secrets Manager: secrets are data too. Vault's operator raft snapshot save for HA recovery.
Cluster certificates and CAs: PKI bootstrap is painful to recreate; back up the root CA.
TerraForm state: state file in S3 with versioning enabled. State loss = "we have no idea what we own."
CI/CD configuration: GitHub Actions secrets, runner tokens.
Identity provider / SSO configuration: app registrations, group mappings.
DNS records: domain registrar dump or DNS provider export.

When you do a full restore, you need all of these in addition to data.

Tested Restore Process

The restore is the only thing that matters. Process:

Pick a random recent backup
Stand up a fresh environment (or designated restore environment)
Restore data
Run application against restored data
Verify business-meaningful queries return expected results
Document elapsed time

This is the actual RTO. Whatever your target says, this is the truth.

Automate at least the snapshot side. The harder parts — verifying business correctness, customer journey — usually require human judgment but the time should be measured.

Anti-Patterns

Backup ≠ DR. Backup is a tool. DR is the practice (backup + runbooks + failover + drills + communication).

The "we'll figure it out" plan. The actual incident is the worst time to plan. Pre-write everything; rehearse.

Backups in the same region. The disaster that hits your primary will hit your in-region backup. Always offsite.

Daily but never restored. Schrödinger's backup: both works and doesn't work, until you check.

Untested runbook. Steps reference a tool that's been deprecated, a person who left, a command that no longer exists. Runbooks rot.

Single-point-of-knowledge. The one person who understands the failover is on vacation when the disaster hits. Spread the practice; rotate GameDay leadership.

RTO/RPO statements without actual measurement. "Our RTO is 30 minutes" — when was it last measured? State your number based on the latest drill.

What's Next

Best Practices — testing cadence, RTO measurement, encryption, compliance, pitfalls

Patterns

On this page