Steven's Knowledge

Patterns

Multi-region failover, snapshot lifecycle, immutable backups, application-consistent backups, runbooks, GameDay drills

Patterns

The patterns that turn "we have backups" into a working DR practice.

Active-Active vs Active-Passive

The two main multi-region shapes:

Active-Passive

us-east-1: live traffic (primary)
us-west-2: warm standby, takes traffic if east dies
  • Standby database is a read replica
  • Application infra exists but doesn't serve user traffic
  • Failover involves DNS / health-check change + DB promotion
  • RTO: 5-30 minutes typical
  • Cost: ~1.5x single region

Active-Active

us-east-1: serves east users
us-west-2: serves west users
Either can serve both
  • Database is multi-master (DynamoDB Global, Aurora Global Database, CockroachDB)
  • Or sharded by region with cross-region replication
  • Both regions live; failover is a routing change
  • RTO: seconds to ~zero (depending on what failed)
  • Cost: ~2x single region + multi-master complexity

Active-active is the gold standard for tier-1 services but is genuinely complex — eventual consistency, conflict resolution, schema migrations across multi-master. Many "active-active" claims are actually "primary in one region, read-only standby in another." Verify the architecture matches the claim.

Application-Consistent vs Crash-Consistent

A snapshot taken without coordination is crash-consistent: it's as if the machine crashed at that moment. Most databases will recover (they're designed to). Some won't, or recover with data loss.

Application-consistent snapshots involve:

  1. Quiesce the application — flush in-memory state, pause writes briefly
  2. Snapshot the underlying volume
  3. Resume writes

For databases, this means:

# Postgres
docker exec pg pg_start_backup('snapshot', true)
# ... take volume snapshot ...
docker exec pg pg_stop_backup()

# MySQL
docker exec mysql mysql -e "FLUSH TABLES WITH READ LOCK; ..."

Velero with the velero-plugin-for-csi + CSI driver supporting application freeze does this for databases on Kubernetes. AWS Backup orchestrates the freeze for RDS/EBS.

If you skip the freeze: restoring may require crash recovery, and some edge cases may corrupt. Apply the application-consistent treatment to anything you can't afford to lose.

Snapshot Lifecycle

Snapshots are point-in-time references; without lifecycle management, they multiply forever and cost a fortune.

A common scheme (GFS - Grandfather-Father-Son):

FrequencyRetention
Hourly snapshotsLast 24
Daily snapshotsLast 7
Weekly snapshotsLast 4
Monthly snapshotsLast 12
Yearly snapshotsLast 7

Implementation:

# AWS Backup plan
{
  "BackupPlan": {
    "BackupPlanName": "tier1-services",
    "Rules": [
      { "RuleName": "hourly", "ScheduleExpression": "cron(0 * * * ? *)",
        "Lifecycle": { "DeleteAfterDays": 1 } },
      { "RuleName": "daily", "ScheduleExpression": "cron(0 5 * * ? *)",
        "Lifecycle": { "DeleteAfterDays": 7 } },
      { "RuleName": "monthly", "ScheduleExpression": "cron(0 5 1 * ? *)",
        "Lifecycle": { "MoveToColdStorageAfterDays": 30, "DeleteAfterDays": 365 } }
    ]
  }
}

Move older snapshots to cold storage (Glacier, S3 Deep Archive) — same restorability, much lower cost.

Immutable / Ransomware-Resistant Backups

Ransomware encrypts everything reachable. If your backups are reachable from the application, they get encrypted too. Defenses:

MechanismWhat it provides
S3 Object Lock (Compliance mode)Object can't be deleted or modified before retention expires, even by root
Glacier Vault LockSame for Glacier
AWS Backup Vault LockSame for AWS Backup
Air-gapped tapePhysical isolation; tested by major banks
Cross-account, cross-cloud copiesAn attacker with one account doesn't have the other
WORM (write-once-read-many) NASHardware enforcement

A minimum modern config: backups go to S3 with Object Lock in compliance mode, with a retention of weeks/months. Even compromised admin credentials can't delete them.

Cross-Region Replication

For databases:

  • AWS RDS: read replicas in another region, async; promote on failure
  • Aurora Global Database: continuous physical replication; failover in ~1 minute
  • DynamoDB Global Tables: multi-master, no failover concept — every region writeable
  • Postgres: logical replication (subscription) or physical streaming
  • MySQL: classic primary-replica, semi-sync available

For objects:

  • S3 Cross-Region Replication (CRR): automatic
  • GCS Multi-region buckets: built-in
  • Object Lambda + custom replication: for fine control

For Kubernetes resources:

  • Velero with --snapshot-move-data: backup in one region, restore in another
  • Multi-cluster apps: GitOps deploys to multiple clusters; cluster failure = traffic shift, not data restore

Runbooks

A runbook is the document an on-call engineer reads at 3 AM when production is on fire. Required sections:

# Region Failover Runbook

## When to use
Primary region (us-east-1) unhealthy for >5 minutes per synthetics.

## Pre-conditions
- Read replica in us-west-2 is healthy and not too far behind (< 30s replica lag)
- Standby application infra is current (deployed via GitOps)
- DNS health checks configured

## Steps
1. Declare incident in Slack #incidents
2. Verify in dashboard: primary region truly down (not just our monitoring)
3. Run: `./scripts/failover.sh us-west-2`
4. Verify: synthetic transactions succeed against us-west-2
5. Communicate: status page, customer-facing message
6. Failback runbook: separate document, run after primary recovers

## Verification
- Check that ... command outputs ...
- Customer journey X works
- Metrics in Datadog show ...

## Rollback (if failover itself fails)
- ...

## On-call escalation
- DBA on-call for PostgreSQL promotion issues
- Networking on-call for DNS / load balancer issues

Runbooks live in Git, are reviewed quarterly, and are tested during GameDay drills. A runbook untested for a year is fiction.

GameDay Drills

The practice that separates real DR from theater: deliberately break production-like environments and execute recovery.

Typical GameDay:

  • Pre-day: scope (what fails, what's safe to break)
  • Day: fail something at announced time; team executes runbook with real-time metrics
  • Post-day: write findings; file tickets for gaps

Common drill scenarios:

ScenarioWhat it tests
Primary DB unreachableDB failover; replica promotion; client reconnection
Region API endpoints failDNS failover; standby app capacity; data consistency post-recovery
Single AZ outageCross-AZ resilience; load balancer health checks
Backup restoreBackup integrity; restore time; data correctness
Ransomware simulatedImmutable backup restore; from scratch infra rebuild
Whole regionFull cross-region failover; coordination across teams

Frequency: quarterly minimum for critical services. Monthly for the highest-impact ones. New services should drill once before they go production.

Connect to Chaos Engineering; chaos is the technique, GameDay is the format.

Backup of Stateful K8s Workloads

For databases running in Kubernetes (often a regret, but common):

  • Use operators: Zalando Postgres Operator, CrunchyData PGO, Strimzi Kafka. They orchestrate snapshots + WAL + restore.
  • Velero + CSI snapshots: snapshot the PVC; restore creates a new PV from the snapshot.
  • Database-native backup to S3: WAL-G for Postgres, mongodump to S3, etc. Often simpler than Velero for stateful workloads.
  • Backup to a separate cluster: a "DR cluster" with restore tooling pre-deployed.

Configuration and Secrets Backup

Don't forget non-data:

  • Vault / Secrets Manager: secrets are data too. Vault's operator raft snapshot save for HA recovery.
  • Cluster certificates and CAs: PKI bootstrap is painful to recreate; back up the root CA.
  • TerraForm state: state file in S3 with versioning enabled. State loss = "we have no idea what we own."
  • CI/CD configuration: GitHub Actions secrets, runner tokens.
  • Identity provider / SSO configuration: app registrations, group mappings.
  • DNS records: domain registrar dump or DNS provider export.

When you do a full restore, you need all of these in addition to data.

Tested Restore Process

The restore is the only thing that matters. Process:

  1. Pick a random recent backup
  2. Stand up a fresh environment (or designated restore environment)
  3. Restore data
  4. Run application against restored data
  5. Verify business-meaningful queries return expected results
  6. Document elapsed time

This is the actual RTO. Whatever your target says, this is the truth.

Automate at least the snapshot side. The harder parts — verifying business correctness, customer journey — usually require human judgment but the time should be measured.

Anti-Patterns

Backup ≠ DR. Backup is a tool. DR is the practice (backup + runbooks + failover + drills + communication).

The "we'll figure it out" plan. The actual incident is the worst time to plan. Pre-write everything; rehearse.

Backups in the same region. The disaster that hits your primary will hit your in-region backup. Always offsite.

Daily but never restored. Schrödinger's backup: both works and doesn't work, until you check.

Untested runbook. Steps reference a tool that's been deprecated, a person who left, a command that no longer exists. Runbooks rot.

Single-point-of-knowledge. The one person who understands the failover is on vacation when the disaster hits. Spread the practice; rotate GameDay leadership.

RTO/RPO statements without actual measurement. "Our RTO is 30 minutes" — when was it last measured? State your number based on the latest drill.

What's Next

  • Best Practices — testing cadence, RTO measurement, encryption, compliance, pitfalls

On this page