Steven's Knowledge

Disaster Recovery & Backup

Velero, Restic, snapshot patterns, cross-region replication, RTO/RPO - getting back online when things go very wrong

Disaster Recovery & Backup

Disaster recovery (DR) is the discipline of getting back online after things go very wrong: a region outage, an accidentally-dropped database, a ransomware encryption of your storage, a deletion that wasn't supposed to happen. Backup is one tool in the DR toolbox; DR also includes replicas, failover patterns, runbooks, and the practice of testing it all.

The hard truth: untested backups don't exist. Every team has a backup process; far fewer have actually restored from one. The first time you find out your backup is incomplete or unrestorable is during the worst day of your career.

Why DR Deliberately

Without DR practiceWith DR practice
Region outage = scramble to figure out what to doRegion outage = execute runbook, recover in minutes
Accidental DB drop = data loss, panicRestore from PITR backup; loss bounded
Ransomware encrypts productionRestore from immutable offline backup
"We have backups" — never testedQuarterly restore drills; verified
Single region/AZ: convenient, fragileMulti-region or cross-region replica with documented failover
Recovery point ambiguousRPO defined, measured, met
Customer asks about resilienceDocumented BCP, evidence of drills

The cost of DR isn't huge if you build it in early; it's catastrophic to retrofit during an incident.

RTO and RPO

Two numbers define the recovery target:

TermDefinitionExample
RTO (Recovery Time Objective)How long until back online"60 minutes from declaration"
RPO (Recovery Point Objective)How much data loss is acceptable"Up to 5 minutes of writes lost"

Stronger numbers cost more. Match RTO/RPO to business impact per service tier:

TierRTORPOApproach
Critical (checkout, auth)5-15 min0-1 minActive-active multi-region or hot standby
Important (admin, reports)1-4 hours5-15 minWarm standby + PITR backups
Standard (most services)4-24 hours1 hourDaily backup + ability to redeploy
Best-effort (internal tools)DaysDaysBackup only

A tier-1 service costs a lot more to operate than a tier-3 one. Don't apply tier-1 expectations to everything; you'll burn money on the unimportant.

The Players

Kubernetes-native backup

ToolBest for
VeleroThe standard; backs up cluster resources + PVCs to object storage
Kasten K10 (Veeam)Commercial, full-featured, application-aware
StashOSS Kubernetes-native
Portworx PX-BackupEnterprise; integrates with Portworx storage

Block / file backup

ToolNotes
ResticOSS, encrypted, deduplicated, cross-platform
BorgOSS, deduplicated, popular for server backups
DuplicityOSS, classic
rsync + rsnapshotOld, reliable for files

Database backup

Each engine has native tools:

  • Postgres: pg_basebackup, WAL archiving (PITR), logical replication
  • MySQL: mysqldump, xtrabackup, binlog-based PITR
  • MongoDB: mongodump, oplog tailing
  • Cassandra: nodetool snapshot + sstableloader
  • DynamoDB / RDS: AWS-managed PITR, snapshots
  • Elasticsearch: _snapshot API to S3

Cloud DR services

ProviderService
AWSBackup; Elastic Disaster Recovery; AWS Resilience Hub
AzureAzure Backup; Azure Site Recovery
GCPCloud Backup and DR; cross-region replication built into managed services

Managed services like RDS, Cloud SQL, DynamoDB make backup and PITR a checkbox — use them.

The Backup Bestiary

Not all backups are equal:

TypeWhatWhen
SnapshotBlock-level copy at a point in time (LVM, EBS, ZFS)Fast for VMs, databases when momentary; consistency matters
FullComplete copy of dataBaseline; expensive at scale
IncrementalOnly what changed since last backupFast and small; restore needs base + chain
DifferentialOnly what changed since last fullBigger than incremental, faster restore
PITR (point-in-time)Continuous log archiving + base = restore to any momentStrongest for DBs; biggest storage cost
ReplicationReal-time stream to a standbyLowest RPO; standby is online cost
ImmutableWrite-once-read-many; can't be deleted/encrypted by attackerRansomware protection

Most production stacks combine these: snapshots for fast recovery, replication for low RPO, PITR for fine-grained recovery, immutable offsite for ransomware.

The 3-2-1 Rule

Time-tested guidance: 3 copies of data, on 2 different media, with 1 offsite.

Modern interpretation:

  • 3 copies: primary, secondary (replica or snapshot), tertiary (backup)
  • 2 different media/services: e.g., live storage + S3 + cold archive
  • 1 offsite (different region or different cloud)
  • Modern addition: at least 1 is immutable

Why offsite: a region outage takes your primary and your in-region backup. Why immutable: ransomware encrypts everything it can reach.

Learning Path

Multi-Region: The Spectrum

How "multi-region" you go is a slider, not a switch:

single region


single region + cross-region backup       (RPO: hours)


single region + cross-region warm standby  (RPO: minutes; RTO: minutes-hours)


active-passive multi-region                (RPO: seconds; RTO: minutes)


active-active multi-region                 (RPO: zero; RTO: ~zero)


multi-cloud + multi-region                  (sovereignty / vendor)

Each step is more complex and more expensive. The right answer is rarely the most rigorous — match to actual business impact and budget.

Backup Anti-Patterns

No restore test. Backup process runs nightly; never restored. You don't have backups, you have hope. Test restores monthly minimum.

Backups on the same volume as data. The fire that destroyed your data destroys the backup. Offsite — different region, different cloud, different physical location.

Backups without immutability. Ransomware encrypts the backup along with the source. Use S3 Object Lock, Glacier Vault Lock, or air-gapped storage.

Backup the data, not the schema. Restoring rows is useless without the database schema, the app version that wrote them, and the config that made them mean what they meant.

Forgetting secrets. Restoring the database is great. Without the encryption key for the data at rest, the restored disk is useless. Back up the system, including all its dependencies.

Tying backup to a vendor that can be locked out. Your account gets suspended; your backups go too. Back up out of any single vendor's domain.

The DR test you don't want to run is the one you most need to run. "What if the whole AWS region we're in dies right now?" Try it. Spin up cold infrastructure in a different region; restore from your offsite backups; time how long until services are usable. The first time is humbling. The second time is informative. By the third time it's an unsurprising operational competence.

On this page