Disaster Recovery & Backup
Velero, Restic, snapshot patterns, cross-region replication, RTO/RPO - getting back online when things go very wrong
Disaster Recovery & Backup
Disaster recovery (DR) is the discipline of getting back online after things go very wrong: a region outage, an accidentally-dropped database, a ransomware encryption of your storage, a deletion that wasn't supposed to happen. Backup is one tool in the DR toolbox; DR also includes replicas, failover patterns, runbooks, and the practice of testing it all.
The hard truth: untested backups don't exist. Every team has a backup process; far fewer have actually restored from one. The first time you find out your backup is incomplete or unrestorable is during the worst day of your career.
Why DR Deliberately
| Without DR practice | With DR practice |
|---|---|
| Region outage = scramble to figure out what to do | Region outage = execute runbook, recover in minutes |
| Accidental DB drop = data loss, panic | Restore from PITR backup; loss bounded |
| Ransomware encrypts production | Restore from immutable offline backup |
| "We have backups" — never tested | Quarterly restore drills; verified |
| Single region/AZ: convenient, fragile | Multi-region or cross-region replica with documented failover |
| Recovery point ambiguous | RPO defined, measured, met |
| Customer asks about resilience | Documented BCP, evidence of drills |
The cost of DR isn't huge if you build it in early; it's catastrophic to retrofit during an incident.
RTO and RPO
Two numbers define the recovery target:
| Term | Definition | Example |
|---|---|---|
| RTO (Recovery Time Objective) | How long until back online | "60 minutes from declaration" |
| RPO (Recovery Point Objective) | How much data loss is acceptable | "Up to 5 minutes of writes lost" |
Stronger numbers cost more. Match RTO/RPO to business impact per service tier:
| Tier | RTO | RPO | Approach |
|---|---|---|---|
| Critical (checkout, auth) | 5-15 min | 0-1 min | Active-active multi-region or hot standby |
| Important (admin, reports) | 1-4 hours | 5-15 min | Warm standby + PITR backups |
| Standard (most services) | 4-24 hours | 1 hour | Daily backup + ability to redeploy |
| Best-effort (internal tools) | Days | Days | Backup only |
A tier-1 service costs a lot more to operate than a tier-3 one. Don't apply tier-1 expectations to everything; you'll burn money on the unimportant.
The Players
Kubernetes-native backup
| Tool | Best for |
|---|---|
| Velero | The standard; backs up cluster resources + PVCs to object storage |
| Kasten K10 (Veeam) | Commercial, full-featured, application-aware |
| Stash | OSS Kubernetes-native |
| Portworx PX-Backup | Enterprise; integrates with Portworx storage |
Block / file backup
| Tool | Notes |
|---|---|
| Restic | OSS, encrypted, deduplicated, cross-platform |
| Borg | OSS, deduplicated, popular for server backups |
| Duplicity | OSS, classic |
| rsync + rsnapshot | Old, reliable for files |
Database backup
Each engine has native tools:
- Postgres:
pg_basebackup, WAL archiving (PITR), logical replication - MySQL:
mysqldump,xtrabackup, binlog-based PITR - MongoDB:
mongodump, oplog tailing - Cassandra:
nodetool snapshot+ sstableloader - DynamoDB / RDS: AWS-managed PITR, snapshots
- Elasticsearch:
_snapshotAPI to S3
Cloud DR services
| Provider | Service |
|---|---|
| AWS | Backup; Elastic Disaster Recovery; AWS Resilience Hub |
| Azure | Azure Backup; Azure Site Recovery |
| GCP | Cloud Backup and DR; cross-region replication built into managed services |
Managed services like RDS, Cloud SQL, DynamoDB make backup and PITR a checkbox — use them.
The Backup Bestiary
Not all backups are equal:
| Type | What | When |
|---|---|---|
| Snapshot | Block-level copy at a point in time (LVM, EBS, ZFS) | Fast for VMs, databases when momentary; consistency matters |
| Full | Complete copy of data | Baseline; expensive at scale |
| Incremental | Only what changed since last backup | Fast and small; restore needs base + chain |
| Differential | Only what changed since last full | Bigger than incremental, faster restore |
| PITR (point-in-time) | Continuous log archiving + base = restore to any moment | Strongest for DBs; biggest storage cost |
| Replication | Real-time stream to a standby | Lowest RPO; standby is online cost |
| Immutable | Write-once-read-many; can't be deleted/encrypted by attacker | Ransomware protection |
Most production stacks combine these: snapshots for fast recovery, replication for low RPO, PITR for fine-grained recovery, immutable offsite for ransomware.
The 3-2-1 Rule
Time-tested guidance: 3 copies of data, on 2 different media, with 1 offsite.
Modern interpretation:
- 3 copies: primary, secondary (replica or snapshot), tertiary (backup)
- 2 different media/services: e.g., live storage + S3 + cold archive
- 1 offsite (different region or different cloud)
- Modern addition: at least 1 is immutable
Why offsite: a region outage takes your primary and your in-region backup. Why immutable: ransomware encrypts everything it can reach.
Learning Path
1. Getting Started
Install Velero on kind; backup and restore a workload; PostgreSQL PITR; restic offsite backup; failover exercise
2. Patterns
Multi-region failover, snapshot lifecycle, immutable backups, application-consistent backups, runbooks, GameDay drills
3. Best Practices
Testing cadence, RTO/RPO measurement, encryption, compliance, common pitfalls, scaling, cost optimization
Multi-Region: The Spectrum
How "multi-region" you go is a slider, not a switch:
single region
│
▼
single region + cross-region backup (RPO: hours)
│
▼
single region + cross-region warm standby (RPO: minutes; RTO: minutes-hours)
│
▼
active-passive multi-region (RPO: seconds; RTO: minutes)
│
▼
active-active multi-region (RPO: zero; RTO: ~zero)
│
▼
multi-cloud + multi-region (sovereignty / vendor)Each step is more complex and more expensive. The right answer is rarely the most rigorous — match to actual business impact and budget.
Backup Anti-Patterns
No restore test. Backup process runs nightly; never restored. You don't have backups, you have hope. Test restores monthly minimum.
Backups on the same volume as data. The fire that destroyed your data destroys the backup. Offsite — different region, different cloud, different physical location.
Backups without immutability. Ransomware encrypts the backup along with the source. Use S3 Object Lock, Glacier Vault Lock, or air-gapped storage.
Backup the data, not the schema. Restoring rows is useless without the database schema, the app version that wrote them, and the config that made them mean what they meant.
Forgetting secrets. Restoring the database is great. Without the encryption key for the data at rest, the restored disk is useless. Back up the system, including all its dependencies.
Tying backup to a vendor that can be locked out. Your account gets suspended; your backups go too. Back up out of any single vendor's domain.
The DR test you don't want to run is the one you most need to run. "What if the whole AWS region we're in dies right now?" Try it. Spin up cold infrastructure in a different region; restore from your offsite backups; time how long until services are usable. The first time is humbling. The second time is informative. By the third time it's an unsurprising operational competence.