Velero, Restic, snapshot patterns, cross-region replication, RTO/RPO - getting back online when things go very wrong

Disaster Recovery & Backup

Disaster recovery (DR) is the discipline of getting back online after things go very wrong: a region outage, an accidentally-dropped database, a ransomware encryption of your storage, a deletion that wasn't supposed to happen. Backup is one tool in the DR toolbox; DR also includes replicas, failover patterns, runbooks, and the practice of testing it all.

The hard truth: untested backups don't exist. Every team has a backup process; far fewer have actually restored from one. The first time you find out your backup is incomplete or unrestorable is during the worst day of your career.

Why DR Deliberately

Without DR practice	With DR practice
Region outage = scramble to figure out what to do	Region outage = execute runbook, recover in minutes
Accidental DB drop = data loss, panic	Restore from PITR backup; loss bounded
Ransomware encrypts production	Restore from immutable offline backup
"We have backups" — never tested	Quarterly restore drills; verified
Single region/AZ: convenient, fragile	Multi-region or cross-region replica with documented failover
Recovery point ambiguous	RPO defined, measured, met
Customer asks about resilience	Documented BCP, evidence of drills

The cost of DR isn't huge if you build it in early; it's catastrophic to retrofit during an incident.

RTO and RPO

Two numbers define the recovery target:

Term	Definition	Example
RTO (Recovery Time Objective)	How long until back online	"60 minutes from declaration"
RPO (Recovery Point Objective)	How much data loss is acceptable	"Up to 5 minutes of writes lost"

Stronger numbers cost more. Match RTO/RPO to business impact per service tier:

Tier	RTO	RPO	Approach
Critical (checkout, auth)	5-15 min	0-1 min	Active-active multi-region or hot standby
Important (admin, reports)	1-4 hours	5-15 min	Warm standby + PITR backups
Standard (most services)	4-24 hours	1 hour	Daily backup + ability to redeploy
Best-effort (internal tools)	Days	Days	Backup only

A tier-1 service costs a lot more to operate than a tier-3 one. Don't apply tier-1 expectations to everything; you'll burn money on the unimportant.

The Players

Kubernetes-native backup

Tool	Best for
Velero	The standard; backs up cluster resources + PVCs to object storage
Kasten K10 (Veeam)	Commercial, full-featured, application-aware
Stash	OSS Kubernetes-native
Portworx PX-Backup	Enterprise; integrates with Portworx storage

Block / file backup

Tool	Notes
Restic	OSS, encrypted, deduplicated, cross-platform
Borg	OSS, deduplicated, popular for server backups
Duplicity	OSS, classic
rsync + rsnapshot	Old, reliable for files

Database backup

Each engine has native tools:

Postgres: pg_basebackup, WAL archiving (PITR), logical replication
MySQL: mysqldump, xtrabackup, binlog-based PITR
MongoDB: mongodump, oplog tailing
Cassandra: nodetool snapshot + sstableloader
DynamoDB / RDS: AWS-managed PITR, snapshots
Elasticsearch: _snapshot API to S3

Cloud DR services

Provider	Service
AWS	Backup; Elastic Disaster Recovery; AWS Resilience Hub
Azure	Azure Backup; Azure Site Recovery
GCP	Cloud Backup and DR; cross-region replication built into managed services

Managed services like RDS, Cloud SQL, DynamoDB make backup and PITR a checkbox — use them.

The Backup Bestiary

Not all backups are equal:

Type	What	When
Snapshot	Block-level copy at a point in time (LVM, EBS, ZFS)	Fast for VMs, databases when momentary; consistency matters
Full	Complete copy of data	Baseline; expensive at scale
Incremental	Only what changed since last backup	Fast and small; restore needs base + chain
Differential	Only what changed since last full	Bigger than incremental, faster restore
PITR (point-in-time)	Continuous log archiving + base = restore to any moment	Strongest for DBs; biggest storage cost
Replication	Real-time stream to a standby	Lowest RPO; standby is online cost
Immutable	Write-once-read-many; can't be deleted/encrypted by attacker	Ransomware protection

Most production stacks combine these: snapshots for fast recovery, replication for low RPO, PITR for fine-grained recovery, immutable offsite for ransomware.

The 3-2-1 Rule

Time-tested guidance: 3 copies of data, on 2 different media, with 1 offsite.

Modern interpretation:

3 copies: primary, secondary (replica or snapshot), tertiary (backup)
2 different media/services: e.g., live storage + S3 + cold archive
1 offsite (different region or different cloud)
Modern addition: at least 1 is immutable

Why offsite: a region outage takes your primary and your in-region backup. Why immutable: ransomware encrypts everything it can reach.

Learning Path

1. Getting Started

Install Velero on kind; backup and restore a workload; PostgreSQL PITR; restic offsite backup; failover exercise

2. Patterns

Multi-region failover, snapshot lifecycle, immutable backups, application-consistent backups, runbooks, GameDay drills

3. Best Practices

Testing cadence, RTO/RPO measurement, encryption, compliance, common pitfalls, scaling, cost optimization

Multi-Region: The Spectrum

How "multi-region" you go is a slider, not a switch:

single region
    │
    ▼
single region + cross-region backup       (RPO: hours)
    │
    ▼
single region + cross-region warm standby  (RPO: minutes; RTO: minutes-hours)
    │
    ▼
active-passive multi-region                (RPO: seconds; RTO: minutes)
    │
    ▼
active-active multi-region                 (RPO: zero; RTO: ~zero)
    │
    ▼
multi-cloud + multi-region                  (sovereignty / vendor)

Each step is more complex and more expensive. The right answer is rarely the most rigorous — match to actual business impact and budget.

Backup Anti-Patterns

No restore test. Backup process runs nightly; never restored. You don't have backups, you have hope. Test restores monthly minimum.

Backups on the same volume as data. The fire that destroyed your data destroys the backup. Offsite — different region, different cloud, different physical location.

Backups without immutability. Ransomware encrypts the backup along with the source. Use S3 Object Lock, Glacier Vault Lock, or air-gapped storage.

Backup the data, not the schema. Restoring rows is useless without the database schema, the app version that wrote them, and the config that made them mean what they meant.

Forgetting secrets. Restoring the database is great. Without the encryption key for the data at rest, the restored disk is useless. Back up the system, including all its dependencies.

Tying backup to a vendor that can be locked out. Your account gets suspended; your backups go too. Back up out of any single vendor's domain.

The DR test you don't want to run is the one you most need to run. "What if the whole AWS region we're in dies right now?" Try it. Spin up cold infrastructure in a different region; restore from your offsite backups; time how long until services are usable. The first time is humbling. The second time is informative. By the third time it's an unsurprising operational competence.

Disaster Recovery & Backup

1. Getting Started

2. Patterns

3. Best Practices

On this page