Best Practices

Testing cadence, measuring RTO and RPO, encryption, compliance, common pitfalls, cost optimization

Best Practices

The operational realities of running DR that actually works.

Test, Test, Test

The non-negotiable. Categorize your tests:

Test	Cadence	What it proves
Verify backup completion	Daily, automated	Backup ran, completed, has reasonable size
Spot-check restore	Weekly	One random backup can be opened/parsed
Full restore drill	Monthly	End-to-end restore to a separate env; correctness verified
Component failover	Quarterly	DB failover, region failover, etc.
Full disaster GameDay	Quarterly	Multi-team coordinated drill
Tabletop exercise	Monthly	Walk through scenarios verbally without doing

The cheapest test (tabletop) is the most overlooked. You'd be surprised what you learn just by talking through "what if X dies right now."

Measure What You Promise

Don't claim RTO/RPO numbers based on theory:

After every drill, record the actual recovery time
Track the trend over time
If actual diverges from target, either fix or revise the target (don't keep lying)
Communicate to stakeholders: "Tier-1 services have measured RTO of 12 minutes, RPO of 90 seconds, as of last drill on 2026-04-15."

This honesty changes conversations. Customers and auditors trust evidence-based numbers; aspirational numbers undermine trust.

Encryption

Backups are highly sensitive — they're all your data, in one place:

Encrypt in transit: TLS for backup transport.
Encrypt at rest: server-side (S3 SSE-KMS) and/or client-side (restic, age, custom).
Key management: KMS-backed; don't store encryption keys with the backup.
Test recovery of the encryption key: a backup you can't decrypt isn't a backup. Keep keys backed up separately.

Common failure: encrypted backups, key in same account as backup → key compromised, backup unusable. Use cross-account or cross-cloud key custody.

Compliance

Frameworks have specific DR requirements:

Framework	Typical requirement
SOC 2	Documented BCP/DR plan; tested; evidence retained
ISO 27001	A.17 - Business continuity controls
HIPAA	Contingency plan; encryption of backups
PCI DSS 4.0	Tested backup of CDE; offline backup for ransomware
GDPR / data sovereignty	Backups may need to stay in region; right-to-be-forgotten on restore

Document:

BCP narrative and runbooks
Last drill date + results
Backup retention matched to legal requirements
Encryption methodology
Access control list for backup systems

A clean DR practice is one of the easier compliance items to evidence.

Backup Hygiene

Tags / labels on every backup linking to the source service. Without metadata, "what is this?" can't be answered.
Naming conventions: include service, environment, date, type. Searchable.
Inventory: a list of all backup locations + retention + RTO/RPO commitments. Maintained.
Audit: who can access backups; who can delete; logged.
Cost monitoring: backups are a sneaky cost line; tag them; report monthly.

Cost Optimization

Backups are often a top-10 cost line. Optimize:

Lifecycle tiers: hot for recent, warm (IA) for monthly, cold (Glacier Deep) for yearly.
Deduplicate: restic, Borg, and Backblaze dedup at the block level; massive savings for similar data.
Compress: most backup tools compress; verify it's on.
Don't backup what's reproducible: build artifacts, scratch volumes, derived caches. Backup the inputs, regenerate the outputs.
Retention review: are you really keeping daily for a year? Step it down.
Test cost of unused backups quarterly: surprising waste hides there.

A typical optimization: full daily backups of every PVC blindly → tiered backups of important PVCs + selective inclusion → 70% cost reduction with same RPO.

Failure Modes of DR Itself

The DR system can fail too:

Backup vendor goes bankrupt / cuts off your account. Don't rely on single vendor.
Backup format changes between tool versions. Test restores after every backup tool upgrade.
The restoration machine is missing dependencies. Restoring to a "clean" environment surfaces this.
Backup volume runs out of space silently. Monitor.
Snapshot retention silently shortens. Audit lifecycle policies.
Restore requires manual steps that have been lost. Runbook test.

The system requires monitoring like any other production system. Backup failures should page.

Communications During DR Event

Often missed: while engineering recovers, who's communicating?

Status page updated in agreed cadence (every 30 min minimum during active incident)
Customer-facing email/Slack for affected accounts
Internal Slack #incidents for cross-team coordination
Executive briefing with specific timing — they want to know
Post-incident: written postmortem within 5 business days, customer-facing summary if customer-impacting

DR isn't only technical. Communications failures during real DR events make tech recoveries pointless.

Documentation Discipline

What auditors and new hires need to see:

DR plan document — what we recover, why, how, who
Runbooks per scenario — step-by-step
Service tier matrix — which services have which RTO/RPO
Drill log — what was drilled, when, what was found, fixes
Restore evidence — screenshots / logs of successful drills
Postmortems — real DR events analyzed

Keep these in the same wiki/repo as your other ops docs. Don't let DR docs be a separate forgotten corner.

Common Pitfalls

"We have backups." No, you don't. You have something you call backups. Until you've restored from them with verifiable correctness, you have nothing.

Backing up to the same provider only. Your single cloud account / AWS Organization is a SPOF. At least one offsite copy outside the primary blast radius.

Restoring to a brand new region. Underestimating the time. AMIs, network setup, certs, IAM bootstrap — a region you've never used takes hours to get to baseline. Have a "warm region" approach.

DR plan written, never reviewed. The plan from 3 years ago references EC2 instances that were replaced by Fargate. Quarterly review minimum.

Single person knows the plan. Bus factor of one. Spread responsibility; rotate drill leaders.

No comms plan during DR. Engineering recovers in 20 minutes; PR is destroyed because nobody told customers what was happening for 4 hours.

Confusing HA and DR. High availability handles small failures (one node, one AZ). DR handles big failures (region, ransomware, mass deletion). They're complementary, not substitutes.

Backup as the only DR strategy. Backup is one layer. Replication for low RPO, HA for resilience to small failures, backup for the worst case. Layered.

Not testing in actual production conditions. Restoring to a tiny test cluster works; restoring 5TB on a real-sized cluster might not. Scale-realistic tests.

When to Outsource

DR is one of the more legitimate things to buy:

Managed backup services (AWS Backup, Azure Backup, Veeam, Druva) handle scheduling, retention, cross-region.
DR-as-a-service (DRaaS) offerings: VMware Cloud DR, Zerto, AWS Elastic DR. Replicate to a standby and orchestrate failover.
Pure SaaS for things like email, secrets management, identity — those vendors do better DR than you can.

Don't reinvent backup software unless you have a specific reason. Buy the commodity; spend energy on your DR runbooks, drills, and architecture.

Checklist

What's Next

You have a DR practice. Connect it to:

Chaos Engineering — chaos is the practice; DR drills are the formal event
Monitoring — detect failures fast; trigger DR sooner
FinOps — backup costs are real; optimize the lifecycle
Object Storage — S3 + Object Lock is the backbone of modern DR
Secrets — Vault backups + key custody is part of full DR

Best Practices

On this page