Best Practices
Testing cadence, measuring RTO and RPO, encryption, compliance, common pitfalls, cost optimization
Best Practices
The operational realities of running DR that actually works.
Test, Test, Test
The non-negotiable. Categorize your tests:
| Test | Cadence | What it proves |
|---|---|---|
| Verify backup completion | Daily, automated | Backup ran, completed, has reasonable size |
| Spot-check restore | Weekly | One random backup can be opened/parsed |
| Full restore drill | Monthly | End-to-end restore to a separate env; correctness verified |
| Component failover | Quarterly | DB failover, region failover, etc. |
| Full disaster GameDay | Quarterly | Multi-team coordinated drill |
| Tabletop exercise | Monthly | Walk through scenarios verbally without doing |
The cheapest test (tabletop) is the most overlooked. You'd be surprised what you learn just by talking through "what if X dies right now."
Measure What You Promise
Don't claim RTO/RPO numbers based on theory:
- After every drill, record the actual recovery time
- Track the trend over time
- If actual diverges from target, either fix or revise the target (don't keep lying)
- Communicate to stakeholders: "Tier-1 services have measured RTO of 12 minutes, RPO of 90 seconds, as of last drill on 2026-04-15."
This honesty changes conversations. Customers and auditors trust evidence-based numbers; aspirational numbers undermine trust.
Encryption
Backups are highly sensitive — they're all your data, in one place:
- Encrypt in transit: TLS for backup transport.
- Encrypt at rest: server-side (S3 SSE-KMS) and/or client-side (restic, age, custom).
- Key management: KMS-backed; don't store encryption keys with the backup.
- Test recovery of the encryption key: a backup you can't decrypt isn't a backup. Keep keys backed up separately.
Common failure: encrypted backups, key in same account as backup → key compromised, backup unusable. Use cross-account or cross-cloud key custody.
Compliance
Frameworks have specific DR requirements:
| Framework | Typical requirement |
|---|---|
| SOC 2 | Documented BCP/DR plan; tested; evidence retained |
| ISO 27001 | A.17 - Business continuity controls |
| HIPAA | Contingency plan; encryption of backups |
| PCI DSS 4.0 | Tested backup of CDE; offline backup for ransomware |
| GDPR / data sovereignty | Backups may need to stay in region; right-to-be-forgotten on restore |
Document:
- BCP narrative and runbooks
- Last drill date + results
- Backup retention matched to legal requirements
- Encryption methodology
- Access control list for backup systems
A clean DR practice is one of the easier compliance items to evidence.
Backup Hygiene
- Tags / labels on every backup linking to the source service. Without metadata, "what is this?" can't be answered.
- Naming conventions: include service, environment, date, type. Searchable.
- Inventory: a list of all backup locations + retention + RTO/RPO commitments. Maintained.
- Audit: who can access backups; who can delete; logged.
- Cost monitoring: backups are a sneaky cost line; tag them; report monthly.
Cost Optimization
Backups are often a top-10 cost line. Optimize:
- Lifecycle tiers: hot for recent, warm (IA) for monthly, cold (Glacier Deep) for yearly.
- Deduplicate: restic, Borg, and Backblaze dedup at the block level; massive savings for similar data.
- Compress: most backup tools compress; verify it's on.
- Don't backup what's reproducible: build artifacts, scratch volumes, derived caches. Backup the inputs, regenerate the outputs.
- Retention review: are you really keeping daily for a year? Step it down.
- Test cost of unused backups quarterly: surprising waste hides there.
A typical optimization: full daily backups of every PVC blindly → tiered backups of important PVCs + selective inclusion → 70% cost reduction with same RPO.
Failure Modes of DR Itself
The DR system can fail too:
- Backup vendor goes bankrupt / cuts off your account. Don't rely on single vendor.
- Backup format changes between tool versions. Test restores after every backup tool upgrade.
- The restoration machine is missing dependencies. Restoring to a "clean" environment surfaces this.
- Backup volume runs out of space silently. Monitor.
- Snapshot retention silently shortens. Audit lifecycle policies.
- Restore requires manual steps that have been lost. Runbook test.
The system requires monitoring like any other production system. Backup failures should page.
Communications During DR Event
Often missed: while engineering recovers, who's communicating?
- Status page updated in agreed cadence (every 30 min minimum during active incident)
- Customer-facing email/Slack for affected accounts
- Internal Slack #incidents for cross-team coordination
- Executive briefing with specific timing — they want to know
- Post-incident: written postmortem within 5 business days, customer-facing summary if customer-impacting
DR isn't only technical. Communications failures during real DR events make tech recoveries pointless.
Documentation Discipline
What auditors and new hires need to see:
- DR plan document — what we recover, why, how, who
- Runbooks per scenario — step-by-step
- Service tier matrix — which services have which RTO/RPO
- Drill log — what was drilled, when, what was found, fixes
- Restore evidence — screenshots / logs of successful drills
- Postmortems — real DR events analyzed
Keep these in the same wiki/repo as your other ops docs. Don't let DR docs be a separate forgotten corner.
Common Pitfalls
"We have backups." No, you don't. You have something you call backups. Until you've restored from them with verifiable correctness, you have nothing.
Backing up to the same provider only. Your single cloud account / AWS Organization is a SPOF. At least one offsite copy outside the primary blast radius.
Restoring to a brand new region. Underestimating the time. AMIs, network setup, certs, IAM bootstrap — a region you've never used takes hours to get to baseline. Have a "warm region" approach.
DR plan written, never reviewed. The plan from 3 years ago references EC2 instances that were replaced by Fargate. Quarterly review minimum.
Single person knows the plan. Bus factor of one. Spread responsibility; rotate drill leaders.
No comms plan during DR. Engineering recovers in 20 minutes; PR is destroyed because nobody told customers what was happening for 4 hours.
Confusing HA and DR. High availability handles small failures (one node, one AZ). DR handles big failures (region, ransomware, mass deletion). They're complementary, not substitutes.
Backup as the only DR strategy. Backup is one layer. Replication for low RPO, HA for resilience to small failures, backup for the worst case. Layered.
Not testing in actual production conditions. Restoring to a tiny test cluster works; restoring 5TB on a real-sized cluster might not. Scale-realistic tests.
When to Outsource
DR is one of the more legitimate things to buy:
- Managed backup services (AWS Backup, Azure Backup, Veeam, Druva) handle scheduling, retention, cross-region.
- DR-as-a-service (DRaaS) offerings: VMware Cloud DR, Zerto, AWS Elastic DR. Replicate to a standby and orchestrate failover.
- Pure SaaS for things like email, secrets management, identity — those vendors do better DR than you can.
Don't reinvent backup software unless you have a specific reason. Buy the commodity; spend energy on your DR runbooks, drills, and architecture.
Checklist
DR production readiness:
- Service tier matrix defined with RTO/RPO per tier
- Every tier-1 service has a runbook for primary failure modes
- At least one offsite backup copy (different region or cloud)
- At least one immutable backup (ransomware protection)
- Monthly automated restore verification (one random backup tested)
- Quarterly drill of full failover for tier-1 services
- Backups encrypted in transit and at rest; key custody separate
- Backup completion + size monitored; failures page
- Lifecycle policies in place (cold tier for old backups)
- Documented BCP/DR plan reviewed within last 12 months
- Communications plan: status page, customer comms, exec briefing
- Bus factor > 1 for runbook execution
- Cost of backups tracked monthly; lifecycle reviewed annually
- Configuration / secrets / IaC state backed up alongside data
- Drill results logged; gaps tracked as tickets
What's Next
You have a DR practice. Connect it to:
- Chaos Engineering — chaos is the practice; DR drills are the formal event
- Monitoring — detect failures fast; trigger DR sooner
- FinOps — backup costs are real; optimize the lifecycle
- Object Storage — S3 + Object Lock is the backbone of modern DR
- Secrets — Vault backups + key custody is part of full DR