++++

Engineering

Mar 2025×10 min read

While High Availability (HA) keeps the system running through minor failures (like a single server crash), Disaster R...

Disaster Recovery (DR) 🌪️🛡️

Driptanil DattaSoftware Developer

While High Availability (HA) keeps the system running through minor failures (like a single server crash), Disaster Recovery (DR) is the plan for surviving catastrophic events that take down entire regions or data centers.

🌍

References & Disclaimer

This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

📉 Why Disaster Recovery Matters

Disasters aren't just natural events like floods or earthquakes; they include regional cloud outages, large-scale cyber attacks (ransomware), and massive human errors.

Cost of Downtime: For mission-critical systems (banking, healthcare, e-commerce), every minute of downtime can cost thousands or millions of dollars.
Data Protection: Ensures that data is not only available but remains persistent and uncorrupted.
Compliance: Many regulated industries require a documented and tested DR plan by law.

🏗️ DR for Mission-Critical Applications

Large-scale systems must meet strict RTO (Recovery Time Objective) and RPO (Recovery Point Objective) targets.

Multi-Level Redundancy: Redundancy must exist at every layer: Compute (servers), Storage (databases), and Network (DNS/Load Balancers).
Automated Failover: The switch to a recovery region should be automated to minimize downtime.
Tested Plans: A DR plan is only as good as its last successful drill.

🤝 Failover + Backup = True Resilience

It's common to confuse backup with failover, but true DR requires both.

Strategy	Focus	Protects Against
Backup	Data Recovery	Data corruption, accidental deletion, ransomware.
Failover	Service Continuity	Infrastructure failure, regional power outage, hardware crash.

The Synergy: You use failover to keep the service online during an outage, and you use backups to restore data if that outage involved data corruption.

🧪 Testing & Automation

"If you haven't tested it, you don't have it." — This is the golden rule of DR.

Regular Drills: Periodically simulate failures to ensure the team knows the protocol and the scripts work.
Chaos Engineering: Intentionally injecting failures into production to test the system's automated response.
Automatic Validation: After a restore, automated scripts should validate data integrity and service health before pointing traffic back.

🌍 Geo-Redundancy & Quorum-Based Design

Geo-Redundancy

Deploying your application across multiple physical locations (Regions) worldwide. If an entire region goes offline (e.g., AWS us-east-1), your traffic is routed to another region.

Quorum-Based Design

In distributed systems, a Quorum is the minimum number of nodes that must agree on a distributed operation for it to be considered successful.

Formula: Usually (N/2) + 1 nodes.
Why it matters: It ensures data consistency and prevents "split-brain" scenarios during a network partition. If you have 5 nodes across 3 regions, the system only commits data if at least 3 nodes confirm the write.

🚩 Challenges in Geo-Distributed Systems

Latency: Synchronizing data across continents is limited by the speed of light, leading to higher write latency.
Consistency vs. Availability: Referencing the CAP Theorem, choosing to stay consistent (Quorum) might mean losing availability if enough nodes go offline.
Data Locality: Strict regulations (like GDPR) might prevent you from moving user data to certain geographical regions.

Interview Questions - Disaster Recovery in Practice 💡

1. What’s the difference between failover and backup?

Answer:

Backup: Storing data copies to protect against loss or corruption. It restores data but doesnt ensure immediate availability.
Failover: Automatic switch to a redundant system during failure. It keeps services running with minimal downtime. Key Difference: Backups are for data recovery; failover is for service continuity. Best practice is to use both.

2. How do you design DR for a high-traffic web app?

Answer:

Define RTO/RPO: Based on business requirements.
Multi-Region Deployment: Active-active or active-passive instances with geo-replication.
Automated Failover: Use DNS (AWS Route 53) or global load balancing (GCP Global LB) with health checks.
Data Protection: Frequent snapshots, versioning, and offsite storage.
Testing: Automate recovery and run regular DR drills.

3. What is RTO/RPO, and how do you optimize them?

Answer:

RTO (Recovery Time Objective): Max acceptable downtime. Optimize with automated failover and hot standbys.
RPO (Recovery Point Objective): Max acceptable data loss (time). Optimize with real-time replication (CDC, WAL shipping). Trade-off: Shorter RTO/RPO leads to higher cost and complexity.

4. What are challenges with geo-distributed DR systems?

Answer:

Data Consistency: Difficult across regions during network partitions.
Latency: Speed of light limits replication speed.
Split-Brain: Coordinating which region is "Primary" without a single point of failure.
Data Locality: Compliance (GDPR) restricting data movement across borders.
Orchestration: Complexities in syncing state and promoting standby databases.

5. Explain quorum-based design in distributed recovery.

Answer: Quorum ensures decisions (like leader election or write confirmation) are made by a majority consensus (e.g., 3 out of 5 nodes).

Consensus: Prevents split-brain and ensures data integrity during failover.
Used in: Paxos/Raft protocols, etcd, Zookeeper, and databases like CockroachDB.
Benefit: Ensures safe recovery and high availability even during regional failures.

Next up? Scaling your data — Sharding & Partitioning