++++

Engineering

Mar 2025×10 min read

No matter how highly available or fault-tolerant a system is, data loss can still occur due to extreme circumstances....

Backup & Recovery 💾🔄

Driptanil DattaSoftware Developer

No matter how highly available or fault-tolerant a system is, data loss can still occur due to extreme circumstances. Backup & Recovery is the ultimate safety net for ensuring business continuity.

🌍

References & Disclaimer

This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🛡️ What is Backup & Recovery?

Backup: The process of creating copies of data so that these additional copies can be used to restore the original after a data loss event.
Recovery: The process of restoring data from backups after a failure, corruption, or cyberattack.

[!IMPORTANT] Replication is not Backup! If you accidentally run DROP TABLE users; on the primary database, the replication system will immediately drop that table on all secondary nodes too. You need an isolated backup to recover from human error or ransomware.

Why is it Important?

Hardware or software failure.
Human error (accidental deletion).
Cyber attacks (Ransomware).
Natural disasters.
Compliance & data retention requirements.

🗃️ Types of Backup

Choosing how to copy the data involves trade-offs between storage costs and recovery speed.

1. Full Backup

Copies the entire dataset, regardless of any previous backups.

Pros: Simplest and fastest to restore.
Cons: Highest storage cost and takes the longest time to run.

2. Incremental Backup

Backs up only the data that has changed since the last backup (of any type).

Pros: Fastest to run, smallest storage size.
Cons: Slowest to restore. (You must restore the last Full Backup, plus every incremental backup in order).

3. Differential Backup

Backs up all data that has changed since the last Full backup.

Pros: Faster to restore than Incremental (only need the Full Backup + the latest Differential).
Cons: Takes longer to run and uses more storage than Incremental.

📍 Recovery Types: Cold, Warm, Hot

When disaster strikes a data center, how quickly can you bring up a new one?

Cold Recovery (Cold Site):
- Backups are stored offline or in deep storage. No active hardware is ready. You must provision hardware, install OS, and download data.
- Metrics: High downtime, lowest cost.
Warm Recovery (Warm Site):
- Infrastructure is pre-provisioned, but data might be slightly out of date (needs a final sync), and the site is not currently taking traffic.
- Metrics: Moderate downtime, moderate cost.
Hot Recovery (Hot Site):
- A fully redundant, active data center taking traffic alongside the primary site (Active-Active).
- Metrics: Near-zero downtime, highest cost.

⏱️ Understanding RTO & RPO

These two metrics dictate your Backup & Disaster Recovery (DR) strategy and budget.

RTO (Recovery Time Objective):
- Question: "How long can we afford to be down?"
- It is the target time calculated to restore the system after a disaster.
RPO (Recovery Point Objective):
- Question: "How much data can we afford to lose?"
- It is the maximum acceptable amount of data loss measured in time (e.g., "We backup every hour, so our RPO is 1 hour").

[!TIP] Shorter RTO and RPO = Exponentially higher cost and system complexity.

🏆 Best Practices

Automate: Backups should be automated, and restore procedures should be regularly tested. A backup hasn't actually succeeded until you successfully restore from it.
Encrypt: Encrypt backups both in transit and at rest to protect against theft.
Apply the 3-2-1 Rule:
- Keep 3 copies of your data (1 primary, 2 backups).
- Store them on 2 different types of media (e.g., Disk and Tape/Cloud).
- Keep 1 copy offsite (e.g., in a different AWS region).

Interview Questions - Backup & Recovery Strategies 💡

1. What is the difference between full, incremental, and differential backups? When would you use each?

Answer:

Full Backup: Backs up all data, regardless of changes. Pros: Easy to restore. Cons: Time-consuming and requires high storage. Use case: Weekly full backups as a foundation.
Incremental Backup: Backs up only the data that changed since the last backup (incremental or full). Pros: Fast and storage-efficient. Cons: Recovery requires all incremental backups. Use case: Daily backups after a full weekly backup.
Differential Backup: Backs up changes since the last full backup. Pros: Faster restore than incremental. Cons: Larger than incremental as the week progresses. Use case: Mid-week restore-friendly backups.

2. How do you define and balance RTO and RPO in a large-scale distributed system?

Answer:

RTO (Recovery Time Objective): Max acceptable time to restore service after failure.
RPO (Recovery Point Objective): Max acceptable data loss measured in time (e.g., last 15 mins). Balancing: Low RTO/RPO requires hot backups, replication, and higher cost. Analyze SLA commitments and criticality of components. Use a tiered strategy:
Mission-critical: low RTO/RPO (e.g., failover DB, real-time replication).
Less-critical: higher RTO/RPO (e.g., batch systems, cold backups).

3. Explain cold, warm, and hot recovery strategies with examples.

Answer:

Cold Recovery: No pre-configured resources; systems must be rebuilt from backups. 🧊 Example: Backup stored on tape or cold cloud storage. Slowest, cheapest.
Warm Recovery: Some components (e.g., data, configs) pre-provisioned, but app not running. 🔥 Example: Standby server with recent backups, manual DB restore. Moderate speed & cost.
Hot Recovery: Fully functional redundant system with real-time syncing. ⚡ Example: Active-active databases, failover-ready load-balanced clusters. Fastest, most expensive.

4. How would you implement a backup strategy for a microservices-based application hosted in the cloud?

Answer:

Identify critical services and data stores (DBs, object storage, configs).
Use cloud-native backup tools (e.g., AWS Backup, GCP snapshots).
Apply different backup frequencies per service: Full weekly, daily incrementals; Real-time replication for critical DBs.
Automate with IaC or CI/CD pipelines (e.g., Terraform, GitHub Actions).
Encrypt backups and store in multi-region S3/Blob buckets.
Regularly test restoration (chaos engineering / DR drills).

5. What trade-offs do you consider when designing a backup and recovery system for a high-availability service?

Answer:

Cost vs. recovery speed: Hot backups increase cost.
Complexity vs. maintainability: Incremental backups are efficient but harder to restore.
Storage vs. retention: Long retention increases storage needs.
Compliance vs. agility: Regulatory backups might need immutability and longer archives.
RTO/RPO: Must match business tolerance. 👉 Use a multi-tiered strategy to align criticality with the backup effort.

6. How does cloud storage simplify or complicate backup and recovery strategies?

Answer: Simplifies: Elastic, durable storage (e.g., S3, Azure Blob); Built-in snapshotting (e.g., EBS, RDS); Lifecycle management (automatic tiering, archival); Geo-redundancy support. Complicates: Vendor lock-in risks; Costs can spiral with high frequency or long retention; Need for access management & encryption; Cross-region data compliance challenges.

7. What are some best practices for backup automation and testing in production systems?

Answer:

Automate backups with scheduled jobs or cloud-native tools.
Use infrastructure as code to provision backup policies.
Test restore procedures regularly (runbook + chaos testing).
Monitor backup success/failures via alerts.
Encrypt backups and verify data integrity.
Follow 3-2-1 Rule: 3 copies, 2 different media, 1 offsite (e.g., cloud or DR site).

8. How would you handle backup for a database with terabytes of data and minimal allowed downtime?

Answer:

Use point-in-time recovery (PITR) if supported (e.g., MySQL binlogs, PostgreSQL WAL).
Leverage incremental or log-based backup instead of full backups daily.
Use replication (read-replica) to offload backups.
Take online snapshots (e.g., EBS or managed RDS snapshots).
Compress, encrypt, and store to cold + warm tiers.
Use parallelism and dedicated backup windows to minimize performance impact.