++++

Engineering

Mar 2025×10 min read

As applications grow from thousands to millions of users, a single database server becomes a bottleneck. To handle massive traffic and ensure high availability, we must employ advanced scaling and distribution techniques.

Database Scaling, Replication & Sharding 🚀

Driptanil DattaSoftware Developer

🌍

References & Disclaimer

This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

1. Database Scaling: Vertical vs. Horizontal

Before distributing data, we must decide how to add resources.

Vertical Scaling (Scale-Up): Increasing the capacity of a single server (More CPU, RAM, or faster NVMe storage).
- Pros: Simple, maintained ACID compliance, no application changes.
- Cons: Hard hardware limits, expensive, single point of failure.
Horizontal Scaling (Scale-Out): Adding more nodes (servers) to the database cluster.
- Pros: Virtually unlimited scale, fault tolerance, cost-effective (standard hardware).
- Cons: Complex to manage, requires sharding, challenges with consistency.

2. Database Replication

Replication is the process of keeping multiple copies of the same data on different nodes to improve Availability and Read Performance.

Leader-Follower Replication

Leader: Handles all Writes. Updates followers asynchronously or synchronously.
Followers: Provide Read-Only access. If the leader fails, a follower can be promoted to leader.

Read Replicas

Often used just for load balancing reads. Read replicas do not participate in the write process and are typically used to scale read-heavy workloads (like BI or reporting tools).

3. Database Sharding (Partitioning)

Sharding is the process of splitting data across multiple databases to scale horizontally.

Horizontal Sharding: Rows are distributed across shards (most common).
Vertical Sharding: Tables or columns are split across shards based on function or access patterns.

Sharding Strategies

Range-Based Sharding: Data is split based on ranges of a key (e.g., User IDs 1-1000 go to Shard A).
- Risk: Can lead to "Hot spots" if certain ranges are more active.
Hash-Based Sharding: A hash function is applied to the shard key to determine the server.
- Balance: Distributes load evenly but makes range queries difficult.
Geo-Based Sharding: Data is sharded based on the user's geographic region. Useful for reducing latency in global systems.

4. Consistent Hashing

Consistent Hashing is the "secret sauce" behind highly scalable distributed systems like Amazon's Dynamo or Apache Cassandra. It minimizes the amount of data that needs to be moved when nodes are added or removed from a cluster.

Instead of a simple modulo operation, keys and nodes are mapped onto a logical Ring. This ensures that adding or removing a node only affects a small portion of the keys, significantly improving the availability and elasticity of the system.

5. Polyglot Persistence: The Modern Way

Modern architectures don't use a single "silver bullet" database. Instead, they use the best database for each specific job.

SQL (Postgres): For billing, user identity, and complex transactions.
NoSQL (Cassandra/MongoDB): For high-volume log data or time-series metrics.
Search (Elasticsearch): For full-text search and catalog indexing.
Cache (Redis): For ultra-fast session storage and ephemeral data.

Real-World Examples:

Netflix: Uses Cassandra for geo-distribution, MySQL for billing, and Elasticsearch for movie search.
Uber: Uses PostgreSQL and MySQL for core trips, and Redis for real-time geolocation.

Interview Questions & Answers 💡

1. What is the difference between horizontal and vertical scaling?

Vertical (Scale-up): Adding CPU/RAM to one server. Simpler but limited by hardware capacity.
Horizontal (Scale-out): Adding more nodes. Elastic and fault-tolerant but architecturally complex.

2. Explain leader-follower replication and its impact on consistency.

Writes go to a leader; reads go to followers. It improves availability but can lead to eventual consistency due to replication lag.

3. Compare range-based and hash-based sharding.

Range-based is easy for range queries but creates hot spots. Hash-based distributes load well but makes range queries very hard.

4. Why is consistent hashing important in distributed databases?

It minimizes data re-distribution when adding or removing nodes. In a standard key % N system, changing N moves almost all data. With Consistent Hashing, only a small fraction of keys are remapped.

5. What database model would you choose for specific tasks?

Financial Ledger: SQL (Needs ACID and strong consistency).
Product Catalog: NoSQL Document (Flexible attributes).
Real-time Chat: NoSQL Key-Value/Document (Low latency).

Final Thoughts

Scaling is about picking the right trade-off between Consistency and Availability. Replication is for availability; Sharding is for capacity. Modern systems are always Polyglot.

What's next? Mastering the CAP Theorem

SQL vs. NoSQL Object Storage