Database Scaling, Replication & Sharding π
As applications grow from thousands to millions of users, a single database server becomes a bottleneck. To handle massive traffic and ensure high availability, we must employ advanced scaling and distribution techniques.
This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
1. Database Scaling: Vertical vs. Horizontal
Before distributing data, we must decide how to add resources.
- Vertical Scaling (Scale-Up): Increasing the capacity of a single server (More CPU, RAM, or faster NVMe storage).
- Pros: Simple, maintained ACID compliance, no application changes.
- Cons: Hard hardware limits, expensive, single point of failure.
- Horizontal Scaling (Scale-Out): Adding more nodes (servers) to the database cluster.
- Pros: Virtually unlimited scale, fault tolerance, cost-effective (standard hardware).
- Cons: Complex to manage, requires sharding, challenges with consistency.
2. Database Replication
Replication is the process of keeping multiple copies of the same data on different nodes to improve Availability and Read Performance.
Leader-Follower Replication
- Leader: Handles all Writes. Updates followers asynchronously or synchronously.
- Followers: Provide Read-Only access. If the leader fails, a follower can be promoted to leader.
Read Replicas
Often used just for load balancing reads. Read replicas do not participate in the write process and are typically used to scale read-heavy workloads (like BI or reporting tools).
3. Database Sharding (Partitioning)
Sharding is the process of splitting data across multiple databases to scale horizontally.
- Horizontal Sharding: Rows are distributed across shards (most common).
- Vertical Sharding: Tables or columns are split across shards based on function or access patterns.
Sharding Strategies
- Range-Based Sharding: Data is split based on ranges of a key (e.g., User IDs 1-1000 go to Shard A).
- Risk: Can lead to "Hot spots" if certain ranges are more active.
- Hash-Based Sharding: A hash function is applied to the shard key to determine the server.
- Balance: Distributes load evenly but makes range queries difficult.
- Geo-Based Sharding: Data is sharded based on the user's geographic region. Useful for reducing latency in global systems.
4. Consistent Hashing
Consistent Hashing is the "secret sauce" behind highly scalable distributed systems like Amazon's Dynamo or Apache Cassandra. It minimizes the amount of data that needs to be moved when nodes are added or removed from a cluster.
Instead of a simple modulo operation, keys and nodes are mapped onto a logical Ring. This ensures that adding or removing a node only affects a small portion of the keys, significantly improving the availability and elasticity of the system.
5. Polyglot Persistence: The Modern Way
Modern architectures don't use a single "silver bullet" database. Instead, they use the best database for each specific job.
- SQL (Postgres): For billing, user identity, and complex transactions.
- NoSQL (Cassandra/MongoDB): For high-volume log data or time-series metrics.
- Search (Elasticsearch): For full-text search and catalog indexing.
- Cache (Redis): For ultra-fast session storage and ephemeral data.
Real-World Examples:
- Netflix: Uses Cassandra for geo-distribution, MySQL for billing, and Elasticsearch for movie search.
- Uber: Uses PostgreSQL and MySQL for core trips, and Redis for real-time geolocation.
Interview Questions - Advanced Database Topics π‘
1. What is the difference between horizontal and vertical scaling? When would you prefer one?
Answer:
- Vertical (Scale-up): Adding CPU/RAM to one server. Simpler but limited by hardware capacity. Use for smaller scales or where consistency is non-negotiable.
- Horizontal (Scale-out): Adding more nodes. Elastic and fault-tolerant but architecturally complex. Use for global-scale traffic and "Big Data."
2. Explain leader-follower replication. Impact on consistency?
Answer: Writes go to a leader; reads go to followers. It improves availability but can lead to eventual consistency due to replication lag.
3. What are the pros and cons of using read replicas?
Answer:
- Pros: Scales read-heavy workloads, offloads primary DB, increases fault tolerance.
- Cons: Replication lag leads to stale data, complexity in routing reads vs. writes, no help for write scalability.
4. Compare range-based and hash-based sharding.
Answer: Range-based is easy for range queries but creates hot spots. Hash-based distributes load well but makes range queries very hard.
5. Why is consistent hashing important in distributed databases?
Answer: It minimizes data re-distribution when adding or removing nodes. In a standard key % N system, changing N moves almost all data. With Consistent Hashing, only a small fraction of keys are remapped.
6. What database model would you choose for:
- a. Financial Ledger: SQL (Needs ACID and strong consistency).
- b. Product Catalog: NoSQL Document (Flexible attributes).
- c. Real-time Chat: NoSQL Key-Value/Document (Low latency).
7. How does data modeling differ between SQL and NoSQL systems?
- SQL: Highly normalized, multiple tables with relationships, focus on reducing redundancy.
- NoSQL: Denormalized or nested structures, data is modeled specifically for access patterns (read/write efficiency).
8. How do systems like Uber use polyglot persistence?
Answer: Uber uses PostgreSQL for business logic, Redis for real-time location tracking, and BigQuery for analytical data processing. Using the right tool for each job optimizes both performance and developer productivity.
Summary: Scaling is about picking the right trade-off between Consistency and Availability. Replication is for availability; Sharding is for capacity. Modern systems are always Polyglot.
What's next? Mastering the CAP Theorem