Distributed File Systems & Big Data π
A File System defines how data is stored and retrieved on disk. While traditional systems (like NTFS or ext4) are great for single machines, Distributed File Systems (DFS) allow data to be spread across thousands of servers, ensuring massive scale and fault tolerance.
This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
What is a Distributed File System (DFS)?
A DFS allows file access across multiple nodes while appearing as a single, unified file system to the user.
- Scalability: Can scale horizontally by adding commodity servers.
- Reliability: Uses replication to ensure data isn't lost if a server fails.
- Example: HDFS (Hadoop Distributed File System) β the foundational storage for big data analytics.
HDFS Architecture: How it Works
HDFS follows a master-slave architecture with two primary components:
- NameNode (The Brain): Manages the file system hierarchy and keeps track of where data blocks are stored across the cluster. If the NameNode fails, the file system becomes inaccessible.
- DataNodes (The Muscle): Store the actual data blocks. They perform low-level read/write requests from the file system's clients.
Replication Factor
To prevent data loss, HDFS replicates blocks across different nodes (and often different racks).
- Default Replication: 3 copies.
- If DataNode A fails, the NameNode detects it and ensures a new replica is created on a healthy node.
Big Data: The 6 V's π
Big Data isn't just about size; it's about the complexity and speed of data. We define it using the 6 V's:
- Volume: Massive amounts of data generated from web, sensors, and apps.
- Velocity: The breakneck speed at which data is collected and processed (e.g., stock markets).
- Variety: Heterogeneous formats: Structured (SQL), Semi-structured (JSON), and Unstructured (Video).
- Veracity: The accuracy and trustworthiness of the data (cleaning and validation).
- Value: The business insights and ROI extracted from the data.
- Variability: The inconsistency in data flow or meaning over time (e.g., seasonal trends or evolving language).
Common Big Data Workloads β
Distributed systems are optimized for specific types of high-scale workloads:
- Logs & Events: System metrics, server logs, and transaction history.
- Clickstreams: Tracking user navigation paths on websites and apps in real-time.
- IoT Data: Continuous streams from sensors, smart devices, and industrial equipment.
- ML Pipelines: Extracting features and training models on petabytes of historical data.
Processing Big Data: Batch vs. Stream
Storing data is the first step; processing it efficiently is where the value lies.
1. Batch Processing (Hadoop, Spark)
Processes large chunks of data at once. High throughput but higher latency (minutes/hours). Best for historical analysis and nightly aggregations.
2. Stream Processing (Flink, Kafka Streams)
Processes events in real-time as they arrive. Low latency (milliseconds) but lower throughput per event. Best for fraud detection and real-time monitoring.
Modern Data Storage: S3 vs. HDFS
| Feature | HDFS | Amazon S3 |
|---|---|---|
| Coupling | Tightly coupled with compute nodes. | Decoupled storage (Serverless). |
| Scaling | Manual scaling (Add hardware). | Automatic cloud scaling. |
| Use Case | High-throughput, on-prem batch jobs. | Cloud-native analytics & Data Lakes. |
Interview Questions - Distributed Storage & Big Data π‘
1. Why do traditional databases struggle with Big Data workloads?
Answer: Traditional RDBMSs are designed for single-node vertical scaling and rigid schemas. They fail at "scale-out" commodity hardware, struggle with unstructured data (Variety), and hit performance bottlenecks under high-velocity streams.
2. Compare Batch vs. Stream processing. Which for fraud detection?
Answer:
- Batch (Spark): Processes data in large chunks (e.g., nightly reports). High throughput, high latency.
- Stream (Flink/Kafka): Processes events as they arrive (milliseconds).
- Fraud Detection: Use Stream processing because immediate action is required to stop transactions.
3. What is Delta Lake and how does it improve data lakes?
Answer: Delta Lake adds a reliability layer on top of S3/HDFS. It brings ACID Transactions, Schema Enforcement, and Time Travel (rolling back to previous versions), preventing the "data swamp" problem where data lakes become disorganized and corrupt.
4. Design a system to process terabytes of log data daily.
Answer:
- Ingestion: Use Kafka to buffer real-time logs.
- Storage: Raw logs in S3 (Data Lake) using Delta Lake for reliability.
- Processing: Use Spark for batch ETL and Flink for real-time alerting.
- Querying: Use Presto/Athena for ad-hoc serverless SQL queries.
5. What is the difference between HDFS NameNode and DataNode?
Answer: The NameNode manages metadata (file names, permissions, block locations), while DataNodes store the actual data blocks. It's the classic "Intelligence vs. Storage" split.
Summary & What's next? π―
- Distributed File Systems are horizontal, fault-tolerant, and high-throughput.
- HDFS is the standard for on-prem Hadoop, while S3 is the standard for the cloud.
- Big Data architectures rely on the separation of Storage and Compute.
Check out related topics: