++++

Engineering

Mar 2025×10 min read

A File System defines how data is stored and retrieved on disk. While traditional systems (like NTFS or ext4) are great for single machines, Distributed File Systems (DFS) allow data to be spread across thousands of servers, ensuring massive scale and fault tolerance.

Distributed File Systems & Big Data 🌏

Driptanil DattaSoftware Developer

🌍

References & Disclaimer

This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

What is a Distributed File System (DFS)?

A DFS allows file access across multiple nodes while appearing as a single, unified file system to the user.

Scalability: Can scale horizontally by adding commodity servers.
Reliability: Uses replication to ensure data isn't lost if a server fails.
Example: HDFS (Hadoop Distributed File System) — the foundational storage for big data analytics.

HDFS Architecture: How it Works

HDFS follows a master-slave architecture with two primary components:

NameNode (The Brain): Manages the file system hierarchy and keeps track of where data blocks are stored across the cluster. If the NameNode fails, the file system becomes inaccessible.
DataNodes (The Muscle): Store the actual data blocks. They perform low-level read/write requests from the file system's clients.

Replication Factor

To prevent data loss, HDFS replicates blocks across different nodes (and often different racks).

Default Replication: 3 copies.
If DataNode A fails, the NameNode detects it and ensures a new replica is created on a healthy node.

Big Data: The 6 V's 📊

Big Data isn't just about size; it's about the complexity and speed of data. We define it using the 6 V's:

Volume: Massive amounts of data generated from web, sensors, and apps.
Velocity: The breakneck speed at which data is collected and processed (e.g., stock markets).
Variety: Heterogeneous formats: Structured (SQL), Semi-structured (JSON), and Unstructured (Video).
Veracity: The accuracy and trustworthiness of the data (cleaning and validation).
Value: The business insights and ROI extracted from the data.
Variability: The inconsistency in data flow or meaning over time.

Common Big Data Workloads ✅

Distributed systems are optimized for specific types of high-scale workloads:

Logs & Events: System metrics, server logs, and transaction history.
Clickstreams: Tracking user navigation paths on websites and apps in real-time.
IoT Data: Continuous streams from sensors, smart devices, and industrial equipment.
ML Pipelines: Extracting features and training models on petabytes of historical data.

Processing Big Data: Batch vs. Stream

Storing data is the first step; processing it efficiently is where the value lies.

1. Batch Processing (Hadoop, Spark)

Processes large chunks of data at once. High throughput but higher latency. Best for historical analysis and nightly aggregations.

2. Stream Processing (Flink, Kafka Streams)

Processes events in real-time as they arrive. Low latency but lower throughput per event. Best for fraud detection and real-time monitoring.

Modern Data Storage: S3 vs. HDFS

Feature	HDFS	Amazon S3
Coupling	Tightly coupled with compute nodes.	Decoupled storage (Serverless).
Scaling	Manual scaling (Add hardware).	Automatic cloud scaling.
Use Case	High-throughput, on-prem batch jobs.	Cloud-native analytics & Data Lakes.

Interview Questions & Answers 💡

1. Why do traditional databases struggle with Big Data workloads?

Traditional RDBMSs are designed for single-node vertical scaling and rigid schemas. They fail at "scale-out" commodity hardware and hitting performance bottlenecks under high-velocity streams.

2. Compare Batch vs. Stream processing. Which for fraud detection?

Batch (Spark): Processes data in large chunks. High throughput, high latency.
Stream (Flink/Kafka): Processes events as they arrive (milliseconds).
Fraud Detection: Use Stream processing because immediate action is required to stop transactions.

3. What is Delta Lake and how does it improve data lakes?

Delta Lake adds a reliability layer on top of S3/HDFS. It brings ACID Transactions, Schema Enforcement, and Time Travel, preventing the "data swamp" problem where data lakes become disorganized.

4. Design a system to process terabytes of log data daily.

Ingestion

Use Kafka to buffer real-time logs.

Storage

Raw logs in S3 (Data Lake) using Delta Lake for reliability.

Processing

Use Spark for batch ETL and Flink for real-time alerting.

Querying

Use Presto/Athena for ad-hoc serverless SQL queries.

5. What is the difference between HDFS NameNode and DataNode?

The NameNode manages metadata (file names, block locations), while DataNodes store the actual data blocks. It's the "Intelligence vs. Storage" split.

Final Thoughts 🎯

Distributed File Systems are horizontal, fault-tolerant, and high-throughput. While HDFS remains the standard for on-prem Hadoop, S3 has become the standard for modern cloud-native analytics.

Check out related topics: