Storage Basics: The Foundation of Data ποΈ
In system design, storage is not just about "keeping data"βit's about how that data is structured, accessed, and protected. Every system generates and consumes data; choosing the right storage strategy impacts performance, reliability, and cost.
This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
Structured vs. Unstructured Data
Before choosing a storage medium, you must understand the nature of your data.
- Structured Data: Organized in rows and columns with a predefined schema (e.g., SQL tables, CSVs). Ideal for transactional systems.
- Unstructured Data: No fixed schema; flexible formats like images, videos, logs, and PDF documents. This represents the majority of data generated today.
| Feature | Structured | Unstructured |
|---|---|---|
| Format | Strictly defined (Tables) | Flexible (Files, Objects) |
| Storage | Relational DBs (RDBMS) | Data Lakes, Object Store, NoSQL |
| Scalability | Often Vertical | Usually Horizontal |
Categories of Storage
There are four primary ways systems store data at the infrastructure level:
- Object Storage (e.g., AWS S3): Groups data into "objects" with unique IDs and metadata. Best for static assets (images, backups).
- File Storage (e.g., NFS, EFS): Organizes data in a hierarchical folder-file structure. Standard for shared files and desktop OS.
- Block Storage (e.g., AWS EBS): Breaks data into fixed-sized blocks. Fast and efficient for databases and OS boot volumes.
- Database Storage: Higher-level abstraction (SQL/NoSQL) that manages data relationships and queries.
Storage Properties
To build reliable systems, storage must guarantee several key properties (often referred to in the context of ACID):
- Durability: Data persists even after failures (e.g., power loss or hardware crashes).
- Availability: Data can be accessed whenever the system needs it.
- Consistency: Every read returns the most recent write or an error.
- Atomicity: Operations are all-or-nothing; if one part of a transaction fails, the whole thing fails.
The Trade-offs in Storage Design
There is no "perfect" storage solution. Architects must trade off between three competing needs:
The "Unicorn" Problem: You can rarely achieve maximum Scalability, Reliability, and Performance simultaneously. For example, highly consistent databases (Reliability) often struggle with ultra-high global scale (Scalability).
Real-World Use Cases
- E-commerce: Product catalogs (Structured), Product images (Object).
- Streaming Services: Video files (Object), User watch history (NoSQL), Subscription billing (SQL).
- Log Aggregation: Time-series or Columnar DBs for fast analytical queries.
Interview Questions & Answers - Storage Fundamentals π‘
1. Why is storage a critical component in system design?
Answer: Storage is essential because it ensures data persistence across sessions and failures. The choice of storage impacts scalability, performance, availability, and cost.
2. How would you differentiate between structured and unstructured data?
Answer:
- Structured: Highly organized, predefined schema (SQL tables). E.g., transaction logs.
- Unstructured: No fixed schema, stored as raw files or blobs. E.g., images, videos, documents.
3. What are the different types of storage systems and their use cases?
Answer:
- Databases (SQL/NoSQL): Structured data, queries, and transactions.
- Object Storage: Unstructured media, backups (S3).
- File Storage: Hierarchical shared drives (NFS).
- Block Storage: Raw, high-performance volumes for DBs/VMs (EBS).
4. What do durability, availability, and consistency mean?
Answer:
- Durability: Data remains intact after crashes.
- Availability: System responds to requests even during failures.
- Consistency: Clients see the latest committed data after a write.
5. What is atomicity, and where is it relevant?
Answer: Atomicity ensures operations are all-or-nothing. Itβs crucial in transactional systems (like banking) where partial updates could cause data corruption.
6. Photo-sharing app design: photos vs. metadata?
Answer:
- Photos: Object storage (S3), optimized for large unstructured files.
- Metadata: NoSQL or Relational DB, depending on query patterns and consistency needs.
7. What storage for an analytics pipeline (logs/metrics)?
Answer:
- Raw logs: Object storage (S3).
- Metrics: Columnar or Time-series DBs (InfluxDB, Apache Druid).
- Batch processing: Distributed file systems (HDFS).
8. Object vs. File vs. Block storage?
Answer:
- Object: Scalable, flat structure, API-based (S3).
- File: Hierarchical, good for shared directories (NFS).
- Block: Raw, fast I/O, best for DB volumes.
9. What database model would you choose for:
- a. Financial Ledger: SQL (PostgreSQL) β Needs strict ACID compliance and strong consistency.
- b. Product Catalog: NoSQL Document (MongoDB) β Handles schema flexibility and nested attributes well.
- c. Real-time Chat: NoSQL Key-Value/Document (Redis/DynamoDB) β Low latency and high throughput are priorities.
Summary & What's next? π―
- Storage is chosen based on Data Format (Structured/Unstructured).
- Block storage is for speed; Object storage is for scale.
- Reliability guarantees like Durability are non-negotiable for enterprise systems.
What's next? The CAP Theorem: Balancing Distributed Data