Design a Notification System π
A notification system delivers critical updatesβlike messages, shipping alerts, or security codesβacross multiple channels (Email, SMS, Push, In-App).
This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
π Introduction
Building a notification system for 10M+ DAU requires more than just calling an API. It necessitates a distributed, resilient architecture capable of handling bursty events, managing user preferences, and ensuring high delivery rates across global providers.

π Requirements
Functional Requirements
- Multi-Channel Support: Support for Email, SMS, Push Notifications, and In-App alerts.
- User Preferences: Allow users to opt-in/out of specific event types and channels.
- Templating: Localized, template-based message generation.
- Retry & DLQ: Automatic retries for failed provider calls with dead-letter queueing.
- API Access: Unified APIs for triggering notifications and fetching in-app history.
Non-Functional Requirements
- At-Least-Once Delivery: Prioritize reliability to ensure critical alerts are not lost.
- Low Latency: Near real-time delivery (< 10 seconds for most channels).
- High Scalability: Handle flash-sale spikes (100M+ alerts/day).
- Observability: End-to-end traceability for every event to provider delivery.
π Scale Estimation
- DAU: 10 Million users.
- Average Load: 5 events/user/day with a 2x channel fan-out = 100 Million notifications/day.
- Peak Load: 3x multiplier during incidents or sales = ~3,500 notifications/second.
- Storage: Preferences for 10M users + massive log volumes for delivery auditing.
π High-Level Architecture
The system is decoupled using a message broker to isolate ingestion from heavy delivery processing.
ποΈ The Final Design - Notification System
A comprehensive view of the entire system, from ingestion to external provider integration.

π οΈ Bottlenecks & Strategic Decisions
- Third-Party Rate Limits: Providers like Twilio or SendGrid have strict throttles. Use Channel-Specific Queues to buffer traffic and implement circuit breakers to avoid cascading failures.
- Preference Lookup Overhead: Querying the main SQL database for every notification is slow. Use Redis to cache user notification settings with a write-through strategy.
- Idempotency: Retrying provider calls can lead to "double-pinging." Use a unique
event_idordeduplication_keyat the worker level to ensure a specific alert is sent only once per user.
π‘ Top Interview Questions
Q: How do you handle "Quiet Hours"? The Orchestrator queries the Preference Service for local user time. If it falls within quiet hours, the notification is either queued for later or dropped, depending on the severity level (e.g., OTPs bypass quiet hours).
Q: What is a Dead-letter Queue (DLQ) used for here? If a worker fails to deliver after multiple retries (e.g., invalid phone number), the message is moved to a DLQ for manual audit or further automated analysis without blocking the main worker threads.