Design a Notification System
Design a notification system that handles critical time-sensitive notifications (like 1:1 chat messages) and promotional notifications (like system-generated content recommendations) with expiration logic. The system should scale to 1M notifications/second with an 80/20 critical/promotional distribution, where promotional notifications can target thousands of users simultaneously.
Asked at:
Airbnb
Uber
Meta

Microsoft
A notifications platform (think OneSignal, Firebase Cloud Messaging, or Twilio Notify) lets product teams send critical alerts (e.g., OTPs, security events) and promotional campaigns (e.g., sales, reminders) to users across channels like push, SMS, and email. Users expect critical messages in near real time, and they should never receive a promo after the offer expires. Interviewers ask this to evaluate if you can design a high-throughput, multi-tenant, priority-aware pipeline that isolates critical traffic from bulk campaigns, handles fan-out to large audiences, respects provider rate limits, and enforces time windows. They also probe your grasp of durability, idempotency, per-user ordering, retries, and resiliency when downstream providers degrade. You are not judged on client-side delivery, but on the backend architecture that can hit 1M notifications/sec at scale.
Common Functional Requirements
Most candidates end up covering this set of core functionalities
Users should be able to receive critical notifications in near real time with high delivery reliability and per-user ordering.
Users should be able to receive promotional notifications only within the valid promotion window and never after it expires.
Users should be able to manage notification preferences (opt-in/out, channels, quiet hours) that are respected across all message types.
Users should be able to receive targeted campaign notifications at massive scale without degrading the latency or reliability of critical notifications.
Common Deep Dives
Common follow-up questions interviewers like to ask for this question
In a system design interview, demonstrating priority isolation shows you understand workload contention and SLO protection. Critical alerts cannot be delayed by large promotional blasts, so your design should separate ingress and processing paths and provide hard capacity guarantees. - Consider separate API endpoints, topics/queues, and worker pools per priority class, with capacity reservations and quotas so promo traffic can never starve critical traffic. - You could implement admission control and backpressure at the API gateway or producer layer; reject, shed, or defer promo requests when the system approaches critical capacity. - Partition by user/device to preserve per-user ordering while scaling horizontally, and ensure retries do not jump the priority queue and reorder messages.
Large campaigns are multi-step workflows: audience expansion, scheduling, dispatch, and retries. Interviewers look for time-bound execution, cancellation, and compliance with external throttles. - You could expand the audience into a durable send list, then dispatch via a scheduler that enforces a TTL/end-time; drop or cancel any work that crosses the expiry boundary. - Consider per-provider and per-tenant token-bucket rate limiters, with batching and exponential backoff, so you respect limits for SMS, email, and push providers. - Provide a cancel/pause control that marks in-flight items as canceled and drains queues without sending, plus observability of send window adherence and backlog age.
At-least-once delivery with idempotency is the standard trade-off at scale. Interviewers want to see how you prevent duplicates, maintain order per user/device, and handle retries without breaking semantics. - You could use idempotency keys (user, channel, messageId) with a short TTL in a fast store to dedupe at the channel boundary and make retries safe. - Partition your log/queue by user or device so a single consumer shard preserves local ordering; avoid cross-partition reordering by keeping all messages for a key in one partition. - Define retry policies, dead-letter queues for poison messages, and ensure backoff does not reorder messages within a key when a single send fails.
Real systems face flaky SMS/email gateways and API throttling. Interviewers expect circuit breakers, health checks, rerouting, and controlled shedding for non-critical work. - You could implement health probing, circuit breakers, and dynamic routing to fallback providers per channel; fail closed for critical messages only when all providers are down. - Consider separate retry lanes and DLQs per error class (transient vs. permanent), with exponential backoff and jitter to avoid thundering herds. - Use SLO-driven autoscaling, queue-lag alerts, and age-of-oldest metrics; shed or soft-fail promo traffic as latencies rise to protect critical SLAs.
Relevant Patterns
Relevant patterns that you should know for this question
The system must ingest and dispatch up to 1M notifications/sec with high fan-out. Partitioned, append-only logs and horizontally scalable write paths are essential to avoid bottlenecks while preserving per-user ordering.
Notification delivery is a durable workflow: accept, classify, expand audience, schedule within a window, personalize, rate limit, send, retry, and record. Modeling this as a multi-step process simplifies retries, cancellation on expiry, and observability.
Promotional blasts create hotspots and can starve critical traffic. Priority isolation, sharded queues, quotas, and backpressure are required to prevent head-of-line blocking and protect critical SLOs.
Relevant Technologies
Relevant technologies that could be used to solve this question
Similar Problems to Practice
Related problems to practice for this question
Promotional messages must be scheduled within validity windows, canceled on expiry, and retried reliably—mirroring distributed workflow orchestration and time-based triggers in a job scheduler.
Respecting per-provider and per-tenant limits for SMS/email/push is central to notification delivery, paralleling the need for global, consistent rate limiting under high concurrency.
Per-user ordering, at-least-once delivery, fan-out, and multi-device routing are shared concerns with a messaging system, even though notifications are one-way and often provider-mediated.
Red Flags to Avoid
Common mistakes that can sink candidates in an interview
Question Timeline
See when this question was last asked and where, including any notes left by other candidates.
Late August, 2025
Airbnb
Senior
Early August, 2025

Microsoft
Staff
Early August, 2025
Airbnb
Senior
Your account is free and you can post anonymously if you choose.