Design a Distributed Cache System
Design a service that provides distributed caching capabilities.
Asked at:

Microsoft

Meta
A distributed cache is an in-memory key-value service (think Redis or Memcached) that sits in front of your databases and APIs to serve data with microseconds-to-milliseconds latency. Users read and write simple records (often JSON or binary blobs) by key, with optional time-to-live (TTL), and the cache offloads the backend while keeping latency predictable. Interviewers ask you to design a distributed cache because it bundles core distributed systems skills into one problem: sharding and client routing, consistency versus freshness, replication and failover, multi-region design, hot-key contention, and the realities of cache invalidation. You’re expected to reason about trade-offs (cache-aside vs write-through), manage thundering herds, and describe concrete mechanisms that keep the system fast and correct under failures at global scale.
Hello Interview Problem Breakdown
Design a Distributed Cache
System design answer key for designing a distributed in-memory cache like Redis, built by FAANG managers and staff engineers.
Common Functional Requirements
Most candidates end up covering this set of core functionalities
Users should be able to set, get, and delete key–value entries with optional per-key TTLs.
Users should be able to read data with consistently low latency from the nearest available cache node or region.
Users should be able to configure eviction behavior to stay within memory limits (for example, LRU or LFU by default).
Users should be able to invalidate cached data precisely (single key) or broadly (namespace/tag) when underlying data changes.
Common Deep Dives
Common follow-up questions interviewers like to ask for this question
Key placement drives both performance and operational stability. Good routing keeps hot keys spread out and allows you to add or remove nodes without blowing away most of the cache (and your hit rate). - Prefer consistent hashing with virtual nodes so you remap only a small fraction of keys when the cluster size changes; expose this via a client library to avoid an extra network hop. - Maintain lightweight cluster metadata (membership, hash ring) in a discovery service and roll out changes gradually; use progressive rebalancing with background copy and serve-while-migrating. - Warm up new shards (preload hottest keys or allow a slow-start phase) and throttle migrations to protect p99 latency during reshard events.
When a miss falls through to a 100 ms backend, concurrency can amplify into a stampede. You need mechanisms that collapse duplicate work and bound staleness without sacrificing latency. - Choose a write policy deliberately: cache-aside for simplicity, write-through for stronger freshness guarantees, and write-back only if you can tolerate durability risks and need ultra-low write latency. - Implement single-flight/request coalescing so only one miss per key hits the database; others wait briefly or return a stale-but-acceptable value (stale-while-revalidate). - Use TTL jitter to avoid synchronized expiration, soft TTLs for background refresh, negative caching for known-missing keys, and protect the database with timeouts, backoff, and a circuit breaker. - Version values (etag, logical timestamp) to prevent stale overwrites and to let refresh workers safely upsert without races.
Global caches should serve reads locally and propagate changes efficiently. Full value replication across WAN links is costly; invalidation-first designs are lighter and usually sufficient. - Keep data region-local for reads, and propagate invalidations (key, tag/namespace) via a durable pub/sub bus; accept bounded eventual consistency with a freshness SLO. - Consider write affinity (pin writes to a home region) to reduce cross-region chatter, and make invalidation messages idempotent and ordered per key/namespace. - Plan for missed invalidations: periodic snapshot-based reconciliation, version checks on access, or a fallback full refresh path to recover correctness. - Tune TTLs per region and workload to balance freshness, hit rate, and WAN costs.
Caches must survive node failures seamlessly, but synchronous replication can inflate tail latency. Your design should favor availability while acknowledging that the cache is not the system of record. - Use primary–replica per shard with asynchronous replication to keep writes fast; add health checks and automatic failover (leader election) with minimal split-brain risk. - After failover, let replicas catch up via replication offsets or change streams; for read-your-writes, route to the primary or enforce a small read-after-write window. - Decide on durability: most caches skip fsync and accept data loss on failover; if running write-through, the database remains the source of truth. - Isolate hotspots with additional read replicas or hot-key replication to prevent a single node failure from spiking tail latencies.
Relevant Patterns
Relevant patterns that you should know for this question
A distributed cache’s primary job is to absorb massive read traffic and deliver sub-millisecond to single-digit millisecond responses. Sharding, replication, client-side routing, and request coalescing are the core techniques to meet aggressive latency and availability SLOs.
Hot keys and synchronized expirations cause thundering herds and tail latency spikes. Single-flight on misses, TTL jitter, negative caching, and selective hot-key replication are essential to keep the system stable under bursty, uneven workloads.
To keep multi-region caches coherent, you need efficient, near–real time propagation of invalidations and updates. A pub/sub or event-stream approach ensures downstream caches react quickly to changes without heavy cross-region reads.
Relevant Technologies
Relevant technologies that could be used to solve this question
Similar Problems to Practice
Related problems to practice for this question
It is effectively the same problem: designing sharding, replication, eviction, persistence trade-offs, and client routing for an in-memory key-value store like Redis.
Both systems battle hot-key contention and strict latency budgets. Sharded counters, consistent hashing, and single-flight patterns apply directly.
A read-heavy product that relies heavily on caching, invalidations, and fan-out. Handling freshness, cache warm-up, and hot-key mitigation parallels multi-region cache design.
Red Flags to Avoid
Common mistakes that can sink candidates in an interview
Question Timeline
See when this question was last asked and where, including any notes left by other candidates.
Late August, 2025

Manager
Imagine there is a database which has all the data and we need to cache that to a caching system to serve data to 10 different geographic locations with very low latency. How would you setup the cache system, explain why this cache system will be fast.
Early August, 2025

Senior Manager
Early August, 2025
Meta
Staff
Your account is free and you can post anonymously if you choose.