Common Problems
Facebook News Feed
Understanding the Problem
Let's assume we're handling uni-directional "follow" relationships as opposed to the bi-directional "friend" relationships which were more important for the earliest versions of Facebook.
Functional Requirements
Core Requirements
- Users should be able to create posts.
- Users should be able to friend/follow people.
- Users should be able to view a feed of posts from people they follow, in chronological order.
- Users should be able to page through their feed.
Below the line (out of scope):
- Users should be able to like and comment on posts.
- Posts can be private or have restricted visibility.
For the sake of this problem (and most system design problems for what it's worth), we can assume that users are already authenticated and that we have their user ID stored in the session or JWT.
Non-Functional Requirements
Core Requirements
- The system should be highly available (prioritizing availability over consistency). Tolerate up to 2 minutes for eventual consistency.
- Posting and viewing the feed should be fast, returning in < 500ms.
- The system should be able to handle a massive number of users (2B).
- Users should be able to follow an unlimited number of users, users should be able to be followed by an unlimited number of users.
Here's how it might look on your whiteboard:
The Set Up
Planning the Approach
The hard part of this design is going to be dealing with the potential of users who are following a massive number of people, or people with lots of followers (a problem known as "fan out"). We'll want to move quickly to satisfy the base requirements so we can dive deep there. For this problem (like many!), following our functional requirements in order provides a natural structure to the interview, so we'll do that.
Defining the Core Entities
Starting with core entities gives us a set of terms to use through the rest of the interview. This also helps us to understand the data model we'll need to support the functional requirements later.
For the News Feed, the primary entities are easy. We'll make an explicit entitity for the link between users
- User: A users in our system.
- Follow: A uni-directional link between users in our system.
- Post: A post made by a user in our system. Posts can be made by any user, and are shown in the feed of users who follow the poster.
In the actual interview, this can be as simple as a short list like this. Just make sure you talk through the entities with your interviewer to ensure you are on the same page.
The API
The API is the primary interface that users will interact with. It's important to define the API early on, as it will guide your high-level design. We just need to define an endpoint for each of our functional requirements.
For our first requirement, we need to create posts:
POST /post/create Request: { "content": { } } Response: { "postId": // ... }
We'll leave the content empty to account for rich content or other structured data we might want to include in the post. For authentication, we'll tell our interviewer that authentication tokens are included in the header of the request and avoid going into too much detail there unless requested.
Moving on, we need to be able to follow people.
POST /user/[id]/follow Request: { }
Here the follow action is binary. We'll assume it's idempotent so it doesn't fail if the user clicks follow twice. We don't need a body but we'll include a stub just in case and keep moving.
Our last requirement is to be able to view our feed.
GET /feed Response: { items: POST[] }
This can be a simple GET request. We'll avoid delving into the structure of Posts for the moment to give ourselves time for the juicier parts of the interivew.
High-Level Design
1. Users should be able to create posts.
Dealing with this requirement first, we only need to create posts and have them accessible by their ID. We're going to start very basic here and build up complexity as we go.
To start, and since we know this is going to scale, we'll put a horizontally scaled service behind an API gateway/load balancer. We can skip the caching aspect of the pattern since we'll get to that later. By having the API gateway and load balancer, we can scale up our service with more traffic by adding additional hosts. Since each host is stateless as it's only writing to the database (and we're not dealing with reads yet), this is really easy to scale by just adding more hosts!
Users hit our API gateway with a new post request, this gets passed to one of the post service endpoints which creates an insert event that is sent to our database. Easy.
For our database, we can use any key-value store. For this application, let's use DynamoDB for its simplicity and scalability. DynamoDB allows us to provision a nearly limitless amount of capacity provided we spread our load evenly across our partitions.
2. Users should be able to friend/follow people.
Functionally, following or friending a person is establishing a relationship between two users. This is a many-to-many relationship, and we can model it as a graph. Typically, for a graph you'll use a purpose-built graph database (like Neo4j) or a triple store. However, for our purposes, we can use a simple key-value store and model the graph ourselves.
Graph databases can be more useful when you need to do traversals or comprehensions of a graph. This usually requires stepping between different nodes, like capturing the friends of your friends or creating an embedding for a recommendation system. We only have simple requirements in this problem so we'll keep our system simple and save ourselves from scarier questions like "how do you scale Neo4j?"
To do this, we'll have a simple Follow table using the entire relation userFollowing:userFollowed as the partition key. This will allow us to quickly check if a user is following another user. We can use a sort key on the following user to ensure we can look up all the users they are following economically. We can also apply a secondary index to the ID of the person they're following to allow us to look up all the followers of a given user.
We can put this data in another DynamoDB table for simplicity.
Ok, we're off to a good start. Let's keep going.
3. Users should be able to view a feed of posts from people they follow.
The challenge of viewing a feed of posts can be broken down into steps:
- We get all of the users who a given user is following.
- We get all of the posts from those users.
- We sort the posts by time and return them to the user.
There is a lot of complexity here, but let's ignore it for now and create a naive base from which we can iteratively solve the problem.
Before we're done, while we have an index on our follow table to quickly look up all the users a user is following, we don't have an index on our post table to quickly look up all the posts from a set of users. We can solve this by adding a sort key on the Post table with the user ID and the timestamp of the post concatenated together. This will allow us to quickly pull all the posts from a set of users in chronological order.
Here we have a naive feed service. For a given user, we first request all of the users they follow from the Follow table. Then we request all the posts from those N users from the Post table. Then we sort all of those posts by timestamp and return the results to the user!
Several flags emerge here and your interviewer is going to expect you to see them immediately:
- Our user may be following lots of users.
- Each of those users may have lots of posts.
- The total set of posts may be very large because of (1) or (2).
Before we dive into these complexities let's polish off our functional requirements.
4. Users should be able to page through their feed.
We want to be able support an "infinite scroll" -like experience for our users. To do this, we need to know what they've already seen in their feed and be able to quickly pull the next set of posts. Fortunately, "what they've already seen in their feed" can be described fairly concisely: a timestamp of the oldest post they've looked at. Since users are viewing posts from youngest to oldest, that one timestamp tells us where they've stopped a provides a great cursor for the next set of posts.
Since we can tolerate up to a minute of eventual consistency, we can accomplish this by keeping a cache of recent posts for the user in a distributed cache.
- When the user requests their feed for the first time (or if we don't have a cache entry), we'll build it using the Post and Follow tables then store a large number of post IDs (say 500) with their timestamps in sorted order in the Feed Cache with a TTL lower than our consistency requirement (in this case, 1 minute).
- When the user requests the next page, they'll send us a timestamp of the oldest post they've seen. We'll look up the user's recent posts in the cache and return posts older than that timestamp to the user. That might look like this:
This satisfies our initial requirements, but it's not very performant or scalable! Let's optimize.
Potential Deep Dives
1) How do we handle users who are following a large number of users?
If a user is following a large number of users, the queries to the Follow table will take a while to return and we'll be making a large number of queries to the Post table to build their feed. This problem is known as "fan-out" - a single requests fans out to create many more requests. Especially when latency is a concern, fanout can be a problem, so it makes for a good discussion point for system design interviews.
What can we do instead? Your instinct here should be to think about ways that we can compute the feed results on write or post creation rather than at the time we want to read their feed. And in fact this is a fruitful line of thinking!
Bad Solution: Horizontal scaling via sharding
Great Solution: Adding a feed table
2) How do we handle users with a large number of followers?
When a user has a large number of followers we have a similar fanout problem when we create a post: we need to write to millions of Feed records!
Because we chose to allow some inconsistency in our non-functional requirements, we have a short window of time (< 1 minute) to perform these writes.
Bad Solution: Shotgun the Requests
Good Solution: Async Workers
Great Solution: Async Workers with Hybrid Feeds
3) How can we handle uneven reads of Posts?
To this point, our feed service has been reading directly from the Post table whenever a user requests to get their feed. For the vast majority of posts, they will be read for a few days and never read again. For some posts (especially those from accounts with lots of followers), the number of reads the post experiences in the first few hours will be massive.
DynamoDB, like many key value stores, offers nearly infinite scaling provided certain conditions like there being even load across the keyspace. If Post1 gets 500 requests per second and Post2 through Post 1000 get 0 requests per second, this is not even load!
Fortunately for us, Posts are far more likely to be created than they are to be edited. But how do we solve the issue of hot keys in our Post Table?
Good Solution: Post Cache with Large Keyspace
Great Solution: Redundant Post Cache
What is Expected at Each Level?
Ok, that was a lot. You may be thinking, “how much of that is actually required from me in an interview?” Let’s break it down.
Mid-level
Breadth vs. Depth: A mid-level candidate will be mostly focused on breadth (80% vs 20%). You should be able to craft a high-level design that meets the functional requirements you've defined, but many of the components will be abstractions with which you only have surface-level familiarity.
Probing the Basics: Your interviewer will spend some time probing the basics to confirm that you know what each component in your system does. For example, if you add an API Gateway, expect that they may ask you what it does and how it works (at a high level). In short, the interviewer is not taking anything for granted with respect to your knowledge.
Mixture of Driving and Taking the Backseat: You should drive the early stages of the interview in particular, but the interviewer doesn’t expect that you are able to proactively recognize problems in your design with high precision. Because of this, it’s reasonable that they will take over and drive the later stages of the interview while probing your design.
The Bar for News Feed: For this question, an E4 candidate will have clearly defined the API endpoints and data model, and landed on a high-level design that is functional and meets the requirements. While they may have some of the "Good" solutions, they would not be expected to cover all the possible scaling edge cases in the Deep Dives.
Senior
Depth of Expertise: As a senior candidate, expectations shift towards more in-depth knowledge — about 60% breadth and 40% depth. This means you should be able to go into technical details in areas where you have hands-on experience. It's crucial that you demonstrate a deep understanding of key concepts and technologies relevant to the task at hand.
Advanced System Design: You should be familiar with advanced system design principles. For example, knowing how to use a search-optimized data store like Elasticsearch for event searching is essential. You’re also expected to understand the use of a distributed cache for locking tickets and to discuss detailed scaling strategies (it’s ok if this took some probing/hints from the interviewer), including sharding and replication. Your ability to navigate these advanced topics with confidence and clarity is key.
Articulating Architectural Decisions: You should be able to clearly articulate the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability. You justify your decisions and explain the trade-offs involved in your design choices.
Problem-Solving and Proactivity: You should demonstrate strong problem-solving skills and a proactive approach. This includes anticipating potential challenges in your designs and suggesting improvements. You need to be adept at identifying and addressing bottlenecks, optimizing performance, and ensuring system reliability.
The Bar for News Feed: For this question, E5 candidates are expected to speed through the initial high level design so you can spend time discussing, in detail, at least 2 of the deep dives. They should also be able to discuss the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability. They would be expected to proactively surface some of the potential issues around fanout and performance bottlenecks.
Staff+
Emphasis on Depth: As a staff+ candidate, the expectation is a deep dive into the nuances of system design — I'm looking for about 40% breadth and 60% depth in your understanding. This level is all about demonstrating that, while you may not have solved this particular problem before, you have solved enough problems in the real world to be able to confidently design a solution backed by your experience.
You should know which technologies to use, not just in theory but in practice, and be able to draw from your past experiences to explain how they’d be applied to solve specific problems effectively. The interviewer knows you know the small stuff (REST API, data normalization, etc) so you can breeze through that at a high level so you have time to get into what is interesting.
High Degree of Proactivity: At this level, an exceptional degree of proactivity is expected. You should be able to identify and solve issues independently, demonstrating a strong ability to recognize and address the core challenges in system design. This involves not just responding to problems as they arise but anticipating them and implementing preemptive solutions. Your interviewer should intervene only to focus, not to steer.
Practical Application of Technology: You should be well-versed in the practical application of various technologies. Your experience should guide the conversation, showing a clear understanding of how different tools and systems can be configured in real-world scenarios to meet specific requirements.
Complex Problem-Solving and Decision-Making: Your problem-solving skills should be top-notch. This means not only being able to tackle complex technical challenges but also making informed decisions that consider various factors such as scalability, performance, reliability, and maintenance.
Advanced System Design and Scalability: Your approach to system design should be advanced, focusing on scalability and reliability, especially under high load conditions. This includes a thorough understanding of distributed systems, load balancing, caching strategies, and other advanced concepts necessary for building robust, scalable systems.
The Bar for News Feed: For a staff+ candidate, expectations are high regarding depth and quality of solutions, particularly for the complex scenarios discussed earlier. A staff candidate will likely cover all of the deep dives (and/or some that we haven't enumerated). They would be expected to surface potential issues in the system and talk about performance tuning for this problem.
Loading comments...