Watch the author walk through the problem step-by-step
Watch Video Walkthrough
Watch the author walk through the problem step-by-step
Understanding the Problem
🔍 What is Facebook?
Facebook is a social network centered around “posts” (messages). Users consume posts via a timeline composed of posts from users they follow or more recently, that the algorithm predicts they will enjoy. Posts can be replied, “liked”, or “shared” (sometimes with commentary).
For this problem, we're zooming in on the search experience for Facebook. This question is primed for infrastructure-style interviews where your interviewer wants to understand how deeply you understand data layout, indexing, and scaling.
We're going to assume our interviewer has explicitly told us that we're not allowed to use a search engine like Elasticsearch or a pre-built full-text index (like Postgres Full-Text) for this problem.
This practice of disallowing specific technologies is common enough that you should be prepared for it. The intent here is to test your knowledge of the fundamentals of data organization and indexing, rather than the specifics of a particular technology. Yes, this actually happens!
For our requirements, we not only want to get an idea of what the system will do but also what we don't need it to do. While explicitly enumerating out-of-scope requirements isn't necessary, asking detailed questions about what functionality might be included will help to make sure you avoid any last-minute surprises from your interviewer.
Core Requirements
Users should be able to create and like posts.
Users should be able to search posts by keyword.
Users should be able to get search results sorted by recency or like count.
Below the line (out of scope)
Support fuzzy matching on terms (e.g. search for "bird" matches "ostrich").
Personalization in search results (i.e. the results depend on details of the user searching).
Privacy rules and filters.
Sophisticated relevance algorithms for ranking.
Images and media.
Realtime updates to the search page as new posts come in.
By de-scoping personalization, we dramatically simplify the problem and make caching much more effective. It's a good idea to ask if it's a firm requirement for the system!
In infrastructure-focused questions, non-functional requirements tend to take center stage as your interviewer is looking to see how you identify and solve bottlenecks. For this problem, we know we want search to be fast but we also need to think about how posts are created and made searchable.
Core Requirements
The system must be fast, median queries should return in < 500ms.
The system must support a high volume of requests (we'll estimate this later).
New posts must be searchable in a short amount of time, < 1 minute.
All posts must be discoverable, including old or unpopular posts. (we can take more time for these)
The system should be highly available.
If you have a lot of data in your system, there's a good chance you'll have "hot" and "cold" data. Hot data is readily accessed and needs to be served quickly, often from memory. Cold data is infrequently accessed and can be served from disk, a remote database, or even tape.
In this case the non-functional requirement that we can take more time for old or unpopular posts is foreshadowing a potential design decision later in the interview.
Here's how these might be shorthanded in an interview. Note that out-of-scope requirements usually stem from questions like "do we need to handle privacy?". Interviewers are usually comfortable with you making assertions "I'm going to leave privacy out of scope for the start" and will correct you if needed.
Requirements
Scale estimations
For this problem, it's obvious we're dealing with a large-scale system. What we need to know in order to make informed design decisions is a few parameters: how much data are we storing, how often are we writing to the system, and how frequently are we reading from it?
A common mistake here is to fixate on the search aspect of this problem and assume reads are the most important part. So let's do a bit of estimates to come up with ballpark numbers which will help us decipher the nature of the system we're designing.
We'll assume Facebook has 1B users. On average each user produces 1 post per day (maybe 20% do 5 posts and 80% do 0 posts) and Likes 10 posts. For ease of calculation, we'll assume 100k seconds in a day (rounding up for convenience).
Noteworthy here is that the number of likes is significantly more than the number of post creations. I'm going to note that for my interviewer so they see how I'm thinking about the problem.
While this is a lot (and may burst the 10x this value or more), our system is write-heavy vs read-heavy. Another thing to note.
Finally let's evaluate the storage requirements of the system. First let's assume Facebook is 10 years old and that the full post metadata is 1kb (probably an overestimate).
Wow, that's a lot of storage! We're going to need to find some way to constrain this in our search system.
For this particular problem, doing these estimations helps us to determine the nature of the design. Reflections like those above help your interviewer understand how your estimations are affecting your design and demonstrate your experience - don't estimate for the sake of estimation!
Scale Estimates
The Set Up
Planning the Approach
As we progress in the interview, we'll want to refer back to the requirements to ensure that we're covering everything we established we need to cover. We'll solve each of our functional requirements one by one and then come back to address scale and our non-functional requirements as deep-dives.
Getting a handle on the APIs for our system is going to provide a scaffolding for the rest of our solution. Most of the time, our job is to make the connection between APIs and storage systems or dependencies, and this case is no different.
Our APIs are straightforward. We have two paths: a query path for searching and a write path for creating posts and likes.
In a real system we might be consuming from a Kafka stream or some other event bus for both of these events, but for clarity we'll create separate API endpoints for each and can let our interviewer know that we'd elect to merge these into the rest of the system as appropriate.
API
With these APIs defined, we can start to see the shape of our system. Writes come in via the two write endpoints and are written to a database. Queries come in via the search endpoint and read from the database. Good start!
Remember, for our high-level design we're walking through our functional requirements and trying to build a simple system that satisfies them before we go deep into optimizations in our deep dive. Starting simple will avoid rabbit-holing on unimportant pieces and by covering our requirements quickly we can guarantee the system doesn't have any missing pieces.
1) Users should be able to create and like posts.
Our first requirement is on the write path, allowing users to create and like posts. We need to be able to accept these calls and write them to our database.
Our search system is just a part of the larger social network product, so it's reasonable to assume the existence of an internal "Post Service" and "Like Service" that are responsible for taking client requests. Since we're only designing search, we take for granted that these other services exist and are working as expected.
Simple ingestion-side design
While we had previously calculated that the scale of Likes and Post creation was very different, the operation of our Ingestion service is very simple at this time, so we'll leave it as a single service and note the potential concern to our interviewer.
It's a good idea to acknowledge for your interviewer that this solution is obviously over-simplified and that you'll come back to solve its problems. Interviewers are going to be critiquing your design and looking for flaws - the longer you leave them with unacknowledged issues, the more they assume you didn't see them!
2) Users should be able to search posts by keyword.
Next, we need to allow users to actually search. To do this we need to set up the read leg of our system. We'll start by adding the basic machinery necessary to serve the user requests: an API gateway for authentication and rate limiting and a horizontally scaled search service which accepts the request then queries the index.
Allow Search
The Index is doing a lot of the work for search here. Let's talk about that a bit more. In order to allow users to search posts by keyword, we need to be able to find all of the posts which contain that keyword. With trillions of posts and petabytes of data, this is not a small feat! What are our options?
3) Users should be able to get search results sorted by recency or like count.
Next we'll move to our last requirement which is to be able to sort the results by either recency or like count - that is, if we search for "Taylor" and sort by recency, we want to be able to show the most recent posts that were created. If we sort by likes, we want to see the top liked posts.
With the core functional requirements met, it's time to dig into the non-functional requirements and other optimizations via deep dives.
Ideally, you've identified potential scaling bottlenecks and issues up to this point. Those make excellent anchor points for your deep dives. You should prepare to be quizzed by your interviewer about various relevant aspects of the system, but some interviewers will expect you to spot your own deficiencies and resolve them.
1) How can we handle the large volume of requests from users?
Our in-memory reverse-index based system is quite fast, but we're going to be handling a lot of traffic. We had some convenient requirements earlier that might make our job even easier. Two requirements in particular:
We do not have personalization, so if you and I are searching for the same thing with the same parameters, we should get the same results!
We have up to 1 minute before a post needs to appear in the search results.
Caching sticks out here as the obvious tool for us! Any time we can tolerate stale data and we have duplicate requests coming through, we should consider whether caching is appropriate.
Pattern: Scaling Reads
Caching is a powerful tool to use for systems where we need to scale reads. Read our breakdown on this pattern for more strategies when you're designing a system for heavy read traffic.
2) How can we handle multi-keyword, phrase queries?
While sometimes our users might be searching for a single word like "Dog", they may also be searching for strings like "Taylor Swift" or even full excerpts like "All your base are belong". Many search engines support an entire boolean language for specifying queries like "+Taylor +Swift -Red" - let's assume the request here is simple: how can we answer queries for multi-word phrases?
3) How can we address the large volume of writes?
The write volume for our system is very high. We have two sources of writes: post creation and likes on those posts. Let's talk about them separately.
Post Creation
When a post is created we call to the ingestion service which tokenizes the post (breaks it down into keywords) and then writes the postId to the Creation and Likes indexes. If a post has 100 words, we might trigger 100+ writes, one for each word.
If a lot of posts are created simultaneously, our ingestion service might get overwhelmed. In the worst case, we might lose posts or events.
We can address this by adding more capacity to our ingestion service and partitioning the incoming requests. By using a log/stream like Kafka, we can fan out the creation requests to multiple ingestion instances and partition the load. We can also buffer requests so that we can handle bursts of post creation. Finally, we can scale out our index by sharding the indexes by keyword. This way writes to the indexes are spread across many Redis instances.
Scaling Ingestion
Pattern: Scaling Writes
Scaling writes is a frequent deep-dive question across interviews. It's good to have a toolkit to use for handling these scenarios. In our Scaling Writes pattern breakdown, we walk you through the options for handling these scenarios.
We've partly handled the ingestion of new posts, but our need to index like counts is still the elephant in the room. With the existing system we need to make many writes to our indexes every time a like happens, and, as we discussed earlier, likes happen far more frequently than post creations.
This is a low-hanging opportunity to optimize our system. How can we reduce the number of writes we have to do to our indexes?
4) How can we optimize storage of our system?
This system is indexing an impressive amount of data, but our users are likely only interested in a vanishingly small portion of it. How can we optimize storage?
First, we can put caps and limits on each of our inverted indexes. We probably won't need all 10M posts with "Mark" contained somewhere in their contents. By keeping our indexes to 1k-10k items, we can reduce the necessary storage by orders of magnitude.
Next, most keywords won't be searched often or even at all. Based on our search analytics, we can run a batch job to move rarely used keywords to a less frequently accessed but ultimately cheaper storage. One way to do this is to move these keyword indexes to cold, blob storage like S3 or R2.
On a regular basis we'll determine which keywords were rarely (or not at all) accessed in the past month. We'll move these indexes from our in-memory Redis instance to a blob in our blob storage. When the index needs to be queried, we'll first try to query Redis. If we don't get our keyword there, we can query the index from our blob storage with a small latency penalty.
Blob Storage for Cold Indexes
Our full design might look like this, although most candidates (especially Mid-level) won't get through all of these deep dives.
There’s a lot of meat to this question! Your interviewer may even have you go deeper on specific sections. What might you expect in an actual assessment?
Mid-Level
Breadth vs. Depth: A mid-level candidate will be mostly focused on breadth. As an approximation, you’ll show 80% breadth and 20% depth in your knowledge. You should be able to craft a high-level design that meets the functional requirements you've defined, but the optimality of your solution will be icing on top rather than the focus.
Probing the Basics: Your interviewer will spend some time probing the basics to confirm that you know what each component in your system does. For example, if you're using a cache, expect to be asked about your eviction policy. Your interviewer will not be taking anything for granted with respect to your knowledge.
Mixture of Driving and Taking the Backseat: You should drive the early stages of the interview in particular, but your interviewer won't expect that you are able to proactively recognize problems in your design with high precision. Because of this, it’s reasonable that they take over and drive the later stages of the interview while probing your design.
The Bar for Post Search: For this question, interviewers expect a mid-level candidate to have clearly defined the API endpoints and data model, and created both the sides of the system: ingestion and query. In instances where the candidate uses a “Bad” solution, the interviewer will expect a good discussion but not that the candidate immediately jumps to a great (or sometimes even good) solution.
Senior
Depth of Expertise: As a senior candidate, your interviewer expects a shift towards more in-depth knowledge — about 60% breadth and 40% depth. This means you should be able to go into technical details in areas where you have hands-on experience.
Advanced System Design: You should be familiar with advanced system design principles. Certain aspects of this problem should jump out to experienced engineers (write volume, storage, lots of duplicate inputs) and your interviewer will be expecting you to have reasonable solutions.
Articulating Architectural Decisions: Your interviewer will want you to clearly articulate the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability. You should be able to justify your decisions and explain the trade-offs involved in your design choices.
Problem-Solving and Proactivity: You should demonstrate strong problem-solving skills and a proactive approach. This includes anticipating potential challenges in your designs and suggesting improvements. You need to be adept at identifying and addressing bottlenecks, optimizing performance, and ensuring system reliability.
The Bar for Post Search: For this question, a senior candidate is expected to speed through the initial high level design so we can spend time discussing, in detail, how to optimize the critical paths. At a bare minimum the solution should have both ingestion and query detailed, with a proper reverse index, and appropriate caching strategy.
Staff+
Emphasis on Depth: As a staff+ candidate, the expectation is a deep dive into the nuances of system design — the interviewer is looking for about 40% breadth and 60% depth in your understanding. This level is all about demonstrating that "been there, done that" expertise. You should know which technologies to use, not just in theory but in practice, and be able to draw from your past experiences to explain how they’d be applied to solve specific problems effectively. Your interviewer knows you know the small stuff (REST API, data normalization, etc) so you can breeze through that at a high level so we have time to get into what is interesting.
High Degree of Proactivity: At this level, your interviewer expects an exceptional degree of proactivity. You should be able to identify and solve issues independently, demonstrating a strong ability to recognize and address the core challenges in system design. This involves not just responding to problems as they arise but anticipating them and implementing preemptive solutions.
Practical Application of Technology: You should be well-versed in the practical application of various technologies. Your experience should guide the conversation, showing a clear understanding of how different tools and systems can be configured in real-world scenarios to meet specific requirements.
Complex Problem-Solving and Decision-Making: Your problem-solving skills should be top-notch. This means not only being able to tackle complex technical challenges but also making informed decisions that consider various factors such as scalability, performance, reliability, and maintenance.
Advanced System Design and Scalability: Your approach to system design should be advanced, focusing on scalability and reliability, especially under high load conditions. This includes a thorough understanding of distributed systems, load balancing, caching strategies, and other advanced concepts necessary for building robust, scalable systems.
The Bar for Post Search: For a staff+ candidate, expectations are set high regarding depth and quality of solutions, particularly for the complex scenarios discussed earlier. Your interviewer will be looking for you to be getting through several of the deep dives, showcasing not just proficiency but also innovative thinking and optimal solution-finding abilities. A crucial indicator of a staff+ candidate's caliber is the level of insight and knowledge they bring to the table. A good measure for this is if the interviewer comes away from the discussion having gained new understanding or perspectives. There are a lot of different ways to solve this problem.
Test Your Knowledge
Take a quick 15 question quiz to test what you've learned.
Mark as read
Your account is free and you can post anonymously if you choose.
Your account is free and you can post anonymously if you choose.