The System Design Interview: Designing Twitter
By Evan King
Aug 16, 2023
System design interviews can be intimidating, especially if you're like most candidates who have never designed large-scale systems before. Perhaps even more stressful, the last time you interviewed might have been several years ago, and that was for a junior position where system design interviews weren't even part of the process. So, facing this for the first time, it's natural that you might feel apprehensive.
But there's no need to worry. Take a deep breath; I'm here to help.
As a former Staff Engineer at Meta, I conducted hundreds of interviews. Over time, I've seen exactly what makes the difference between getting hired and falling short, and I'm here to let you in on a few of those secrets.
In this blog, we're going to delve into one of the most common system design interview questions out there: design Twitter (or, umm, X?). Whether you're a novice or looking to brush up on your skills, this guide should equip you with the tools you need to excel in your next interview.
What is a System Design interview?
Let's start from the beginning. Unlike other interview formats, System Design interviews are unique in that they are largely candidate driven. You have the wheel, so it's important to know where you're heading!
You should follow this simple roadmap:
- Understand the Requirements: Before you get started, you need to ensure you know exactly what it is you're building. This includes defining functional and non-functional requirements, asking questions, and estimating the scale of the system.
- High-level Design: Often using a whiteboard (virtual in many cases), you'll draw an overview of the system, outlining its architecture and the interactions between components, providing a general understanding of the structure.
- Data Model: Next, you'll want to define the data modal and the relationships between the data, including the design of the database schema and making a selection for the type of database best suited for your system.
- Core Components: These are the heart and soul of your system. Identify the primary modules and functions that are vital to your solution, describing how they interact and collaborate to fulfill the system's purpose.
- Scalability: Explain how the system can grow, outlining methods to handle increased loads, identifying potential bottlenecks, and addressing strategies to ensure that your design can expand to meet future demands.
- Security, Monitoring & Testing: Outline the measures to ensure data integrity, privacy, and protection from threats, and describe how you'll monitor system performance and health, along with the testing approaches to make sure everything works as intended.
Designing a Microblogging Service like Twitter
It's a brisk Wednesday morning, your cup of tea is steaming on your desk, and you're ready to tackle your day. You've just logged in for your System Design interview, and after some small talk, your interviewer jumps straight in: "So, for today's interview, I want you to design a microblogging service, something similar to Twitter." It's game time.
If you want to practice this very question with your own AI interviewer navigate to Mock Interviews in the navbar and click "Microblogging Service like Twitter." I highly recommend!
Start by understanding the requirements
Before we get into the nitty-gritty, let's first define what our system is supposed to do. In a system design interview, an interviewer might not always spell out every requirement. It's up to you, as the interviewee, to extract these details through targeted questions. Ensuring that you understand the requirements correctly is a crucial step and one that an interviewer wants to see you perform.
Here, let's walk through a probable set of functional requirements for our microblogging service.
- Users should be able to create an account and log in.
- Users should be able to post short messages (tweets).
- Users should be able to follow other users.
- Users should be able to view tweets from users they follow on their homepage.
- Users should be able to like, retweet, and reply to tweets.
- Users should be able to search for other users or specific content.
Now, let's move onto our non-functional requirements:
- The system should be able to support millions of users.
- The system should be able to handle a high volume of tweets, likes, and retweets.
- The system should be highly available, ensuring users can access the service at all times.
- The system should ensure the security and privacy of the users' data.
- The system should have low latency, ensuring that tweets are delivered quickly to users' feeds.
- The system should ensure data consistency, making sure that when a user posts a tweet, it becomes immediately visible to all their followers.
Estimating usage and system capacity
Now that we know what our system is supposed to do, let's try to estimate how much it needs to do it. We need to gauge the number of users and the amount of storage our system will require.
Let’s assume our platform became a hit and we managed to capture a significant number of social media users. Now, if there are around 3.5 billion daily active users on all social media platforms and about 3% of those users post short messages each day, this will mean we're looking at Daily Active Users (DAU) in the ballpark of 105M.
For storage requirements, let's do some quick math. If we assume that each record in our database may be around 10KB - that's the sum of user profile data (about 1KB), tweets (about 1KB on average), and some metadata & analytics (let's say 500 bytes). But, keeping in mind backup, redundancy and future growth, we can safely estimate system storage requirements at about 795TB per day.
Let's just take a moment here. Interviewers love to throw curve-balls, and one might come in the form of the CAP theorem. You might get asked where to put emphasis for our system: consistency or availability? For a microblogging service like ours, we prioritize availability. Users expect our service to be up and running, and to respond quickly. It's okay if our system takes a while to become consistent (tweet updates might exhibit slight inconsistencies for a brief period).
In the majority of cases, especially for non-financial systems, availability is prioritized over consistency. When in doubt, favor availability.
Time for the fun part! Grab your dry erase marker and go to the whiteboard—it's time to sketch the high-level design. Of course, Twitter is a large and complex system. You won't be able to cover everything in under an hour. Instead, concentrate on the most important components of the system and ensure scalability.
The major components might include:
- Clients: The clients include a Webapp Client, iOS Client, and Android Client. These are the user-facing parts of our system. They are responsible for sending user requests to the server and displaying server responses to the user. The clients are crucial because they represent the user interface of our system. Their performance and usability directly impact the user experience.
- Load Balancer: The Load Balancer is a crucial component that distributes incoming network traffic across multiple servers to ensure no single server becomes a bottleneck. It receives requests from the clients and forwards them to the appropriate services. The Load Balancer ensures that our system can handle large volumes of traffic without any single server getting overwhelmed, thus maintaining the system's performance and availability.
- Rate Limiter: The Rate Limiter controls the rate of requests a client can make to the server. It helps to prevent abuse and keeps our service available by limiting the number of requests a client can make in a certain time frame. The Rate Limiter is important for maintaining the system's stability and preventing resource exhaustion due to malicious attacks or heavy traffic.
- Tweet CRUD Service: This service is responsible for managing tweets, including creating, reading, updating, and deleting tweets. It interacts with the Tweets DB to store and retrieve tweets, and it communicates with other services like the Search Service and the Push Notification Service to update search indices and send notifications. This service is crucial as it handles the core functionality of our system, which is managing tweets.
- Timeline Service: The Timeline Service generates the timeline for each user. It interacts with the Search Service to fetch the latest tweets from the users that a particular user follows. It also interacts with the Timeline & Trending Cache to retrieve and store frequently accessed timelines. This service is critical as it provides the main functionality that users interact with, which is viewing their timeline.
- Profile Service: The Profile Service manages user profiles. It interacts with the User DB to store and retrieve user profiles and it communicates with the Authn/Authx component to handle user authentication and authorization. This service is important as it manages user data and ensures that only authenticated and authorized users can access the system.
- DM Service: The DM (Direct Message) Service is responsible for managing direct messages between users. It interacts with the DM DB to store and retrieve direct messages. This service is important as it enables private communication between users, a key feature of our system.
- Databases: We have several databases in our system, each serving a specific purpose. The User DB (SQL) stores user profile information, the Tweets DB (NoSQL) stores tweets, the DM DB (NoSQL) stores direct messages, and the User Relationships (NoSQL, Graph DB) stores the relationships between users (who follows whom). These databases are crucial for storing and retrieving data efficiently and reliably.
- Caches: We use the Timeline & Trending Cache (Redis) to store frequently accessed data like user timelines and trending topics. Caches improve our system's performance by reducing the load on our databases and providing quick access to frequently accessed data.
- CDN (Content Delivery Network): The CDN delivers static assets to the clients. It helps to reduce latency by serving assets from the server closest to the client. The CDN enhances the user experience by ensuring fast delivery of static content like images and scripts.
- Search Service: The Search Service handles search queries. It uses Lucene, a powerful full-text search library, to provide fast and accurate search results. The Search Service is essential for enabling users to find specific content or users.
- Push Notification Service: The Push Notification Service sends real-time notifications to users. It can send notifications for various events like new followers, likes, retweets, and direct messages. This service enhances the user experience by providing real-time updates about activities related to the user.
What about the API?
While interviewing, translating requirements into API endpoints might seem a bit too nitty-gritty, but it's an exercise that demonstrates how detail-oriented you are in understanding and explaining requirements. Here are some of the major endpoints:
Choosing the Right Database
When designing a microblogging service like Twitter, choosing the right database is a crucial decision that can significantly impact the system's performance and scalability. Given the nature of our service, we will be dealing with a variety of data types, including user profiles, tweets, direct messages, and user relationships. To handle these diverse data types efficiently, we will adopt a polyglot persistence approach, which involves using different databases for different types of data.
- User DB (SQL): User profile information, such as username, email, bio, profile picture, etc., is structured and relational. For example, a user can have multiple email addresses, and each email address can be associated with multiple users. Therefore, a relational database like MySQL or PostgreSQL is a suitable choice for storing user profile information. These databases support ACID properties (Atomicity, Consistency, Isolation, Durability), ensuring data consistency and integrity.
- Tweets DB (NoSQL): Tweets, on the other hand, are unstructured and non-relational. They can contain various types of content, including text, images, videos, and links. Moreover, tweets are typically write-heavy and need to be delivered in real-time to a large number of users. Therefore, a NoSQL database like Cassandra or DynamoDB, which provides high write throughput and low latency, is a better fit for storing tweets.
- DM DB (NoSQL): Direct messages (DMs) are similar to tweets in that they are unstructured and non-relational. However, DMs are private between two users and do not need to be delivered in real-time to a large number of users. Therefore, a document-oriented NoSQL database like MongoDB, which provides flexible schemas and efficient querying, is a good choice for storing DMs.
- User Relationships (NoSQL, Graph DB): User relationships, such as who follows whom, are highly relational and can be represented as a graph. Therefore, a graph database like Neo4j or Amazon Neptune, which is designed to store and traverse relationships, is an ideal choice for storing user relationships.
By using a polyglot persistence approach, we can leverage the strengths of each type of database to handle specific types of data efficiently. However, this approach also introduces complexity, as we need to manage multiple databases and ensure data consistency across them. To mitigate this complexity, we can use database management tools and implement robust data pipelines.
Calling out specific technologies (e.g., DynamoDB) can impress the interviewer, but tread carefully. Only mention technologies that you have decent familiarity with. You may face follow-up questions like, "Why DynamoDB instead of Cassandra?" or, even more challenging, "How do you plan to deal with DynamoDB's lack of built-in support for secondary indexes, compared to Cassandra's ability to create them?" Being prepared to discuss these specific differences and how you might address or leverage them in your design will showcase your deep understanding and ability to think strategically about technology choices. But if you don't have the necessary experience, you're setting yourself up to get egg on your face.
Storing and Retrieving Tweets
Storing and retrieving tweets efficiently is a critical aspect of our microblogging service. This is why we chose to store our Tweets in a NoSQL DB. These databases are designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
The data model for our tweets would be designed to optimize the most common operations. Tweets could be stored in a collection (or a similar construct based on the NoSQL DB choice) with attributes like TweetID, UserID, TweetText, and TimeStamp.
In databases like MongoDB, the TweetID would be a unique identifier (often the default _id field). For databases like Cassandra, the primary key could be a combination of UserID and TimeStamp, allowing for efficient retrieval of a user's tweets in reverse chronological order.
To support efficient retrieval of individual tweets, databases like DynamoDB allow for secondary indexes on the TweetID attribute, enabling quick fetching of specific tweets.
To further optimize retrieval, we would use caching. Memcached or Redis could store the most recent or most frequently accessed tweets. This would reduce the load on the database and provide faster response times.
For write-heavy workloads, techniques like write-behind caching or batch processing would be implemented. Write-behind caching delays the writing of data to the database until it's necessary. On the other hand, batch processing groups together multiple write operations to perform them all at once.
For scalability, database sharding or partitioning would be essential. This would involve distributing data across multiple nodes or clusters. Depending on the NoSQL database in use, we could partition data based on attributes like UserID or TweetID. This distribution would help balance the load and cater to more requests.
To cater to the need for full-text search on the tweet text, we could integrate with technologies like Elasticsearch (which is built on top of Lucene). This would enable efficient searching through the content of the tweets.
Data Durability and Availability
To ensure data durability and availability, replication across multiple nodes or clusters is crucial. Many NoSQL databases come with built-in replication strategies. This ensures that even if one node or cluster fails, the data remains accessible from others.
Let's get into the nitty-gritty of generating timelines. After all, it's what our service is all about. The good news? You don’t need to reinvent the wheel. A combination of fan-out on write and fan-out on read techniques gets this done for us.
In the fan-out on write strategy, as soon as a user posts a tweet, the system will immediately push this tweet to all the followers' timelines. This approach ensures real-time delivery of tweets, but it might not be scalable if a user has a large number of followers or if there are many tweets being posted at the same time.
The fan-out on read strategy, on the other hand, is more scalable. In this approach, when a user posts a tweet, it's stored in the user's timeline only. When a follower logs in or refreshes their timeline, the system will pull the latest tweets from all the people they follow. This approach reduces the write load on the system, but it might increase the read load and delay the delivery of tweets.
A hybrid approach could be used to balance the trade-offs between these two strategies. For example, we could use the fan-out on write strategy for users with a small number of followers and the fan-out on read strategy for users with a large number of followers. We could also cache the most recent tweets to reduce the read load.
To handle the distribution in a scalable way, we can use a distributed message queue like Kafka or RabbitMQ. The producer (the user who posts a tweet) will send messages (tweets) to the queue, and the consumers (the followers) will consume these messages from the queue. This approach ensures that the system can handle a large volume of tweets and distribute them to all followers in a timely manner.
Ensuring scalability and identifying your bottleneck
Let’s circle back to scale - Can our service handle the rigors of widespread usage? This is a question we need to answer by focusing on horizontal scaling and caching.
As our user base grows, we need to distribute data across multiple servers to accommodate the higher load. Here's where our handy-dandy NoSQL database shines, due to its propensity for horizontal scaling. For an even distribution of incoming requests across servers, a trusty load balancer is our superhero.
However, just having more servers isn’t going to help if our bottleneck is read operations on the database. If our service is read-heavy, hammering those read operations on the database may mean an unhappy user. For this, we look to the superhero's sidekick - Caching. A caching layer between our app and database reduces read operations, as frequently accessed URLs will be stored and retrieved from the cache. Presto, happier users, less frustrated engineers!
Hungry for more? We can serve our increasing dataset by employing consistent hashing, a technique to distribute the data evenly across multiple database nodes, making for quick data retrieval. With these strategies coupled together, our system should be able to handle even the Mondayest of Monday loads.
Remember the potential performance bottleneck we talked about earlier? That pesky latency? Here's where the dynamic duo of caching and read replicas come back into play. Caching can store and provide quick access to frequently accessed URLs, significantly reducing reads. Under extreme loads, read replicas of our database can be created. These replicas distribute load across multiple nodes, thus preventing individual servers from collapsing under the weight.
In a microblogging service like Twitter, implementing robust security measures is of paramount importance to protect user data and maintain user trust. Here are some key security measures we would implement:
- Secure User Authentication and Authorization: We would implement a secure authentication system using protocols like OAuth2.0 or OpenID Connect. These protocols ensure that only legitimate users can access their accounts. Additionally, we would use role-based access control (RBAC) to manage user permissions. This means users can only perform actions they are authorized to do. For instance, a regular user should not be able to delete another user's posts. This measure is crucial in maintaining the integrity of user data and preventing unauthorized actions.
- Data Encryption: Protecting user data is a top priority. To do this, we would encrypt all sensitive user information both at rest and in transit. At rest, data would be encrypted using strong symmetric encryption algorithms like AES. In transit, data would be encrypted using protocols like HTTPS and TLS to protect against man-in-the-middle attacks. This ensures that even if a malicious actor were to gain access to the data, they would not be able to read or use it.
- Secure Password Storage: Passwords would be stored securely using techniques like hashing and salting. This ensures that even if the password database were to be compromised, the actual passwords would not be directly accessible. This measure is crucial in preventing unauthorized access to user accounts.
- Protection against Common Web Attacks: We would implement measures to prevent SQL injection and Cross-Site Scripting (XSS) attacks, such as input validation and parameterized queries. These measures protect our system and user data from common web vulnerabilities.
- Incident Response Plan: We would have a plan in place for handling potential data breaches or attacks, including immediate containment of the breach, investigation, notification of affected users, and measures to prevent future incidents. This measure is crucial in maintaining user trust and ensuring a swift response to security incidents.
Monitoring the system
To effectively monitor a microblogging service like Twitter, we would focus on three critical performance metrics: throughput, error rates, and latency. Monitoring these metrics is crucial in ensuring a smooth and reliable service for users.
- Throughput: This measures the number of transactions or requests processed by the system per unit of time. It's a key indicator of system performance. We would use real-time analytics tools like Prometheus or Datadog to monitor throughput. These tools provide a live feed of system throughput, enabling us to quickly identify and address any drops in performance. Furthermore, we can set up alerts to notify the team if throughput falls below a certain threshold.
- Error Rates: Error rates can indicate potential issues with the system. They measure the number of failed requests, such as server errors or database errors, relative to the total number of requests. We would use monitoring tools like Grafana or the ELK Stack to track error rates. Similar to throughput, we can set up alerts to notify the team if error rates exceed a certain threshold.
- Latency: Latency, the delay before a transfer of data begins following an instruction, is another crucial performance metric for real-time services like Twitter. Monitoring latency can help us ensure that the system is responding to requests in a timely manner. Tools like Pingdom or New Relic can be used to monitor latency.
In addition to these key metrics, we would also implement regular health checks for the system's components, such as the databases and servers, to ensure they are functioning optimally. A comprehensive
Designing a microblogging service like Twitter is no small feat, but with careful planning and a deep understanding of system design principles, it's entirely achievable. The key is to focus on understanding the requirements, estimating system usage, choosing the right database, generating timelines efficiently, ensuring scalability, and implementing robust security, monitoring, and testing measures. With these strategies in place, you'll be well on your way to designing a microblogging service that can handle millions of users and their tweets.
You can practice this exact question (and dozens of others) with Hello Interview AI. You'll be able to answer real system design interview questions on an interactive whiteboard and receive instant AI feedback!
Rehearse, revise, and be ready to ace your next system design interview.
Evan, Co-founder of Hello Interview and former Tech Lead at Meta, possesses a unique vantage point, having been on both sides of the tech hiring process. With a track record of conducting hundreds of interviews and securing offers from top tech companies himself, he is now on a mission to help others do the same.
Schedule a Mock Interview with a Real FAANG Interviewer
The System Design Interview: What is Expected at Each Level
Thu Nov 30 2023
Understanding the Differences between Meta's SWE System Design and Product Design Interviews
Wed Nov 15 2023
System Design Interview Fundamentals: Mastering Estimation
Thu Nov 02 2023
Understanding Job Levels at FAANG Companies
Wed Nov 01 2023
Story Crafting 101: Constructing Engaging Behavioral Interview Stories
Mon Oct 16 2023