Common Problems
Top-K Youtube Videos
Stefan Mai
medium
35 min
Understanding the Problem
Let's assume we have a very large stream of views on YouTube (our stream is a firehose of VideoIDs). At any given moment we'd like to be able to query, precisely, the top K most viewed videos for a given time period (say 1 hour, 1 day, 1 month, all time) together with their counts.
Our interviewer might give us some quantities to help us understand the scale of the system: Youtube Shorts had 70 billion views per day and approximately 1 hour of Youtube content is uploaded every second. Big!
Functional Requirements
Core Requirements
- Clients should be able to query the top K videos (max 1000) for a given time period.
- Time periods should be limited to 1 {hour, day, month} and all-time.
Below the line (out of scope):
- Arbitrary time periods.
- Arbitrary starting/ending points (we'll assume all queries are looking back from the current moment).
Non-Functional Requirements
Core Requirements
- We'll tolerate at most 1 min delay between when a view occurs and when it should be tabulated.
- Our results must be precise, so we should not approximate. (Note: This would be unusual for most production systems)
- Our system should be able to handle a massive number (TBD - cover this later) of views per second.
- We should support a massive number (TBD - cover this later) of videos.
- We should return results within 10's of milliseconds.
- Our system should be economical. We shouldn't need a 10k host fleet to solve this problem.
Here's how it might look on your whiteboard:
Scale Estimation
We've earmarked two quantities important to our design: (a) the number of views per second, and (b) the total number of videos. The first will help us understand the overall throughput of the system while the second is important for bounding the storage we'll need.
First let's look at throughput:
70B views/day / (100k seconds/day) = 700k tps
Woo, that's a lot. We're definitely going to need to look for ways to shard this across many different hosts.
Now, let's talk storage. First we need the number of videos:
Videos/Day = 1 hour content/second / (6 minutes content/video) * (100k seconds/day) = 1M videos/day Total Videos = 1M videos/day * 365 days/year * 10 years = 3.6B videos
With that let's estimate how big a naive table of IDs and counts would be:
Naive Storage = 4B videos * (8 bytes/ID + 8 bytes/count) = 64 GB
Ok, probably something we can keep in memory if we're clever, especially if we use a number of hosts.
The Set Up
Planning the Approach
Based on our requirements, we know we're going to make some observations for our interviewer:
- First, we need to index data from a very high volume stream. Most quantities will need to be precomputed in order to meet the latency requirements.
- Next, problems like this typically have bottlenecks that are hidden behind bottlenecks: solving one problem creates (at least) one more. So we'll aim to solve the simplest problem first, and then add complexity as we go.
- Finally, we'll note that the sliding time window adds more challenge. So we'll start with all-time and then try to figure out the rest.
Our rough plan is thus:
- Generate a basic (but not scalable solution) to the all-time top K problem.
- Solve the primary issues of our basic solution.
- Add a solution for the time period inputs.
- Deep dive remaining bottlenecks until we run out of time.
Defining the Core Entities
In our problem, we have some basic entities we're going to work with to build our API:
- Video
- View
- Time Window
From a conceptual perspective this problem is straightforward so we're not going to spend any more time here. We might even skip this section to save time.
API or System Interface
Our API guides the rest of the interview, but in this case it's really basic too! We simply need an API to retrieve the top K videos.
GET /views/top?window={WINDOW}&k={K} Response: { "videos": [ { "videoId": // ... "views": // ... } ] }
We're not going to dawdle here and keep moving on to the meat of the interview.
High-Level Design
1) A Basic Solution for All-Time
Let's start with a simple solution for all-time top K videos which we'll build on a gigantic single host, then we can start to whittle away at optimization.
We can do this by maintaining a table of video IDs and counts. This gets us an up-to-date count of every video, but iterating over all 4B keys to find the largest values is untenable, so we'll keep a heap of the top K videos which we can update with each increment. The vast majority of views will never touch this heap since they'll be below the threshold of the top 1000 (the max K we established in our functional requirements).
The basic function is this: when a request comes in, we atomically increment the counter in the hash table with the incoming ID. We retrieve the updated count and test it against the floor of our heap. If the count is higher than our floor (i.e. the video belongs in the top 1,000) we update/insert it into the heap and heapify. Our clients query directly from that heap to retrieve the top K videos.
This is really simple and fast because we run it on a single host. And while this is possible conceptually, in memory, on a single host, we wouldn't want to do that. First, because the throughput we can support is likely more than an order of magnitude shy of the 700k TPS we need and secondly because that host becomes a single point of failure. What to do here?
2) Primary Issues of the Basic Solution
We have two issues we need to address: how to maintain reliability in the presence of failures and how to scale the write throughput of the system. Let's talk about them in order.
In order for our system to be reliable, we need to be able to gracefully handle node failures. In our single-node system we're going to be in the job search again if that host fails. No good.
Bad Solution: Write out to a database
Good Solution: Replication
Great Solution: Replicas with Snapshots
Ok, with some replicas and snapshots we're in a much more fault-tolerant state. Next, we need to scale the write throughput of our system as our replicas don't solve for the problem of having a massive firehose of incoming data. Your mind should immediately go to sharding/partitioning here.
Bad Solution: Fixed partitioning by ID
Good Solution: Elastic partitioning
Ok cool, now we have a basic in-memory solution which is both fault-tolerant and (somewhat) scalable. But we haven't solved all our functional requirements yet. On to those pesky time windows.
Potential Deep Dives
1) Handling Time Windows
While our "All-Time" solution conveniently can aggregate views forever, to handle time windows we need to age out views that happened outside that window. As an example, if a video got a single view at time T=0, if our time window is 1, by T=2 we need to make sure that video has a count of 0.
One advantage we have is that the time windows we're working with are fixed and small: we only have 3. One disadvantage is they are very different granularities: from 1 minute to 1 month.
This is complicated so our best strategy is to start with something basic and probably bad then use it as inspiration to try come up with alternative solutions.
Your interviewer is going to be looking for how you can think through this problem, not (necessarily) that you get the optimal answer. Identifying pinch points, solving them, and not getting stuck is critical. But if you can think of the best solution go for it!
Bad Solution: Naive micro-buckets
Good Solution: Heap expirations
Great Solution: Use two pointers
2) Large number of incoming requests
So far we've been talking about how to handle a lot of views/writes, but what about reads? Given we have 1 minute between when a view happens and when it needs to be tabulated, the most natural solution is to add a cache. We can put a 1 minute TTL on the cache so results are never more stale than our requirement. When a request comes in, we can either serve it from cache or we query all Counters for the given heap of the request and then store the merged values back in the case.
What is Expected at Each Level?
Ok, that was a lot. You may be thinking, โhow much of that is actually required from me in an interview?โ Letโs break it down.
Mid-level
Breadth vs. Depth: A mid-level candidate will be mostly focused on breadth (80% vs 20%). You should be able to craft a high-level design that meets the functional requirements you've defined, but many of the components will be abstractions with which you only have surface-level familiarity.
Probing the Basics: Your interviewer will spend some time probing the basics to confirm that you know what each component in your system does. For example, if you add an API Gateway, expect that they may ask you what it does and how it works (at a high level). In short, the interviewer is not taking anything for granted with respect to your knowledge.
Mixture of Driving and Taking the Backseat: You should drive the early stages of the interview in particular, but the interviewer doesnโt expect that you are able to proactively recognize problems in your design with high precision. Because of this, itโs reasonable that they will take over and drive the later stages of the interview while probing your design.
The Bar for Top K: For this question, an Mid-Level candidate will be able to come up with an end-to-end solution that probably isn't optimal. They'll have some insights into pinch points of the system and be able to solve some of them. They'll have familiarity with relevant technologies.
Senior
Depth of Expertise: As a senior candidate, expectations shift towards more in-depth knowledge โ about 60% breadth and 40% depth. This means you should be able to go into technical details in areas where you have hands-on experience. It's crucial that you demonstrate a deep understanding of key concepts and technologies relevant to the task at hand.
Advanced System Design: You should be familiar with advanced system design principles. For example, knowing about how to use consistent hashes to elastically scale partitioned data. You'd also be expected to understand how log-based event streaming (e.g. like implemented via Kafka or Redis Streams) functions. Your ability to navigate these advanced topics with confidence and clarity is key.
Articulating Architectural Decisions: You should be able to clearly articulate the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability. You justify your decisions and explain the trade-offs involved in your design choices.
Problem-Solving and Proactivity: You should demonstrate strong problem-solving skills and a proactive approach. This includes anticipating potential challenges in your designs and suggesting improvements. You need to be adept at identifying and addressing bottlenecks, optimizing performance, and ensuring system reliability.
The Bar for Top K: For this question, a Senior candidate should be able to come up with an end-to-end solution that is near optimal. They'll identify most bottlenecks and proactively work to resolve them. They'll be familiar with relevant technologies and might even weigh the pros and cons of each.
Staff+
Emphasis on Depth: As a staff+ candidate, the expectation is a deep dive into the nuances of system design โ I'm looking for about 40% breadth and 60% depth in your understanding. This level is all about demonstrating that, while you may not have solved this particular problem before, you have solved enough problems in the real world to be able to confidently design a solution backed by your experience.
You should know which technologies to use, not just in theory but in practice, and be able to draw from your past experiences to explain how theyโd be applied to solve specific problems effectively. The interviewer knows you know the small stuff (caches, key-value stores, etc) so you can breeze through that at a high level so you have time to get into what is interesting.
High Degree of Proactivity: At this level, an exceptional degree of proactivity is expected. You should be able to identify and solve issues independently, demonstrating a strong ability to recognize and address the core challenges in system design. This involves not just responding to problems as they arise but anticipating them and implementing preemptive solutions. Your interviewer should intervene only to focus, not to steer.
Practical Application of Technology: You should be well-versed in the practical application of various technologies. Your experience should guide the conversation, showing a clear understanding of how different tools and systems can be configured in real-world scenarios to meet specific requirements.
Complex Problem-Solving and Decision-Making: Your problem-solving skills should be top-notch. This means not only being able to tackle complex technical challenges but also making informed decisions that consider various factors such as scalability, performance, reliability, and maintenance.
Advanced System Design and Scalability: Your approach to system design should be advanced, focusing on scalability and reliability, especially under high load conditions. This includes a thorough understanding of distributed systems, load balancing, caching strategies, and other advanced concepts necessary for building robust, scalable systems.
The Bar for Top K: For a staff+ candidate, expectations are high regarding depth and quality of solutions, particularly for the complex scenarios discussed earlier. A staff candidate will expand to cover deep dives that we haven't enumerated.
Not sure where your gaps are?
Mock interview with an interviewer from your target company. Learn exactly what's standing in between you and your dream job.
Loading comments...