Common Problems
Design a Local Delivery Service like Gopuff
Understanding the Problem
Functional Requirements
Core Requirements
- Customers should be able to query availability of items, deliverable in 1 hour, by location.
- Customers should be able to order items.
Below the line (out of scope)
- Handling payments/purchases.
- Handling driver routing and deliveries.
- Search functionality and catalog APIs. (The system is strictly concerned with availability and ordering).
- Cancellations and returns.
Non-Functional Requirements
Core Requirements
- Customers should be able to order from any DC within 1 hour delivery time (i.e. the effective availability is the union of all nearby DCs).
- Availability requests should fast (<100ms) to support use-cases like search.
- Ordering should be strongly consistent: two customers should not be able to purchase the same physical product.
- System should be able to support 10k DCs and 100k items in the catalog across DCs.
- Order volume will be O(1m orders/day)
Below the line (out of scope)
- Privacy and security.
- Disaster recovery.
Hereโs how these might be shorthanded in an interview. Note that out-of-scope requirements usually stem from questions like โdo we need to handle privacy?โ. Interviewers are usually comfortable with you making assertions โIโm going to leave privacy out of scope for the startโ and will correct you if needed.
Scale estimations
Because we need to support 1m orders/day, we can estimate the orders/second:
Orders: 1M orders/day / (100k seconds/day) = 10 orders/second
This might burst considerably higher and is unlikely to be evenly distributed over 24 hours, so we might increase this estimate to 50 or 100. This is fairly easily achievable with almost any standard database system. We can safely move on.
Queries are a bit harder to estimate. All-in, we might estimate that each user will look at 100 items across search, the homepage, etc. before purchasing 1 item. In addition, maybe only 5% of these users will end up buying.
Queries: 1M orders/day / (100k seconds/day) / .05 * 100 = 20k queries/second
The read volume here is substantially higher. Weโll need to think through a good caching strategy but we can probably get some advantage due to the fact that these queries are going to be localized by region (more later).
We donโt really need to do any more estimation beyond this. Reflections like those above help your interviewer understand how your estimations are affecting your design and demonstrate your experience - donโt estimate for the sake of estimation!
Set Up
Planning the Approach
In this interview, we need to solve 2 key challenges, aligned with our requirements:
- Answering queries about inventory availability.
- Placing orders.
We'll note that this problem requires us to use location information specified by the client so this
Defining the API
To begin this problem, we can start with the APIs we need to satisfy - which follow naturally from our functional requirements. Note there are a lot of extraneous details we could shove into these APIs, but weโll start with these before expanding. We may also need to add APIs alongside our high level design, that's ok too!
Defining the Core Entities
We can also sketch out a rudimentary data model. The important thing here is to find the core entities for the problem that are going to be relevant to our requirements.
Of particular note for this problem is the distinction between an ItemInstance and an Item. While our API consumers are strictly concerned with items (e.g. Cheetohs), we need to keep track of where the physical items are actually located. Orders, meanwhile, are concerned with ItemInstances because they need to be able to track the physical items which are being ordered. If we remove the links, we end up with 4 entities in our system:
- ItemInstance: A physical instance of an item, located at a DC. We'll sum up ItemInstances to determine the quantity available.
- Item: A type of item, e.g. Cheetohs. These are what our customers will actually care about.
- DistributionCenter: A physical location where items are stored. We'll use these to determine which items are available to a user. ItemInstances are stored in DCs.
- Order: A collection of ItemInstances which have been ordered by a user.
High-Level Design
1) Customers should be able to query availability of items
In order to find the items available to a customer, we need to first find the DCs that are close enough to be able to deliver in 1 hour. Then we can aggregate the items that are available in those DCs to come up with the final quantities we can return to the user.
We can start by building a simple API which takes a LAT and LONG and returns a list of DCs within 1 hour delivery. There's a lot of options for this service, from how the data is stored to how closely we can answer the question of delivery time. In the ideal case, we have a service which is both fast and accurate.
Bad Solution: Simple SQL Distance
Bad Solution: Use a Travel Time Estimation Service Against all DCs
Great Solution: Use a Travel Time Estimation Service Against Nearby DCs
Once we have the available DCs, the next step is to answer queries about inventory availability. Since our nearby service gives us a list of DCs in scope for a given user, the availability lookup is relatively simple: it aggregates the inventory for the list of DCs and returns the result. We want to make this fast and scalable, and (per our requirements) we can compromise a bit on consistency to get it.
Bad Solution: Query Inventory Directly with Nearby Service Results
Good Solution: Query Inventory Through Cache with Nearby Service Results
Great Solution: Postgres Read Replicas and Partitioning
By choosing the "great" solutions for both challenges we net out with a solution that looks something like this:
We now have:
- Availability Service handles requests from our users for availability given a specific location.
- Nearby Service syncs with the database of nearby DCs and uses an external "Travel Time Service" to calculate travel times from DCs (potentially including traffic).
- Inventory Table a replicated SQL database table partitioned by region which returns the inventory available for each item.
When a user makes a request to get availability for items A, B, and C from latitude X and longitude Y, here's what happens:
- We make a request to the Nearby Service with the user's location X and Y.
- The nearby service accesses its sync'd list of available data centers and grabs any DC which is within 60 miles of X and Y.
- For each of these DCs, the nearby service makes a request to the Travel Time Service to get the travel time from the DC to X and Y.
- For all travel times under (say) 45 minutes, we return the IDs of those DCs to the availability service.
- With the DCs available, we query our SQL database for the items A, B, and C with locations in those DCs.
- We sum up the results and return them to our client.
2) Customers should be able to order items.
The last step is for us to enable placing orders. For this, we require strong consistency to make sure two users aren't ordering the same item. To do this we need to check inventory, record the order, and update the inventory together atomically. While latency is not a big concern here (our users will tolerate more latency here than they will on the reads), we definitely want to make sure we're not promising the same inventory to two users.
To ensure we're not "double booking", we need some form of locking. We have lots of options here, but we're going to favor the least complex implementation:
Good Solution: Two data stores with three phase commit
Great Solution: Singular Postgres transaction
By choosing the "great" option and leaning in to our existing Postgres database we can keep our system simple and still meet our requirements. For an order, the process looks like this:
- The user makes a request to the Orders Service to place an order for items A, B, and C.
- The Orders Service makes creates a singular transaction which we submit to our Postgres leader. This transaction: a. Checks the inventory for items A, B, and C > 0. b. If any of the items are out of stock, the transaction fails. c. If all items are in stock, the transaction records the order and updates the status for inventory items A, B, and C to "ordered". d. A new row is created in the Orders table (and OrderItems table) recording the order for A, B, and C. e. The transaction is committed.
- If the transaction succeeds, we return the order to the user.
There are some downsides to this setup. In particular, if any of the items become unavailable in the users order the entire order fails. We'll want to return a more meaningful error message to the user in this case, but this is preferably to succeeding in an order that might not make sense (e.g. a device and its battery).
Putting it all Together
With the two approaches listed, we can now pull everything together for a solution to our problem:
We have three services, one for Availability requests, one for Orders, and a shared service for Nearby DCs. Both our Availability and Orders service use the Nearby service to look up DCs that are close enough to the user. We have a singular Postgres database for inventory and orders, partitioned by region. Our Availability service reads via read replicas, our Orders service writes to the leader using atomic transactions to avoid double writes. A great foundation!
Deep Dives
1) Product change: Add cart "reservations"
Your interviewer asks you to add a new feature:
Given these requirements, we need some way to "lock" inventory to a user when they add it to the cart. We also need a way to expire these locks after 1 hour or when the user checks out -- whichever comes first. We also need to make sure that our Availability service is aware of these locks and doesn't return them as available inventory. We have a couple options for how to implement this:
Good Solution: Use Redis to "lock" items
Great Solution: Use a new status flag
2) Optimization: Late allocation of DCs (Senior+)
Your interviewer asks you to optimize the system:
The fundamental issue surfaced here is that we make assignments between orders and inventory items suboptimally. We have some options:
Good Solution: Use max inventory DCs
Good Solution: Rule engine
Great Solution: Allocation service
What is Expected at Each Level?
Your interviewer may even have you go deeper on specific sections, or ask follow-up questions. What might you expect in an actual assessment?
Mid-Level
Breadth vs. Depth: A mid-level candidate will be mostly focused on breadth. As an approximation, youโll show 80% breadth and 20% depth in your knowledge. You should be able to craft a high-level design that meets the functional requirements you've defined, but the optimality of your solution will be icing on top rather than the focus. Probing the Basics: Your interviewer spend some time probing the basics to confirm that you know what each component in your system does. For example, if you use DynamoDB, expect to be asked about the indexes available to you. Your interviewer will not be taking anything for granted with respect to your knowledge. Mixture of Driving and Taking the Backseat: You should drive the early stages of the interview in particular, but your interviewer won't necessarily expect that you are able to proactively recognize problems in your design with high precision. Because of this, itโs reasonable that they take over and drive the later stages of the interview while probing your design. The Bar for Inventory Management: For this question, interviewers expect a mid-level candidate to have clearly defined the API endpoints and data model, and created both routes: availability and orders. In instances where the candidate uses a โBadโ solution, the interviewer will expect a good discussion but not that the candidate immediately jumps to a great (or sometimes even good) solution.
Senior
Depth of Expertise: As a senior candidate, your interviewer expects a shift towards more in-depth knowledge โ about 60% breadth and 40% depth. This means you should be able to go into technical details in areas where you have hands-on experience. Advanced System Design: You should be familiar with advanced system design principles. Certain aspects of this problem should jump out to experienced engineers (read volume, trivial partitioning) and your interviewer will be expecting you to have reasonable solutions. Articulating Architectural Decisions: Your interviewer will want you to clearly articulate the pros and cons of different architectural choices, especially how they impact scalability, performance, and maintainability. You should be able to justify your decisions and explain the trade-offs involved in your design choices. Problem-Solving and Proactivity: You should demonstrate strong problem-solving skills and a proactive approach. This includes anticipating potential challenges in your designs and suggesting improvements. You need to be adept at identifying and addressing bottlenecks, optimizing performance, and ensuring system reliability. The Bar for Inventory Management: For this question, a senior candidate is expected to speed through the initial high level design so we can spend time discussing, in detail, how to optimize the critical paths. Senior candidates would be expected to have optimized solutions for both the atomic transactions of the orders service as well as the scaling of the availability service.
Staff+
Emphasis on Depth: As a staff+ candidate, the expectation is a deep dive into the nuances of system design โ the interviewer is looking for about 40% breadth and 60% depth in your understanding. This level is all about demonstrating that "been there, done that" expertise. You should know which technologies to use, not just in theory but in practice, and be able to draw from your past experiences to explain how theyโd be applied to solve specific problems effectively. Your interviewer knows you know the small stuff (REST API, data normalization, etc) so you can breeze through that at a high level so we have time to get into what is interesting. High Degree of Proactivity: At this level, your interviewer expects an exceptional degree of proactivity. You should be able to identify and solve issues independently, demonstrating a strong ability to recognize and address the core challenges in system design. This involves not just responding to problems as they arise but anticipating them and implementing preemptive solutions. Practical Application of Technology: You should be well-versed in the practical application of various technologies. Your experience should guide the conversation, showing a clear understanding of how different tools and systems can be configured in real-world scenarios to meet specific requirements. Complex Problem-Solving and Decision-Making: Your problem-solving skills should be top-notch. This means not only being able to tackle complex technical challenges but also making informed decisions that consider various factors such as scalability, performance, reliability, and maintenance. Advanced System Design and Scalability: Your approach to system design should be advanced, focusing on scalability and reliability, especially under high load conditions. This includes a thorough understanding of distributed systems, load balancing, caching strategies, and other advanced concepts necessary for building robust, scalable systems. The Bar for Inventory Management: For a staff+ candidate, expectations are set high regarding depth and quality of solutions, particularly for the complex scenarios discussed earlier. Your interviewer will be looking for you to be diving deep into at least 2-3 key areas, showcasing not just proficiency but also innovative thinking and optimal solution-finding abilities. They should show unique insights for at least a couple follow-up questions of increasing difficulity. A crucial indicator of a staff+ candidate's caliber is the level of insight and knowledge they bring to the table. A good measure for this is if the interviewer comes away from the discussion having gained new understanding or perspectives.
Loading comments...