AI Inference Request Batching

Given a list of AI inference requests with arrival times and token counts, implement a batching strategy that groups requests to maximize throughput while respecting constraints on batch size, token limits, and per-request SLA wait times.

Asked at:

Microsoft

Question Timeline

See when this question was last asked and where, including any notes left by other candidates.

Company

Level

Region

Early April, 2026

Microsoft

Senior

Scenario: You own an AI inference endpoint. To reduce cost and improve throughput, requests can be batched. However, batching increases latency, so you must respect an SLA (Service Level Agreement). Problem Statement: You are given a list of inference requests. Each request has the following attributes: id (string) arrivalTimeMs (long) tokens (int) You want to create batches to send to a model. Batch Rules: A batch must satisfy all constraints: Max tokens per batch: The sum of tokens in the batch should not exceed maxBatchTokens. Max requests per batch: The number of requests in a batch should not exceed maxBatchSize. SLA: Each request must start processing no later than arrivalTimeMs + maxWaitMs. Processing time: Each batch takes a fixed time, batchProcessMs, (independent of its size) once started. A single worker can process only one batch at a time. Task: Implement the function: List<Batch> createBatches(requests, maxBatchSize, maxBatchTokens, maxWaitMs, batchProcessMs) Where each Batch includes: startTimeMs list of request ids The output must respect all constraints and be valid for all inputs.

Your account is free and you can post anonymously if you choose.

Hello Interview Premium

Recent interview questions

System Design Guided Practice

Exclusive content

Learn More

AI Inference Request Batching

Question Timeline

Comments

Questions

Learn

Links

Legal

Contact