Design a Log Aggregation Service
Design a log aggregation service that processes 10TB/hr of data from multiple services, supporting real-time log viewing for debugging, regex-based queries for exact data retrieval, and batch delivery of historical data (24-48 hours) as tar files for analytics teams.
Asked at:
Doordash
Meta
A log aggregation service is a centralized platform where engineering teams ship application logs, search them in near real time, and export historical data for analytics. Think of ELK/OpenSearch or Splunk: developers can tail logs live to debug incidents, run regex searches to pinpoint exact events, and data teams can pull larger time windows for offline analysis. Interviewers ask this to test whether you can design high-throughput ingest (10TB/hour), low-latency search, and cost-effective retention/export pipelines—all while preserving reliability and operability. Strong answers show clarity on partitioning, backpressure, indexing, query safety, and batch workflows, not just picking a tool. The goal is to see you decompose the problem into real-time, search, and batch layers and make explicit trade-offs.
Common Functional Requirements
Most candidates end up covering this set of core functionalities
Users should be able to stream logs from many services reliably and at high throughput without losing data.
Users should be able to view logs in real time (live tail) with second-level latency for debugging.
Users should be able to run regex-based queries on recent logs and receive results quickly.
Analytics users should be able to receive 24–48 hour historical log bundles as downloadable tar files on a predictable schedule.
Common Deep Dives
Common follow-up questions interviewers like to ask for this question
This is where many designs fall over: sustained high write rates create backpressure that can ripple through producers, queues, and consumers. Interviewers want to see partitioning, batching, compression, and clear SLOs for end-to-end latency. - Partition aggressively by time and source (e.g., service, region) so you can add consumers and avoid hot partitions; you could budget partitions based on peak bytes/sec and target per-partition throughput. - Use producer-side batching and compression (e.g., snappy/zstd) and configure backpressure (queue limits, retries with jitter) so your Kafka clients shed load gracefully instead of crashing. - Separate paths for live tail vs indexing: you could stream from Kafka to a lightweight tail service for WebSocket/SSE delivery while a separate pipeline indexes into search storage; this decouples live latency from index lag.
Regex can be expensive and unbounded searches can take down your cluster. Interviewers look for pragmatic safeguards and index design that narrows the search space before running costly operations. - Pre-filter by time and metadata (service, env, level) using structured fields; you could route queries via index aliases to only the relevant time-sharded indices. - Constrain regex: require anchored or bounded patterns, enforce timeouts and shard request limits, and add circuit breakers plus query quotas per tenant to avoid noisy neighbor effects. - Consider n-gram/edge n-gram or keyword subfields for common patterns, and fall back to raw scan on cold storage for rare, wide regex across large windows (with async results).
Batch export is a long-running workflow that should be decoupled, idempotent, and restartable. Interviewers expect durability, exactly/at-least-once semantics, and backfills without reprocessing everything. - Write raw logs to object storage (e.g., time/service partitioned) in parallel to search indexing; then run a scheduled job that composes a manifest and creates tars from immutable files. - Make the export idempotent: deterministic file naming by window, checkpoints, and a manifest so retries don’t duplicate; publish results via pre-signed URLs. - Throttle and isolate resources (separate compute queues) so exports can’t starve the ingest or search clusters; consider chunked tars (e.g., 5–10GB) to bound failure domains.
At 10TB/hour, retention is the difference between a feasible and an impossible design. Interviewers look for lifecycle policies, sharding/rollover strategies, and compression choices that match query patterns. - Use index lifecycle management: roll over time-based indices by size/shard count, move from hot (fast SSD) to warm/cold tiers, and set TTL deletion; tune replication differently per tier. - Store compressed raw data in object storage as the source of truth; keep only the recent window in Elasticsearch/OpenSearch to meet latency SLOs. - Reduce index bloat: only index needed fields, prefer doc_values for aggregations, and avoid storing full _source if you can reconstruct from object storage.
Relevant Patterns
Relevant patterns that you should know for this question
The system must ingest 10TB/hour from many producers. You need partitioning, batching, compression, and backpressure to scale the write path safely without data loss or melted nodes.
Live tailing logs for debugging requires pushing updates to clients within seconds. A dedicated real-time path (e.g., SSE/WebSockets) decoupled from indexing keeps latency low during spikes.
Producing 24–48 hour tar bundles is a multi-hour, failure-prone job. Treating it as a durable, idempotent workflow with checkpoints ensures reliable delivery without impacting the hot path.
Relevant Technologies
Relevant technologies that could be used to solve this question
Elasticsearch/OpenSearch provides near-real-time indexing and search, including support for text queries and controlled regex. Time-based indices, shard routing, and ILM are a great fit for log search.
Similar Problems to Practice
Related problems to practice for this question
Both systems ingest massive event streams, require durable buffering, near-real-time processing, and often separate hot (queryable) and cold (batch analytics) paths.
Live comments and live log tail both need low-latency fan-out to clients, backpressure handling, and isolation so real-time updates are not blocked by slower downstream consumers.
Designing search over large, time-partitioned data with relevance and latency constraints parallels building indexed log search with careful sharding, routing, and query guardrails.
Red Flags to Avoid
Common mistakes that can sink candidates in an interview
Question Timeline
See when this question was last asked and where, including any notes left by other candidates.
Mid April, 2025
Meta
Principal
Early March, 2025
Doordash
Manager
Mid January, 2025
Doordash
Manager
10TB/hr of data being sent from a lot of services. there are 3 main flows - 1. teams should be able to see the data in real time for debugging 2. The team can get exact data by issuing a regex query to the logs in real time. 3. The analytics team can look at data from 24 - 48 hrs, which hey receive in form of a tar file.
Your account is free and you can post anonymously if you choose.