Design a Log Aggregation Service

Design a log aggregation service that processes 10TB/hr of data from multiple services, supporting real-time log viewing for debugging, regex-based queries for exact data retrieval, and batch delivery of historical data (24-48 hours) as tar files for analytics teams.

Asked at:

Doordash

Common Functional Requirements

Most candidates end up covering this set of core functionalities

Users should be able to stream logs from many services reliably and at high throughput without losing data.
Users should be able to view logs in real time (live tail) with second-level latency for debugging.
Users should be able to run regex-based queries on recent logs and receive results quickly.
Analytics users should be able to receive 24–48 hour historical log bundles as downloadable tar files on a predictable schedule.

Common Deep Dives

Common follow-up questions interviewers like to ask for this question

This is where many designs fall over: sustained high write rates create backpressure that can ripple through producers, queues, and consumers. Interviewers want to see partitioning, batching, compression, and clear SLOs for end-to-end latency. - Partition aggressively by time and source (e.g., service, region) so you can add consumers and avoid hot partitions; you could budget partitions based on peak bytes/sec and target per-partition throughput. - Use producer-side batching and compression (e.g., snappy/zstd) and configure backpressure (queue limits, retries with jitter) so your Kafka clients shed load gracefully instead of crashing. - Separate paths for live tail vs indexing: you could stream from Kafka to a lightweight tail service for WebSocket/SSE delivery while a separate pipeline indexes into search storage; this decouples live latency from index lag.

Regex can be expensive and unbounded searches can take down your cluster. Interviewers look for pragmatic safeguards and index design that narrows the search space before running costly operations. - Pre-filter by time and metadata (service, env, level) using structured fields; you could route queries via index aliases to only the relevant time-sharded indices. - Constrain regex: require anchored or bounded patterns, enforce timeouts and shard request limits, and add circuit breakers plus query quotas per tenant to avoid noisy neighbor effects. - Consider n-gram/edge n-gram or keyword subfields for common patterns, and fall back to raw scan on cold storage for rare, wide regex across large windows (with async results).

Batch export is a long-running workflow that should be decoupled, idempotent, and restartable. Interviewers expect durability, exactly/at-least-once semantics, and backfills without reprocessing everything. - Write raw logs to object storage (e.g., time/service partitioned) in parallel to search indexing; then run a scheduled job that composes a manifest and creates tars from immutable files. - Make the export idempotent: deterministic file naming by window, checkpoints, and a manifest so retries don’t duplicate; publish results via pre-signed URLs. - Throttle and isolate resources (separate compute queues) so exports can’t starve the ingest or search clusters; consider chunked tars (e.g., 5–10GB) to bound failure domains.

At 10TB/hour, retention is the difference between a feasible and an impossible design. Interviewers look for lifecycle policies, sharding/rollover strategies, and compression choices that match query patterns. - Use index lifecycle management: roll over time-based indices by size/shard count, move from hot (fast SSD) to warm/cold tiers, and set TTL deletion; tune replication differently per tier. - Store compressed raw data in object storage as the source of truth; keep only the recent window in Elasticsearch/OpenSearch to meet latency SLOs. - Reduce index bloat: only index needed fields, prefer doc_values for aggregations, and avoid storing full _source if you can reconstruct from object storage.

Relevant Patterns

Relevant patterns that you should know for this question

Scaling Writes

The system must ingest 10TB/hour from many producers. You need partitioning, batching, compression, and backpressure to scale the write path safely without data loss or melted nodes.

Real-time Updates

Live tailing logs for debugging requires pushing updates to clients within seconds. A dedicated real-time path (e.g., SSE/WebSockets) decoupled from indexing keeps latency low during spikes.

Managing Long Running Tasks

Producing 24–48 hour tar bundles is a multi-hour, failure-prone job. Treating it as a durable, idempotent workflow with checkpoints ensures reliable delivery without impacting the hot path.

Relevant Technologies

Relevant technologies that could be used to solve this question

Kafka

Kafka is a standard high-throughput, durable log buffer that smooths bursts, enables consumer scaling, and provides ordering and backpressure controls needed to handle 10TB/hour safely.

Elasticsearch

Elasticsearch/OpenSearch provides near-real-time indexing and search, including support for text queries and controlled regex. Time-based indices, shard routing, and ILM are a great fit for log search.

Flink

Flink can power the streaming pipeline to parse/enrich logs, write to multiple sinks (search + object storage), and support exactly-once or at-least-once semantics with checkpoints and state.

Red Flags to Avoid

Common mistakes that can sink candidates in an interview

Using Elasticsearch as the only sink and retaining everything there

At 10TB/hour, the cluster will become cost-prohibitive and unstable. Without a cold tier (object storage) and lifecycle policies, shard counts explode, query latency degrades, and outages follow.

Allowing unbounded regex across wide time ranges without guardrails

Expensive queries can scan billions of documents, causing timeouts and cluster meltdowns. You need time and metadata pre-filters, query timeouts, quotas, and circuit breakers.

No backpressure or partitioning strategy on ingest

Producers and consumers will thrash under load, causing data loss or cascading failures. You must plan partition counts, producer batching/compression, and consumer scaling with clear SLOs.

Question Timeline

See when this question was last asked and where, including any notes left by other candidates.

Company

Level

Region

Mid April, 2025

Community Solutions

Comments

Your account is free and you can post anonymously if you choose.

Design a Log Aggregation Service

Common Functional Requirements

Common Deep Dives

Relevant Patterns

Relevant Technologies

Similar Problems to Practice

Red Flags to Avoid

Question Timeline

Community Solutions

Comments

Questions

Learn

Links

Legal

Contact

Design a Log Aggregation Service

Common Functional Requirements

Common Deep Dives

How will you ingest 10TB/hour and still provide near-real-time (seconds) live tail without data loss or cascading failures?

How will you support regex-based queries at scale while controlling latency and cluster safety?

How will you generate and deliver 24–48 hour tar bundles reliably without impacting the real-time path?

How will you keep storage and cost under control across hot (searchable) and cold (archival) tiers?

Relevant Patterns

Relevant Technologies

Similar Problems to Practice

Red Flags to Avoid

Question Timeline

Community Solutions

Comments

Questions

Learn

Links

Legal

Contact