Design an Ad Click Aggregator
Design a real-time ad click aggregator that processes billions of ad events (clicks and impressions) daily, aggregates them with maximum 30-second latency, and supports both real-time dashboard queries and historical analytics for up to 2 years.
Asked at:
Meta

Stripe
An Ad Click Aggregator is a real-time analytics system that turns raw ad events (impressions and clicks) into trustworthy metrics by minute, hour, and day. Think of the stats you see in Google Ads or Meta Ads Manager - impressions, clicks, CTR - broken down by campaign, ad, device, and geo, and updating every few seconds. Interviewers use this problem to test your ability to design high-throughput streaming pipelines with strict freshness SLAs, ensure correctness under retries and out-of-order data, handle hot-key skew, and serve low-latency, multi-dimensional queries over years of history. Expect constraints like 200B events/day and sub-30s freshness that force you to reason about sharding, windowing, backpressure, idempotency, and tiered storage.
Hello Interview Problem Breakdown
Design Ad Click Aggregator
System design answer key for designing an ad click data aggregation system, built by FAANG managers and staff engineers.
Common Functional Requirements
Most candidates end up covering this set of core functionalities
Users should be able to have their ad clicks and impressions reliably captured and counted exactly once without degrading the redirect or viewing experience.
Advertisers should be able to view near real-time totals (e.g., clicks, impressions, CTR) with at most 30 seconds end-to-end delay.
Advertisers should be able to query aggregated metrics at 1-minute granularity and roll up to hourly and daily views.
Advertisers should be able to filter and group metrics by key dimensions (e.g., ad, campaign, advertiser, device, geo) and access up to 2 years of historical data.
Common Deep Dives
Common follow-up questions interviewers like to ask for this question
This question probes whether you can reason end-to-end about high-throughput streaming: ingestion, partitioning, processing, and sinks. Interviewers want to see where parallelism comes from, how you size components, and how you keep latency predictable under bursty load. - Partition the ingest log (e.g., Kafka) sufficiently and key by a stable dimension (e.g., adId) with salting for hot keys; tune batch sizes, compression, and acks to balance throughput and durability. - Use a horizontally scalable stream processor (e.g., Flink) with event-time tumbling windows (1 minute), watermarks, and checkpointing; set operator parallelism to keep per-operator CPU < 70% and processing time well under 10s to leave headroom. - Make sinks asynchronous and batched; write pre-aggregates instead of raw per-event updates; monitor end-to-end watermark lag and apply backpressure/auto-scaling policies to keep within the 30s SLA.
Trust in analytics depends on correctness. Networks retry, clients go offline, and events arrive late; your design must be idempotent by construction and resilient to disorder without stalling real-time updates. - Assign unique, signed IDs (clickId/impressionId) and deduplicate near the edge using a short-TTL cache; make downstream writes idempotent (e.g., upserts, additive counters with versioning or two-phase commit sinks). - Process by event time with watermarks and allowed lateness; emit incremental window updates and retractions when late events land; prefer exactly-once semantics in the stream processor and sinks where feasible. - Pair the real-time pipeline with periodic batch reconciliation from a durable data lake to re-aggregate and correct drift, then backfill corrected aggregates.
Skew is common in advertising—viral creatives or major campaigns can create hotspots that melt a single partition or row. Interviewers want to see proactive skew mitigation across compute and storage. - Use key-splitting (salting) for hot entities (adId#0..N) and merge shards downstream; size N adaptively based on observed QPS to keep per-shard throughput within limits. - Employ a skew-aware partitioner and dynamic rebalancing to spread hot shards across more partitions/executors; keep partition throughput within target p99 latencies. - Avoid hot counters in storage by bucketing by time (minute/day) and shardId; aggregate on read or via background compactions to consolidate shards.
This tests whether you can separate serving paths for fresh data versus historical queries and avoid scanning raw events. Pre-aggregation, rollups, and smart partitioning are critical to meet latency and cost goals. - Keep minute-level pre-aggregates in a fast store for recent days, and maintain hourly/daily rollups for older ranges; use TTL to expire fine-grained data and stitch minute + rollup results at query time. - Partition/index by advertiser/campaign/ad and day; precompute common group-bys and materialize them to avoid large fan-in reads; keep aggregates compact and column-friendly. - For Top-K, consider streaming heavy-hitter sketches (e.g., Count-Min Sketch + heap) or windowed Top-K operators for fresh results, with offline exact recompute for long ranges.
Relevant Patterns
Relevant patterns that you should know for this question
The system ingests millions of events per second with a strict freshness SLA. Partitioned logs, horizontally scalable stream processors, and sharded counters are mandatory to sustain throughput without creating bottlenecks or violating latency guarantees.
Advertisers query large time ranges with multi-dimensional filters. Pre-aggregation, materialized rollups, and read-optimized schemas enable sub-second queries while keeping costs and fan-in under control.
Traffic is highly skewed: a few ads can dominate. Mitigating hot keys via key-splitting, skew-aware partitioning, and sharded time-bucketed counters prevents hotspots in both compute and storage.
Relevant Technologies
Relevant technologies that could be used to solve this question
Similar Problems to Practice
Related problems to practice for this question
Both require streaming aggregation and ranking over high-velocity data, often within time windows. Techniques like heavy-hitter detection, partitioned counting, and windowed Top-K apply directly.
Real-time counting and fan-in aggregation under bursty load with low-latency updates is shared. Handling out-of-order events and ensuring freshness are core challenges in both systems.
Both deal with extremely hot counters and contention. Sharding keys, time bucketing, and avoiding single-row hotspots are essential patterns in each design.
Red Flags to Avoid
Common mistakes that can sink candidates in an interview
Question Timeline
See when this question was last asked and where, including any notes left by other candidates.
Mid September, 2025
Meta
Senior
Early September, 2025
Meta
Manager
Early September, 2025
Meta
Manager
Question: The goal is to design a real time aggregation system to support advertiser dashboard. Details: - WebServer logs events to our system using function call log(ad_id, user_id, event_type). - Aggregation system should count such events in near real-time with maximum latency of 30s between event occuring and event affecting CTR on dashboard. - "ad was shown" and "ad was clicked on" two event types, CTR = # ads was clicked / # ads was shown - Read API on how advertiser dashboard will query data from our system. - historical data in the system for 2 years.
Your account is free and you can post anonymously if you choose.