Design a Metrics Aggregator
Design a metrics aggregation system that collects count-based metrics (like user signups, system errors, ad clicks) from services via a client library and displays them as histogram data on a dashboard. The system should support querying aggregated metrics within specific time frames for monitoring and analytics purposes.
Asked at:
Stripe
Meta

A StatsD/Datadog-style metrics aggregation platform that lets services emit count-based events (e.g., user_signups, system_error, ad_click) via a lightweight client library. The backend ingests a high-volume firehose, aggregates counts into time buckets, and powers dashboards that show histograms and time-series charts with filters by tags and time ranges. Interviewers ask this to test your ability to design high-throughput, low-latency ingestion and aggregation pipelines, reason about windowed counters, hot-key contention, and storage layouts for time-series. It also probes trade-offs around consistency vs availability, real-time vs batch rollups, cardinality control, and how you expose fast queries without overwhelming the write path.
Common Functional Requirements
Most candidates end up covering this set of core functionalities
Users should be able to instrument services with a client library to emit count metrics with optional tags (e.g., env=prod, region=us-east).
Users should be able to view near–real-time histograms and time-series graphs for selected metrics over a chosen time range and resolution (e.g., 1s, 10s, 1m).
Users should be able to query aggregated counts filtered by metric name and tags to support monitoring and analytics use cases.
Users should be able to retrieve aggregated metrics within specific time frames with predictable latency, even under high write load.
Common Deep Dives
Common follow-up questions interviewers like to ask for this question
This is where many designs fall over: metrics can arrive in bursts and a few popular metrics create contention. Interviewers want to see if you can absorb spikes, avoid single-writer bottlenecks, and keep end-to-end latency low enough for a dashboard. - You could front the ingestion with a durable log (e.g., a message bus) to decouple producers from aggregators, and shard aggregation by metric, tag set, and time bucket to scale horizontally. - You could use sharded or striped counters in a fast in-memory store to reduce contention (multiple keys per bucket, then merge), and flush batches to long-term storage. - You could batch and compress increments (turn N events into one increment per bucket) to reduce write amplification and network/IO overhead.
Counting is deceptively hard when clients retry and networks drop packets. Interviewers look for a pragmatic stance (often at-least-once) and mechanisms to mitigate double counting without sacrificing availability. - You could adopt at-least-once delivery and design idempotence at the bucket level (e.g., include a producer_id and event_batch_id per time bucket) so aggregators can dedupe. - You could bound late arrivals by using time windows with watermarks and allow a small correction window (e.g., 2–5 minutes) before you finalize buckets. - You could periodically reconcile by re-aggregating from a durable log for recent windows to correct drift, trading a bit of freshness for accuracy.
Dashboards query many points over long ranges; raw per-event writes won’t scale. Interviewers want to see downsampling and schema choices that keep read latency predictable. - You could pre-aggregate into fixed rollups (e.g., 1s, 10s, 1m, 1h) and store each rollup tier separately with TTLs so older data is only kept at coarser granularity. - You could use a wide-row time-series schema keyed by metric+tags and partitioned by time bucket to enable sequential writes and range scans. - You could cache hot windows (last 1–15 minutes) in memory and serve colder ranges from a write-optimized time-series store to balance cost and speed.
High-cardinality dimensions (e.g., user_id, request_id) can explode storage and degrade performance. Strong designs proactively prevent and mitigate this. - You could enforce allowlists/denylists for tag keys, quotas per team, and sampling or aggregation-on-ingest to cap unique series. - You could provide a cardinality index and usage reports so teams can see expensive metrics and clean them up before they page SREs. - You could reject or bucket excessive cardinality at ingestion (e.g., hash to a limited set) with clear client-side errors and documentation.
Relevant Patterns
Relevant patterns that you should know for this question
Metrics ingestion is write-heavy and bursty. Decoupling producers from consumers, batching increments, and horizontally sharding aggregation are essential to sustain millions of events per second while meeting near–real-time SLAs.
Popular metrics create hot keys. Using sharded counters, striped keys, and per-bucket aggregation avoids single-point contention and keeps latency low under skewed traffic.
Dashboards execute many range queries across time windows. Pre-aggregated rollups, caches for hot windows, and read-friendly schemas are required to deliver consistent, low-latency queries.
Relevant Technologies
Relevant technologies that could be used to solve this question
Similar Problems to Practice
Related problems to practice for this question
Both ingest a massive stream of simple events and produce time-windowed counts by dimensions (campaign, region, device). The same sharded counters, rollups, and hot-key mitigation apply.
Rate limiting relies on accurate, low-latency, windowed counters across distributed nodes. Techniques like time-bucketed keys, atomic increments, and contention mitigation directly transfer.
Finding trending items over time involves aggregations, rollups, and sometimes approximate techniques under heavy write loads. The storage and streaming patterns are closely related.
Red Flags to Avoid
Common mistakes that can sink candidates in an interview
Question Timeline
See when this question was last asked and where, including any notes left by other candidates.
Early October, 2025

Amazon
Staff
*by staff, I mean L6 Design a real-time analytics dashboard for Twitter that provides content creators and businesses with insights about their tweet performance. The system should display metrics like impressions, likes, retweets, replies, and engagement rates in real-time. * Display real-time metrics for tweets (impressions, likes, retweets, replies, clicks) * Show engagement rates and trends over time * Provide user-level analytics (follower growth, total impressions) * Support time-based filtering (last hour, day, week, month) * Export analytics data in various formats Performance Requirements: * Handle 500M tweets per day with real-time metric updates * Support 10M concurrent dashboard users * Display metrics with <5 second latency for recent data * 99.9% availability during peak traffic scale: Global user base across multiple continents * Handle viral content spikes (tweets reaching 100M+ impressions) * Support enterprise customers with high-volume analytics needs
Late September, 2025
Stripe
Manager
Mid September, 2025
Stripe
Senior
Your account is free and you can post anonymously if you choose.