Common Problems
Metrics Monitoring
Understanding the Problem
Functional Requirements
- The platform should be able to ingest metrics (CPU, memory, latency, custom counters) from services
- Users should be able to query and visualize metrics on dashboards with filters, aggregations, and time ranges
- Users should be able to define alert rules with thresholds over time windows (e.g., "alert if p99 latency > 500ms for 5 minutes")
- Users should receive notifications when alerts fire (email, Slack, PagerDuty)
- Log aggregation and full-text search (separate concern)
- Distributed tracing (spans, traces)
- Anomaly detection via ML
Non-Functional Requirements
- The system should scale to ingest 5M metrics per second from 500k servers
- Dashboard queries should return within seconds, even for queries spanning days or weeks
- Alerts should evaluate with low latency (< 1 minute from metric emission to alert firing)
- The system should be highly available. We can tolerate eventual consistency for dashboards, but alert evaluation should be reliable.
- The system should handle late or out-of-order data gracefully (network delays are common)
- Multi-region replication (would add complexity)
- Strong consistency guarantees
The Set Up
Planning the Approach
Defining the Core Entities
Data Flow
API or System Interface
High-Level Design
1) The platform can ingest metrics from services
2) Users can query and visualize metrics on dashboards
3) Users can define alert rules with thresholds
4) Users receive notifications when alerts fire
Potential Deep Dives
1) How do we serve low-latency dashboard queries over weeks of data?
2) How do we reduce alert latency below 1 minute?
3) How do we ensure high availability during spikes and failures?
4) How do we handle cardinality explosion?
What is Expected at Each Level?
Mid-level
Senior
Staff+
Purchase Premium to Keep Reading
Unlock this article and so much more with Hello Interview Premium
Currently up to 25% off
Hello Interview Premium
On This Page
Understanding the Problem
Functional Requirements
Non-Functional Requirements
The Set Up
Planning the Approach
Defining the Core Entities
Data Flow
API or System Interface
High-Level Design
1) The platform can ingest metrics from services
2) Users can query and visualize metrics on dashboards
3) Users can define alert rules with thresholds
4) Users receive notifications when alerts fire
Potential Deep Dives
1) How do we serve low-latency dashboard queries over weeks of data?
2) How do we reduce alert latency below 1 minute?
3) How do we ensure high availability during spikes and failures?
4) How do we handle cardinality explosion?
What is Expected at Each Level?
Mid-level
Senior
Staff+
Schedule a mock interview
Meet with a FAANG senior+ engineer or manager and learn exactly what it takes to get the job.
