Performance Measurement: SLAs, SLOs & Percentiles π
To optimize performance, you must first be able to measure it accurately. Averages are often misleading in distributed systems; we use contracts (SLAs) and statistical distribution (Percentiles) to understand real-world user experience.
This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
π€ The Reliability Hierarchy: SLA, SLO, SLI
Performance must be trackable and governed by specific targets:
- SLA (Service Level Agreement): A contractual commitment to your customers (e.g., "99.9% uptime"). Failure usually results in financial penalties.
- SLO (Service Level Objective): An internal target that is stricter than the SLA (e.g., "99.95% uptime"). It gives your team a safety margin.
- SLI (Service Level Indicator): The actual measurement of your system at a given moment.
| Type | Definition | Example |
|---|---|---|
| SLA | Legal Contract | 99.9% Availability |
| SLO | Internal Goal | 95% of requests < 200ms |
| SLI | Real-time metric | 93% of requests < 200ms (Failing!) |
Understanding Percentiles (P50, P95, P99) π
Why not just use the Average? In tail-latency-sensitive systems, an average hides the outliers. If 90% of your users have a 100ms experience, but 10% have a 5-second experience, the average looks "okay," but 10% of your users are frustrated.
- P50 (Median): The middle ground β 50% of requests are faster than this.
- P95: 95% of requests are faster than this threshold.
- P99: Captures the "Tail Latency" β the slowest 1% of requests. This is critical for high-scale applications where 1% can mean thousands of users.
Performance vs. Cost Trade-offs π°
Cloud performance isn't free. Optimization decisions often align with budget:
- Provisioned IOPS: Faster disk speed costs more.
- Reserved vs. On-Demand: Reserved is cheaper but less flexible.
- Cold Starts: Serverless is cost-efficient but introduces latency during idle spikes.
Interview Questions - Measurement & Monitoring π‘
1. Why are percentiles (like P95, P99) more important than averages?
Answer: Averages hide outliers. If 1% of your users experience 10-second delays, a P99 will surface that problem immediately, while an average might only shift by a few milliseconds. P99 captures the "tail latency" that affects user retention.
2. How do SLAs, SLOs, and SLIs differ?
Answer:
- SLA is the legal promise (99.9% uptime).
- SLO is the internal target (99.95% uptime).
- SLI is what you actually measure (99.92% today).
- Analogy: An SLA is the warranty, the SLO is the factory test target, and the SLI is the current speedometer reading.
3. Explain the trade-offs between performance and cost in the cloud.
Answer: Faster performance (low latency) usually requires larger instances or provisioned resources, increasing cost. We balance this by using Caching, Auto-scaling (to only pay for what we use), and choosing Reserved Instances for baseline loads.
What's next? How to detect bottlenecks and stress-test your system β Performance Testing & Monitoring