++++

Engineering

Mar 2025×10 min read

To optimize performance, you must first be able to measure it accurately. Averages are often misleading in distributed systems; we use contracts (SLAs) and statistical distribution (Percentiles) to understand real-world user experience.

Performance Measurement: SLAs, SLOs & Percentiles 📊

Driptanil DattaSoftware Developer

🌍

References & Disclaimer

This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🤝 The Reliability Hierarchy: SLA, SLO, SLI

Performance must be trackable and governed by specific targets:

SLA (Service Level Agreement): A contractual commitment to your customers (e.g., "99.9% uptime"). Failure usually results in financial penalties.
SLO (Service Level Objective): An internal target that is stricter than the SLA (e.g., "99.95% uptime"). It gives your team a safety margin.
SLI (Service Level Indicator): The actual measurement of your system at a given moment.

Type	Definition	Example
SLA	Legal Contract	99.9% Availability
SLO	Internal Goal	95% of requests < 200ms
SLI	Real-time metric	93% of requests < 200ms (Failing!)

Understanding Percentiles (P50, P95, P99) 📈

Why not just use the Average? In tail-latency-sensitive systems, an average hides the outliers. If 90% of your users have a 100ms experience, but 10% have a 5-second experience, the average looks "okay," but 10% of your users are frustrated.

P50 (Median): The middle ground — 50% of requests are faster than this.
P95: 95% of requests are faster than this threshold.
P99: Captures the "Tail Latency" — the slowest 1% of requests. This is critical for high-scale applications where 1% can mean thousands of users.

Performance vs. Cost Trade-offs 💰

Cloud performance isn't free. Optimization decisions often align with budget:

Provisioned IOPS: Faster disk speed costs more.
Reserved vs. On-Demand: Reserved is cheaper but less flexible.
Cold Starts: Serverless is cost-efficient but introduces latency during idle spikes.

Interview Questions & Answers 💡

1. Why are percentiles (like P95, P99) more important than averages?

Averages hide outliers. If 1% of your users experience 10-second delays, a P99 will surface that problem immediately, while an average might only shift by a few milliseconds. P99 captures the "tail latency" that affects user retention.

2. How do SLAs, SLOs, and SLIs differ?

SLA is the legal promise (99.9% uptime). - SLO is the internal target (99.95% uptime). - SLI is what you actually measure (99.92% today).

Analogy: An SLA is the warranty, the SLO is the factory test target, and the SLI is the current speedometer reading.

3. Explain the trade-offs between performance and cost in the cloud.

Faster performance (low latency) usually requires larger instances or provisioned resources, increasing cost. We balance this by using:

Caching: Reduces expensive database queries. - Auto-scaling: Scales out only when traffic spikes. - Reserved Instances: Cheaper pricing for baseline loads.

Final Thoughts

Measurement is the prerequisite for improvement. A high-performance system isn't just one that is fast "on average," but one that provides a consistent experience even for the P99 tail.

What's next? How to detect bottlenecks and stress-test your system — Performance Testing & Monitoring