++++
Engineering
Mar 2025ร—10 min read

To optimize performance, you must first be able to measure it accurately. Averages are often misleading in distributed systems; we use contracts (SLAs) and statistical distribution (Percentiles) to understand real-world user experience.

Performance Measurement: SLAs, SLOs & Percentiles ๐Ÿ“Š

Driptanil Datta
Driptanil DattaSoftware Developer
๐ŸŒ
References & Disclaimer

This content is adapted from Mastering System Design from Basics to Cracking Interviews (Udemy). It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

๐Ÿค The Reliability Hierarchy: SLA, SLO, SLI

Performance must be trackable and governed by specific targets:

  1. SLA (Service Level Agreement): A contractual commitment to your customers (e.g., "99.9% uptime"). Failure usually results in financial penalties.
  2. SLO (Service Level Objective): An internal target that is stricter than the SLA (e.g., "99.95% uptime"). It gives your team a safety margin.
  3. SLI (Service Level Indicator): The actual measurement of your system at a given moment.
TypeDefinitionExample
SLALegal Contract99.9% Availability
SLOInternal Goal95% of requests < 200ms
SLIReal-time metric93% of requests < 200ms (Failing!)

Understanding Percentiles (P50, P95, P99) ๐Ÿ“ˆ

Why not just use the Average? In tail-latency-sensitive systems, an average hides the outliers. If 90% of your users have a 100ms experience, but 10% have a 5-second experience, the average looks "okay," but 10% of your users are frustrated.

  • P50 (Median): The middle ground โ€” 50% of requests are faster than this.
  • P95: 95% of requests are faster than this threshold.
  • P99: Captures the "Tail Latency" โ€” the slowest 1% of requests. This is critical for high-scale applications where 1% can mean thousands of users.

Performance vs. Cost Trade-offs ๐Ÿ’ฐ

Cloud performance isn't free. Optimization decisions often align with budget:

  • Provisioned IOPS: Faster disk speed costs more.
  • Reserved vs. On-Demand: Reserved is cheaper but less flexible.
  • Cold Starts: Serverless is cost-efficient but introduces latency during idle spikes.

Interview Questions & Answers ๐Ÿ’ก

1. Why are percentiles (like P95, P99) more important than averages?

Averages hide outliers. If 1% of your users experience 10-second delays, a P99 will surface that problem immediately, while an average might only shift by a few milliseconds. P99 captures the "tail latency" that affects user retention.

2. How do SLAs, SLOs, and SLIs differ?

  • SLA is the legal promise (99.9% uptime). - SLO is the internal target (99.95% uptime). - SLI is what you actually measure (99.92% today).

Analogy: An SLA is the warranty, the SLO is the factory test target, and the SLI is the current speedometer reading.

3. Explain the trade-offs between performance and cost in the cloud.

Faster performance (low latency) usually requires larger instances or provisioned resources, increasing cost. We balance this by using:

  • Caching: Reduces expensive database queries. - Auto-scaling: Scales out only when traffic spikes. - Reserved Instances: Cheaper pricing for baseline loads.

Final Thoughts

Measurement is the prerequisite for improvement. A high-performance system isn't just one that is fast "on average," but one that provides a consistent experience even for the P99 tail.

What's next? How to detect bottlenecks and stress-test your system โ€” Performance Testing & Monitoring

Drip

Driptanil Datta

Software Developer

Building full-stack systems, one commit at a time. This blog is a centralized learning archive for developers.

Legal Notes
Disclaimer

The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP

Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

ยฉ 2026 Driptanil Datta. All rights reserved.