Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Problem With Percentiles

The Problem With Percentiles

Most teams emit a 95th percentile metric to track response times and other performance indicators. But you can't do math on a percentile. You can't average them, you can't sum them; they're unitless and a complete value in themselves. In this talk, you'll learn about SLA buckets, response time histograms, and how to use them.

Avatar for Ben Zvan

Ben Zvan

July 22, 2020
Tweet

More Decks by Ben Zvan

Other Decks in Technology

Transcript

  1. the 95th percentile is the value at which 95% of

    the dataset is included What Does it Mean? 95TH PERCENTILE
  2. for response time: a 95th percentile of 100ms means 95%

    of requests had a response time of 100ms or faster What Does it Mean? 95TH PERCENTILE
  3. for test results: a 95th percentile means YOUR test results

    are better than 95% of the test results What Does it Mean? 95TH PERCENTILE
  4. How is a percentile calculated? 95TH PERCENTILE request 20ms request

    36ms request 30ms request 40ms request 21ms request 31ms request 41ms request 22ms request 32ms request 42ms request 23ms request 33ms request 43ms request 24ms request 34ms request 44ms request 25ms request 35ms request 45ms request 26ms (what is the formula?)
  5. How is a percentile calculated? 95TH PERCENTILE request 20ms request

    36ms request 30ms request 40ms request 21ms request 31ms request 41ms request 22ms request 32ms request 42ms request 23ms request 33ms request 43ms request 24ms request 34ms request 44ms request 25ms request 35ms request 45ms request 26ms R = P/100 x (N + 1) = .95 x (20+1) = 19.95
  6. How is a percentile calculated? 95TH PERCENTILE request 20ms request

    36ms request 30ms request 40ms request 21ms request 31ms request 41ms request 22ms request 32ms request 42ms request 23ms request 33ms request 43ms request 24ms request 34ms request 44ms request 25ms request 35ms request 45ms request 26ms note for statisticians: the “true” percentile is 44ms (19th position) + 1ms (difference to next) x .95 = 44.95 but we’re going to ignore that because 44ms is close enough R = P/100 x (N + 1) = .95 x (20+1) = 19.95
  7. How is a percentile calculated? 95TH PERCENTILE request 20ms request

    36ms request 30ms request 40ms request 21ms request 31ms request 41ms request 22ms request 32ms request 42ms request 23ms request 33ms request 43ms request 24ms request 34ms request 44ms request 25ms request 35ms request 45ms request 26ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 44ms request 1ms request 1ms request 6000ms request 1ms
  8. Can you average percentiles? 95TH PERCENTILE request 20ms request 36ms

    request 30ms request 40ms request 21ms request 31ms request 41ms request 22ms request 32ms request 42ms request 23ms request 33ms request 43ms request 24ms request 34ms request 44ms request 25ms request 35ms request 45ms request 26ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 44ms request 1ms request 1ms request 600ms request 1ms average = (44 + 44) /2 = 44ms (statistically correct: 572ms) (statistically correct: 44.95ms) (statistically correct average: (572 + 44.95) /2 = 308)
  9. Can you average percentiles? 95TH PERCENTILE request 20ms request 36ms

    request 30ms request 40ms request 21ms request 31ms request 41ms request 22ms request 32ms request 42ms request 23ms request 33ms request 43ms request 24ms request 34ms request 44ms request 25ms request 35ms request 45ms request 26ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 43ms request 1ms request 1ms request 44ms request 1ms request 1ms request 6000ms request 1ms average = (44 + 44) /2 = 44ms (statistically correct average: (572 + 44.95) /2 = 308) (statistically correct 95th: 44.95)
  10. Can you average percentiles? 95TH PERCENTILE request 200ms request 360ms

    request 300ms request 400ms request 200ms request 310ms request 410ms request 200ms request 320ms request 420ms request 230ms request 330ms request 430ms request 240ms request 340ms request 440ms request 250ms request 350ms request 450ms request 260ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms average .95 = (1 + 440) /2 = 220.5ms
  11. Can you average percentiles? NO! 95TH PERCENTILE request 200ms request

    360ms request 300ms request 400ms request 200ms request 310ms request 410ms request 200ms request 320ms request 420ms request 230ms request 330ms request 430ms request 240ms request 340ms request 440ms request 250ms request 350ms request 450ms request 260ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms average .95 = (1 + 440) /2 = 220.5ms average = SUM(response times) /40 = 159ms
  12. What about ranges of percentiles? 95TH PERCENTILE request 200ms request

    360ms request 300ms request 400ms request 200ms request 310ms request 410ms request 200ms request 320ms request 420ms request 230ms request 330ms request 430ms request 240ms request 340ms request 440ms request 250ms request 350ms request 450ms request 260ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms range = 1ms - 440ms
  13. What about ranges of percentiles? 95TH PERCENTILE request 2ms request

    3ms request 3ms request 4ms request 2ms request 3ms request 4ms request 2ms request 3ms request 4ms request 2ms request 3ms request 43ms request 2ms request 3ms request 440ms request 2ms request 3ms request 450ms request 2ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms request 1ms range = 1ms - 440ms
  14. What about ranges of percentiles? 95TH PERCENTILE .95% = 1ms

    .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 440ms range = 1ms - 440ms .95% = 1ms .95% = 440ms range = 1ms - 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms dataset 1 dataset 2
  15. Are you meeting your 95%tile SLA? 95TH PERCENTILE .95% =

    1ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 440ms .95% = 1ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms dataset 1 dataset 2 SLA -> 10 ms 100 ms 400 ms dataset 1 ? ? ? dataset 2 ? ? ?
  16. Are you meeting your 95%tile SLA? 95TH PERCENTILE .95% =

    1ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 440ms .95% = 1ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms dataset 1 dataset 2 SLA -> 10 ms 100 ms 400 ms dataset 1 ! ! ! dataset 2 ! ! !
  17. Are you meeting your 95%tile SLA? assuming even traffic distribution?

    95TH PERCENTILE .95% = 1ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 440ms .95% = 1ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms dataset 1 dataset 2 SLA -> 10 ms 100 ms 400 ms dataset 1 ? ? ? dataset 2 ? ? ?
  18. Are you meeting your 95%tile SLA? assuming even traffic distribution?

    95TH PERCENTILE .95% = 1ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 440ms .95% = 1ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms dataset 1 dataset 2 minimum of .005 over 440ms minimum of .855 under 2ms minimum of .045 over 440ms minimum of .095 under 1ms SLA -> 10 ms 100 ms 400 ms dataset 1 dataset 2
  19. Are you meeting your 95%tile SLA?
 assuming even traffic distribution?

    95TH PERCENTILE .95% = 1ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 2ms .95% = 440ms .95% = 1ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms .95% = 440ms dataset 1 dataset 2 SLA -> 10 ms 100 ms 400 ms dataset 1 ! ! ! dataset 2 ! ! ! minimum of .005 over 440ms minimum of .855 under 2ms minimum of .045 over 440ms minimum of .095 under 1ms
  20. sla = 200ms, sla_pct = 1.0 A meaningful metric SLA

    PERCENT sla = 200ms, sla_pct = .95 sla = 200ms, sla_pct = .96 sla = 200ms, sla_pct = .96 sla = 200ms, sla_pct = .94 sla = 200ms, sla_pct = .94 sla = 200ms, sla_pct = .99 sla = 200ms, sla_pct = 1.0 sla = 200ms, sla_pct = .99 sla = 200ms, sla_pct = .93
  21. sla = 200ms, sla_pct = 1.0 Are you meeting your

    95%tile SLA? assuming even traffic distribution? SLA PERCENT sla = 200ms, sla_pct = .95 sla = 200ms, sla_pct = .96 sla = 200ms, sla_pct = .96 sla = 200ms, sla_pct = .94 sla = 200ms, sla_pct = .94 sla = 200ms, sla_pct = .99 sla = 200ms, sla_pct = 1.0 sla = 200ms, sla_pct = .99 sla = 200ms, sla_pct = .93
  22. sla = 200ms, sla_pct = 1.0 Are you meeting your

    95%tile SLA? assuming even traffic distribution? SLA PERCENT sla = 200ms, sla_pct = .95 sla = 200ms, sla_pct = .96 sla = 200ms, sla_pct = .96 sla = 200ms, sla_pct = .94 sla = 200ms, sla_pct = .94 sla = 200ms, sla_pct = .99 sla = 200ms, sla_pct = 1.0 sla = 200ms, sla_pct = .99 sla = 200ms, sla_pct = .93 average = .96 of requests meet SLA … so YES!!
  23. sla = 200ms, sla_pct = 1.0, req_rate = 90 Are

    you meeting your 95%tile SLA? with uneven traffic distribution? SLA PERCENT sla = 200ms, sla_pct = .95, req_rate = 10 sla = 200ms, sla_pct = .96, req_rate = 100 sla = 200ms, sla_pct = .96, req_rate = 50 sla = 200ms, sla_pct = .94, req_rate = 95 sla = 200ms, sla_pct = .94, req_rate = 62 sla = 200ms, sla_pct = .99, req_rate = 120 sla = 200ms, sla_pct = 1.0, req_rate = 15 sla = 200ms, sla_pct = .99, req_rate = 110 sla = 200ms, sla_pct = .93, req_rate = 150 total = 802
  24. sla = 200ms, sla_pct = 1.0, req_rate = 90 Are

    you meeting your 95%tile SLA? with uneven traffic distribution? SLA PERCENT sla = 200ms, sla_pct = .95, req_rate = 10 sla = 200ms, sla_pct = .96, req_rate = 100 sla = 200ms, sla_pct = .96, req_rate = 50 sla = 200ms, sla_pct = .94, req_rate = 95 sla = 200ms, sla_pct = .94, req_rate = 62 sla = 200ms, sla_pct = .99, req_rate = 120 sla = 200ms, sla_pct = 1.0, req_rate = 15 sla = 200ms, sla_pct = .99, req_rate = 110 sla = 200ms, sla_pct = .93, req_rate = 150 9 96 47 89 58 118 15 108 139 90 above SLA total = 802 769 .958 of requests meet SLA … so YES!!
  25. You can still build histograms! SLA PERCENT sla: 200ms sla_pct:

    0.97 <—- 970 requests took 200ms or less .5_sla_pct: 0.96 <—- 960 requests took 100ms or less 2x_sla_pct: 0.99 <—- 990 requests took 400ms or less req_rate: 1000 Bonus: 10 requests took between 100 and 200ms 30 requests took between 200 and 400ms 10 requests took over 400ms
  26. Alternative: bucket all the things! DISTRIBUTION HISTOGRAMS req_5ms_count: 950 req_10ms_count:

    5 req_15ms_count: 3 req_20ms_count: 2 req_110ms_count: 5 req_115ms_count: 20 req_120ms_count: 5 req_450ms_count: 7 req_490ms_count: 3 sla: 200ms sla_pct: 0.97 <—- 970 requests took 200ms or less .5_sla_pct: 0.96 <—- 960 requests took 100ms or less 2x_sla_pct: 0.99 <—- 990 requests took 400ms or less req_rate: 1000 same data
  27. Alternative: bucket all the things! DISTRIBUTION HISTOGRAMS Bonus! percentiles! req_5ms_count:

    950 req_10ms_count: 5 req_15ms_count: 3 req_20ms_count: 2 req_110ms_count: 5 req_115ms_count: 20 req_120ms_count: 5 req_450ms_count: 7 req_490ms_count: 3 total percentile 1750 .875 1765 .882 1773 .886 1785 .892 1800 .900 1945 .977 1970 .985 1987 .993 2000 1.00 ds1 ds2 800 10 5 10 10 125 20 10 10
  28. Alternative: bucket all the things! DISTRIBUTION HISTOGRAMS Bonus! percentiles! req_5ms_count:

    950 req_10ms_count: 5 req_15ms_count: 3 req_20ms_count: 2 req_110ms_count: 5 req_115ms_count: 20 req_120ms_count: 5 req_450ms_count: 7 req_490ms_count: 3 total percentile 1750 .875 1765 .882 1773 .886 1785 .892 1800 .900 1945 .977 1970 .985 1987 .993 2000 1.00 ds1 ds2 800 10 5 10 10 125 20 10 10
  29. you can’t do math on percentiles you CAN do math

    on percentages you CAN get percentiles from histograms you should use SLA buckets or distribution histograms change my mind! In Summary