DevOpsPorto Meetup 37: Why averages lie by Filipe Oliveira

DevOpsPorto Meetup 37: Why averages lie by Filipe Oliveira

A2c14a1c4e16aa337c7d36abe7d1cf8f?s=128

DevOpsPorto

July 23, 2020
Tweet

Transcript

  1. why averages lie and what we can do about it

    @fcosta_oliveira / July 2020 1
  2. > whoami - work as performance engineer @Redis Labs -

    improve/develop open source performance/observability tools - some by necessity, some for the fun of it - https://github.com/filipecosta90 2
  3. > taking a step back 3

  4. > why do we write code? 4

  5. > respond to needs 5 business…. org... society... someone’s...

  6. > it all starts with... work: what we actually care

    about metrics: help you characterize the status of a specific work 6 performance: amount of useful work we can accomplish
  7. > this is too abstract. Let's get concrete. work: application

    performance: Understanding application behaviour ( responsiveness ) metrics: Operations per unit of time, success metrics, error metrics, utilization, saturation, latency, and many more... 7
  8. > Understanding latency behaviour define it: length of the operation

    understand it: 8 - naive approach: store raw data and use it to characterize the system
  9. all latency distributions have the same “common case ” -

    average > Understanding latency behaviour define it: length of the operation understand it: - The common case approach: why it “lies” and we fully chase it... the mean 9 [1] https://en.wikipedia.org/wiki/Anscombe%27s_quartet
  10. > the good, the bad, and the ugly average? 10

  11. > the good, the bad, and the ugly average? 11

  12. > the good, the bad, and the ugly average? 12

  13. > the good, the bad, and the ugly average? 13

    [1] http://en.wikipedia.org/wiki/Multimodal_distribution workflow/pattern 1 peak occurrences workflow/pattern 2 peak occurrences
  14. > the good, the bad, and the ugly average? 14

    [1] http://www.brendangregg.com/FrequencyTrails/modes.html 1) Node.js HTTP server response time (latency) from 50 random production servers, showing around 10,000 responses each. 2) MySQL command latency from 50 random production servers, showing around 10,000 commands for each distribution.
  15. > we know there is a problem… 15 - susceptible

    to outliers... - hides long tail (high latencies)... - under-estimates the actual user experience... - do not correspond to any discrete latencies….
  16. > we know that… 16 better understanding... better decisions...

  17. 17 > how can we solve it?

  18. - representative of data - space and time efficient to

    compute - practical to use 18
  19. 19 > percentiles to the rescue… [1] https://en.wikipedia.org/wiki/Percentile value below

    which a given percentage of observations in a group of observations falls...
  20. - representative of data: - what is the latency that

    90% of our users experience? p90 - what is our worst 1% latency interval? did we got on that worst comparing to last week? [p99,p100] - what is the % of users that are served up to 1ms? and 5ms? *** - space and time efficient: - millions of samples -> ~= dozens/hundreds - practical to use: - look at different slices in the distribution depending on your business needs maturity. startup != twitter 20
  21. 21 > how to calculate Percentiles/CDFs...

  22. 22 > we need histograms...

  23. 23 > they come in different colors and flavours... t-digest,

    d-digest, hdrhistogram, “raw”, a lot of different sketches... Not Important! What matters: space, speed, precision...
  24. 24 [1] https://hdrhistogram.github.io/HdrHistogram/plotFiles.html

  25. - representative of data: - what is the latency that

    90% of our users experience? p90 - what is our worst 1% latency interval? did we got on that worst comparing to last week? [p99,p100] - what is the % of users that are served up to 1ms? and 5ms? *** 25
  26. - representative of data: - what is the latency that

    90% of our users experience? p90 - what is our worst 1% latency interval? did we got on that worst comparing to last week? [p99,p100] - what is the % of users that are served up to 1ms? and 5ms? *** 26
  27. 27 - what is the latency that 90% of our

    users experience? #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000]
  28. - representative of data: - what is the latency that

    90% of our users experience? p90 - what is our worst 1% latency interval? did we got on that worst comparing to last week? [p99,p100] - what is the % of users that are served up to 1ms? and 5ms? *** 28
  29. 29 - what is our worst 1% latency interval? 100%

    #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000]
  30. 30 - …..did we got on that worst comparing to

    last week? YES 100% #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000]
  31. - representative of data: - what is the latency that

    90% of our users experience? p90 - what is our worst 1% latency interval? did we got on that worst comparing to last week? [p99,p100] - what is the % of users that are served up to 1ms? and 5ms? *** 31
  32. 32 - what is the % of users that are

    served up to 1ms? and 5ms? CDF(1)~=5%, CDF(5)~=99.8% [1] https://en.wikipedia.org/wiki/Cumulative_distribution_function #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000]
  33. 33 - Drives business needs (why it’s fully to chase

    means ): - we agreed that 99% of our payment users requests to be below 7.5ms p99 < 7.5ms - we agreed that 99.999% of our payment users requests to be below 10ms p99.999 < 10ms #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000] On normal distributions 99.999% of values are within 4.5 standard deviations of the mean: 1.509 + 4*0.466 ~= 3,373 ms, so we chime In! DEAL! [1] image credit to Kyle Kingsbury
  34. 34 > drive business needs??

  35. 35 - we agreed that 99% of our payment users

    requests to be below 7.5ms p99 < 7.5ms #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000] Service Level Agreement
  36. 36 - we agreed that 99.999% of our payment users

    requests to be below 10ms p99.999 < 10ms #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000] OK Service Level Agreement NOK Service Level Agreement
  37. 37 - we agreed that 99.999% of our payment users

    requests to be below 10ms p99.999 < 10ms #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000] OK Service Level Agreement NOK Service Level Agreement Focus our work upon Don't prematurally optimize what’s OK
  38. 38 > what is the chance of experiencing P99?

  39. 39 [1] Gil Tene: https://bravenewgeek.com/everything-you-know-about-la tency-is-wrong/ [2] https://en.wikipedia.org/wiki/Bernoulli_trial > what

    is the chance of experiencing P99? If it was 1 request: - (1 - 0.99) * 100 = 1% Amazon.com example: - 190 requests - 0.99^190 ~= 0.148 - (1 - 0.148) * 100 ~= 85.2% To get the probability of getting at least one success you use opposite event formula. - The probability of being bellow P99 is 99% on each attempt. ( success 1% ) - The probability of failure on each attempt is (1-99%). - The probability of n failures in a row is (1-)n and so the probability of at least one success is 1 – (1-0.01)n
  40. 40 > we know a better way to understand latency

    behaviour and distribution... > no free lunch... > we need to give it context... metrics: Operations per unit of time, success metrics, error metrics, utilization, saturation, latency, and many more...
  41. 41 > I pass the ball to you...

  42. 42 more… - theory/code: - apply it in your company:

    • how not to measure latency - gil tene • frequency trails: modes and modality - brendan gregg • Metrics, Metrics, Everywhere - Coda Hale • lies, damn lies, and metrics - andré arko • most page loads will experience the 99%'lie server response - gil tene • if you are not measuring and/or plotting max, what are you hiding (from)? - gil tene • latency heat maps - brendan gregg • visualizing system latency - brendan gregg • t-digest - ted dunning • hdrhistogram: a high dynamic range histogram • Check with SaaS-based monitoring services that you use like NewRelic, DataDog, etc… • OSS monitoring solution like prometheus at: Histograms and Summaries page
  43. 43 > help is always wanted: Go: filipecosta90/hdrhistogram - Forked

    from codahale/hdrhistogram given that codahale archived repo - A pure Go implementation of Gil Tene's HDR Histogram. full credits to @codahale and @giltene - ~10.9 ns/op ( C version ~6ns/op ~= >=100M ingestions/sec ) C: RedisBloom / tdigest - Forked from hrbrmstr/tdigest - Descendent of Ted Dunnings MergingDigest, available at: https://github.com/tdunning/t-digest/ - Contains the work of Andrew Werner originally available at: https://github.com/ajwerner/tdigestc - ~60ns/op - meaning 2X faster inserts than Forked version