DevOpsPorto Meetup 37: Why averages lie by Filipe Oliveira

why averages lie and what we can do about it
@fcosta_oliveira / July 2020 1

> whoami - work as performance engineer @Redis Labs -
improve/develop open source performance/observability tools - some by necessity, some for the fun of it - https://github.com/ﬁlipecosta90 2

> taking a step back 3

> why do we write code? 4

> respond to needs 5 business…. org... society... someone’s...

> it all starts with... work: what we actually care
about metrics: help you characterize the status of a speciﬁc work 6 performance: amount of useful work we can accomplish

> this is too abstract. Let's get concrete. work: application
performance: Understanding application behaviour ( responsiveness ) metrics: Operations per unit of time, success metrics, error metrics, utilization, saturation, latency, and many more... 7

> Understanding latency behaviour deﬁne it: length of the operation
understand it: 8 - naive approach: store raw data and use it to characterize the system

all latency distributions have the same “common case ” -
average > Understanding latency behaviour deﬁne it: length of the operation understand it: - The common case approach: why it “lies” and we fully chase it... the mean 9 [1] https://en.wikipedia.org/wiki/Anscombe%27s_quartet

> the good, the bad, and the ugly average? 10

[1] http://en.wikipedia.org/wiki/Multimodal_distribution workﬂow/pattern 1 peak occurrences workﬂow/pattern 2 peak occurrences

[1] http://www.brendangregg.com/FrequencyTrails/modes.html 1) Node.js HTTP server response time (latency) from 50 random production servers, showing around 10,000 responses each. 2) MySQL command latency from 50 random production servers, showing around 10,000 commands for each distribution.

> we know there is a problem… 15 - susceptible
to outliers... - hides long tail (high latencies)... - under-estimates the actual user experience... - do not correspond to any discrete latencies….

> we know that… 16 better understanding... better decisions...

17 > how can we solve it?

- representative of data - space and time eﬃcient to
compute - practical to use 18

19 > percentiles to the rescue… [1] https://en.wikipedia.org/wiki/Percentile value below
which a given percentage of observations in a group of observations falls...

- representative of data: - what is the latency that
90% of our users experience? p90 - what is our worst 1% latency interval? did we got on that worst comparing to last week? [p99,p100] - what is the % of users that are served up to 1ms? and 5ms? *** - space and time eﬃcient: - millions of samples -> ~= dozens/hundreds - practical to use: - look at different slices in the distribution depending on your business needs maturity. startup != twitter 20

21 > how to calculate Percentiles/CDFs...

22 > we need histograms...

23 > they come in different colors and ﬂavours... t-digest,
d-digest, hdrhistogram, “raw”, a lot of different sketches... Not Important! What matters: space, speed, precision...

24 [1] https://hdrhistogram.github.io/HdrHistogram/plotFiles.html

90% of our users experience? p90 - what is our worst 1% latency interval? did we got on that worst comparing to last week? [p99,p100] - what is the % of users that are served up to 1ms? and 5ms? *** 25

27 - what is the latency that 90% of our
users experience? #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000]

29 - what is our worst 1% latency interval? 100%
#[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000]

30 - …..did we got on that worst comparing to
last week? YES 100% #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000]

32 - what is the % of users that are
served up to 1ms? and 5ms? CDF(1)~=5%, CDF(5)~=99.8% [1] https://en.wikipedia.org/wiki/Cumulative_distribution_function #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000]

33 - Drives business needs (why it’s fully to chase
means ): - we agreed that 99% of our payment users requests to be below 7.5ms p99 < 7.5ms - we agreed that 99.999% of our payment users requests to be below 10ms p99.999 < 10ms #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000] On normal distributions 99.999% of values are within 4.5 standard deviations of the mean: 1.509 + 4*0.466 ~= 3,373 ms, so we chime In! DEAL! [1] image credit to Kyle Kingsbury

34 > drive business needs??

35 - we agreed that 99% of our payment users
requests to be below 7.5ms p99 < 7.5ms #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000] Service Level Agreement

36 - we agreed that 99.999% of our payment users
requests to be below 10ms p99.999 < 10ms #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000] OK Service Level Agreement NOK Service Level Agreement

37 - we agreed that 99.999% of our payment users
requests to be below 10ms p99.999 < 10ms #[Mean = 1.509, StdDeviation = 0.466] #[Max = 16.303, Total count = 1818000] OK Service Level Agreement NOK Service Level Agreement Focus our work upon Don't prematurally optimize what’s OK

38 > what is the chance of experiencing P99?

39 [1] Gil Tene: https://bravenewgeek.com/everything-you-know-about-la tency-is-wrong/ [2] https://en.wikipedia.org/wiki/Bernoulli_trial > what
is the chance of experiencing P99? If it was 1 request: - (1 - 0.99) * 100 = 1% Amazon.com example: - 190 requests - 0.99^190 ~= 0.148 - (1 - 0.148) * 100 ~= 85.2% To get the probability of getting at least one success you use opposite event formula. - The probability of being bellow P99 is 99% on each attempt. ( success 1% ) - The probability of failure on each attempt is (1-99%). - The probability of n failures in a row is (1-)n and so the probability of at least one success is 1 – (1-0.01)n

40 > we know a better way to understand latency
behaviour and distribution... > no free lunch... > we need to give it context... metrics: Operations per unit of time, success metrics, error metrics, utilization, saturation, latency, and many more...

41 > I pass the ball to you...

42 more… - theory/code: - apply it in your company:
• how not to measure latency - gil tene • frequency trails: modes and modality - brendan gregg • Metrics, Metrics, Everywhere - Coda Hale • lies, damn lies, and metrics - andré arko • most page loads will experience the 99%'lie server response - gil tene • if you are not measuring and/or plotting max, what are you hiding (from)? - gil tene • latency heat maps - brendan gregg • visualizing system latency - brendan gregg • t-digest - ted dunning • hdrhistogram: a high dynamic range histogram • Check with SaaS-based monitoring services that you use like NewRelic, DataDog, etc… • OSS monitoring solution like prometheus at: Histograms and Summaries page

43 > help is always wanted: Go: ﬁlipecosta90/hdrhistogram - Forked
from codahale/hdrhistogram given that codahale archived repo - A pure Go implementation of Gil Tene's HDR Histogram. full credits to @codahale and @giltene - ~10.9 ns/op ( C version ~6ns/op ~= >=100M ingestions/sec ) C: RedisBloom / tdigest - Forked from hrbrmstr/tdigest - Descendent of Ted Dunnings MergingDigest, available at: https://github.com/tdunning/t-digest/ - Contains the work of Andrew Werner originally available at: https://github.com/ajwerner/tdigestc - ~60ns/op - meaning 2X faster inserts than Forked version

DevOpsPorto Meetup 37: Why averages lie by Fili...

DevOpsPorto Meetup 37: Why averages lie by Filipe Oliveira

More Decks by DevOpsPorto

Other Decks in Technology

Featured

Transcript