Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsPorto Meetup 37: Why averages lie by Filipe Oliveira

DevOpsPorto Meetup 37: Why averages lie by Filipe Oliveira

DevOpsPorto

July 23, 2020
Tweet

More Decks by DevOpsPorto

Other Decks in Technology

Transcript

  1. why averages lie
    and what we can do about it
    @fcosta_oliveira / July 2020 1

    View Slide

  2. > whoami
    - work as performance engineer @Redis Labs
    - improve/develop open source performance/observability tools
    - some by necessity, some for the fun of it
    - https://github.com/filipecosta90
    2

    View Slide

  3. > taking a step back
    3

    View Slide

  4. > why do we write code?
    4

    View Slide

  5. > respond to needs
    5
    business….
    org...
    society...
    someone’s...

    View Slide

  6. > it all starts with...
    work: what we actually care about
    metrics: help you characterize the status of a specific work
    6
    performance: amount of useful work we can accomplish

    View Slide

  7. > this is too abstract. Let's get concrete.
    work: application
    performance: Understanding application behaviour ( responsiveness )
    metrics: Operations per unit of time, success metrics, error metrics,
    utilization, saturation, latency, and many more...
    7

    View Slide

  8. > Understanding latency behaviour
    define it: length of the operation
    understand it:
    8
    - naive approach: store raw data and use it to characterize the
    system

    View Slide

  9. all latency distributions have the same “common case ” - average
    > Understanding latency behaviour
    define it: length of the operation
    understand it:
    - The common case approach: why it “lies” and we fully chase it...
    the mean
    9
    [1] https://en.wikipedia.org/wiki/Anscombe%27s_quartet

    View Slide

  10. > the good, the bad, and the ugly average?
    10

    View Slide

  11. > the good, the bad, and the ugly average?
    11

    View Slide

  12. > the good, the bad, and the ugly average?
    12

    View Slide

  13. > the good, the bad, and the ugly average?
    13
    [1] http://en.wikipedia.org/wiki/Multimodal_distribution
    workflow/pattern 1
    peak occurrences
    workflow/pattern 2
    peak occurrences

    View Slide

  14. > the good, the bad, and the ugly average?
    14
    [1] http://www.brendangregg.com/FrequencyTrails/modes.html
    1) Node.js HTTP server response time (latency) from 50 random production servers, showing around 10,000 responses each.
    2) MySQL command latency from 50 random production servers, showing around 10,000 commands for each distribution.

    View Slide

  15. > we know there is a problem…
    15
    - susceptible to outliers...
    - hides long tail (high latencies)...
    - under-estimates the actual user experience...
    - do not correspond to any discrete latencies….

    View Slide

  16. > we know that…
    16
    better understanding...
    better decisions...

    View Slide

  17. 17
    > how can we solve it?

    View Slide

  18. - representative of data
    - space and time efficient to compute
    - practical to use
    18

    View Slide

  19. 19
    > percentiles to the rescue…
    [1] https://en.wikipedia.org/wiki/Percentile
    value below which a given percentage of observations in a group of observations falls...

    View Slide

  20. - representative of data:
    - what is the latency that 90% of our users experience? p90
    - what is our worst 1% latency interval? did we got on that worst
    comparing to last week? [p99,p100]
    - what is the % of users that are served up to 1ms? and 5ms? ***
    - space and time efficient:
    - millions of samples -> ~= dozens/hundreds
    - practical to use:
    - look at different slices in the distribution depending on your
    business needs maturity. startup != twitter
    20

    View Slide

  21. 21
    > how to calculate Percentiles/CDFs...

    View Slide

  22. 22
    > we need histograms...

    View Slide

  23. 23
    > they come in different colors and flavours...
    t-digest, d-digest, hdrhistogram, “raw”, a lot of different sketches...
    Not Important!
    What matters:
    space, speed, precision...

    View Slide

  24. 24
    [1] https://hdrhistogram.github.io/HdrHistogram/plotFiles.html

    View Slide

  25. - representative of data:
    - what is the latency that 90% of our users experience? p90
    - what is our worst 1% latency interval? did we got on that worst
    comparing to last week? [p99,p100]
    - what is the % of users that are served up to 1ms? and 5ms? ***
    25

    View Slide

  26. - representative of data:
    - what is the latency that 90% of our users experience? p90
    - what is our worst 1% latency interval? did we got on that worst
    comparing to last week? [p99,p100]
    - what is the % of users that are served up to 1ms? and 5ms? ***
    26

    View Slide

  27. 27
    - what is the latency that 90% of our users experience?
    #[Mean = 1.509, StdDeviation = 0.466]
    #[Max = 16.303, Total count = 1818000]

    View Slide

  28. - representative of data:
    - what is the latency that 90% of our users experience? p90
    - what is our worst 1% latency interval? did we got on that worst
    comparing to last week? [p99,p100]
    - what is the % of users that are served up to 1ms? and 5ms? ***
    28

    View Slide

  29. 29
    - what is our worst 1% latency interval?
    100%
    #[Mean = 1.509, StdDeviation = 0.466]
    #[Max = 16.303, Total count = 1818000]

    View Slide

  30. 30
    - …..did we got on that worst comparing to last week? YES
    100%
    #[Mean = 1.509, StdDeviation = 0.466]
    #[Max = 16.303, Total count = 1818000]

    View Slide

  31. - representative of data:
    - what is the latency that 90% of our users experience? p90
    - what is our worst 1% latency interval? did we got on that worst
    comparing to last week? [p99,p100]
    - what is the % of users that are served up to 1ms? and 5ms? ***
    31

    View Slide

  32. 32
    - what is the % of users that are served up to 1ms? and 5ms?
    CDF(1)~=5%, CDF(5)~=99.8%
    [1] https://en.wikipedia.org/wiki/Cumulative_distribution_function
    #[Mean = 1.509, StdDeviation = 0.466]
    #[Max = 16.303, Total count = 1818000]

    View Slide

  33. 33
    - Drives business needs (why it’s fully to chase means ):
    - we agreed that 99% of our payment users requests to be below 7.5ms
    p99 < 7.5ms
    - we agreed that 99.999% of our payment users requests to be below 10ms
    p99.999 < 10ms
    #[Mean = 1.509, StdDeviation = 0.466]
    #[Max = 16.303, Total count = 1818000]
    On normal distributions 99.999% of values are within
    4.5 standard deviations of the mean: 1.509 + 4*0.466 ~=
    3,373 ms, so we chime In! DEAL!
    [1] image credit to Kyle Kingsbury

    View Slide

  34. 34
    > drive business needs??

    View Slide

  35. 35
    - we agreed that 99% of our payment users requests to be below 7.5ms p99 <
    7.5ms
    #[Mean = 1.509, StdDeviation = 0.466]
    #[Max = 16.303, Total count = 1818000]
    Service Level Agreement

    View Slide

  36. 36
    - we agreed that 99.999% of our payment users requests to be below 10ms
    p99.999 < 10ms
    #[Mean = 1.509, StdDeviation = 0.466]
    #[Max = 16.303, Total count = 1818000]
    OK Service Level Agreement NOK Service Level Agreement

    View Slide

  37. 37
    - we agreed that 99.999% of our payment users requests to be below 10ms
    p99.999 < 10ms
    #[Mean = 1.509, StdDeviation = 0.466]
    #[Max = 16.303, Total count = 1818000]
    OK Service Level Agreement NOK Service Level Agreement
    Focus our work upon
    Don't prematurally
    optimize what’s OK

    View Slide

  38. 38
    > what is the chance of experiencing P99?

    View Slide

  39. 39
    [1] Gil Tene:
    https://bravenewgeek.com/everything-you-know-about-la
    tency-is-wrong/
    [2] https://en.wikipedia.org/wiki/Bernoulli_trial
    > what is the chance
    of experiencing P99?
    If it was 1 request:
    - (1 - 0.99) * 100 = 1%
    Amazon.com example:
    - 190 requests
    - 0.99^190 ~= 0.148
    - (1 - 0.148) * 100 ~= 85.2%
    To get the probability of getting at least one success you
    use opposite event formula.
    - The probability of being bellow P99 is 99% on
    each attempt. ( success 1% )
    - The probability of failure on each attempt is
    (1-99%).
    - The probability of n failures in a row is (1-)n and
    so the probability of at least one success is 1 –
    (1-0.01)n

    View Slide

  40. 40
    > we know a better way to understand latency
    behaviour and distribution...
    > no free lunch...
    > we need to give it context...
    metrics: Operations per unit of time, success metrics, error metrics, utilization, saturation,
    latency, and many more...

    View Slide

  41. 41
    > I pass the ball to you...

    View Slide

  42. 42
    more…
    - theory/code:
    - apply it in your company:
    ● how not to measure latency - gil tene
    ● frequency trails: modes and modality - brendan gregg
    ● Metrics, Metrics, Everywhere - Coda Hale
    ● lies, damn lies, and metrics - andré arko
    ● most page loads will experience the 99%'lie server
    response - gil tene
    ● if you are not measuring and/or plotting max, what are you
    hiding (from)? - gil tene
    ● latency heat maps - brendan gregg
    ● visualizing system latency - brendan gregg
    ● t-digest - ted dunning
    ● hdrhistogram: a high dynamic range histogram
    ● Check with SaaS-based monitoring services that you use
    like NewRelic, DataDog, etc…
    ● OSS monitoring solution like prometheus at: Histograms
    and Summaries page

    View Slide

  43. 43
    > help is always wanted:
    Go: filipecosta90/hdrhistogram
    - Forked from codahale/hdrhistogram given that codahale archived repo
    - A pure Go implementation of Gil Tene's HDR Histogram. full credits to @codahale and
    @giltene
    - ~10.9 ns/op ( C version ~6ns/op ~= >=100M ingestions/sec )
    C: RedisBloom / tdigest
    - Forked from hrbrmstr/tdigest
    - Descendent of Ted Dunnings MergingDigest, available at:
    https://github.com/tdunning/t-digest/
    - Contains the work of Andrew Werner originally available at:
    https://github.com/ajwerner/tdigestc
    - ~60ns/op - meaning 2X faster inserts than Forked version

    View Slide