Pro Yearly is on sale from $80 to $50! »

SLOs that Lie

SLOs that Lie

- Is uptime really the right measure of your reliability?
- What happens when that which monitors downtime has downtime?
- If upstream/downstream is down - how does it impact your numbers?
- Find answers to these questions and more.

Ee5407f7a79eb620c4fd54c136847b33?s=128

Piyush Verma

July 09, 2020
Tweet

Transcript

  1. SLOs that Lie

  2. Piyush Verma CTO, last9.io @realmeson10 2

  3. What’s in it for you?  How the world views

    SLOs?  How deep do they go?  How do we measure them? - What do we leave out? - How that which we don’t notice botches the numbers?  What should you do about it? 3
  4. SLO Service Level Objectives 4

  5. SLI Service Level Objectives 5

  6. Service Levels 6

  7. Availability 7

  8. Speed 8

  9. Correctness 9

  10. Freshness 10

  11. How to Measure a level 11

  12. Availability hours 12

  13. Speed mbps? ms? 13

  14. Correctness ? 14

  15. Freshness ? 15

  16. SLI Service Level Objectives 16

  17. SLO Service Level Objectives 17

  18. In a given week/day/month SLI X should be ___ SLO

    should be > < = ___ SLA or else …. 18
  19. Uptime 19

  20. Is this 3 9s? 20

  21. 21 9s Per Day Per Week Per Month Per Year

    99 14.4 mins 1.7 hours 7.3 hours 3.7 days 99.9 1.4 mins 10.1 mins 43.8 mins 8.7 hours 99.99 8.6 secs 1 min 4.4 mins 52.6 mins 99.999 0.864 sec 6.1 min 26.3 secs 5.3 mins
  22. Window 22

  23. To measure uptime Total - Down 23

  24. Downtime 24

  25. How long was I asleep? Try answering without external observation

    25
  26. Device SDK 26

  27. Metrics emission 27

  28. 99.9 / week ~ 10 mins = 1.4 mins/day 28

  29. 29

  30. 30 There are 10K devices. avg 500 ms to reach

    1.4 minutes to report AND fix
  31. 1% devices SDKs experience an ISP fault 99.9% is 99%

    true 31
  32. 60% of the time, it works every time [Anchorman] 32

  33. Layered SLOs 33

  34. Client ⇆ Firewall ⇆ Load Balancer ⇆ Proxy ⇆ Handler

    34
  35. SLOs is an aggregation of all the underlying layers 35

  36. It’s possible that, the business calls this SLO Uptime, but

    you may need ErrorRatio!! 36
  37. 1. Uptime Downtime monitors 37

  38. Uptime = timeUp ÷ (timeUp + timeDown)% 38

  39. 39 Downtime is how monitor sees NOT how the customer

    sees
  40. How would a downtime monitor maintain it’s 100% uptime? 40

  41. Challenges 41

  42. Uptime monitor offers 99.99% / month Actual downtime could be

    Your 4 mins AND Their 4 minutes 42
  43. 2. Geo-Balanced Monitors Should a failure on 1 Geo- be

    called a downtime? 43
  44. 3. State based monitors 44

  45. Uptime is not requested every ms 45

  46. Ok Unknown Down 46

  47. Pop Quiz Measure the uptime 47

  48. ┌────────┬──────────┬──────────┬──────────┬──────────┐ │ Status │ OK │ OK │ Unknown │

    Ok │ ├────────┼──────────┼──────────┼──────────┼──────────┤ │ Time │ 10:00:01 │ 10:00:10 │ 10:00:11 │ 10:00:21 │ └────────┴──────────┴──────────┴──────────┴──────────┘ 48
  49. ┌────────┬──────────┬──────────┬──────────┬──────────┐ │ Status │ OK │ Unknown │ Down │

    Ok │ ├────────┼──────────┼──────────┼──────────┼──────────┤ │ Time │ 10:00:01 │ 10:00:10 │ 10:00:20 │ 10:00:21 │ └────────┴──────────┴──────────┴──────────┴──────────┘ 49
  50. Monitor’s sleep period may overlap with actual downtime 50

  51. Conclusion 51

  52. There is no one true SLO Uptime of one layer

    could be error of another. 52
  53. Uptime is mostly and massively aggregated 53

  54. As the 9s increase, SLOs become a confirmation of proactivity

    than a measure of reactivity 54
  55. Thank you last9.io/failures Piyush Verma