Slide 1

Slide 1 text

SLOs that Lie

Slide 2

Slide 2 text

Piyush Verma CTO, last9.io @realmeson10 2

Slide 3

Slide 3 text

What’s in it for you?  How the world views SLOs?  How deep do they go?  How do we measure them? - What do we leave out? - How that which we don’t notice botches the numbers?  What should you do about it? 3

Slide 4

Slide 4 text

SLO Service Level Objectives 4

Slide 5

Slide 5 text

SLI Service Level Objectives 5

Slide 6

Slide 6 text

Service Levels 6

Slide 7

Slide 7 text

Availability 7

Slide 8

Slide 8 text

Speed 8

Slide 9

Slide 9 text

Correctness 9

Slide 10

Slide 10 text

Freshness 10

Slide 11

Slide 11 text

How to Measure a level 11

Slide 12

Slide 12 text

Availability hours 12

Slide 13

Slide 13 text

Speed mbps? ms? 13

Slide 14

Slide 14 text

Correctness ? 14

Slide 15

Slide 15 text

Freshness ? 15

Slide 16

Slide 16 text

SLI Service Level Objectives 16

Slide 17

Slide 17 text

SLO Service Level Objectives 17

Slide 18

Slide 18 text

In a given week/day/month SLI X should be ___ SLO should be > < = ___ SLA or else …. 18

Slide 19

Slide 19 text

Uptime 19

Slide 20

Slide 20 text

Is this 3 9s? 20

Slide 21

Slide 21 text

21 9s Per Day Per Week Per Month Per Year 99 14.4 mins 1.7 hours 7.3 hours 3.7 days 99.9 1.4 mins 10.1 mins 43.8 mins 8.7 hours 99.99 8.6 secs 1 min 4.4 mins 52.6 mins 99.999 0.864 sec 6.1 min 26.3 secs 5.3 mins

Slide 22

Slide 22 text

Window 22

Slide 23

Slide 23 text

To measure uptime Total - Down 23

Slide 24

Slide 24 text

Downtime 24

Slide 25

Slide 25 text

How long was I asleep? Try answering without external observation 25

Slide 26

Slide 26 text

Device SDK 26

Slide 27

Slide 27 text

Metrics emission 27

Slide 28

Slide 28 text

99.9 / week ~ 10 mins = 1.4 mins/day 28

Slide 29

Slide 29 text

29

Slide 30

Slide 30 text

30 There are 10K devices. avg 500 ms to reach 1.4 minutes to report AND fix

Slide 31

Slide 31 text

1% devices SDKs experience an ISP fault 99.9% is 99% true 31

Slide 32

Slide 32 text

60% of the time, it works every time [Anchorman] 32

Slide 33

Slide 33 text

Layered SLOs 33

Slide 34

Slide 34 text

Client ⇆ Firewall ⇆ Load Balancer ⇆ Proxy ⇆ Handler 34

Slide 35

Slide 35 text

SLOs is an aggregation of all the underlying layers 35

Slide 36

Slide 36 text

It’s possible that, the business calls this SLO Uptime, but you may need ErrorRatio!! 36

Slide 37

Slide 37 text

1. Uptime Downtime monitors 37

Slide 38

Slide 38 text

Uptime = timeUp ÷ (timeUp + timeDown)% 38

Slide 39

Slide 39 text

39 Downtime is how monitor sees NOT how the customer sees

Slide 40

Slide 40 text

How would a downtime monitor maintain it’s 100% uptime? 40

Slide 41

Slide 41 text

Challenges 41

Slide 42

Slide 42 text

Uptime monitor offers 99.99% / month Actual downtime could be Your 4 mins AND Their 4 minutes 42

Slide 43

Slide 43 text

2. Geo-Balanced Monitors Should a failure on 1 Geo- be called a downtime? 43

Slide 44

Slide 44 text

3. State based monitors 44

Slide 45

Slide 45 text

Uptime is not requested every ms 45

Slide 46

Slide 46 text

Ok Unknown Down 46

Slide 47

Slide 47 text

Pop Quiz Measure the uptime 47

Slide 48

Slide 48 text

┌────────┬──────────┬──────────┬──────────┬──────────┐ │ Status │ OK │ OK │ Unknown │ Ok │ ├────────┼──────────┼──────────┼──────────┼──────────┤ │ Time │ 10:00:01 │ 10:00:10 │ 10:00:11 │ 10:00:21 │ └────────┴──────────┴──────────┴──────────┴──────────┘ 48

Slide 49

Slide 49 text

┌────────┬──────────┬──────────┬──────────┬──────────┐ │ Status │ OK │ Unknown │ Down │ Ok │ ├────────┼──────────┼──────────┼──────────┼──────────┤ │ Time │ 10:00:01 │ 10:00:10 │ 10:00:20 │ 10:00:21 │ └────────┴──────────┴──────────┴──────────┴──────────┘ 49

Slide 50

Slide 50 text

Monitor’s sleep period may overlap with actual downtime 50

Slide 51

Slide 51 text

Conclusion 51

Slide 52

Slide 52 text

There is no one true SLO Uptime of one layer could be error of another. 52

Slide 53

Slide 53 text

Uptime is mostly and massively aggregated 53

Slide 54

Slide 54 text

As the 9s increase, SLOs become a confirmation of proactivity than a measure of reactivity 54

Slide 55

Slide 55 text

Thank you last9.io/failures Piyush Verma