Slide 1

Slide 1 text

SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020 #srenext #srenextC

Slide 2

Slide 2 text

Service Level Objectives

Slide 3

Slide 3 text

Questions • ✋Do you know the meaning of SLO? • ✋Do you define SLO for your service? • ✋Do you have an Error Budget Policy for your service?

Slide 4

Slide 4 text

Target • People who want to know SLI/SLO • People who want to know how to use SLI/SLO • People who want to keep the reliability and agility of product development

Slide 5

Slide 5 text

Site Reliability Engineering: Measuring and Managing Reliability https://www.coursera.org/learn/site-reliability-engineering-slos

Slide 6

Slide 6 text

tl;dr • It is worth defining and reviewing SLI / SLO • But the SLI / SLO is not perfect from the beginning • Reduce cognitive load and introduce gradually to team

Slide 7

Slide 7 text

Agenda • Learn SLO • What / Why / Where • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve

Slide 8

Slide 8 text

Agenda • Learn SLO • What / Why / Where • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve

Slide 9

Slide 9 text

What • SLI / Service Level Indicators • A quantifiable measure of service reliability • i.e. http success rate, response time • SLO / Service Level Objectives • Set a reliability target for an SLI • 99%, 99.9%, 99.99%… • Error Budget • An SLO implies an acceptable level of unreliability • This is a budget that can be allocated The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0

Slide 10

Slide 10 text

SLI should be related to user happiness SLI(%) Good Event ——————————- Valid Event

Slide 11

Slide 11 text

SLI should be related to user happiness SLI(%) http 2xx status count ———————————————————————————-——- http 2xx status count + 5xx status count

Slide 12

Slide 12 text

SLO is a reliability target for an SLI SLI(%) SLO: 99.9% http 2xx status count ———————————————————————————-——- http 2xx status count + 5xx status count

Slide 13

Slide 13 text

SLO is a reliability target for an SLI SLI(%) SLO: 99.9% Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count)

Slide 14

Slide 14 text

We can accept Errors as Error Budget SLI(%) SLO: 99.9% Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count) Error Budget We can accept more 5 count of 5xx error

Slide 15

Slide 15 text

We can accept Errors as Error Budget SLI(%) SLO: 99.9% Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count) Error Budget We can accept more 5 count of 5xx error Event based SLO

Slide 16

Slide 16 text

We can accept Errors as Error Budget SLI(%) SLO: 99.9% Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window

Slide 17

Slide 17 text

We can accept Errors as Error Budget SLI(%) SLO: 99.9% Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days

Slide 18

Slide 18 text

We can accept Errors as Error Budget SLI(%) SLO: 99.9% Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days Monitor based SLO

Slide 19

Slide 19 text

Why • Fact-based decision making • Team can develop with a balance between reliability and agility • Especially important in the microserrvices architecture

Slide 20

Slide 20 text

Team can develop with a balance between reliability and agility Reliability Agility Ops Keep the reliability Dev Let’s release new feature! SLO

Slide 21

Slide 21 text

Especially important in the microserrvices architecture ServiceA ServiceB ServiceC Success Rate 99.9% Success Rate 99% Success Rate 99% Reliability depends on other services

Slide 22

Slide 22 text

Where Synthetics Client Frontend CDN LoadBalancer Application DataStore Many options, Trade-off

Slide 23

Slide 23 text

Where Synthetics Client Frontend CDN LoadBalancer Application DataStore Many options, Trade-off Some requests might not reach to the apps Need more engineering effort to generate E2E tests

Slide 24

Slide 24 text

In Quipper Synthetics Client Frontend CDN LoadBalancer Application DataStore Send everything to Datadog

Slide 25

Slide 25 text

Agenda • Learn SLO • What / Why / Where • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve

Slide 26

Slide 26 text

Self-Contained “Encourage development teams to be self-contained so that each team can make products more comprehensively, proactively, and efficiently.”

Slide 27

Slide 27 text

SRE Mission for 2020 / Self-Contained • Product Team can develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO

Slide 28

Slide 28 text

Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production Readiness Checklist SLO review by myself Set Error Budget Policy Jun. Mar. Mar. Sep. SRE NEXT SLO review with Devs

Slide 29

Slide 29 text

Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Set Error Budget Policy

Slide 30

Slide 30 text

Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Why do we need such steps? Set Error Budget Policy

Slide 31

Slide 31 text

Why do we need such steps? • SLIs/SLOs we defined are appropriate? • If not, Error Budget Policy won’t work well • Can the product team start the process itself? • If not, need some scaffold, preparation, training

Slide 32

Slide 32 text

Case Study in Quipper • Define the Ownership • SLO review by myself • SLO review with Devs • Set Error Budget Policy

Slide 33

Slide 33 text

Case Study in Quipper • Define the Ownership • SLO review by myself • SLO review with Devs • Set Error Budget Policy

Slide 34

Slide 34 text

Know your systems and organizations • 2 Product • 4 Branches • 97 Kubernetes Deployment • 84 Developers (Includes 6 SREs) • 48 subdomains Where is the Ownership?

Slide 35

Slide 35 text

Define the Owner

Slide 36

Slide 36 text

Define the Owner Services / Teams Japan 7 Global 8 Philippines 3 indonesia 4 Shared 1

Slide 37

Slide 37 text

Define Service Owner In Design Doc for new service

Slide 38

Slide 38 text

Case Study in Quipper • Define the Ownership • SLO review by myself • SLO review with Devs • Set Error Budget Policy

Slide 39

Slide 39 text

SLO review by myself • Establish SLO Review process • How to set SLO? • How to monitor SLO? • What is an action when SLO violation? • How to investigate? • Improve SLI / SLO accuracy • How to think to revise?

Slide 40

Slide 40 text

How to set and monitor SLO?

Slide 41

Slide 41 text

How to set and monitor SLO? • Unfortunately, there is no Alert or recording system • Use Slack reminder and record on Github Issue

Slide 42

Slide 42 text

How to set and monitor SLO?

Slide 43

Slide 43 text

Availability Table https://landing.google.com/sre/sre-book/chapters/availability-table/ Too many errors Target too high Start with this!

Slide 44

Slide 44 text

Realized that “SLO Review” is good habit • Good habit? • Like Pair-Programming or Unit Test • Why? • Motivate to get metrics • No burnout, feel relief • Aware of the factors that hinder reliability • Platform Outage • Push notification • Resource Capacity • Rolling Update

Slide 45

Slide 45 text

Case Study in Quipper • Define the Ownership • SLO review by myself • SLO review with Devs • Set Error Budget Policy

Slide 46

Slide 46 text

Many Problems… • Noisy metrics by dos detector • Developing SLIs • Send http path tag for shared service • No available metrics for microservices SLIs

Slide 47

Slide 47 text

Dos Detector: Rate limiting by Reverse Proxy

Slide 48

Slide 48 text

Dos Detector: Rate limiting by Reverse Proxy If a large number of requests are made from the same client in a short time, returns 503

Slide 49

Slide 49 text

SLI should be related to user happiness SLI(%) http 2xx status count ———————————————————————————-——- http 2xx status count + 5xx status count

Slide 50

Slide 50 text

Noisy metrics by dos detector

Slide 51

Slide 51 text

Send http path tag for shared service Coaching Team uses example.quipper.com/coaching School Team uses example.quipper.com/school

Slide 52

Slide 52 text

Send http path tag for shared service

Slide 53

Slide 53 text

Send http path tag for shared service

Slide 54

Slide 54 text

No available metrics for microservices SLIs

Slide 55

Slide 55 text

No available metrics for microservices SLIs ServiceA ServiceB ServiceC GET http://serviceb GET http://servicec

Slide 56

Slide 56 text

No available metrics for microservices SLIs ServiceA ServiceB ServiceC GET http://serviceb GET http://servicec Side-car container

Slide 57

Slide 57 text

Case Study in Quipper • Define the Ownership • SLO review by myself • SLO review with Devs • Set Error Budget Policy • To be continued…

Slide 58

Slide 58 text

Agenda • Learn SLO • What / Why / Where • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve

Slide 59

Slide 59 text

Provide Standardized / Recommended SLIs • Ideally, better to set SLIs by Product Team but… • Start with default first

Slide 60

Slide 60 text

SLI menu • Availability • http success rate • Latency • upstream response time < x msec

Slide 61

Slide 61 text

Make the configuration as code

Slide 62

Slide 62 text

Make the configuration as code Developer can easily change by pull request

Slide 63

Slide 63 text

Have a steep learning curve

Slide 64

Slide 64 text

Good Documentation

Slide 65

Slide 65 text

Work together

Slide 66

Slide 66 text

Agenda • Learn SLO • What / Why / Where • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve

Slide 67

Slide 67 text

Summery • It is worth defining and reviewing SLI / SLO • But the SLI / SLO is not perfect from the beginning • Reduce cognitive load and introduce gradually to team

Slide 68

Slide 68 text

Thank You! chaspy chaspy_ Site Reliability Engineer at Quipper Takeshi Kondo SRE Lounge Terraform-jp