Questions
•
✋Do you know the meaning of SLO?
•
✋Do you define SLO for your service?
•
✋Do you have an Error Budget Policy for your service?
Slide 4
Slide 4 text
Target
• People who want to know SLI/SLO
• People who want to know how to use SLI/SLO
• People who want to keep the reliability and agility of product
development
Slide 5
Slide 5 text
Site Reliability Engineering: Measuring and Managing Reliability
https://www.coursera.org/learn/site-reliability-engineering-slos
Slide 6
Slide 6 text
tl;dr
• It is worth defining and reviewing SLI / SLO
• But the SLI / SLO is not perfect from the beginning
• Reduce cognitive load and introduce gradually to team
Slide 7
Slide 7 text
Agenda
• Learn SLO
• What / Why / Where
• Case Study in Quipper
• Takeaways
• Provide Recommended SLIs
• Make the configuration as code
• Have a steep learning curve
Slide 8
Slide 8 text
Agenda
• Learn SLO
• What / Why / Where
• Case Study in Quipper
• Takeaways
• Provide Recommended SLIs
• Make the configuration as code
• Have a steep learning curve
Slide 9
Slide 9 text
What
• SLI / Service Level Indicators
• A quantifiable measure of service reliability
• i.e. http success rate, response time
• SLO / Service Level Objectives
• Set a reliability target for an SLI
• 99%, 99.9%, 99.99%…
• Error Budget
• An SLO implies an acceptable level of unreliability
• This is a budget that can be allocated
The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0
Slide 10
Slide 10 text
SLI should be related to user happiness
SLI(%)
Good Event
——————————-
Valid Event
Slide 11
Slide 11 text
SLI should be related to user happiness
SLI(%)
http 2xx status count
———————————————————————————-——-
http 2xx status count + 5xx status count
Slide 12
Slide 12 text
SLO is a reliability target for an SLI
SLI(%)
SLO: 99.9%
http 2xx status count
———————————————————————————-——-
http 2xx status count + 5xx status count
Slide 13
Slide 13 text
SLO is a reliability target for an SLI
SLI(%)
SLO: 99.9%
Present: 99.95%
10000 (2xx count)
———————————————————————————-——-
10000 (2xx count) + 5 (5xx count)
Slide 14
Slide 14 text
We can accept Errors as Error Budget
SLI(%)
SLO: 99.9%
Present: 99.95%
10000 (2xx count)
———————————————————————————-——-
10000 (2xx count) + 5 (5xx count)
Error Budget
We can accept more 5
count of 5xx error
Slide 15
Slide 15 text
We can accept Errors as Error Budget
SLI(%)
SLO: 99.9%
Present: 99.95%
10000 (2xx count)
———————————————————————————-——-
10000 (2xx count) + 5 (5xx count)
Error Budget
We can accept more 5
count of 5xx error
Event based SLO
Slide 16
Slide 16 text
We can accept Errors as Error Budget
SLI(%)
SLO: 99.9%
Present: 99.95%
95 percentile Response time < 100msec
In last 1 minutes
———————————————————————————-——-
All time window
Slide 17
Slide 17 text
We can accept Errors as Error Budget
SLI(%)
SLO: 99.9%
Present: 99.95%
95 percentile Response time < 100msec
In last 1 minutes
———————————————————————————-——-
All time window
7 days
Error Budget is only 10
minutes in 7 days
Slide 18
Slide 18 text
We can accept Errors as Error Budget
SLI(%)
SLO: 99.9%
Present: 99.95%
95 percentile Response time < 100msec
In last 1 minutes
———————————————————————————-——-
All time window
7 days
Error Budget is only 10
minutes in 7 days
Monitor based SLO
Slide 19
Slide 19 text
Why
• Fact-based decision making
• Team can develop with a balance between reliability and agility
• Especially important in the microserrvices architecture
Slide 20
Slide 20 text
Team can develop with a balance between reliability and agility
Reliability Agility
Ops
Keep the reliability
Dev
Let’s release new feature!
SLO
Slide 21
Slide 21 text
Especially important in the microserrvices architecture
ServiceA
ServiceB
ServiceC
Success Rate 99.9%
Success Rate 99%
Success Rate 99%
Reliability depends on
other services
Slide 22
Slide 22 text
Where
Synthetics Client
Frontend
CDN LoadBalancer Application DataStore
Many options, Trade-off
Slide 23
Slide 23 text
Where
Synthetics Client
Frontend
CDN LoadBalancer Application DataStore
Many options, Trade-off
Some requests might
not reach to the apps
Need more
engineering effort to
generate E2E tests
Slide 24
Slide 24 text
In Quipper
Synthetics Client
Frontend
CDN LoadBalancer Application DataStore
Send everything to Datadog
Slide 25
Slide 25 text
Agenda
• Learn SLO
• What / Why / Where
• Case Study in Quipper
• Takeaways
• Provide Recommended SLIs
• Make the configuration as code
• Have a steep learning curve
Slide 26
Slide 26 text
Self-Contained
“Encourage development teams to be self-contained so that each team can make products
more comprehensively, proactively, and efficiently.”
Slide 27
Slide 27 text
SRE Mission for 2020 / Self-Contained
• Product Team can develop by themselves
• No ask SREs
• We SRE provides the process
• Design Doc
• Production Readiness Check
• Delegate Infrastructure Management(Terraform)
• SLI/SLO
Slide 28
Slide 28 text
Timeline
2019 2020
Migrated to Kubernetes
Define the Ownership
Production Readiness Checklist
SLO review by myself
Set Error Budget Policy
Jun.
Mar. Mar.
Sep.
SRE NEXT
SLO review with Devs
Slide 29
Slide 29 text
Timeline
2019 2020
Migrated to Kubernetes
Define the Ownership
Production Readiness Checklist
SLO review by myself
SLO review with Devs
Jun.
Mar. Mar.
Sep.
SRE NEXT
Set Error Budget Policy
Slide 30
Slide 30 text
Timeline
2019 2020
Migrated to Kubernetes
Define the Ownership
Production Readiness Checklist
SLO review by myself
SLO review with Devs
Jun.
Mar. Mar.
Sep.
SRE NEXT
Why do we need such steps?
Set Error Budget Policy
Slide 31
Slide 31 text
Why do we need such steps?
• SLIs/SLOs we defined are appropriate?
• If not, Error Budget Policy won’t work well
• Can the product team start the process itself?
• If not, need some scaffold, preparation, training
Slide 32
Slide 32 text
Case Study in Quipper
• Define the Ownership
• SLO review by myself
• SLO review with Devs
• Set Error Budget Policy
Slide 33
Slide 33 text
Case Study in Quipper
• Define the Ownership
• SLO review by myself
• SLO review with Devs
• Set Error Budget Policy
Slide 34
Slide 34 text
Know your systems and organizations
• 2 Product
• 4 Branches
• 97 Kubernetes Deployment
• 84 Developers (Includes 6 SREs)
• 48 subdomains
Where is the Ownership?
Slide 35
Slide 35 text
Define the Owner
Slide 36
Slide 36 text
Define the Owner
Services / Teams
Japan 7 Global 8
Philippines 3
indonesia 4
Shared 1
Slide 37
Slide 37 text
Define Service Owner In Design Doc for new service
Slide 38
Slide 38 text
Case Study in Quipper
• Define the Ownership
• SLO review by myself
• SLO review with Devs
• Set Error Budget Policy
Slide 39
Slide 39 text
SLO review by myself
• Establish SLO Review process
• How to set SLO?
• How to monitor SLO?
• What is an action when SLO violation?
• How to investigate?
• Improve SLI / SLO accuracy
• How to think to revise?
Slide 40
Slide 40 text
How to set and monitor SLO?
Slide 41
Slide 41 text
How to set and monitor SLO?
• Unfortunately, there is no Alert or recording system
• Use Slack reminder and record on Github Issue
Slide 42
Slide 42 text
How to set and monitor SLO?
Slide 43
Slide 43 text
Availability Table
https://landing.google.com/sre/sre-book/chapters/availability-table/
Too many errors
Target too high
Start with this!
Slide 44
Slide 44 text
Realized that “SLO Review” is good habit
• Good habit?
• Like Pair-Programming or Unit Test
• Why?
• Motivate to get metrics
• No burnout, feel relief
• Aware of the factors that hinder reliability
• Platform Outage
• Push notification
• Resource Capacity
• Rolling Update
Slide 45
Slide 45 text
Case Study in Quipper
• Define the Ownership
• SLO review by myself
• SLO review with Devs
• Set Error Budget Policy
Slide 46
Slide 46 text
Many Problems…
• Noisy metrics by dos detector
• Developing SLIs
• Send http path tag for shared service
• No available metrics for microservices SLIs
Slide 47
Slide 47 text
Dos Detector: Rate limiting by Reverse Proxy
Slide 48
Slide 48 text
Dos Detector: Rate limiting by Reverse Proxy
If a large number of requests
are made from the same client
in a short time, returns 503
Slide 49
Slide 49 text
SLI should be related to user happiness
SLI(%)
http 2xx status count
———————————————————————————-——-
http 2xx status count + 5xx status count
Slide 50
Slide 50 text
Noisy metrics by dos detector
Slide 51
Slide 51 text
Send http path tag for shared service
Coaching Team uses
example.quipper.com/coaching
School Team uses
example.quipper.com/school
Slide 52
Slide 52 text
Send http path tag for shared service
Slide 53
Slide 53 text
Send http path tag for shared service
Slide 54
Slide 54 text
No available metrics for microservices SLIs
Slide 55
Slide 55 text
No available metrics for microservices SLIs
ServiceA
ServiceB
ServiceC
GET http://serviceb
GET http://servicec
Slide 56
Slide 56 text
No available metrics for microservices SLIs
ServiceA
ServiceB
ServiceC
GET http://serviceb
GET http://servicec
Side-car container
Slide 57
Slide 57 text
Case Study in Quipper
• Define the Ownership
• SLO review by myself
• SLO review with Devs
• Set Error Budget Policy
• To be continued…
Slide 58
Slide 58 text
Agenda
• Learn SLO
• What / Why / Where
• Case Study in Quipper
• Takeaways
• Provide Recommended SLIs
• Make the configuration as code
• Have a steep learning curve
Slide 59
Slide 59 text
Provide Standardized / Recommended SLIs
• Ideally, better to set SLIs by Product Team but…
• Start with default first
Slide 60
Slide 60 text
SLI menu
• Availability
• http success rate
• Latency
• upstream response time < x msec
Slide 61
Slide 61 text
Make the configuration as code
Slide 62
Slide 62 text
Make the configuration as code
Developer can easily
change by pull request
Slide 63
Slide 63 text
Have a steep learning curve
Slide 64
Slide 64 text
Good Documentation
Slide 65
Slide 65 text
Work together
Slide 66
Slide 66 text
Agenda
• Learn SLO
• What / Why / Where
• Case Study in Quipper
• Takeaways
• Provide Recommended SLIs
• Make the configuration as code
• Have a steep learning curve
Slide 67
Slide 67 text
Summery
• It is worth defining and reviewing SLI / SLO
• But the SLI / SLO is not perfect from the beginning
• Reduce cognitive load and introduce gradually to team
Slide 68
Slide 68 text
Thank You!
chaspy
chaspy_
Site Reliability Engineer
at Quipper
Takeshi Kondo
SRE Lounge Terraform-jp