SLO Review

SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020
#srenext #srenextC

Service Level Objectives

Questions • ✋Do you know the meaning of SLO? •
✋Do you deﬁne SLO for your service? • ✋Do you have an Error Budget Policy for your service?

Target • People who want to know SLI/SLO • People
who want to know how to use SLI/SLO • People who want to keep the reliability and agility of product development

Site Reliability Engineering: Measuring and Managing Reliability https://www.coursera.org/learn/site-reliability-engineering-slos

tl;dr • It is worth deﬁning and reviewing SLI /
SLO • But the SLI / SLO is not perfect from the beginning • Reduce cognitive load and introduce gradually to team

Agenda • Learn SLO • What / Why / Where
• Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the conﬁguration as code • Have a steep learning curve

What • SLI / Service Level Indicators • A quantiﬁable
measure of service reliability • i.e. http success rate, response time • SLO / Service Level Objectives • Set a reliability target for an SLI • 99%, 99.9%, 99.99%… • Error Budget • An SLO implies an acceptable level of unreliability • This is a budget that can be allocated The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0

SLI should be related to user happiness SLI(%) Good Event
——————————- Valid Event

SLI should be related to user happiness SLI(%) http 2xx
status count ———————————————————————————-——- http 2xx status count + 5xx status count

SLO is a reliability target for an SLI SLI(%) SLO:
99.9% http 2xx status count ———————————————————————————-——- http 2xx status count + 5xx status count

SLO is a reliability target for an SLI SLI(%) SLO:
99.9% Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count)

We can accept Errors as Error Budget SLI(%) SLO: 99.9%
Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count) Error Budget We can accept more 5 count of 5xx error

Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count) Error Budget We can accept more 5 count of 5xx error Event based SLO

Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window

Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days

Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days Monitor based SLO

Why • Fact-based decision making • Team can develop with
a balance between reliability and agility • Especially important in the microserrvices architecture

Team can develop with a balance between reliability and agility
Reliability Agility Ops Keep the reliability Dev Let’s release new feature! SLO

Especially important in the microserrvices architecture ServiceA ServiceB ServiceC Success
Rate 99.9% Success Rate 99% Success Rate 99% Reliability depends on other services

Where Synthetics Client Frontend CDN LoadBalancer Application DataStore Many options,
Trade-off

Where Synthetics Client Frontend CDN LoadBalancer Application DataStore Many options,
Trade-off Some requests might not reach to the apps Need more engineering effort to generate E2E tests

In Quipper Synthetics Client Frontend CDN LoadBalancer Application DataStore Send
everything to Datadog

Self-Contained “Encourage development teams to be self-contained so that each
team can make products more comprehensively, proactively, and efﬁciently.”

SRE Mission for 2020 / Self-Contained • Product Team can
develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO

Timeline 2019 2020 Migrated to Kubernetes Deﬁne the Ownership Production
Readiness Checklist SLO review by myself Set Error Budget Policy Jun. Mar. Mar. Sep. SRE NEXT SLO review with Devs

Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Set Error Budget Policy

Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Why do we need such steps? Set Error Budget Policy

Why do we need such steps? • SLIs/SLOs we deﬁned
are appropriate? • If not, Error Budget Policy won’t work well • Can the product team start the process itself? • If not, need some scaﬀold, preparation, training

Case Study in Quipper • Deﬁne the Ownership • SLO
review by myself • SLO review with Devs • Set Error Budget Policy

Know your systems and organizations • 2 Product • 4
Branches • 97 Kubernetes Deployment • 84 Developers (Includes 6 SREs) • 48 subdomains Where is the Ownership?

Deﬁne the Owner

Deﬁne the Owner Services / Teams Japan 7 Global 8
Philippines 3 indonesia 4 Shared 1

Deﬁne Service Owner In Design Doc for new service

SLO review by myself • Establish SLO Review process •
How to set SLO? • How to monitor SLO? • What is an action when SLO violation? • How to investigate? • Improve SLI / SLO accuracy • How to think to revise?

How to set and monitor SLO?

How to set and monitor SLO? • Unfortunately, there is
no Alert or recording system • Use Slack reminder and record on Github Issue

How to set and monitor SLO?

Availability Table https://landing.google.com/sre/sre-book/chapters/availability-table/ Too many errors Target too high Start
with this!

Realized that “SLO Review” is good habit • Good habit?
• Like Pair-Programming or Unit Test • Why? • Motivate to get metrics • No burnout, feel relief • Aware of the factors that hinder reliability • Platform Outage • Push notiﬁcation • Resource Capacity • Rolling Update

Many Problems… • Noisy metrics by dos detector • Developing
SLIs • Send http path tag for shared service • No available metrics for microservices SLIs

Dos Detector: Rate limiting by Reverse Proxy

Dos Detector: Rate limiting by Reverse Proxy If a large
number of requests are made from the same client in a short time, returns 503

SLI should be related to user happiness SLI(%) http 2xx
status count ———————————————————————————-——- http 2xx status count + 5xx status count

Noisy metrics by dos detector

Send http path tag for shared service Coaching Team uses
example.quipper.com/coaching School Team uses example.quipper.com/school

Send http path tag for shared service

No available metrics for microservices SLIs

No available metrics for microservices SLIs ServiceA ServiceB ServiceC GET
http://serviceb GET http://servicec

No available metrics for microservices SLIs ServiceA ServiceB ServiceC GET
http://serviceb GET http://servicec Side-car container

review by myself • SLO review with Devs • Set Error Budget Policy • To be continued…

Provide Standardized / Recommended SLIs • Ideally, better to set
SLIs by Product Team but… • Start with default ﬁrst

SLI menu • Availability • http success rate • Latency
• upstream response time < x msec

Make the conﬁguration as code

Make the conﬁguration as code Developer can easily change by
pull request

Have a steep learning curve

Good Documentation

Work together

Summery • It is worth deﬁning and reviewing SLI /
SLO • But the SLI / SLO is not perfect from the beginning • Reduce cognitive load and introduce gradually to team

Thank You! chaspy chaspy_ Site Reliability Engineer at Quipper Takeshi
Kondo SRE Lounge Terraform-jp

SLO Review

SLO Review

More Decks by Takeshi Kondo

Other Decks in Technology

Featured

Transcript