measure of service reliability • i.e. http success rate, response time • SLO / Service Level Objectives • Set a reliability target for an SLI • 99%, 99.9%, 99.99%… • Error Budget • An SLO implies an acceptable level of unreliability • This is a budget that can be allocated The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0
Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days
Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days Monitor based SLO
develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO
are appropriate? • If not, Error Budget Policy won’t work well • Can the product team start the process itself? • If not, need some scaffold, preparation, training
How to set SLO? • How to monitor SLO? • What is an action when SLO violation? • How to investigate? • Improve SLI / SLO accuracy • How to think to revise?
• Like Pair-Programming or Unit Test • Why? • Motivate to get metrics • No burnout, feel relief • Aware of the factors that hinder reliability • Platform Outage • Push notification • Resource Capacity • Rolling Update