• An SLO implies an acceptable level of unreliability • This is a budget that can be allocated The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0
SLI(%) SLO: 99.9% Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days
SLI(%) SLO: 99.9% Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days Monitor based SLO
Especially important in the microserrvices architecture ServiceA ServiceB ServiceC Success Rate 99.9% Success Rate 99% Success Rate 99% Reliability depends on other services
Where Synthetics Client Frontend CDN LoadBalancer Application DataStore Many options, Trade-off Some requests might not reach to the apps Need more engineering effort to generate E2E tests
Self-Contained “Encourage development teams to be self-contained so that each team can make products more comprehensively, proactively, and efficiently.”
Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production Readiness Checklist SLO review by myself Set Error Budget Policy Jun. Mar. Mar. Sep. SRE NEXT SLO review with Devs
Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Set Error Budget Policy
Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Why do we need such steps? Set Error Budget Policy