Agenda
SLOs and SLIs
Reliability Target And Error Budgets
Using SLI to Measure
Example
Advanced Topics
Slide 3
Slide 3 text
Service Level Objective (SLO)
An SLO sets a target level of reliability for the
services customer's
It's a tool for making data-driven decisions about
reliability. Features or Reliability?
The most important SRE concept, without SLOs,
there is no need for Site Reliability Engineering
Slide 4
Slide 4 text
Service Level Indicators (SLI)
Key measurement of availability
Usually are request latency and error rate
Request metrics are measured at the load
balancer
SLO for UIs has multiple SLIs:
Page load time
Browser interaction time (by geo)
Server error rate
JS error rate (by geo)
Slide 5
Slide 5 text
Reliability Target
Above this threshold, almost all users should be
happy with the service
Below this threshold, users are likely to start
complaining or stop using the service
Ultimately, user happiness is what matters. We
keep our services reliable to keep our customers
happy
Usually owned by Product Owners
Slide 6
Slide 6 text
Error Budgets
Provide an incentive to balance reliability with
other features
It gives teams permission to focus on reliability
when data indicates that reliability is more
important than other product features
Slide 7
Slide 7 text
Non-Goals
Error Budget is not intended to serve as a
punishment for missing SLOs
It's neither to make SREs happy nor service
owners
Slide 8
Slide 8 text
Using the SLIs to Calculate the
SLOs
Slide 9
Slide 9 text
Availability
Availability = (successful requests / total requests)
Calculation for Availability SLIs over the previous
4 weeks
sum(rate(http_requests_total{service=~"my-
service", status=~"5.*"}[4w]))
/
sum(rate(http_requests_total{service=~"my-
service"}[4w]))
More about Aggregate Availability
Slide 10
Slide 10 text
Latency
Percentage of requests with latency < xyz ms
Calculation for Latency SLIs over the previous 4
weeks
histogram_quantile(0.95,
sum(rate(request_duration_seconds_bucket{ing
ress=~"my-service"}[4w])) by (le))
histogram_quantile(0.99,
sum(rate(request_duration_seconds_bucket{ing
ress=~"my-service"}[4w])) by (le))
Slide 11
Slide 11 text
Create a proposed SLO
Based upon the previous proposed SLIs, we can
de ne our SLO for the period of four weeks
SLO
99.5% of availability
95% of requests faster than 400ms
99% of requests faster than 950ms
Slide 12
Slide 12 text
Example SLO Document
See SLO Template Example
Slide 13
Slide 13 text
Advanced Topics
Modeling User Journey
Alerting over SLOs violations
SLIs/SLOs for UIs
Delivery Fast, error-free UI
Multiple SLIs: Page load time, browser
interaction time (by geo), server error rate, js
error rate (by geo)
Slide 14
Slide 14 text
Challenges
SLOs for legacy systems
Too many edge services
SLOs reviews