SLOs and Error Budget - Speaker Deck

Slide 1

Slide 1 text

SLO's and Error Budgets

Slide 2

Slide 2 text

Agenda SLOs and SLIs Reliability Target And Error Budgets Using SLI to Measure Example Advanced Topics

Slide 3

Slide 3 text

Service Level Objective (SLO) An SLO sets a target level of reliability for the services customer's It's a tool for making data-driven decisions about reliability. Features or Reliability? The most important SRE concept, without SLOs, there is no need for Site Reliability Engineering

Slide 4

Slide 4 text

Service Level Indicators (SLI) Key measurement of availability Usually are request latency and error rate Request metrics are measured at the load balancer SLO for UIs has multiple SLIs: Page load time Browser interaction time (by geo) Server error rate JS error rate (by geo)

Slide 5

Slide 5 text

Reliability Target Above this threshold, almost all users should be happy with the service Below this threshold, users are likely to start complaining or stop using the service Ultimately, user happiness is what matters. We keep our services reliable to keep our customers happy Usually owned by Product Owners

Slide 6

Slide 6 text

Error Budgets Provide an incentive to balance reliability with other features It gives teams permission to focus on reliability when data indicates that reliability is more important than other product features

Slide 7

Slide 7 text

Non-Goals Error Budget is not intended to serve as a punishment for missing SLOs It's neither to make SREs happy nor service owners

Slide 8

Slide 8 text

Using the SLIs to Calculate the SLOs

Slide 9

Slide 9 text

Availability Availability = (successful requests / total requests) Calculation for Availability SLIs over the previous 4 weeks sum(rate(http_requests_total{service=~"my- service", status=~"5.*"}[4w])) / sum(rate(http_requests_total{service=~"my- service"}[4w])) More about Aggregate Availability

Slide 10

Slide 10 text

Latency Percentage of requests with latency < xyz ms Calculation for Latency SLIs over the previous 4 weeks histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{ing ress=~"my-service"}[4w])) by (le)) histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{ing ress=~"my-service"}[4w])) by (le))

Slide 11

Slide 11 text

Create a proposed SLO Based upon the previous proposed SLIs, we can de ne our SLO for the period of four weeks SLO 99.5% of availability 95% of requests faster than 400ms 99% of requests faster than 950ms

Slide 12

Slide 12 text

Example SLO Document See SLO Template Example

Slide 13

Slide 13 text

Advanced Topics Modeling User Journey Alerting over SLOs violations SLIs/SLOs for UIs Delivery Fast, error-free UI Multiple SLIs: Page load time, browser interaction time (by geo), server error rate, js error rate (by geo)