SLOs and Error Budget

SLO's and Error Budgets

Agenda SLOs and SLIs Reliability Target And Error Budgets Using
SLI to Measure Example Advanced Topics

Service Level Objective (SLO) An SLO sets a target level
of reliability for the services customer's It's a tool for making data-driven decisions about reliability. Features or Reliability? The most important SRE concept, without SLOs, there is no need for Site Reliability Engineering

Service Level Indicators (SLI) Key measurement of availability Usually are
request latency and error rate Request metrics are measured at the load balancer SLO for UIs has multiple SLIs: Page load time Browser interaction time (by geo) Server error rate JS error rate (by geo)

Reliability Target Above this threshold, almost all users should be
happy with the service Below this threshold, users are likely to start complaining or stop using the service Ultimately, user happiness is what matters. We keep our services reliable to keep our customers happy Usually owned by Product Owners

Error Budgets Provide an incentive to balance reliability with other
features It gives teams permission to focus on reliability when data indicates that reliability is more important than other product features

Non-Goals Error Budget is not intended to serve as a
punishment for missing SLOs It's neither to make SREs happy nor service owners

Using the SLIs to Calculate the SLOs

Availability Availability = (successful requests / total requests) Calculation for
Availability SLIs over the previous 4 weeks sum(rate(http_requests_total{service=~"my- service", status=~"5.*"}[4w])) / sum(rate(http_requests_total{service=~"my- service"}[4w])) More about Aggregate Availability

Latency Percentage of requests with latency < xyz ms Calculation
for Latency SLIs over the previous 4 weeks histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{ing ress=~"my-service"}[4w])) by (le)) histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{ing ress=~"my-service"}[4w])) by (le))

Create a proposed SLO Based upon the previous proposed SLIs,
we can de ne our SLO for the period of four weeks SLO 99.5% of availability 95% of requests faster than 400ms 99% of requests faster than 950ms

Example SLO Document See SLO Template Example

Advanced Topics Modeling User Journey Alerting over SLOs violations SLIs/SLOs
for UIs Delivery Fast, error-free UI Multiple SLIs: Page load time, browser interaction time (by geo), server error rate, js error rate (by geo)

Challenges SLOs for legacy systems Too many edge services SLOs
reviews

Site Reliability Engineering

The Site Rleiability Workbook

SLOs and Error Budget

SLOs and Error Budget

Rafael Jesus

More Decks by Rafael Jesus

Featured

Transcript