Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SLOs and Error Budget

Rafael Jesus
November 12, 2018
160

SLOs and Error Budget

SLOs and Error Budget Presentation

Rafael Jesus

November 12, 2018
Tweet

Transcript

  1. Service Level Objective (SLO) An SLO sets a target level

    of reliability for the services customer's It's a tool for making data-driven decisions about reliability. Features or Reliability? The most important SRE concept, without SLOs, there is no need for Site Reliability Engineering
  2. Service Level Indicators (SLI) Key measurement of availability Usually are

    request latency and error rate Request metrics are measured at the load balancer SLO for UIs has multiple SLIs: Page load time Browser interaction time (by geo) Server error rate JS error rate (by geo)
  3. Reliability Target Above this threshold, almost all users should be

    happy with the service Below this threshold, users are likely to start complaining or stop using the service Ultimately, user happiness is what matters. We keep our services reliable to keep our customers happy Usually owned by Product Owners
  4. Error Budgets Provide an incentive to balance reliability with other

    features It gives teams permission to focus on reliability when data indicates that reliability is more important than other product features
  5. Non-Goals Error Budget is not intended to serve as a

    punishment for missing SLOs It's neither to make SREs happy nor service owners
  6. Availability Availability = (successful requests / total requests) Calculation for

    Availability SLIs over the previous 4 weeks sum(rate(http_requests_total{service=~"my- service", status=~"5.*"}[4w])) / sum(rate(http_requests_total{service=~"my- service"}[4w])) More about Aggregate Availability
  7. Latency Percentage of requests with latency < xyz ms Calculation

    for Latency SLIs over the previous 4 weeks histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{ing ress=~"my-service"}[4w])) by (le)) histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{ing ress=~"my-service"}[4w])) by (le))
  8. Create a proposed SLO Based upon the previous proposed SLIs,

    we can de ne our SLO for the period of four weeks SLO 99.5% of availability 95% of requests faster than 400ms 99% of requests faster than 950ms
  9. Advanced Topics Modeling User Journey Alerting over SLOs violations SLIs/SLOs

    for UIs Delivery Fast, error-free UI Multiple SLIs: Page load time, browser interaction time (by geo), server error rate, js error rate (by geo)