Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adopting SLO and Error Budget

Adopting SLO and Error Budget

Rafael Jesus

October 18, 2019
Tweet

More Decks by Rafael Jesus

Other Decks in Technology

Transcript

  1. Agenda SLO and SLI Concepts Why SLOs? Error Budget SLO

    Process Alerting on SLOs Challenges
  2. Service Level Objective (SLO) Target level of reliability for the

    service end users Its a tool for balancing the addition of new features vs reliability Philosophy of monitoring and managing systems The most important SRE concept Its expressed as a percentage or ratio over a rolling time window eg "99.9% for any given 30d"
  3. Why SLOs? Provides context that allows engineering teams to make

    data- driven decisions about the service availability/performance Provides clear and accurate statements about impact of production incidents
  4. Service Level Indicators (SLI) Key measurement of availability Usually are

    request latency and rate of errors Metrics are measured high up in the stack eg "load balancer" SLOs for UIs has multiple SLIs: Page load time Browser interaction time (by Geo) Server error rate JS error rate (by Geo)
  5. Error Budget Provides an incentive to balance reliability vs new

    features Gives teams permission to focus on service reliability "for customers" Total number of failures tolerated by the SLO
  6. Non-Goals Its not intented to blame or serve as a

    punishiment for missing SLOs Its not intended to promote infrastructure teams In fact its business and customers concern
  7. Availability successful / total requests = availability calculating availability SLIs

    over the previous 4 weeks sum ( rate ( istio_requests_total{destination_service_name="svc-k8s", response_code!~"5.*"}[4w] ) ) / sum ( rate ( istio_requests_total{destination_service_name="svc-k8s"}[4w] ) )
  8. Latency percentage of requests with latency < X seconds calculation

    for latency SLIs over the previous 4 weeks histogram_quantile ( 0.99, sum ( rate ( istio_request_duration_seconds_bucket{destination_service_name="svc-k8s", response_code=~"2.*"}[4w] ) ) by (le) )
  9. Create a proposed SLO Based upon the previous proposed SLIs,

    define SLO for the period of 4 weeks: SLO 99.9% of availability 99% of requests faster than 950ms 95% of requests faster than 400ms
  10. Optimizing SLO queries Record the percentage availability of 5m intervals

    and get the average over time avg ( avg_over_time ( istio:api_v2_availability:ratio_rate5m[4w] ) )
  11. Optimizing availability query successful (non-5xx) requests / total requests -

    record: istio:api_v2_availability:ratio_rate5m expr: | sum ( rate ( istio_requests_total{destination_service_name="api-v2", response_code!~"5.*"}[5m] ) ) / sum ( rate ( istio_requests_total{destination_service_name="api-v2"}[5m] ) )
  12. Alerting on SLOs Notification for a significant event: an event

    that consumes a large fraction of the error budget Basic: Target Error Rate ≥ SLO Threshold Advanced: Multiple Burn Rate Alerts
  13. Target Error Rate ≥ SLO Threshold The most trivial solution

    If the SLO is 99.9% over 30 days, alert if the error rate over the previous 10m is ≥ 0.1% - alert: HighErrorRate expr: istio:api_v2_request_errors:ratio_rate10m >= 0.001
  14. The 5-minute average calculation: - record: istio:api_v2_availability:ratio_rate10m expr: | sum

    ( rate ( istio_requests_total{destination_service_name="api-v2", response_code=~"5.*"}[10m] ) ) / sum ( rate ( istio_requests_total{destination_service_name="api-v2"}[10m] ) )
  15. Cons Low precision: Fires on many events that do not

    threaten the SLO. A 0.1% error rate for 5 minutes would alert, while consuming only 0.01% of the monthly error budget. One could receive up to 144 alerts per day every day, not act upon any alerts, and still meet the SLO.
  16. Multiple Burn Rate Alerts Burn alert is an alert that

    signals that the error budget is being burned rapidly
  17. Multiple Burn Rate Alerts Fast burning is detected quickly, whereas

    slow burning requires a longer time window. Severity Long Window Short Window for Duration Burn Rate Factor Error Budget Consumed P1 1h 5m 2m 14.4 2% P2 6h 30m 15m 6 5% P4 1d 2h 1h 3 10% P4 3d 6h 1h 1 10%
  18. SLO Error Budget Burn Alert alert: SloErrorBudgetBurn expr: | (istio:remote_config_service_request_errors:ratio_rate1h

    > (14.4 * 0.001)) and (istio:remote_config_service_request_errors:ratio_rate5m > (14.4 * 0.001)) for: 2m labels: severity: P1 long_window: 1h service: remote-config-service squad: engineering-experience tribe: platform
  19. Pros Reduce the number of false positives Sustainable on-call shifts

    Ability to adapt the monitoring config to many situations according to criticality: alert quickly if the error rate is high; alert eventually if the error rate is low but sustained Set up ticket notifications. Ex: incidents that go unnoticed but can exhaust the error budget if left unchecked
  20. Cons More numbers, window sizes, and thresholds to manage and

    reason about. An even longer reset time, as a result of the 3d window. Multiple alerts can be sent if all conditions are true. Ex: 10% budget spend in 5m also means that 5% of the budget was spent in 6h, and 2% of the budget was spent in 1h
  21. Modeling User Journeys SLO Journey: Customers Checkout - record: elb:checkout_availability:ratio_rate5m

    expr: | sum ( rate ( http_requests_total{path=~"(cart|subscriptions).*", http_code=~"5.*"}[5m] ) ) / sum ( rate ( http_requests_total{path=~"(cart|subscriptions).*"}[5m] ) )
  22. Modeling User Journeys SLO Alert: TooManyCustomersCheckoutErrors alert: TooManyCustomersCheckoutErrors expr: |

    (elb:checkout_availability:ratio_rate1h > (14.4 * 0.001)) and (elb:checkout_availability:ratio_rate5m > (14.4 * 0.001)) for: 2m labels: severity: P1 long_window: 1h
  23. Pipeline Services SLO SLO: 99.9% of produced subscriptions should be

    consumed less than 5m Monitoring: Instrument producing and consuming times
  24. SLO implementation challenges Requires solid monitoring, alerting infrastructure Metrics, alerts

    and dashboards should be created via automation with less human intervation as much as possible SLOs without error budgets adoption are useless SLOs refinements/reviews Complicated Edge layer