Adopting SLO and Error Budget

Slide 1

Slide 1 text

Adopting SLO & Error Budget

Slide 2

Slide 2 text

Agenda SLO and SLI Concepts Why SLOs? Error Budget SLO Process Alerting on SLOs Challenges

Slide 3

Slide 3 text

Service Level Objective (SLO) Target level of reliability for the service end users Its a tool for balancing the addition of new features vs reliability Philosophy of monitoring and managing systems The most important SRE concept Its expressed as a percentage or ratio over a rolling time window eg "99.9% for any given 30d"

Slide 4

Slide 4 text

Why SLOs? Provides context that allows engineering teams to make data- driven decisions about the service availability/performance Provides clear and accurate statements about impact of production incidents

Slide 5

Slide 5 text

Service Level Indicators (SLI) Key measurement of availability Usually are request latency and rate of errors Metrics are measured high up in the stack eg "load balancer" SLOs for UIs has multiple SLIs: Page load time Browser interaction time (by Geo) Server error rate JS error rate (by Geo)

Slide 6

Slide 6 text

Error Budget Provides an incentive to balance reliability vs new features Gives teams permission to focus on service reliability "for customers" Total number of failures tolerated by the SLO

Slide 7

Slide 7 text

Non-Goals Its not intented to blame or serve as a punishiment for missing SLOs Its not intended to promote infrastructure teams In fact its business and customers concern

Slide 8

Slide 8 text

Availability successful / total requests = availability calculating availability SLIs over the previous 4 weeks sum ( rate ( istio_requests_total{destination_service_name="svc-k8s", response_code!~"5.*"}[4w] ) ) / sum ( rate ( istio_requests_total{destination_service_name="svc-k8s"}[4w] ) )

Slide 9

Slide 9 text

Latency percentage of requests with latency < X seconds calculation for latency SLIs over the previous 4 weeks histogram_quantile ( 0.99, sum ( rate ( istio_request_duration_seconds_bucket{destination_service_name="svc-k8s", response_code=~"2.*"}[4w] ) ) by (le) )

Slide 10

Slide 10 text

Create a proposed SLO Based upon the previous proposed SLIs, define SLO for the period of 4 weeks: SLO 99.9% of availability 99% of requests faster than 950ms 95% of requests faster than 400ms

Slide 11

Slide 11 text

Optimizing SLO queries Record the percentage availability of 5m intervals and get the average over time avg ( avg_over_time ( istio:api_v2_availability:ratio_rate5m[4w] ) )

Slide 12

Slide 12 text

Optimizing availability query successful (non-5xx) requests / total requests - record: istio:api_v2_availability:ratio_rate5m expr: | sum ( rate ( istio_requests_total{destination_service_name="api-v2", response_code!~"5.*"}[5m] ) ) / sum ( rate ( istio_requests_total{destination_service_name="api-v2"}[5m] ) )

Slide 13

Slide 13 text

Optimizing SLO queries Goal Alleviate the load on monitoring services Faster visualization Non-Goal Send alerts

Slide 14

Slide 14 text

Alerting on SLOs Notification for a significant event: an event that consumes a large fraction of the error budget Basic: Target Error Rate ≥ SLO Threshold Advanced: Multiple Burn Rate Alerts

Slide 15

Slide 15 text

Target Error Rate ≥ SLO Threshold The most trivial solution If the SLO is 99.9% over 30 days, alert if the error rate over the previous 10m is ≥ 0.1% - alert: HighErrorRate expr: istio:api_v2_request_errors:ratio_rate10m >= 0.001

Slide 16

Slide 16 text

The 5-minute average calculation: - record: istio:api_v2_availability:ratio_rate10m expr: | sum ( rate ( istio_requests_total{destination_service_name="api-v2", response_code=~"5.*"}[10m] ) ) / sum ( rate ( istio_requests_total{destination_service_name="api-v2"}[10m] ) )

Slide 17

Slide 17 text

Target Error Rate ≥ SLO Threshold

Slide 18

Slide 18 text

Pros Easy to implement, maintain and reason about

Slide 19

Slide 19 text

Cons Low precision: Fires on many events that do not threaten the SLO. A 0.1% error rate for 5 minutes would alert, while consuming only 0.01% of the monthly error budget. One could receive up to 144 alerts per day every day, not act upon any alerts, and still meet the SLO.

Slide 20

Slide 20 text

Multiple Burn Rate Alerts Burn alert is an alert that signals that the error budget is being burned rapidly

Slide 21

Slide 21 text

Multiple Burn Rate Alerts Fast burning is detected quickly, whereas slow burning requires a longer time window. Severity Long Window Short Window for Duration Burn Rate Factor Error Budget Consumed P1 1h 5m 2m 14.4 2% P2 6h 30m 15m 6 5% P4 1d 2h 1h 3 10% P4 3d 6h 1h 1 10%

Slide 22

Slide 22 text

SLO Error Budget Burn Alert alert: SloErrorBudgetBurn expr: | (istio:remote_config_service_request_errors:ratio_rate1h > (14.4 * 0.001)) and (istio:remote_config_service_request_errors:ratio_rate5m > (14.4 * 0.001)) for: 2m labels: severity: P1 long_window: 1h service: remote-config-service squad: engineering-experience tribe: platform

Slide 23

Slide 23 text

Pros Reduce the number of false positives Sustainable on-call shifts Ability to adapt the monitoring config to many situations according to criticality: alert quickly if the error rate is high; alert eventually if the error rate is low but sustained Set up ticket notifications. Ex: incidents that go unnoticed but can exhaust the error budget if left unchecked

Slide 24

Slide 24 text

Cons More numbers, window sizes, and thresholds to manage and reason about. An even longer reset time, as a result of the 3d window. Multiple alerts can be sent if all conditions are true. Ex: 10% budget spend in 5m also means that 5% of the budget was spent in 6h, and 2% of the budget was spent in 1h

Slide 25

Slide 25 text

Advanced Topics Modeling user journeys Infrastructure, pipeline services SLIs/SLOs for UIs

Slide 26

Slide 26 text

Modeling User Journeys SLO Journey: Customers Checkout - record: elb:checkout_availability:ratio_rate5m expr: | sum ( rate ( http_requests_total{path=~"(cart|subscriptions).*", http_code=~"5.*"}[5m] ) ) / sum ( rate ( http_requests_total{path=~"(cart|subscriptions).*"}[5m] ) )

Slide 27

Slide 27 text

Modeling User Journeys SLO Alert: TooManyCustomersCheckoutErrors alert: TooManyCustomersCheckoutErrors expr: | (elb:checkout_availability:ratio_rate1h > (14.4 * 0.001)) and (elb:checkout_availability:ratio_rate5m > (14.4 * 0.001)) for: 2m labels: severity: P1 long_window: 1h

Slide 28

Slide 28 text

Pipeline Services SLO #in-492

Slide 29

Slide 29 text

Pipeline Services SLO SLO: 99.9% of produced subscriptions should be consumed less than 5m Monitoring: Instrument producing and consuming times

Slide 30

Slide 30 text

SLO implementation challenges Requires solid monitoring, alerting infrastructure Metrics, alerts and dashboards should be created via automation with less human intervation as much as possible SLOs without error budgets adoption are useless SLOs refinements/reviews Complicated Edge layer

Slide 31

Slide 31 text

Thank you!