Adopting SLO and Error Budget

Adopting SLO & Error Budget

Agenda SLO and SLI Concepts Why SLOs? Error Budget SLO
Process Alerting on SLOs Challenges

Service Level Objective (SLO) Target level of reliability for the
service end users Its a tool for balancing the addition of new features vs reliability Philosophy of monitoring and managing systems The most important SRE concept Its expressed as a percentage or ratio over a rolling time window eg "99.9% for any given 30d"

Why SLOs? Provides context that allows engineering teams to make
data- driven decisions about the service availability/performance Provides clear and accurate statements about impact of production incidents

Service Level Indicators (SLI) Key measurement of availability Usually are
request latency and rate of errors Metrics are measured high up in the stack eg "load balancer" SLOs for UIs has multiple SLIs: Page load time Browser interaction time (by Geo) Server error rate JS error rate (by Geo)

Error Budget Provides an incentive to balance reliability vs new
features Gives teams permission to focus on service reliability "for customers" Total number of failures tolerated by the SLO

Non-Goals Its not intented to blame or serve as a
punishiment for missing SLOs Its not intended to promote infrastructure teams In fact its business and customers concern

Availability successful / total requests = availability calculating availability SLIs
over the previous 4 weeks sum ( rate ( istio_requests_total{destination_service_name="svc-k8s", response_code!~"5.*"}[4w] ) ) / sum ( rate ( istio_requests_total{destination_service_name="svc-k8s"}[4w] ) )

Latency percentage of requests with latency < X seconds calculation
for latency SLIs over the previous 4 weeks histogram_quantile ( 0.99, sum ( rate ( istio_request_duration_seconds_bucket{destination_service_name="svc-k8s", response_code=~"2.*"}[4w] ) ) by (le) )

Create a proposed SLO Based upon the previous proposed SLIs,
define SLO for the period of 4 weeks: SLO 99.9% of availability 99% of requests faster than 950ms 95% of requests faster than 400ms

Optimizing SLO queries Record the percentage availability of 5m intervals
and get the average over time avg ( avg_over_time ( istio:api_v2_availability:ratio_rate5m[4w] ) )

Optimizing availability query successful (non-5xx) requests / total requests -
record: istio:api_v2_availability:ratio_rate5m expr: | sum ( rate ( istio_requests_total{destination_service_name="api-v2", response_code!~"5.*"}[5m] ) ) / sum ( rate ( istio_requests_total{destination_service_name="api-v2"}[5m] ) )

Optimizing SLO queries Goal Alleviate the load on monitoring services
Faster visualization Non-Goal Send alerts

Alerting on SLOs Notification for a significant event: an event
that consumes a large fraction of the error budget Basic: Target Error Rate ≥ SLO Threshold Advanced: Multiple Burn Rate Alerts

Target Error Rate ≥ SLO Threshold The most trivial solution
If the SLO is 99.9% over 30 days, alert if the error rate over the previous 10m is ≥ 0.1% - alert: HighErrorRate expr: istio:api_v2_request_errors:ratio_rate10m >= 0.001

The 5-minute average calculation: - record: istio:api_v2_availability:ratio_rate10m expr: | sum
( rate ( istio_requests_total{destination_service_name="api-v2", response_code=~"5.*"}[10m] ) ) / sum ( rate ( istio_requests_total{destination_service_name="api-v2"}[10m] ) )

Target Error Rate ≥ SLO Threshold

Pros Easy to implement, maintain and reason about

Cons Low precision: Fires on many events that do not
threaten the SLO. A 0.1% error rate for 5 minutes would alert, while consuming only 0.01% of the monthly error budget. One could receive up to 144 alerts per day every day, not act upon any alerts, and still meet the SLO.

Multiple Burn Rate Alerts Burn alert is an alert that
signals that the error budget is being burned rapidly

Multiple Burn Rate Alerts Fast burning is detected quickly, whereas
slow burning requires a longer time window. Severity Long Window Short Window for Duration Burn Rate Factor Error Budget Consumed P1 1h 5m 2m 14.4 2% P2 6h 30m 15m 6 5% P4 1d 2h 1h 3 10% P4 3d 6h 1h 1 10%

SLO Error Budget Burn Alert alert: SloErrorBudgetBurn expr: | (istio:remote_config_service_request_errors:ratio_rate1h
> (14.4 * 0.001)) and (istio:remote_config_service_request_errors:ratio_rate5m > (14.4 * 0.001)) for: 2m labels: severity: P1 long_window: 1h service: remote-config-service squad: engineering-experience tribe: platform

Pros Reduce the number of false positives Sustainable on-call shifts
Ability to adapt the monitoring config to many situations according to criticality: alert quickly if the error rate is high; alert eventually if the error rate is low but sustained Set up ticket notifications. Ex: incidents that go unnoticed but can exhaust the error budget if left unchecked

Cons More numbers, window sizes, and thresholds to manage and
reason about. An even longer reset time, as a result of the 3d window. Multiple alerts can be sent if all conditions are true. Ex: 10% budget spend in 5m also means that 5% of the budget was spent in 6h, and 2% of the budget was spent in 1h

Advanced Topics Modeling user journeys Infrastructure, pipeline services SLIs/SLOs for
UIs

Modeling User Journeys SLO Journey: Customers Checkout - record: elb:checkout_availability:ratio_rate5m
expr: | sum ( rate ( http_requests_total{path=~"(cart|subscriptions).*", http_code=~"5.*"}[5m] ) ) / sum ( rate ( http_requests_total{path=~"(cart|subscriptions).*"}[5m] ) )

Modeling User Journeys SLO Alert: TooManyCustomersCheckoutErrors alert: TooManyCustomersCheckoutErrors expr: |
(elb:checkout_availability:ratio_rate1h > (14.4 * 0.001)) and (elb:checkout_availability:ratio_rate5m > (14.4 * 0.001)) for: 2m labels: severity: P1 long_window: 1h

Pipeline Services SLO #in-492

Pipeline Services SLO SLO: 99.9% of produced subscriptions should be
consumed less than 5m Monitoring: Instrument producing and consuming times

SLO implementation challenges Requires solid monitoring, alerting infrastructure Metrics, alerts
and dashboards should be created via automation with less human intervation as much as possible SLOs without error budgets adoption are useless SLOs refinements/reviews Complicated Edge layer

Thank you!

Adopting SLO and Error Budget

Adopting SLO and Error Budget

Rafael Jesus

More Decks by Rafael Jesus

Other Decks in Technology

Featured

Transcript

Adopting SLO & Error Budget

Agenda SLO and SLI Concepts Why SLOs? Error Budget SLO

Service Level Objective (SLO) Target level of reliability for the

Why SLOs? Provides context that allows engineering teams to make

Service Level Indicators (SLI) Key measurement of availability Usually are

Error Budget Provides an incentive to balance reliability vs new

Non-Goals Its not intented to blame or serve as a

Availability successful / total requests = availability calculating availability SLIs

Latency percentage of requests with latency < X seconds calculation

Create a proposed SLO Based upon the previous proposed SLIs,

Optimizing SLO queries Record the percentage availability of 5m intervals

Optimizing availability query successful (non-5xx) requests / total requests -

Optimizing SLO queries Goal Alleviate the load on monitoring services

Alerting on SLOs Notification for a significant event: an event

Target Error Rate ≥ SLO Threshold The most trivial solution

The 5-minute average calculation: - record: istio:api_v2_availability:ratio_rate10m expr: | sum

Target Error Rate ≥ SLO Threshold

Pros Easy to implement, maintain and reason about

Cons Low precision: Fires on many events that do not

Multiple Burn Rate Alerts Burn alert is an alert that

Multiple Burn Rate Alerts Fast burning is detected quickly, whereas

SLO Error Budget Burn Alert alert: SloErrorBudgetBurn expr: | (istio:remote_config_service_request_errors:ratio_rate1h

Pros Reduce the number of false positives Sustainable on-call shifts

Cons More numbers, window sizes, and thresholds to manage and

Advanced Topics Modeling user journeys Infrastructure, pipeline services SLIs/SLOs for

Modeling User Journeys SLO Journey: Customers Checkout - record: elb:checkout_availability:ratio_rate5m

Modeling User Journeys SLO Alert: TooManyCustomersCheckoutErrors alert: TooManyCustomersCheckoutErrors expr: |

Pipeline Services SLO #in-492

Pipeline Services SLO SLO: 99.9% of produced subscriptions should be

SLO implementation challenges Requires solid monitoring, alerting infrastructure Metrics, alerts

Thank you!