service end users Its a tool for balancing the addition of new features vs reliability Philosophy of monitoring and managing systems The most important SRE concept Its expressed as a percentage or ratio over a rolling time window eg "99.9% for any given 30d"
request latency and rate of errors Metrics are measured high up in the stack eg "load balancer" SLOs for UIs has multiple SLIs: Page load time Browser interaction time (by Geo) Server error rate JS error rate (by Geo)
for latency SLIs over the previous 4 weeks histogram_quantile ( 0.99, sum ( rate ( istio_request_duration_seconds_bucket{destination_service_name="svc-k8s", response_code=~"2.*"}[4w] ) ) by (le) )
If the SLO is 99.9% over 30 days, alert if the error rate over the previous 10m is ≥ 0.1% - alert: HighErrorRate expr: istio:api_v2_request_errors:ratio_rate10m >= 0.001
threaten the SLO. A 0.1% error rate for 5 minutes would alert, while consuming only 0.01% of the monthly error budget. One could receive up to 144 alerts per day every day, not act upon any alerts, and still meet the SLO.
Ability to adapt the monitoring config to many situations according to criticality: alert quickly if the error rate is high; alert eventually if the error rate is low but sustained Set up ticket notifications. Ex: incidents that go unnoticed but can exhaust the error budget if left unchecked
reason about. An even longer reset time, as a result of the 3d window. Multiple alerts can be sent if all conditions are true. Ex: 10% budget spend in 5m also means that 5% of the budget was spent in 6h, and 2% of the budget was spent in 1h
and dashboards should be created via automation with less human intervation as much as possible SLOs without error budgets adoption are useless SLOs refinements/reviews Complicated Edge layer