of reliability for the services customer's It's a tool for making data-driven decisions about reliability. Features or Reliability? The most important SRE concept, without SLOs, there is no need for Site Reliability Engineering
request latency and error rate Request metrics are measured at the load balancer SLO for UIs has multiple SLIs: Page load time Browser interaction time (by geo) Server error rate JS error rate (by geo)
happy with the service Below this threshold, users are likely to start complaining or stop using the service Ultimately, user happiness is what matters. We keep our services reliable to keep our customers happy Usually owned by Product Owners
Availability SLIs over the previous 4 weeks sum(rate(http_requests_total{service=~"my- service", status=~"5.*"}[4w])) / sum(rate(http_requests_total{service=~"my- service"}[4w])) More about Aggregate Availability
for Latency SLIs over the previous 4 weeks histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket{ing ress=~"my-service"}[4w])) by (le)) histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{ing ress=~"my-service"}[4w])) by (le))