Effective Alerting

Agenda Avoid Alert Fatigue Alerting on SLOs Reducing noisy alerts
Runbooks Monitoring at the Edge Alerting Sensitivity

Avoid Alert Fatigue Pod was restarted Error Rate > 1
ops DB query too slow CPU/Memory is too High Health Check Failed Noti cations to private slack channels Alerts always ring, when they aren't peeps assumes something is wrong

Private Slack Channels Noti cations They are just noti cations
Usually too noise and non actionable Productivity Killer Operators and on-call engineers are not aware of them No Post Morterms, no learnings then

From Google SRE workshop

Alerting on SLOs SLI -> Indicator: a key measurement distribution
of response time over 5m response error ratios over 5m SLO -> Objective: a availability goal 99% response time below 500ms 99% during 4 weeks i.e lower than 1% rate of errors Alert when your service objective (SLO) is affected.

Availability affected Alert -> TooManyUsersCheckoutErrors Expression -> http_requests_5xx:rate5m > 0.05
Description -> Alert by proportion of traf c i.e 5% of requests are failing as measured over the last 5 minutes. Avoid alerting on static thresholds (> 5 errors seen) and small period such as 1 min, they are too noise.

Slow Response Times Alert -> TooManySlowUsersQueries Expression -> http_requests_2xx_95th_latency:rate5m >
10s Description -> Alert on 95th of good slow users requests as measured over the last 5 minutes. Avoid alerting on 5xx (they're typically slow), 99th percentiles and small period as 1 min, they are too noise

Slow is the new Down 53% of mobile users abandon
web pages that take more than 3 seconds to load.

Reducing noisy alerts During outages both alerts will get triggered
Avoid that by using comparison operator TooManyUsersCheckoutErrors -> http_requests_5xx:rate5m > 0.05 unless http_requests_2xx_95th_latency:rate5m > 10 TooManySlowUsersQueries -> http_requests_2xx_95th_latency:rate5m > 10 unless http_requests_5xx:rate5m > 0.05

Runbooks Write minimal viable runbooks for the alert Add dashboards
to investigate the issue How to mitigate the issue, with bash scripts and so on Hint: If there's neither clear action nor dashboards to investigate the alert, just remove the alert

Creating a runbook Visit https://github.com/hellofresh/runbooks One runbook per alert troubleshooting/thanos-query-error-rate.md
troubleshooting/thanos-query-high-latency.md Runbook Description

Architecture Overview

Troubleshooting

Monitoring at the Edge Measure at the highest level i.e
load balancer If a backend goes away you still have metrics

Why monitoring at the highest level?

Post Moterm 1: Api Gateway Outage Since traf c is
not routed to upstream, no metrics nor alerts

Post Moterm 2: Ingress Slowdown Remote Con g Service slower
response times

Post Moterm 2: Ingress Slowdown Fragments too many 499 status
codes

Post Moterm 2: Ingress Slowdown Every requests was getting queued
on ELB

Alerting Sensitivity Error Rate below the threshold for too long

Alert for errors under alert threshold Alert -> IncreaseOfHTTP5xxErrorsOnMyApp Expression
-> ingress:api_v2_http_5xx_increased:rate15m > 0.01 Description -> Percentage of http 5xx status code has increased on My App. Avoid alerting on small period as 1-10 min, they are too noise for this kind of alert

Alerting Sensitivity Alert on good requests having high response times

Alerting Sensitivity High latency alert might re before error rate

Google Page Example

Ways to Alert on Signi cant Events Target Error Rate
> SLO Threshold Increased Alert Window Incrementing Alert Duration Alert on Burn Rate Multiple Burn Rate Alerts Multiwindow, Multi-Burn-Rate Alerts

Thank you

Effective Alerting

Effective Alerting

Rafael Jesus

More Decks by Rafael Jesus

Featured

Transcript