Effective Alerting - Speaker Deck

Slide 1

Slide 1 text

Effective Alerting

Slide 2

Slide 2 text

Agenda Avoid Alert Fatigue Alerting on SLOs Reducing noisy alerts Runbooks Monitoring at the Edge Alerting Sensitivity

Slide 3

Slide 3 text

Avoid Alert Fatigue Pod was restarted Error Rate > 1 ops DB query too slow CPU/Memory is too High Health Check Failed Noti cations to private slack channels Alerts always ring, when they aren't peeps assumes something is wrong

Slide 4

Slide 4 text

Private Slack Channels Noti cations They are just noti cations Usually too noise and non actionable Productivity Killer Operators and on-call engineers are not aware of them No Post Morterms, no learnings then

Slide 5

Slide 5 text

From Google SRE workshop

Slide 6

Slide 6 text

Alerting on SLOs SLI -> Indicator: a key measurement distribution of response time over 5m response error ratios over 5m SLO -> Objective: a availability goal 99% response time below 500ms 99% during 4 weeks i.e lower than 1% rate of errors Alert when your service objective (SLO) is affected.

Slide 7

Slide 7 text

Availability affected Alert -> TooManyUsersCheckoutErrors Expression -> http_requests_5xx:rate5m > 0.05 Description -> Alert by proportion of traf c i.e 5% of requests are failing as measured over the last 5 minutes. Avoid alerting on static thresholds (> 5 errors seen) and small period such as 1 min, they are too noise.

Slide 8

Slide 8 text

Slow Response Times Alert -> TooManySlowUsersQueries Expression -> http_requests_2xx_95th_latency:rate5m > 10s Description -> Alert on 95th of good slow users requests as measured over the last 5 minutes. Avoid alerting on 5xx (they're typically slow), 99th percentiles and small period as 1 min, they are too noise

Slide 9

Slide 9 text

Slow is the new Down 53% of mobile users abandon web pages that take more than 3 seconds to load.

Slide 10

Slide 10 text

Reducing noisy alerts During outages both alerts will get triggered Avoid that by using comparison operator TooManyUsersCheckoutErrors -> http_requests_5xx:rate5m > 0.05 unless http_requests_2xx_95th_latency:rate5m > 10 TooManySlowUsersQueries -> http_requests_2xx_95th_latency:rate5m > 10 unless http_requests_5xx:rate5m > 0.05

Slide 11

Slide 11 text

Runbooks Write minimal viable runbooks for the alert Add dashboards to investigate the issue How to mitigate the issue, with bash scripts and so on Hint: If there's neither clear action nor dashboards to investigate the alert, just remove the alert

Slide 12

Slide 12 text

Creating a runbook Visit https://github.com/hellofresh/runbooks One runbook per alert troubleshooting/thanos-query-error-rate.md troubleshooting/thanos-query-high-latency.md Runbook Description

Slide 13

Slide 13 text

Architecture Overview

Slide 14

Slide 14 text

Troubleshooting

Slide 15

Slide 15 text

Troubleshooting

Slide 16

Slide 16 text

Monitoring at the Edge Measure at the highest level i.e load balancer If a backend goes away you still have metrics

Slide 17

Slide 17 text

Why monitoring at the highest level?

Slide 18

Slide 18 text

Post Moterm 1: Api Gateway Outage Since traf c is not routed to upstream, no metrics nor alerts

Slide 19

Slide 19 text

Post Moterm 2: Ingress Slowdown Remote Con g Service slower response times

Slide 20

Slide 20 text

Post Moterm 2: Ingress Slowdown Fragments too many 499 status codes

Slide 21

Slide 21 text

Post Moterm 2: Ingress Slowdown Every requests was getting queued on ELB

Slide 22

Slide 22 text

Alerting Sensitivity Error Rate below the threshold for too long

Slide 23

Slide 23 text

Alert for errors under alert threshold Alert -> IncreaseOfHTTP5xxErrorsOnMyApp Expression -> ingress:api_v2_http_5xx_increased:rate15m > 0.01 Description -> Percentage of http 5xx status code has increased on My App. Avoid alerting on small period as 1-10 min, they are too noise for this kind of alert

Slide 24

Slide 24 text

Alerting Sensitivity Alert on good requests having high response times

Slide 25

Slide 25 text

Alerting Sensitivity High latency alert might re before error rate

Slide 26

Slide 26 text

Google Page Example

Slide 27

Slide 27 text

Ways to Alert on Signi cant Events Target Error Rate > SLO Threshold Increased Alert Window Incrementing Alert Duration Alert on Burn Rate Multiple Burn Rate Alerts Multiwindow, Multi-Burn-Rate Alerts

Slide 28

Slide 28 text

Thank you