Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Effective Alerting

Rafael Jesus
December 11, 2018
200

Effective Alerting

Effective Alerting

Rafael Jesus

December 11, 2018
Tweet

Transcript

  1. Agenda Avoid Alert Fatigue Alerting on SLOs Reducing noisy alerts

    Runbooks Monitoring at the Edge Alerting Sensitivity
  2. Avoid Alert Fatigue Pod was restarted Error Rate > 1

    ops DB query too slow CPU/Memory is too High Health Check Failed Noti cations to private slack channels Alerts always ring, when they aren't peeps assumes something is wrong
  3. Private Slack Channels Noti cations They are just noti cations

    Usually too noise and non actionable Productivity Killer Operators and on-call engineers are not aware of them No Post Morterms, no learnings then
  4. Alerting on SLOs SLI -> Indicator: a key measurement distribution

    of response time over 5m response error ratios over 5m SLO -> Objective: a availability goal 99% response time below 500ms 99% during 4 weeks i.e lower than 1% rate of errors Alert when your service objective (SLO) is affected.
  5. Availability affected Alert -> TooManyUsersCheckoutErrors Expression -> http_requests_5xx:rate5m > 0.05

    Description -> Alert by proportion of traf c i.e 5% of requests are failing as measured over the last 5 minutes. Avoid alerting on static thresholds (> 5 errors seen) and small period such as 1 min, they are too noise.
  6. Slow Response Times Alert -> TooManySlowUsersQueries Expression -> http_requests_2xx_95th_latency:rate5m >

    10s Description -> Alert on 95th of good slow users requests as measured over the last 5 minutes. Avoid alerting on 5xx (they're typically slow), 99th percentiles and small period as 1 min, they are too noise
  7. Slow is the new Down 53% of mobile users abandon

    web pages that take more than 3 seconds to load.
  8. Reducing noisy alerts During outages both alerts will get triggered

    Avoid that by using comparison operator TooManyUsersCheckoutErrors -> http_requests_5xx:rate5m > 0.05 unless http_requests_2xx_95th_latency:rate5m > 10 TooManySlowUsersQueries -> http_requests_2xx_95th_latency:rate5m > 10 unless http_requests_5xx:rate5m > 0.05
  9. Runbooks Write minimal viable runbooks for the alert Add dashboards

    to investigate the issue How to mitigate the issue, with bash scripts and so on Hint: If there's neither clear action nor dashboards to investigate the alert, just remove the alert
  10. Monitoring at the Edge Measure at the highest level i.e

    load balancer If a backend goes away you still have metrics
  11. Post Moterm 1: Api Gateway Outage Since traf c is

    not routed to upstream, no metrics nor alerts
  12. Alert for errors under alert threshold Alert -> IncreaseOfHTTP5xxErrorsOnMyApp Expression

    -> ingress:api_v2_http_5xx_increased:rate15m > 0.01 Description -> Percentage of http 5xx status code has increased on My App. Avoid alerting on small period as 1-10 min, they are too noise for this kind of alert
  13. Ways to Alert on Signi cant Events Target Error Rate

    > SLO Threshold Increased Alert Window Incrementing Alert Duration Alert on Burn Rate Multiple Burn Rate Alerts Multiwindow, Multi-Burn-Rate Alerts