Upgrade to Pro — share decks privately, control downloads, hide ads and more …

alerting-deep-dive.pdf

 alerting-deep-dive.pdf

Rafael Jesus

December 05, 2019
Tweet

More Decks by Rafael Jesus

Other Decks in Technology

Transcript

  1. Honeycomb Incident Report It is impossible to prevent every memory

    leak; A high process crash (panic/OOM) rate is clearly abnormal in the system and that information should be displayed to people debugging issues, even if it is one of many potential causes rather than a symptom of user pain. Thus, it should be made a diagnostic message rather than a paging alert. And rather than outright crash and thereby lose internal telemetry, we will consider adding backpressure (returning an unhealthy status code) to incoming requests when we are resource constrained to keep the telemetry with that vital data flowing if we’re tight on resources “ “
  2. Alert Operator Spec spec: slo: ... istioRules: ... systemRules: ...

    additionalRules: ... In the app chart alerts: slo: availability: 9.99
  3. SLO Alerting Rules alert: SloErrorBudgetBurn expr: | ( (slo:service_errors:ratio_rate1h >

    (14.4 * 0.001)) and (slo:service_errors:ratio_rate5m > (14.4 * 0.001)) ) or ( (slo:service_read_latency_errors:ratio_rate1h > (14.4 * 0.001)) and (slo:service_read_latency_errors:ratio_rate5m > (14.4 * 0.001)) ) or ( (slo:service_write_latency_errors:ratio_rate1h > (14.4 * 0.001)) and (slo:service_write_latency_errors:ratio_rate5m > (14.4 * 0.001)) )
  4. Latency Rules record: slo:service_read_latency_errors:ratio_rate5m expr: | sum ( rate (incoming_http_requests_latency_bucket{job="service",

    method="GET", code!~"5..", le="1"}[5m])) / sum ( rate (incoming_http_requests_latency_count{job="service", method="GET"}[5m]) ) record: slo:service_write_latency_errors:ratio_rate5m expr: | sum ( rate (incoming_http_requests_latency_bucket{job="service", method="GET", code!~"5..", le="2.5"}[5m])) / sum ( rate (incoming_http_requests_latency_count{job="service", method="GET"}[5m]) )
  5. Fast burning is detected quickly, whereas slow burning requires a

    longer time window. See the table below: Severity Long Window Short Window Burn Rate Factor Error Budget Consumed P1 1h 5m 14.4 2% P2 6h 30m 6 5% Tickets for all the rest
  6. Wrap up Symptoms -> pages. Causes -> tickets. Too many

    alerts incurs too much toil. Stick with few alerts, alert on SLOs. Guide lines doesn't scale, invest on tooling instead. Leverage automation to make a product out of alerts.
  7. Grouping Expression route: group_by: ["tier"] group_wait: 30s group_interval: 10m repeat_interval:

    1h group_wait => bundle alerts for first notification group_interval => send notification for new or resolved alerts repeat_interval => remind users that alerts are still firing
  8. Inhibiting Alerts If zone is down then inhibit next up

    coming alerts - source_match: severity: P0 target_match_re: severity: P1|P2 equal: - zone
  9. Wrap up Routing alerts is SRE/Ops folks jobs Don't depend

    exclusively on white box monitoring for paging alerts To get it done right, its a full time job