alerting-deep-dive.pdf

Alerting Deep Dive

Agenda - Straight to the Point

Symptom Based Pages

Avoid alerting in all services

The worst scenario to be

High up as much as possible

Honeycomb Incident Report Memory leak within the code

Honeycomb Incident Report It is impossible to prevent every memory
leak; A high process crash (panic/OOM) rate is clearly abnormal in the system and that information should be displayed to people debugging issues, even if it is one of many potential causes rather than a symptom of user pain. Thus, it should be made a diagnostic message rather than a paging alert. And rather than outright crash and thereby lose internal telemetry, we will consider adding backpressure (returning an unhealthy status code) to incoming requests when we are resource constrained to keep the telemetry with that vital data flowing if we’re tight on resources “ “

Alert operator A place to manage alerting rules. Automate multi-window,
multi burn rates SLO alert.

Alert Operator Spec spec: slo: ... istioRules: ... systemRules: ...
additionalRules: ... In the app chart alerts: slo: availability: 9.99

SLO Alerting Rules alert: SloErrorBudgetBurn expr: | ( (slo:service_errors:ratio_rate1h >
(14.4 * 0.001)) and (slo:service_errors:ratio_rate5m > (14.4 * 0.001)) ) or ( (slo:service_read_latency_errors:ratio_rate1h > (14.4 * 0.001)) and (slo:service_read_latency_errors:ratio_rate5m > (14.4 * 0.001)) ) or ( (slo:service_write_latency_errors:ratio_rate1h > (14.4 * 0.001)) and (slo:service_write_latency_errors:ratio_rate5m > (14.4 * 0.001)) )

Latency Rules record: slo:service_read_latency_errors:ratio_rate5m expr: | sum ( rate (incoming_http_requests_latency_bucket{job="service",
method="GET", code!~"5..", le="1"}[5m])) / sum ( rate (incoming_http_requests_latency_count{job="service", method="GET"}[5m]) ) record: slo:service_write_latency_errors:ratio_rate5m expr: | sum ( rate (incoming_http_requests_latency_bucket{job="service", method="GET", code!~"5..", le="2.5"}[5m])) / sum ( rate (incoming_http_requests_latency_count{job="service", method="GET"}[5m]) )

Fast burning is detected quickly, whereas slow burning requires a
longer time window. See the table below: Severity Long Window Short Window Burn Rate Factor Error Budget Consumed P1 1h 5m 14.4 2% P2 6h 30m 6 5% Tickets for all the rest

Wrap up Symptoms -> pages. Causes -> tickets. Too many
alerts incurs too much toil. Stick with few alerts, alert on SLOs. Guide lines doesn't scale, invest on tooling instead. Leverage automation to make a product out of alerts.

Alert Manager Aggregate,duplicate and route alerts

Grouping Alerts

Grouping Expression route: group_by: ["tier"] group_wait: 30s group_interval: 10m repeat_interval:
1h group_wait => bundle alerts for first notification group_interval => send notification for new or resolved alerts repeat_interval => remind users that alerts are still firing

Inhibition

Inhibiting Alerts If zone is down then inhibit next up
coming alerts - source_match: severity: P0 target_match_re: severity: P1|P2 equal: - zone

Don't forget to monitor and operate Alert Manager

Wrap up Routing alerts is SRE/Ops folks jobs Don't depend
exclusively on white box monitoring for paging alerts To get it done right, its a full time job

Optimizing MTTD with Chaos Engineering

Thank you

alerting-deep-dive.pdf

alerting-deep-dive.pdf

Rafael Jesus

More Decks by Rafael Jesus

Other Decks in Technology

Featured

Transcript