Slide 1

Slide 1 text

Alerting Deep Dive

Slide 2

Slide 2 text

Agenda - Straight to the Point

Slide 3

Slide 3 text

Symptom Based Pages

Slide 4

Slide 4 text

Avoid alerting in all services

Slide 5

Slide 5 text

The worst scenario to be

Slide 6

Slide 6 text

High up as much as possible

Slide 7

Slide 7 text

Honeycomb Incident Report Memory leak within the code

Slide 8

Slide 8 text

Honeycomb Incident Report It is impossible to prevent every memory leak; A high process crash (panic/OOM) rate is clearly abnormal in the system and that information should be displayed to people debugging issues, even if it is one of many potential causes rather than a symptom of user pain. Thus, it should be made a diagnostic message rather than a paging alert. And rather than outright crash and thereby lose internal telemetry, we will consider adding backpressure (returning an unhealthy status code) to incoming requests when we are resource constrained to keep the telemetry with that vital data flowing if we’re tight on resources “ “

Slide 9

Slide 9 text

Alert operator A place to manage alerting rules. Automate multi-window, multi burn rates SLO alert.

Slide 10

Slide 10 text

Alert Operator Spec spec: slo: ... istioRules: ... systemRules: ... additionalRules: ... In the app chart alerts: slo: availability: 9.99

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

SLO Alerting Rules alert: SloErrorBudgetBurn expr: | ( (slo:service_errors:ratio_rate1h > (14.4 * 0.001)) and (slo:service_errors:ratio_rate5m > (14.4 * 0.001)) ) or ( (slo:service_read_latency_errors:ratio_rate1h > (14.4 * 0.001)) and (slo:service_read_latency_errors:ratio_rate5m > (14.4 * 0.001)) ) or ( (slo:service_write_latency_errors:ratio_rate1h > (14.4 * 0.001)) and (slo:service_write_latency_errors:ratio_rate5m > (14.4 * 0.001)) )

Slide 13

Slide 13 text

Latency Rules record: slo:service_read_latency_errors:ratio_rate5m expr: | sum ( rate (incoming_http_requests_latency_bucket{job="service", method="GET", code!~"5..", le="1"}[5m])) / sum ( rate (incoming_http_requests_latency_count{job="service", method="GET"}[5m]) ) record: slo:service_write_latency_errors:ratio_rate5m expr: | sum ( rate (incoming_http_requests_latency_bucket{job="service", method="GET", code!~"5..", le="2.5"}[5m])) / sum ( rate (incoming_http_requests_latency_count{job="service", method="GET"}[5m]) )

Slide 14

Slide 14 text

Fast burning is detected quickly, whereas slow burning requires a longer time window. See the table below: Severity Long Window Short Window Burn Rate Factor Error Budget Consumed P1 1h 5m 14.4 2% P2 6h 30m 6 5% Tickets for all the rest

Slide 15

Slide 15 text

Wrap up Symptoms -> pages. Causes -> tickets. Too many alerts incurs too much toil. Stick with few alerts, alert on SLOs. Guide lines doesn't scale, invest on tooling instead. Leverage automation to make a product out of alerts.

Slide 16

Slide 16 text

Alert Manager Aggregate,duplicate and route alerts

Slide 17

Slide 17 text

Grouping Alerts

Slide 18

Slide 18 text

Grouping Expression route: group_by: ["tier"] group_wait: 30s group_interval: 10m repeat_interval: 1h group_wait => bundle alerts for first notification group_interval => send notification for new or resolved alerts repeat_interval => remind users that alerts are still firing

Slide 19

Slide 19 text

Inhibition

Slide 20

Slide 20 text

Inhibiting Alerts If zone is down then inhibit next up coming alerts - source_match: severity: P0 target_match_re: severity: P1|P2 equal: - zone

Slide 21

Slide 21 text

Don't forget to monitor and operate Alert Manager

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Wrap up Routing alerts is SRE/Ops folks jobs Don't depend exclusively on white box monitoring for paging alerts To get it done right, its a full time job

Slide 25

Slide 25 text

Optimizing MTTD with Chaos Engineering

Slide 26

Slide 26 text

Thank you