Slide 1

Slide 1 text

Fixing your noisy pager in 500 easy steps

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

This is fi xable

Slide 6

Slide 6 text

Hi

Slide 7

Slide 7 text

sinjo.dev

Slide 8

Slide 8 text

sinjo.dev

Slide 9

Slide 9 text

Infra Engineer

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Pages per month (2023–24) Aug Sep Oct Nov Dec Jan Feb Mar 0 500 1,000 1,500 Daytime Evening Night ...

Slide 12

Slide 12 text

Pages per month (2023–24) Aug Sep Oct Nov Dec Jan Feb Mar 0 Daytime Evening Night 500 1,000 1,500

Slide 13

Slide 13 text

How did we get here?

Slide 14

Slide 14 text

Pages per month (2023–24) Aug Sep Oct Nov Dec Jan Feb Mar 0 Daytime Evening Night 500 1,000 1,500

Slide 15

Slide 15 text

Pages per month (2023–24) Jan 0 Daytime Evening Night Apr Jul Oct Jan 500 1,000 1,500

Slide 16

Slide 16 text

Pages per month (2023–24) Jan 0 Daytime Evening Night Apr Jul Oct Jan 500 1,000 1,500

Slide 17

Slide 17 text

Pages per month (2023–24) Jan 0 Daytime Evening Night Apr Jul Oct Jan 500 1,000 1,500

Slide 18

Slide 18 text

Problem: convincing yourselves you have a problem

Slide 19

Slide 19 text

We didn’t have those graphs

Slide 20

Slide 20 text

“Not much to report.” “Nothing interesting from my shift.”

Slide 21

Slide 21 text

If nothing interesting happened, why did a computer wake you up every night?

Slide 22

Slide 22 text

The way out of this situation is data

Slide 23

Slide 23 text

Pages per month (2023–24) Jan 0 Daytime Evening Night Apr Jul Oct Jan 500 1,000 1,500

Slide 24

Slide 24 text

So how do we fi x it?

Slide 25

Slide 25 text

1. Group alerts by name 2. Sort by frequency 3. Categorise each alert

Slide 26

Slide 26 text

1. Group alerts by name 2. Sort by frequency 3. Categorise each alert

Slide 27

Slide 27 text

1. Group alerts by name 2. Sort by frequency 3. Categorise each alert

Slide 28

Slide 28 text

Alert Name Count HttpErrorRateHigh 37 TooManyUnhealthyReplicas 21 AutoscalerMaxedOut 5 Alert frequency

Slide 29

Slide 29 text

Categorise how?

Slide 30

Slide 30 text

Alerts that are mostly right vs Alerts that are mostly wrong

Slide 31

Slide 31 text

Alerts that are mostly right vs Alerts that are mostly wrong

Slide 32

Slide 32 text

Easier socially Harder technically

Slide 33

Slide 33 text

Buy in for fi xing bugs and improving automation

Slide 34

Slide 34 text

Alerts that are mostly right vs Alerts that are mostly wrong

Slide 35

Slide 35 text

Easier technically Harder socially

Slide 36

Slide 36 text

Changing people’s minds is harder than changing a couple of lines of PromQL

Slide 37

Slide 37 text

Background noise Alerts that were once useful

Slide 38

Slide 38 text

Compelling reasons - Pager fatigue: we miss real issues - Tiredness: people can’t do their best work - Learned helplessness: we don’t believe we can improve things

Slide 39

Slide 39 text

Compelling reasons - Pager fatigue: we miss real issues - Tiredness: people can’t do their best work - Learned helplessness: we don’t believe we can improve things

Slide 40

Slide 40 text

Compelling reasons - Pager fatigue: we miss real issues - Tiredness: people can’t do their best work - Learned helplessness: we don’t believe we can improve things

Slide 41

Slide 41 text

Compelling reasons - Pager fatigue: we miss real issues - Tiredness: people can’t do their best work - Learned helplessness: we don’t believe we can improve things

Slide 42

Slide 42 text

2 choices

Slide 43

Slide 43 text

Make it more precise or Delete the alert

Slide 44

Slide 44 text

Make it more precise or Delete the alert

Slide 45

Slide 45 text

Bonus Alerts that are right, but in an annoying way

Slide 46

Slide 46 text

Excessive urgency ‑ Send to Slack or create a ticket

Slide 47

Slide 47 text

Excessive urgency ‑ Send to Slack or create a ticket

Slide 48

Slide 48 text

Flappy alerts ‑ Calculate rate over longer window

Slide 49

Slide 49 text

Flappy alerts ‑ Calculate rate over longer window

Slide 50

Slide 50 text

Pager storms ‑ Use alert grouping/ inhibition

Slide 51

Slide 51 text

Pager storms ‑ Use alert grouping/ inhibition

Slide 52

Slide 52 text

Thorny case: Real problems in software outside your control

Slide 53

Slide 53 text

Inside your company, across team boundaries

Slide 54

Slide 54 text

Usually fi xed by another team? ‑ Currently owned by the wrong team

Slide 55

Slide 55 text

Usually fi xed by another team? ‑ Currently owned by the wrong team

Slide 56

Slide 56 text

Open source and third-party software

Slide 57

Slide 57 text

Work with the maintainers!

Slide 58

Slide 58 text

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1951

Slide 59

Slide 59 text

In the meantime… protect your team

Slide 60

Slide 60 text

Automated remediation 101

Slide 61

Slide 61 text

Restarting the software fi xes the software (temporarily)

Slide 62

Slide 62 text

Restarting the software fi xes the software (temporarily)

Slide 63

Slide 63 text

Problem shape - Recurring problem: happens regularly - Reliable detection: highly correlated alert - Mechanical fi x: on-caller follows runbook

Slide 64

Slide 64 text

Problem shape - Recurring problem: happens regularly - Reliable detection: highly correlated alert - Mechanical fi x: on-caller follows runbook

Slide 65

Slide 65 text

Problem shape - Recurring problem: happens regularly - Reliable detection: highly correlated alert - Mechanical fi x: on-caller follows runbook

Slide 66

Slide 66 text

Problem shape - Recurring problem: happens regularly - Reliable detection: highly correlated alert - Mechanical fi x: on-caller follows runbook

Slide 67

Slide 67 text

Waking someone up to apply a mechanical fi x is a terrible use of their time

Slide 68

Slide 68 text

Mechanical work is what computers are great at!

Slide 69

Slide 69 text

We wrote a tool: auto-repair

Slide 70

Slide 70 text

Write a non-paging alert that goes o ff before your paging one

Slide 71

Slide 71 text

auto-repair (simpli fi ed) alerts = get(“prom:9090/api/v1/alerts") issues = filter_fixable(alerts) for i in issues do // for most issues, restart process apply_fix(i) end

Slide 72

Slide 72 text

auto-repair (simpli fi ed) alerts = get("prom:9090/api/v1/alerts") issues = filter_fixable(alerts) for i in issues do // for most issues, restart process apply_fix(i) end

Slide 73

Slide 73 text

auto-repair (simpli fi ed) alerts = get("prom:9090/api/v1/alerts") issues = filter_fixable(alerts) for i in issues do // for most issues, restart process apply_fix(i) end

Slide 74

Slide 74 text

auto-repair (simpli fi ed) alerts = get("prom:9090/api/v1/alerts") issues = filter_fixable(alerts) for i in issues do // for most issues, restart process apply_fix(i) end

Slide 75

Slide 75 text

It’s that simple (kinda)

Slide 76

Slide 76 text

It’s that simple (kinda)

Slide 77

Slide 77 text

Runaway automation

Slide 78

Slide 78 text

What the tool doesn’t do is more important than what it does do

Slide 79

Slide 79 text

3 limitations Don’t restart: - Too many processes with the same issue - The same instance repeatedly - Processes that have already paged

Slide 80

Slide 80 text

Don’t restart: - Too many processes with the same issue - The same instance repeatedly - Processes that have already paged 3 limitations

Slide 81

Slide 81 text

Don’t restart: - Too many processes with the same issue - The same instance repeatedly - Processes that have already paged 3 limitations

Slide 82

Slide 82 text

Don’t restart: - Too many processes with the same issue - The same instance repeatedly - Processes that have already paged 3 limitations

Slide 83

Slide 83 text

This prevents high ones to low tens of pages per week

Slide 84

Slide 84 text

(yes, we still fi le bugs)

Slide 85

Slide 85 text

What did we learn ?

Slide 86

Slide 86 text

You need long-term buy-in

Slide 87

Slide 87 text

Talk about how it impacts customers

Slide 88

Slide 88 text

Embrace hacky fi xes that help you survive

Slide 89

Slide 89 text

Dumb ideas that work aren’t dumb

Slide 90

Slide 90 text

Good things happen if you make it a habit

Slide 91

Slide 91 text

Pages per month (2023–24) Jan 0 Daytime Evening Night Apr Jul Oct Jan 500 1,000 1,500

Slide 92

Slide 92 text

Pages per month (2023–24) Jan 0 Daytime Evening Night Apr Jul Oct Jan Apr Jul 500 1,000 1,500

Slide 93

Slide 93 text

Pages per month (2023–24) Jan 0 Daytime Evening Night Apr Jul Oct Jan Apr Jul 500 1,000 1,500

Slide 94

Slide 94 text

Thank you ✌❤ @planetscaledata sinjo.dev

Slide 95

Slide 95 text

Image credits • Analog Alarm Clock in Morning Sunlight - Ruslan Sikunov - https:// www.pexels.com/photo/analog-alarm-clock-in-morning-sunlight-19188894/

Slide 96

Slide 96 text

No content

Slide 97

Slide 97 text

Questions? ✌❤ @planetscaledata sinjo.dev