Pages per month (2023–24)
Aug Sep Oct Nov Dec Jan Feb Mar
0
500
1,000
1,500
Daytime
Evening
Night
...
Slide 12
Slide 12 text
Pages per month (2023–24)
Aug Sep Oct Nov Dec Jan Feb Mar
0
Daytime
Evening
Night
500
1,000
1,500
Slide 13
Slide 13 text
How did we
get here?
Slide 14
Slide 14 text
Pages per month (2023–24)
Aug Sep Oct Nov Dec Jan Feb Mar
0
Daytime
Evening
Night
500
1,000
1,500
Slide 15
Slide 15 text
Pages per month (2023–24)
Jan
0
Daytime
Evening
Night
Apr Jul Oct Jan
500
1,000
1,500
Slide 16
Slide 16 text
Pages per month (2023–24)
Jan
0
Daytime
Evening
Night
Apr Jul Oct Jan
500
1,000
1,500
Slide 17
Slide 17 text
Pages per month (2023–24)
Jan
0
Daytime
Evening
Night
Apr Jul Oct Jan
500
1,000
1,500
Slide 18
Slide 18 text
Problem:
convincing
yourselves you
have a problem
Slide 19
Slide 19 text
We didn’t
have those
graphs
Slide 20
Slide 20 text
“Not much to report.”
“Nothing interesting
from my shift.”
Slide 21
Slide 21 text
If nothing interesting
happened,
why did a computer
wake you up
every night?
Slide 22
Slide 22 text
The way out of
this situation is
data
Slide 23
Slide 23 text
Pages per month (2023–24)
Jan
0
Daytime
Evening
Night
Apr Jul Oct Jan
500
1,000
1,500
Slide 24
Slide 24 text
So how
do we
fi
x it?
Slide 25
Slide 25 text
1. Group alerts by name
2. Sort by frequency
3. Categorise each alert
Slide 26
Slide 26 text
1. Group alerts by name
2. Sort by frequency
3. Categorise each alert
Slide 27
Slide 27 text
1. Group alerts by name
2. Sort by frequency
3. Categorise each alert
Slide 28
Slide 28 text
Alert Name Count
HttpErrorRateHigh 37
TooManyUnhealthyReplicas 21
AutoscalerMaxedOut 5
Alert frequency
Slide 29
Slide 29 text
Categorise
how?
Slide 30
Slide 30 text
Alerts that are mostly right
vs
Alerts that are mostly wrong
Slide 31
Slide 31 text
Alerts that are mostly right
vs
Alerts that are mostly wrong
Slide 32
Slide 32 text
Easier socially
Harder technically
Slide 33
Slide 33 text
Buy in for
fi
xing bugs and
improving automation
Slide 34
Slide 34 text
Alerts that are mostly right
vs
Alerts that are mostly wrong
Slide 35
Slide 35 text
Easier technically
Harder socially
Slide 36
Slide 36 text
Changing people’s minds
is harder than changing
a couple of lines of
PromQL
Slide 37
Slide 37 text
Background noise
Alerts that were
once useful
Slide 38
Slide 38 text
Compelling reasons
- Pager fatigue: we miss real issues
- Tiredness: people can’t do their best work
- Learned helplessness: we don’t believe we
can improve things
Slide 39
Slide 39 text
Compelling reasons
- Pager fatigue: we miss real issues
- Tiredness: people can’t do their best work
- Learned helplessness: we don’t believe we
can improve things
Slide 40
Slide 40 text
Compelling reasons
- Pager fatigue: we miss real issues
- Tiredness: people can’t do their best work
- Learned helplessness: we don’t believe we
can improve things
Slide 41
Slide 41 text
Compelling reasons
- Pager fatigue: we miss real issues
- Tiredness: people can’t do their best work
- Learned helplessness: we don’t believe we
can improve things
Slide 42
Slide 42 text
2
choices
Slide 43
Slide 43 text
Make it more precise
or
Delete the alert
Slide 44
Slide 44 text
Make it more precise
or
Delete the alert
Slide 45
Slide 45 text
Bonus
Alerts that are right,
but in an annoying way
Slide 46
Slide 46 text
Excessive urgency
‑
Send to Slack or
create a ticket
Slide 47
Slide 47 text
Excessive urgency
‑
Send to Slack or
create a ticket
Slide 48
Slide 48 text
Flappy alerts
‑
Calculate rate over
longer window
Slide 49
Slide 49 text
Flappy alerts
‑
Calculate rate over
longer window
Slide 50
Slide 50 text
Pager storms
‑
Use alert grouping/
inhibition
Slide 51
Slide 51 text
Pager storms
‑
Use alert grouping/
inhibition
Slide 52
Slide 52 text
Thorny case:
Real problems in
software
outside your control
Slide 53
Slide 53 text
Inside your company,
across team boundaries
Slide 54
Slide 54 text
Usually
fi
xed by
another team?
‑
Currently owned by the
wrong team
Slide 55
Slide 55 text
Usually
fi
xed by
another team?
‑
Currently owned by the
wrong team
Restarting the
software
fi
xes
the software
(temporarily)
Slide 62
Slide 62 text
Restarting the
software
fi
xes
the software
(temporarily)
Slide 63
Slide 63 text
Problem shape
- Recurring problem: happens regularly
- Reliable detection: highly correlated alert
- Mechanical
fi
x: on-caller follows runbook
Slide 64
Slide 64 text
Problem shape
- Recurring problem: happens regularly
- Reliable detection: highly correlated alert
- Mechanical
fi
x: on-caller follows runbook
Slide 65
Slide 65 text
Problem shape
- Recurring problem: happens regularly
- Reliable detection: highly correlated alert
- Mechanical
fi
x: on-caller follows runbook
Slide 66
Slide 66 text
Problem shape
- Recurring problem: happens regularly
- Reliable detection: highly correlated alert
- Mechanical
fi
x: on-caller follows runbook
Slide 67
Slide 67 text
Waking someone up to apply
a mechanical
fi
x
is a terrible
use of their time
Slide 68
Slide 68 text
Mechanical work
is what computers
are great at!
Slide 69
Slide 69 text
We wrote a tool:
auto-repair
Slide 70
Slide 70 text
Write a non-paging
alert that goes o
ff
before your
paging one
Slide 71
Slide 71 text
auto-repair (simpli
fi
ed)
alerts = get(“prom:9090/api/v1/alerts")
issues = filter_fixable(alerts)
for i in issues do
// for most issues, restart process
apply_fix(i)
end
Slide 72
Slide 72 text
auto-repair (simpli
fi
ed)
alerts = get("prom:9090/api/v1/alerts")
issues = filter_fixable(alerts)
for i in issues do
// for most issues, restart process
apply_fix(i)
end
Slide 73
Slide 73 text
auto-repair (simpli
fi
ed)
alerts = get("prom:9090/api/v1/alerts")
issues = filter_fixable(alerts)
for i in issues do
// for most issues, restart process
apply_fix(i)
end
Slide 74
Slide 74 text
auto-repair (simpli
fi
ed)
alerts = get("prom:9090/api/v1/alerts")
issues = filter_fixable(alerts)
for i in issues do
// for most issues, restart process
apply_fix(i)
end
Slide 75
Slide 75 text
It’s that
simple
(kinda)
Slide 76
Slide 76 text
It’s that
simple
(kinda)
Slide 77
Slide 77 text
Runaway
automation
Slide 78
Slide 78 text
What the tool
doesn’t do is
more important
than what it does do
Slide 79
Slide 79 text
3 limitations
Don’t restart:
- Too many processes with the same issue
- The same instance repeatedly
- Processes that have already paged
Slide 80
Slide 80 text
Don’t restart:
- Too many processes with the same issue
- The same instance repeatedly
- Processes that have already paged
3 limitations
Slide 81
Slide 81 text
Don’t restart:
- Too many processes with the same issue
- The same instance repeatedly
- Processes that have already paged
3 limitations
Slide 82
Slide 82 text
Don’t restart:
- Too many processes with the same issue
- The same instance repeatedly
- Processes that have already paged
3 limitations
Slide 83
Slide 83 text
This prevents high ones
to low tens of pages
per week
Slide 84
Slide 84 text
(yes, we still
fi
le bugs)
Slide 85
Slide 85 text
What did we
learn ?
Slide 86
Slide 86 text
You
need
long-term
buy-in
Slide 87
Slide 87 text
Talk
about
how it
impacts
customers
Slide 88
Slide 88 text
Embrace hacky
fi
xes that help
you survive
Slide 89
Slide 89 text
Dumb ideas that
work
aren’t dumb
Slide 90
Slide 90 text
Good things
happen if you
make it a habit
Slide 91
Slide 91 text
Pages per month (2023–24)
Jan
0
Daytime
Evening
Night
Apr Jul Oct Jan
500
1,000
1,500
Slide 92
Slide 92 text
Pages per month (2023–24)
Jan
0
Daytime
Evening
Night
Apr Jul Oct Jan Apr Jul
500
1,000
1,500
Slide 93
Slide 93 text
Pages per month (2023–24)
Jan
0
Daytime
Evening
Night
Apr Jul Oct Jan Apr Jul
500
1,000
1,500
Slide 94
Slide 94 text
Thank you
✌❤
@planetscaledata
sinjo.dev
Slide 95
Slide 95 text
Image credits
• Analog Alarm Clock in Morning Sunlight - Ruslan Sikunov - https://
www.pexels.com/photo/analog-alarm-clock-in-morning-sunlight-19188894/