HOW TO ESTABLISH AN
INCIDENT MANAGEMENT
PROGRAM
@tammybutow
Slide 2
Slide 2 text
What do jelly beans
have to do with incident management ?
Slide 3
Slide 3 text
Insert kid crying
Slide 4
Slide 4 text
Insert kid running around
Slide 5
Slide 5 text
Insert calm kid calling on the phone
Slide 6
Slide 6 text
Insert Jelly Beans
Slide 7
Slide 7 text
Insert photo of my mum and me
Slide 8
Slide 8 text
Hi I’m Tammy Butow,
SRE @ gremlin.com
I’ve worked on high severity incidents my entire life,
and I’ve gotten better at it!
Slide 9
Slide 9 text
10+ years.
Slide 10
Slide 10 text
Gremlin
Dropbox
DigitalOcean
National Australia Bank
Queensland University of Technology
My home in Eastwood, NSW, Australia
Slide 11
Slide 11 text
How do you empower everyone in
your company to identify problems
and get help?
Slide 12
Slide 12 text
Empower
Everyone.
Slide 13
Slide 13 text
Insert illustration of a building
Slide 14
Slide 14 text
Has that ever happened where
you’ve worked?
Slide 15
Slide 15 text
FOR YOUR ENTIRE COMPANY
Slide 16
Slide 16 text
One common misconception…
Slide 17
Slide 17 text
All people who resolve
incidents are heroes.
Slide 18
Slide 18 text
Hero vs Helper
Slide 19
Slide 19 text
I’m a helper.
Slide 20
Slide 20 text
No content
Slide 21
Slide 21 text
What is High Severity Incident Management?
Slide 22
Slide 22 text
SEVs
Slide 23
Slide 23 text
What are the 4 most common types of SEVs?
Slide 24
Slide 24 text
1. The Availability Drop
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
2. The Broken Feature
Slide 27
Slide 27 text
No content
Slide 28
Slide 28 text
3. The Loss of Data
Slide 29
Slide 29 text
Cry baby
Slide 30
Slide 30 text
4. The Security Risk
Slide 31
Slide 31 text
No content
Slide 32
Slide 32 text
Let’s take a journey together outside this room
Slide 33
Slide 33 text
Put on your SEV backpack
Slide 34
Slide 34 text
Monday 7pm
Slide 35
Slide 35 text
You’re out on a date enjoying a lovely dinner
Slide 36
Slide 36 text
You start getting errors from the
database for your service.
“ MySQL server has gone away”.
Slide 37
Slide 37 text
You use the SEV tool to get help
Slide 38
Slide 38 text
Getting errors, app having issues too.
Not sure what’s happening yet. MySQL?
SEV Reported by you:
Current SEV Level: 1
Slide 39
Slide 39 text
IMOC is auto-paged and on the case
Slide 40
Slide 40 text
The SEV is automatically named
Slide 41
Slide 41 text
SEV 1 Fast Frog
Slide 42
Slide 42 text
The IMOC finds a TLOC to resolve the issue
Slide 43
Slide 43 text
Tons of teams across the company getting alerts
It’s an alert storm!
Slide 44
Slide 44 text
Insert storm pic
Slide 45
Slide 45 text
Everyone across the company
looks in #sevs on Slack
and check the sevs@ mailing list for updates
Slide 46
Slide 46 text
Threads running is high, the database is hot!
Slide 47
Slide 47 text
No content
Slide 48
Slide 48 text
Database is being hammered!
Slide 49
Slide 49 text
What’s happening?
Slide 50
Slide 50 text
TLOC is looking at the database queries
Slide 51
Slide 51 text
No content
Slide 52
Slide 52 text
Normal queries, nothing has changed
Slide 53
Slide 53 text
More queries than usual
Slide 54
Slide 54 text
Where are they coming from?
Slide 55
Slide 55 text
Our queries have metadata for the service
Slide 56
Slide 56 text
1. It’s the API
Slide 57
Slide 57 text
PUT THAT EVIDENCE IN
YOUR BACKPACK
Slide 58
Slide 58 text
Alarm!
Availability SLA is breached for WWW and API
Slide 59
Slide 59 text
SEV is upgraded to a SEV 0
Slide 60
Slide 60 text
SEV 0 Fast Frog
Slide 61
Slide 61 text
Automation in full-force
Slide 62
Slide 62 text
Executive Leadership Team are auto-emailed
Slide 63
Slide 63 text
We have only 15 min remaining
to resolve the SEV 0
Slide 64
Slide 64 text
15 MINUTES
Slide 65
Slide 65 text
Keep going!
Slide 66
Slide 66 text
Start killing queries to restore service
Slide 67
Slide 67 text
No content
Slide 68
Slide 68 text
Are the queries in the slow log
from one user or many users?
Slide 69
Slide 69 text
2. It’s mostly one user
Slide 70
Slide 70 text
PUT THAT EVIDENCE IN
YOUR BACKPACK
Slide 71
Slide 71 text
Is the one user legitimate?
Slide 72
Slide 72 text
What kind of workload are they performing?
Slide 73
Slide 73 text
3 — It’s a heavy workload, heavier than
we usually get.
Slide 74
Slide 74 text
PUT THAT EVIDENCE IN
YOUR BACKPACK
Slide 75
Slide 75 text
Do we have rate limiting and throttling?
Slide 76
Slide 76 text
4 — It isn’t working well in this situation
Slide 77
Slide 77 text
PUT THAT EVIDENCE IN
YOUR BACKPACK
Slide 78
Slide 78 text
Let’s temporarily kill queries for this user.
We can use a query kill loop
or use the support app.
Then service will return to normal for everyone.
Slide 79
Slide 79 text
SLA is back on-track
MITIGATED the SEV 0 in 5 minutes!
Slide 80
Slide 80 text
Let’s open up our evidence backpack
Slide 81
Slide 81 text
Our Evidence Backpack
It’s the API
It’s one user
It’s a heavier workload
Our rate limiting & throttling can’t
handle this workload
We temp resolved by killing
queries from this customer
Slide 82
Slide 82 text
Let’s check what rate limiting
and throttling is currently set to
Slide 83
Slide 83 text
We need to fix that, add an action item.
Slide 84
Slide 84 text
Let’s also reach out to the customer
and understand this heavy workload
they are performing
Slide 85
Slide 85 text
They do batch-style processing using our API.
They plan to do it Monday 7pm every week.
How can we better support it long-term?
Slide 86
Slide 86 text
That’s what a SEV 0 looks like
Slide 87
Slide 87 text
What are SEV levels?
Slide 88
Slide 88 text
SEV
Level Description
Target
resolution time Who is notified
SEV 0 Catastrophic
Service Impact
Resolve within
10 min
Ambulance
SEV 1 Critical Service
Impact
Resolve within 8
hours
Neighbour &
Best Friend
SEV 2 High Service
Impact
Resolve within
24 hours
Best Friend
How To Establish SEV levels - Diabetes
Slide 89
Slide 89 text
SEV
Level Description
Target
resolution time Who is notified
SEV 0 Catastrophic
Service Impact
Resolve within
15 min
Entire company
SEV 1 Critical Service
Impact
Resolve within 8
hours
Teams working on
SEV & CTO
SEV 2 High Service
Impact
Resolve within
24 hours
Teams working on
SEV
How To Establish SEV levels
How can you be ready to
sprint to mitigation at any moment?
Slide 95
Slide 95 text
What is the full lifecycle of a SEV?
Slide 96
Slide 96 text
No content
Slide 97
Slide 97 text
How are SEVs measured?
Slide 98
Slide 98 text
% loss * outage duration
Slide 99
Slide 99 text
How do you create SEV levels
for your company?
Slide 100
Slide 100 text
SEV levels for data loss
SEV Level Data Loss Impact
SEV 0 Loss of customer data
SEV 1 Loss of primary backup
SEV 2 Loss of secondary backup
Slide 101
Slide 101 text
No content
Slide 102
Slide 102 text
What does a SEV look like?
Slide 103
Slide 103 text
No content
Slide 104
Slide 104 text
We measure this SEV as:
0.2% * 30 min (6) for WWW
0.11% * 30 min (3.3) for API
Slide 105
Slide 105 text
How do you ensure your team operates
effectively during a SEV 0?
Slide 106
Slide 106 text
Incident Manager On-Call (IMOC)
Slide 107
Slide 107 text
Small Rotation of Engineering Leaders
Slide 108
Slide 108 text
One person is on-call in this role
at any point in time
Slide 109
Slide 109 text
Can be paged by emailing imoc-pager@
Slide 110
Slide 110 text
Wide knowledge of services
and engineering teams
Slide 111
Slide 111 text
Tech Lead On-Call (TLOC)
Slide 112
Slide 112 text
The engineer responsible for
resolving the SEV
Slide 113
Slide 113 text
Deep knowledge of own
service area
Slide 114
Slide 114 text
Deep knowledge of upstream and
downstream dependencies
Slide 115
Slide 115 text
How do you setup IMOCs for
success during SEV 0s?
Slide 116
Slide 116 text
How do you categorise SEVs?
Slide 117
Slide 117 text
No content
Slide 118
Slide 118 text
How do you empower everyone in
your company to fix things that are broken?
Slide 119
Slide 119 text
No content
Slide 120
Slide 120 text
No content
Slide 121
Slide 121 text
How should you name SEVs?
Slide 122
Slide 122 text
0086343430
Slide 123
Slide 123 text
SEV 0 Fast Frog
Slide 124
Slide 124 text
No content
Slide 125
Slide 125 text
What causes SEVs?
Slide 126
Slide 126 text
Pareto Principle
Slide 127
Slide 127 text
Technical & Cultural Issues
Slide 128
Slide 128 text
What are some of the expected issues
you are likely to experience?
Slide 129
Slide 129 text
Technical Issues
Dependency Failure
Region/Zone Failure
Provider Failure
Overheating
PDU failure
Network upgrades
Rack failures
Core Switch failures
Connectivity issues
Flaky DNS
Misconfigured machines
Bugs
Corrupt or unavailable backups
Cultural Issues
Lack of knowledge sharing
Lack of knowledge handover
Lack of on-call training
Lack of chaos engineering
Lack of an incident management program
Lack of documentation and playbooks
Lack of alerts and pages
Lack of effective alerting thresholds
Lack of backup strategy
Slide 130
Slide 130 text
How do you prevent SEVs from repeating?
Slide 131
Slide 131 text
Let’s look at high impact practices….
Slide 132
Slide 132 text
An Incident Management Program
Slide 133
Slide 133 text
A helpful IMOC Rotation
Slide 134
Slide 134 text
Automation Tooling For Incident Management
Slide 135
Slide 135 text
Chaos Engineering
Slide 136
Slide 136 text
Insert calm kid calling on the phone
Calling for help when an incident happens is awesome!
Slide 137
Slide 137 text
Calling for help when an incident happens is awesome!
Slide 138
Slide 138 text
Create Your Own Incident Management Program
1. Determine how you will measure SEVs
2. Determine your SEV Levels
3. Set your SLOs
4. Create your IMOC rotation
5. Start using automation tooling for SEVs
6. Build a critical service dashboard
Slide 139
Slide 139 text
It’s a beautiful day to start
Slide 140
Slide 140 text
Learn from and help others on this journey:
Join the Chaos & Reliability Community
gremlin.com/community
Thank you
@tammybutow tammy@gremlin.com
gremlin.com/slack