How To Establish A High Severity Incident Management Program

Slide 1

Slide 1 text

HOW TO ESTABLISH AN INCIDENT MANAGEMENT PROGRAM @tammybutow

Slide 2

Slide 2 text

What do jelly beans have to do with incident management ?

Slide 3

Slide 3 text

Insert kid crying

Slide 4

Slide 4 text

Insert kid running around

Slide 5

Slide 5 text

Insert calm kid calling on the phone

Slide 6

Slide 6 text

Insert Jelly Beans

Slide 7

Slide 7 text

Insert photo of my mum and me

Slide 8

Slide 8 text

Hi I’m Tammy Butow, SRE @ gremlin.com I’ve worked on high severity incidents my entire life, and I’ve gotten better at it!

Slide 9

Slide 9 text

10+ years.

Slide 10

Slide 10 text

Gremlin Dropbox DigitalOcean National Australia Bank Queensland University of Technology My home in Eastwood, NSW, Australia

Slide 11

Slide 11 text

How do you empower everyone in your company to identify problems and get help?

Slide 12

Slide 12 text

Empower Everyone.

Slide 13

Slide 13 text

Insert illustration of a building

Slide 14

Slide 14 text

Has that ever happened where you’ve worked?

Slide 15

Slide 15 text

FOR YOUR ENTIRE COMPANY

Slide 16

Slide 16 text

One common misconception…

Slide 17

Slide 17 text

All people who resolve incidents are heroes.

Slide 18

Slide 18 text

Hero vs Helper

Slide 19

Slide 19 text

I’m a helper.

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

What is High Severity Incident Management?

Slide 22

Slide 22 text

SEVs

Slide 23

Slide 23 text

What are the 4 most common types of SEVs?

Slide 24

Slide 24 text

1. The Availability Drop

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

2. The Broken Feature

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

3. The Loss of Data

Slide 29

Slide 29 text

Cry baby

Slide 30

Slide 30 text

4. The Security Risk

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Let’s take a journey together outside this room

Slide 33

Slide 33 text

Put on your SEV backpack

Slide 34

Slide 34 text

Monday 7pm

Slide 35

Slide 35 text

You’re out on a date enjoying a lovely dinner

Slide 36

Slide 36 text

You start getting errors from the database for your service. “ MySQL server has gone away”.

Slide 37

Slide 37 text

You use the SEV tool to get help

Slide 38

Slide 38 text

Getting errors, app having issues too. Not sure what’s happening yet. MySQL? SEV Reported by you: Current SEV Level: 1

Slide 39

Slide 39 text

IMOC is auto-paged and on the case

Slide 40

Slide 40 text

The SEV is automatically named

Slide 41

Slide 41 text

SEV 1 Fast Frog

Slide 42

Slide 42 text

The IMOC ﬁnds a TLOC to resolve the issue

Slide 43

Slide 43 text

Tons of teams across the company getting alerts It’s an alert storm!

Slide 44

Slide 44 text

Insert storm pic

Slide 45

Slide 45 text

Everyone across the company looks in #sevs on Slack and check the sevs@ mailing list for updates

Slide 46

Slide 46 text

Threads running is high, the database is hot!

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Database is being hammered!

Slide 49

Slide 49 text

What’s happening?

Slide 50

Slide 50 text

TLOC is looking at the database queries

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

Normal queries, nothing has changed

Slide 53

Slide 53 text

More queries than usual

Slide 54

Slide 54 text

Where are they coming from?

Slide 55

Slide 55 text

Our queries have metadata for the service

Slide 56

Slide 56 text

1. It’s the API

Slide 57

Slide 57 text

PUT THAT EVIDENCE IN YOUR BACKPACK

Slide 58

Slide 58 text

Alarm! Availability SLA is breached for WWW and API

Slide 59

Slide 59 text

SEV is upgraded to a SEV 0

Slide 60

Slide 60 text

SEV 0 Fast Frog

Slide 61

Slide 61 text

Automation in full-force

Slide 62

Slide 62 text

Executive Leadership Team are auto-emailed

Slide 63

Slide 63 text

We have only 15 min remaining to resolve the SEV 0

Slide 64

Slide 64 text

15 MINUTES

Slide 65

Slide 65 text

Keep going!

Slide 66

Slide 66 text

Start killing queries to restore service

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Are the queries in the slow log from one user or many users?

Slide 69

Slide 69 text

2. It’s mostly one user

Slide 70

Slide 70 text

PUT THAT EVIDENCE IN YOUR BACKPACK

Slide 71

Slide 71 text

Is the one user legitimate?

Slide 72

Slide 72 text

What kind of workload are they performing?

Slide 73

Slide 73 text

3 — It’s a heavy workload, heavier than we usually get.

Slide 74

Slide 74 text

PUT THAT EVIDENCE IN YOUR BACKPACK

Slide 75

Slide 75 text

Do we have rate limiting and throttling?

Slide 76

Slide 76 text

4 — It isn’t working well in this situation

Slide 77

Slide 77 text

PUT THAT EVIDENCE IN YOUR BACKPACK

Slide 78

Slide 78 text

Let’s temporarily kill queries for this user. We can use a query kill loop or use the support app. Then service will return to normal for everyone.

Slide 79

Slide 79 text

SLA is back on-track MITIGATED the SEV 0 in 5 minutes!

Slide 80

Slide 80 text

Let’s open up our evidence backpack

Slide 81

Slide 81 text

Our Evidence Backpack It’s the API It’s one user It’s a heavier workload Our rate limiting & throttling can’t handle this workload We temp resolved by killing queries from this customer

Slide 82

Slide 82 text

Let’s check what rate limiting and throttling is currently set to

Slide 83

Slide 83 text

We need to ﬁx that, add an action item.

Slide 84

Slide 84 text

Let’s also reach out to the customer and understand this heavy workload they are performing

Slide 85

Slide 85 text

They do batch-style processing using our API. They plan to do it Monday 7pm every week. How can we better support it long-term?

Slide 86

Slide 86 text

That’s what a SEV 0 looks like

Slide 87

Slide 87 text

What are SEV levels?

Slide 88

Slide 88 text

SEV Level Description Target resolution time Who is notiﬁed SEV 0 Catastrophic Service Impact Resolve within 10 min Ambulance SEV 1 Critical Service Impact Resolve within 8 hours Neighbour & Best Friend SEV 2 High Service Impact Resolve within 24 hours Best Friend How To Establish SEV levels - Diabetes

Slide 89

Slide 89 text

SEV Level Description Target resolution time Who is notiﬁed SEV 0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV How To Establish SEV levels

Slide 90

Slide 90 text

How do your resolution times impact SLOs/SLAs?

Slide 91

Slide 91 text

What is an SLA of 99.99%?

Slide 92

Slide 92 text

Daily: 8.6s Weekly: 1m 0.5s Monthly: 4m 23.0s Yearly: 52m 35.7s

Slide 93

Slide 93 text

What is 52 minutes in a year? Less than 1 meeting

Slide 94

Slide 94 text

How can you be ready to sprint to mitigation at any moment?

Slide 95

Slide 95 text

What is the full lifecycle of a SEV?

Slide 96

Slide 96 text

No content

Slide 97

Slide 97 text

How are SEVs measured?

Slide 98

Slide 98 text

% loss * outage duration

Slide 99

Slide 99 text

How do you create SEV levels for your company?

Slide 100

Slide 100 text

SEV levels for data loss SEV Level Data Loss Impact SEV 0 Loss of customer data SEV 1 Loss of primary backup SEV 2 Loss of secondary backup

Slide 101

Slide 101 text

No content

Slide 102

Slide 102 text

What does a SEV look like?

Slide 103

Slide 103 text

No content

Slide 104

Slide 104 text

We measure this SEV as: 0.2% * 30 min (6) for WWW 0.11% * 30 min (3.3) for API

Slide 105

Slide 105 text

How do you ensure your team operates effectively during a SEV 0?

Slide 106

Slide 106 text

Incident Manager On-Call (IMOC)

Slide 107

Slide 107 text

Small Rotation of Engineering Leaders

Slide 108

Slide 108 text

One person is on-call in this role at any point in time

Slide 109

Slide 109 text

Can be paged by emailing imoc-pager@

Slide 110

Slide 110 text

Wide knowledge of services and engineering teams

Slide 111

Slide 111 text

Tech Lead On-Call (TLOC)

Slide 112

Slide 112 text

The engineer responsible for resolving the SEV

Slide 113

Slide 113 text

Deep knowledge of own service area

Slide 114

Slide 114 text

Deep knowledge of upstream and downstream dependencies

Slide 115

Slide 115 text

How do you setup IMOCs for success during SEV 0s?

Slide 116

Slide 116 text

How do you categorise SEVs?

Slide 117

Slide 117 text

No content

Slide 118

Slide 118 text

How do you empower everyone in your company to ﬁx things that are broken?

Slide 119

Slide 119 text

No content

Slide 120

Slide 120 text

No content

Slide 121

Slide 121 text

How should you name SEVs?

Slide 122

Slide 122 text

0086343430

Slide 123

Slide 123 text

SEV 0 Fast Frog

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

What causes SEVs?

Slide 126

Slide 126 text

Pareto Principle

Slide 127

Slide 127 text

Technical & Cultural Issues

Slide 128

Slide 128 text

What are some of the expected issues you are likely to experience?

Slide 129

Slide 129 text

Technical Issues Dependency Failure Region/Zone Failure Provider Failure Overheating PDU failure Network upgrades Rack failures Core Switch failures Connectivity issues Flaky DNS Misconﬁgured machines Bugs Corrupt or unavailable backups Cultural Issues Lack of knowledge sharing Lack of knowledge handover Lack of on-call training Lack of chaos engineering Lack of an incident management program Lack of documentation and playbooks Lack of alerts and pages Lack of eﬀective alerting thresholds Lack of backup strategy

Slide 130

Slide 130 text

How do you prevent SEVs from repeating?

Slide 131

Slide 131 text

Let’s look at high impact practices….

Slide 132

Slide 132 text

An Incident Management Program

Slide 133

Slide 133 text

A helpful IMOC Rotation

Slide 134

Slide 134 text

Automation Tooling For Incident Management

Slide 135

Slide 135 text

Chaos Engineering

Slide 136

Slide 136 text

Insert calm kid calling on the phone Calling for help when an incident happens is awesome!

Slide 137

Slide 137 text

Calling for help when an incident happens is awesome!

Slide 138

Slide 138 text

Create Your Own Incident Management Program 1. Determine how you will measure SEVs 2. Determine your SEV Levels 3. Set your SLOs 4. Create your IMOC rotation 5. Start using automation tooling for SEVs 6. Build a critical service dashboard

Slide 139

Slide 139 text

It’s a beautiful day to start

Slide 140

Slide 140 text

Learn from and help others on this journey: Join the Chaos & Reliability Community gremlin.com/community Thank you @tammybutow [email protected] gremlin.com/slack