Velocity 2018 - How To Establish A High Severity Incident Management Program

Slide 1

Slide 1 text

HOW TO ESTABLISH A HIGH SEVERITY INCIDENT MANAGEMENT PROGRAM. @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

Slide 2

Slide 2 text

AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

Slide 3

Slide 3 text

TAMMY BÜTOW ANA MEDINA Principal SRE, Gremlin Chaos Engineer, Gremlin @tammybutow @ana_m_medina

Slide 4

Slide 4 text

INTRODUCTIONS

Slide 5

Slide 5 text

SURVIVAL SKILLS FROM THE OUTBACK TO THE CITY.

Slide 6

Slide 6 text

“Many fears cloud people’s engagement with our wilderness. The fear of snakes, spiders, becoming lost and being alone are all common fears. Survival skills can replace fear with respect for, and trust in, nature. Such knowledge enables people to walk freely and feel safer in our natural environment.“

Slide 7

Slide 7 text

HOW TO SURVIVE A SNAKE BITE 1.TRUST (NOT FEAR!) 2.CALL FOR HELP 3.BANDAGE & IMMOBOLISE LIMB 4.STOP SPREAD OF POISON 5.VENOM DETECTION KIT 6.ANTIVENOM “SURVIVAL IS A MIND GAME” — BOB COOPER

Slide 8

Slide 8 text

HOW TO SURVIVE A SEV 1.TRUST (NOT FEAR!) 2.CALL FOR HELP 3.APPLY BANDAGE 4.STOP SPREAD 5.DIAGNOSIS OF ISSUE 6.TECHNICAL RESOLUTION “SURVIVAL IS A MIND GAME” — BOB COOPER

Slide 9

Slide 9 text

KNOW THE 5 KEYS TO WILDERNESS SURVIVAL 1.KNOW HOW TO BUILD A SHELTER 2.HOW HOW TO SIGNAL FOR HELP 3.KNOW WHAT TO EAT & HOW TO FIND IT 4.KNOW HOW TO BUILD AND MAINTAIN A FIRE 5.KNOW HOW TO FIND WATER AND PREPARE SAFE WATER TO DRINK

Slide 10

Slide 10 text

KNOW THE 5 KEYS TO SEV SURVIVAL 1.KNOW HOW TO FIND SHELTER & WIFI 2.KNOW HOW TO SIGNAL FOR HELP 3.KNOW YOUR CRITICAL SYSTEMS & HOW TO ASSESS THEIR HEALTH 4.KNOW HOW TO BANDAGE ISSUES AND STOP THEIR SPREAD 5.KNOW HOW TO PERFORM TECHNICAL EMERGENCY RESOLUTION

Slide 11

Slide 11 text

THE PRIMARY OBJECTIVE OF THIS WORKSHOP IS TO PROVIDE AN UNDERSTANDING OF HIGH SEVERITY INCIDENT MANAGEMENT AND ITS RELATED PRACTICES IN AN EASY AND SYSTEMIC WAY, INCLUDING PRACTICE AS WELL AS THEORY.

Slide 12

Slide 12 text

SUCCESS IS BASED ON FOUR ASPECTS: TRUST, KNOWLEDGE, PRACTICE & MEASUREMENT

Slide 13

Slide 13 text

SURVEY: CURRENT STATE OF INCIDENT MANAGEMENT https://goo.gl/Yma4d2

Slide 14

Slide 14 text

HOW DO YOU EMPOWER EVERYONE IN YOUR COMPANY  TO IDENTIFY PROBLEMS AND SIGNAL FOR HELP?

Slide 15

Slide 15 text

Insert illustration of a building

Slide 16

Slide 16 text

HAS THAT EVER HAPPENED WHERE YOU’VE WORKED?

Slide 17

Slide 17 text

EMPOWER EVERYONE.

Slide 18

Slide 18 text

#velocityconf

Slide 19

Slide 19 text

Slide 20

Slide 20 text

@TAMMYBUTOW @ANA_M_MEDINA GREMLIN HOW TO ESTABLISH A HIGH SEVERITY INCIDENT MANAGEMENT PROGRAM

Slide 21

Slide 21 text

What is High Severity Incident Management?

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

SEVs

Slide 24

Slide 24 text

What are the 4 most common types of SEVs?

Slide 25

Slide 25 text

1. The Availability Drop

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

2. The Broken Feature

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

3. The Loss of Data

Slide 30

Slide 30 text

Cry baby

Slide 31

Slide 31 text

4. The Security Risk

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Let’s take a journey together outside this room

Slide 34

Slide 34 text

Put on your SEV backpack

Slide 35

Slide 35 text

Monday 7pm

Slide 36

Slide 36 text

You’re out having dinner

Slide 37

Slide 37 text

You start getting errors from the database for your service. “ MySQL server has gone away”

Slide 38

Slide 38 text

You use your SEV tool to get help

Slide 39

Slide 39 text

Getting errors, app having issues too. Not sure what’s happening yet. MySQL? SEV Reported by you: Current SEV Level: 1

Slide 40

Slide 40 text

IMOC is auto-paged and on the case

Slide 41

Slide 41 text

The SEV is automatically named

Slide 42

Slide 42 text

SEV 1 Fast Frog

Slide 43

Slide 43 text

The IMOC ﬁnds a TLOC to resolve the issue

Slide 44

Slide 44 text

Tons of teams across the company getting alerts It’s an alert storm!

Slide 45

Slide 45 text

Insert storm pic

Slide 46

Slide 46 text

Everyone across the company looks in #sevs on Slack and check the sevs@ mailing list for updates

Slide 47

Slide 47 text

Threads running is high, the database is hot!

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

Database is being hammered!

Slide 50

Slide 50 text

What’s happening?

Slide 51

Slide 51 text

TLOC is looking at the database queries

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Normal queries, nothing has changed

Slide 54

Slide 54 text

More queries than usual

Slide 55

Slide 55 text

Where are they coming from?

Slide 56

Slide 56 text

Our queries have metadata for the service

Slide 57

Slide 57 text

1. It’s the API

Slide 58

Slide 58 text

PUT THAT EVIDENCE IN YOUR BACKPACK

Slide 59

Slide 59 text

Alarm! Availability SLA is breached for WWW and API

Slide 60

Slide 60 text

SEV is upgraded to a SEV 0

Slide 61

Slide 61 text

SEV 0 Fast Frog

Slide 62

Slide 62 text

Automation in full-force

Slide 63

Slide 63 text

Executive Leadership Team are auto-emailed

Slide 64

Slide 64 text

We have only 15 min remaining to resolve the SEV 0

Slide 65

Slide 65 text

15 MINUTES

Slide 66

Slide 66 text

Keep going!

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Start killing queries to restore service

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

Are the queries in the slow log from one user or many users?

Slide 71

Slide 71 text

2. It’s mostly one user

Slide 72

Slide 72 text

PUT THAT EVIDENCE IN YOUR BACKPACK

Slide 73

Slide 73 text

Is the one user legitimate?

Slide 74

Slide 74 text

What kind of workload are they performing?

Slide 75

Slide 75 text

3 — It’s a heavy workload, heavier than we usually get.

Slide 76

Slide 76 text

PUT THAT EVIDENCE IN YOUR BACKPACK

Slide 77

Slide 77 text

Do we have rate limiting and throttling?

Slide 78

Slide 78 text

4 — It isn’t working well in this situation

Slide 79

Slide 79 text

PUT THAT EVIDENCE IN YOUR BACKPACK

Slide 80

Slide 80 text

Let’s temporarily kill queries for this user. We can use a query kill loop or use the support app. Then service will return to normal for everyone.

Slide 81

Slide 81 text

SLA is back on-track MITIGATED the SEV 0 in 5 minutes!

Slide 82

Slide 82 text

Let’s open up our evidence backpack

Slide 83

Slide 83 text

Our Evidence Backpack It’s the API It’s one user It’s a heavier workload Our rate limiting & throttling can’t handle this workload We temp resolved by killing queries from this customer

Slide 84

Slide 84 text

Let’s check what rate limiting and throttling is currently set to

Slide 85

Slide 85 text

We need to ﬁx that, add an action item.

Slide 86

Slide 86 text

Let’s also reach out to the customer and understand this heavy workload they are performing

Slide 87

Slide 87 text

They do batch-style processing using our API. They plan to do it Monday 7pm every week. How can we better support it long-term?

Slide 88

Slide 88 text

That’s what a SEV 0 looks like

Slide 89

Slide 89 text

What are SEV levels?

Slide 90

Slide 90 text

SEV Level Description Target resolution time Who is notiﬁed SEV 0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV How To Establish SEV levels

Slide 91

Slide 91 text

How do your resolution times impact SLOs/SLAs?

Slide 92

Slide 92 text

What is an SLA of 99.99%?

Slide 93

Slide 93 text

Daily: 8.6s Weekly: 1m 0.5s Monthly: 4m 23.0s Yearly: 52m 35.7s

Slide 94

Slide 94 text

What is 52 minutes in a year? Less than 1 meeting

Slide 95

Slide 95 text

How can you be ready to sprint to mitigation at any moment?

Slide 96

Slide 96 text

What should a SEV not look like?

Slide 97

Slide 97 text

No content

Slide 98

Slide 98 text

What is the full lifecycle of a SEV?

Slide 99

Slide 99 text

No content

Slide 100

Slide 100 text

How are SEVs measured?

Slide 101

Slide 101 text

% loss * outage duration

Slide 102

Slide 102 text

How do you create SEV levels for your company?

Slide 103

Slide 103 text

SEV levels for data loss SEV Level Data Loss Impact SEV 0 Loss of customer data SEV 1 Loss of primary backup SEV 2 Loss of secondary backup

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

What does a SEV look like?

Slide 106

Slide 106 text

No content

Slide 107

Slide 107 text

We measure this SEV as: 0.2% * 30 min (6) for WWW 0.11% * 30 min (3.3) for API

Slide 108

Slide 108 text

How do you ensure your team operates effectively during a SEV 0?

Slide 109

Slide 109 text

Incident Manager On-Call (IMOC)

Slide 110

Slide 110 text

Small Rotation of Engineering Leaders

Slide 111

Slide 111 text

One person is on-call in this role at any point in time

Slide 112

Slide 112 text

Can be paged by emailing imoc-pager@

Slide 113

Slide 113 text

Wide knowledge of services and engineering teams

Slide 114

Slide 114 text

Tech Lead On-Call (TLOC)

Slide 115

Slide 115 text

The engineer responsible for resolving the SEV

Slide 116

Slide 116 text

Deep knowledge of own service area

Slide 117

Slide 117 text

Deep knowledge of upstream and downstream dependencies

Slide 118

Slide 118 text

How do you setup IMOCs for success during SEV 0s?

Slide 119

Slide 119 text

How do you categorise SEVs?

Slide 120

Slide 120 text

No content

Slide 121

Slide 121 text

How do you empower everyone in your company to ﬁx things that are broken?

Slide 122

Slide 122 text

gremlin.com/community

Slide 123

Slide 123 text

gremlin.com/community

Slide 124

Slide 124 text

How should you name SEVs?

Slide 125

Slide 125 text

0086343430

Slide 126

Slide 126 text

SEV 0 Fast Frog

Slide 127

Slide 127 text

What causes SEVs?

Slide 128

Slide 128 text

Pareto Principle

Slide 129

Slide 129 text

Technical & Cultural Issues

Slide 130

Slide 130 text

What are some of the expected issues you are likely to experience?

Slide 131

Slide 131 text

Technical Issues Dependency Failure Region/Zone Failure Provider Failure Overheating PDU failure Network upgrades Rack failures Core Switch failures Connectivity issues Flaky DNS Misconﬁgured machines Bugs Corrupt or unavailable backups Cultural Issues Lack of knowledge sharing Lack of knowledge handover Lack of on-call training Lack of chaos engineering Lack of an incident management program Lack of documentation and playbooks Lack of alerts and pages Lack of eﬀective alerting thresholds Lack of backup strategy

Slide 132

Slide 132 text

How do you prevent SEVs from repeating?

Slide 133

Slide 133 text

Let’s look at high impact practices….

Slide 134

Slide 134 text

An Incident Management Program

Slide 135

Slide 135 text

A helpful IMOC Rotation

Slide 136

Slide 136 text

Automation Tooling For Incident Management

Slide 137

Slide 137 text

Chaos Engineering

Slide 138

Slide 138 text

Insert calm kid calling on the phone Calling for help when an incident happens is awesome!

Slide 139

Slide 139 text

HANDS-ON EXERCISE (GROUPS OF 3 OR 4)

Slide 140

Slide 140 text

CREATE YOUR OWN INCIDENT MANAGEMENT PROGRAM 1. DETERMINE HOW YOU WILL MEASURE SEVS 2. DETERMINE SEV LEVELS 3. SET YOUR SLOS 4. CREATE YOUR IMOC ROTATION 5. START USING AUTOMATION TOOLING FOR SEVS 6. BUILD A CRITICAL SERVICE DASHBOARD

Slide 141

Slide 141 text

Slide 142

Slide 142 text

ENJOY YOUR MORNING BREAK ☕ @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

Slide 143

Slide 143 text

Slide 144

Slide 144 text

MEASURING THE SUCCESS OF YOUR  INCIDENT MANAGEMENT PROGRAM @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

Slide 145

Slide 145 text

MEASURE YOUR INCIDENT MANAGEMENT PROGRAM 1. ENSURING YOUR TEAM OPERATES EFFECTIVELY DURING A SEV 0 2. SETTING UP IMOCS FOR SUCCESS DURING SEV 0s 3. EMPOWERING EVERYONE IN YOUR COMPANY TO REPORT SEVs 4. SEV CAUSES 5. CATEGORISING SEVs 6. PREVENTING SEVs FROM REPEATING 7. USING CHAOS ENGINEERING FOR SEV PREVENTION

Slide 146

Slide 146 text

GOAL: ENSURING YOUR TEAM OPERATES EFFECTIVELY DURING A SEV 0 MEASURED BY: SURVEY FEEDBACK & TTR

Slide 147

Slide 147 text

GOAL: SETTING UP IMOCS FOR SUCCESS DURING SEV 0s MEASURED BY: IMOC & TLOC SURVEYS

Slide 148

Slide 148 text

GOAL: EMPOWERING EVERYONE IN YOUR COMPANY  TO RECORD SEVS MEASURED BY: TTD & COMPANY-WIDE SURVEY

Slide 149

Slide 149 text

GOAL: UNDERSTAND SEV CAUSES MEASURED BY: TAG SEVS BY CAUSES

Slide 150

Slide 150 text

GOAL: CATEGORISE SEVS MEASURE BY: TAGS FOR SERVICE, TEAM, DEPARTMENT ETC.

Slide 151

Slide 151 text

GOAL: PREVENT SEVS FROM REPEATING MEASURE BY: TBF FOR SEVS

Slide 152

Slide 152 text

GOAL: USE CHAOS ENGINEERING TO EMPOWER YOUR TEAMS TO PREVENT SEVS MEASURE BY: SEVS WHICH HAVE BEEN REPRODUCED THROUGH CHAOS ENGINEERING.

Slide 153

Slide 153 text

WHAT ELSE CAN YOU MEASURE?

Slide 154

Slide 154 text

METRICS FOR YOUR INCIDENT MANAGEMENT PROGRAM 1. A SEV DASHBOARD 2. CREATE & SEND SEV REPORTS 3. CREATE & SEND KPI REPORTS 4. SET GOALS FOR SEV REDUCTION 5. ESTABLISH A MONTHLY SRE SYNC 6. SEV TRAINING INCLUDING GAMEDAYS & CHAOSDAYS

Slide 155

Slide 155 text

Slide 156

Slide 156 text

HANDS-ON PRACTICE @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

Slide 157

Slide 157 text

BRAINSTORM: HOW DO YOU REDUCE TTD FOR YOUR TOP 5 CRITICAL SERVICES?

Slide 158

Slide 158 text

•Step 0 - Incident classification including; SEV descriptions and levels, the SEV timeline and the TTD timeline  •Step 1 - Organization-wide critical service monitoring including; key dashboards and KPI metrics emails   •Step 2 - Service ownership and metrics including; measuring TTD by service, service triage, service ownership, building a service ownership service (SOS) and service alerting.  •Step 3 - On-Call Principles including; pareto principle, rotation structure, alert threshold maintenance and escalation practices.  •Step 4 - Chaos Engineering including; chaos days and continuous chaos.   •Step 5 - Self-Healing Systems including; when automation incidents occur, monitoring and metrics for self-healing system automation  •

Slide 159

Slide 159 text

PRACTICE A SEV REVIEW

Slide 160

Slide 160 text

BAD POST-SEV REVIEW EXAMPLE

Slide 161

Slide 161 text

No content

Slide 162

Slide 162 text

GOOD POST-SEV REVIEW EXAMPLE

Slide 163

Slide 163 text

No content

Slide 164

Slide 164 text

DETERMINE HOW YOU WOULD CREATE THE FOLLOWING: 1. A SEV DASHBOARD 2. CREATE & SEND SEV REPORTS 3. CREATE & SEND KPI REPORTS 4. SET GOALS FOR SEV REDUCTION 5. ESTABLISH A MONTHLY SRE SYNC 6. SEV TRAINING INCLUDING GAMEDAYS & CHAOSDAYS

Slide 165

Slide 165 text

SHARE SOMETHING YOU   WILL TAKE BACK TO YOUR   COMPANY WITH EVERYONE

Slide 166

Slide 166 text

TODAY IS A BEAUTIFUL DAY TO START A HIGH SEVERITY INCIDENT MANAGEMENT PROGRAM

Slide 167

Slide 167 text

Learn from & help others on this journey: Join the Chaos & Reliability Community  gremlin.com/community Thank you [email protected] gremlin.com/slack @TAMMYBUTOW @ANA_M_MEDINA GREMLIN [email protected]