How To Establish A High Severity Incident Management Program

HOW TO ESTABLISH AN INCIDENT MANAGEMENT PROGRAM @tammybutow

What do jelly beans have to do with incident management
?

Insert kid crying

Insert kid running around

Insert calm kid calling on the phone

Insert Jelly Beans

Insert photo of my mum and me

Hi I’m Tammy Butow, SRE @ gremlin.com I’ve worked on
high severity incidents my entire life, and I’ve gotten better at it!

10+ years.

Gremlin Dropbox DigitalOcean National Australia Bank Queensland University of Technology
My home in Eastwood, NSW, Australia

How do you empower everyone in your company to identify
problems and get help?

Empower Everyone.

Insert illustration of a building

Has that ever happened where you’ve worked?

FOR YOUR ENTIRE COMPANY

One common misconception…

All people who resolve incidents are heroes.

Hero vs Helper

I’m a helper.

What is High Severity Incident Management?

What are the 4 most common types of SEVs?

1. The Availability Drop

2. The Broken Feature

3. The Loss of Data

Cry baby

4. The Security Risk

Let’s take a journey together outside this room

Put on your SEV backpack

Monday 7pm

You’re out on a date enjoying a lovely dinner

You start getting errors from the database for your service.
“ MySQL server has gone away”.

You use the SEV tool to get help

Getting errors, app having issues too. Not sure what’s happening
yet. MySQL? SEV Reported by you: Current SEV Level: 1

IMOC is auto-paged and on the case

The SEV is automatically named

SEV 1 Fast Frog

The IMOC ﬁnds a TLOC to resolve the issue

Tons of teams across the company getting alerts It’s an
alert storm!

Insert storm pic

Everyone across the company looks in #sevs on Slack and
check the sevs@ mailing list for updates

Threads running is high, the database is hot!

Database is being hammered!

What’s happening?

TLOC is looking at the database queries

Normal queries, nothing has changed

More queries than usual

Where are they coming from?

Our queries have metadata for the service

1. It’s the API

PUT THAT EVIDENCE IN YOUR BACKPACK

Alarm! Availability SLA is breached for WWW and API

SEV is upgraded to a SEV 0

SEV 0 Fast Frog

Automation in full-force

Executive Leadership Team are auto-emailed

We have only 15 min remaining to resolve the SEV
0

15 MINUTES

Keep going!

Start killing queries to restore service

Are the queries in the slow log from one user
or many users?

2. It’s mostly one user

Is the one user legitimate?

What kind of workload are they performing?

3 — It’s a heavy workload, heavier than we usually
get.

Do we have rate limiting and throttling?

4 — It isn’t working well in this situation

Let’s temporarily kill queries for this user. We can use
a query kill loop or use the support app. Then service will return to normal for everyone.

SLA is back on-track MITIGATED the SEV 0 in 5
minutes!

Let’s open up our evidence backpack

Our Evidence Backpack It’s the API It’s one user It’s
a heavier workload Our rate limiting & throttling can’t handle this workload We temp resolved by killing queries from this customer

Let’s check what rate limiting and throttling is currently set
to

We need to ﬁx that, add an action item.

Let’s also reach out to the customer and understand this
heavy workload they are performing

They do batch-style processing using our API. They plan to
do it Monday 7pm every week. How can we better support it long-term?

That’s what a SEV 0 looks like

What are SEV levels?

SEV Level Description Target resolution time Who is notiﬁed SEV
0 Catastrophic Service Impact Resolve within 10 min Ambulance SEV 1 Critical Service Impact Resolve within 8 hours Neighbour & Best Friend SEV 2 High Service Impact Resolve within 24 hours Best Friend How To Establish SEV levels - Diabetes

SEV Level Description Target resolution time Who is notiﬁed SEV
0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV How To Establish SEV levels

How do your resolution times impact SLOs/SLAs?

What is an SLA of 99.99%?

Daily: 8.6s Weekly: 1m 0.5s Monthly: 4m 23.0s Yearly: 52m
35.7s

What is 52 minutes in a year? Less than 1
meeting

How can you be ready to sprint to mitigation at
any moment?

What is the full lifecycle of a SEV?

How are SEVs measured?

% loss * outage duration

How do you create SEV levels for your company?

SEV levels for data loss SEV Level Data Loss Impact
SEV 0 Loss of customer data SEV 1 Loss of primary backup SEV 2 Loss of secondary backup

What does a SEV look like?

We measure this SEV as: 0.2% * 30 min (6)
for WWW 0.11% * 30 min (3.3) for API

How do you ensure your team operates effectively during a
SEV 0?

Incident Manager On-Call (IMOC)

Small Rotation of Engineering Leaders

One person is on-call in this role at any point
in time

Can be paged by emailing imoc-pager@

Wide knowledge of services and engineering teams

Tech Lead On-Call (TLOC)

The engineer responsible for resolving the SEV

Deep knowledge of own service area

Deep knowledge of upstream and downstream dependencies

How do you setup IMOCs for success during SEV 0s?

How do you categorise SEVs?

How do you empower everyone in your company to ﬁx
things that are broken?

How should you name SEVs?

0086343430

SEV 0 Fast Frog

What causes SEVs?

Pareto Principle

Technical & Cultural Issues

What are some of the expected issues you are likely
to experience?

Technical Issues Dependency Failure Region/Zone Failure Provider Failure Overheating PDU
failure Network upgrades Rack failures Core Switch failures Connectivity issues Flaky DNS Misconﬁgured machines Bugs Corrupt or unavailable backups Cultural Issues Lack of knowledge sharing Lack of knowledge handover Lack of on-call training Lack of chaos engineering Lack of an incident management program Lack of documentation and playbooks Lack of alerts and pages Lack of eﬀective alerting thresholds Lack of backup strategy

How do you prevent SEVs from repeating?

Let’s look at high impact practices….

An Incident Management Program

A helpful IMOC Rotation

Automation Tooling For Incident Management

Chaos Engineering

Insert calm kid calling on the phone Calling for help
when an incident happens is awesome!

Calling for help when an incident happens is awesome!

Create Your Own Incident Management Program 1. Determine how you
will measure SEVs 2. Determine your SEV Levels 3. Set your SLOs 4. Create your IMOC rotation 5. Start using automation tooling for SEVs 6. Build a critical service dashboard

It’s a beautiful day to start

Learn from and help others on this journey: Join the
Chaos & Reliability Community gremlin.com/community Thank you @tammybutow [email protected] gremlin.com/slack

How To Establish A High Severity Incident Manag...

How To Establish A High Severity Incident Management Program

More Decks by Tammy Bryant Butow

Other Decks in Technology

Featured

Transcript