Velocity 2018 - How To Establish A High Severity Incident Management Program

HOW TO ESTABLISH A HIGH SEVERITY INCIDENT MANAGEMENT PROGRAM. @TAMMYBUTOW
@ANA_M_MEDINA GREMLIN

AGENDA 09:00 — WELCOME & INTRODUCTIONS 09:30 — ESTABLISHING YOUR
SEV PROGRAM 10:30 — MORNING BREAK ☕ 11:00 — MEASURING SUCCESS 11:30 — HANDS-ON PRACTICE 12:15 - Q & A 12:30 — THANKS & CYA LATER! @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

TAMMY BÜTOW ANA MEDINA Principal SRE, Gremlin Chaos Engineer, Gremlin
@tammybutow @ana_m_medina

INTRODUCTIONS

SURVIVAL SKILLS FROM THE OUTBACK TO THE CITY.

“Many fears cloud people’s engagement with our wilderness. The fear
of snakes, spiders, becoming lost and being alone are all common fears. Survival skills can replace fear with respect for, and trust in, nature. Such knowledge enables people to walk freely and feel safer in our natural environment.“

HOW TO SURVIVE A SNAKE BITE 1.TRUST (NOT FEAR!) 2.CALL
FOR HELP 3.BANDAGE & IMMOBOLISE LIMB 4.STOP SPREAD OF POISON 5.VENOM DETECTION KIT 6.ANTIVENOM “SURVIVAL IS A MIND GAME” — BOB COOPER

HOW TO SURVIVE A SEV 1.TRUST (NOT FEAR!) 2.CALL FOR
HELP 3.APPLY BANDAGE 4.STOP SPREAD 5.DIAGNOSIS OF ISSUE 6.TECHNICAL RESOLUTION “SURVIVAL IS A MIND GAME” — BOB COOPER

KNOW THE 5 KEYS TO WILDERNESS SURVIVAL 1.KNOW HOW TO
BUILD A SHELTER 2.HOW HOW TO SIGNAL FOR HELP 3.KNOW WHAT TO EAT & HOW TO FIND IT 4.KNOW HOW TO BUILD AND MAINTAIN A FIRE 5.KNOW HOW TO FIND WATER AND PREPARE SAFE WATER TO DRINK

KNOW THE 5 KEYS TO SEV SURVIVAL 1.KNOW HOW TO
FIND SHELTER & WIFI 2.KNOW HOW TO SIGNAL FOR HELP 3.KNOW YOUR CRITICAL SYSTEMS & HOW TO ASSESS THEIR HEALTH 4.KNOW HOW TO BANDAGE ISSUES AND STOP THEIR SPREAD 5.KNOW HOW TO PERFORM TECHNICAL EMERGENCY RESOLUTION

THE PRIMARY OBJECTIVE OF THIS WORKSHOP IS TO PROVIDE AN
UNDERSTANDING OF HIGH SEVERITY INCIDENT MANAGEMENT AND ITS RELATED PRACTICES IN AN EASY AND SYSTEMIC WAY, INCLUDING PRACTICE AS WELL AS THEORY.

SUCCESS IS BASED ON FOUR ASPECTS: TRUST, KNOWLEDGE, PRACTICE &
MEASUREMENT

SURVEY: CURRENT STATE OF INCIDENT MANAGEMENT https://goo.gl/Yma4d2

HOW DO YOU EMPOWER EVERYONE IN YOUR COMPANY  TO IDENTIFY
PROBLEMS AND SIGNAL FOR HELP?

Insert illustration of a building

HAS THAT EVER HAPPENED WHERE YOU’VE WORKED?

EMPOWER EVERYONE.

#velocityconf

@TAMMYBUTOW @ANA_M_MEDINA GREMLIN HOW TO ESTABLISH A HIGH SEVERITY INCIDENT
MANAGEMENT PROGRAM

What is High Severity Incident Management?

What are the 4 most common types of SEVs?

1. The Availability Drop

2. The Broken Feature

3. The Loss of Data

Cry baby

4. The Security Risk

Let’s take a journey together outside this room

Put on your SEV backpack

Monday 7pm

You’re out having dinner

You start getting errors from the database for your service.
“ MySQL server has gone away”

You use your SEV tool to get help

Getting errors, app having issues too. Not sure what’s happening
yet. MySQL? SEV Reported by you: Current SEV Level: 1

IMOC is auto-paged and on the case

The SEV is automatically named

SEV 1 Fast Frog

The IMOC ﬁnds a TLOC to resolve the issue

Tons of teams across the company getting alerts It’s an
alert storm!

Insert storm pic

Everyone across the company looks in #sevs on Slack and
check the sevs@ mailing list for updates

Threads running is high, the database is hot!

Database is being hammered!

What’s happening?

TLOC is looking at the database queries

Normal queries, nothing has changed

More queries than usual

Where are they coming from?

Our queries have metadata for the service

1. It’s the API

PUT THAT EVIDENCE IN YOUR BACKPACK

Alarm! Availability SLA is breached for WWW and API

SEV is upgraded to a SEV 0

SEV 0 Fast Frog

Automation in full-force

Executive Leadership Team are auto-emailed

We have only 15 min remaining to resolve the SEV
0

15 MINUTES

Keep going!

Start killing queries to restore service

Are the queries in the slow log from one user
or many users?

2. It’s mostly one user

Is the one user legitimate?

What kind of workload are they performing?

3 — It’s a heavy workload, heavier than we usually
get.

Do we have rate limiting and throttling?

4 — It isn’t working well in this situation

Let’s temporarily kill queries for this user. We can use
a query kill loop or use the support app. Then service will return to normal for everyone.

SLA is back on-track MITIGATED the SEV 0 in 5
minutes!

Let’s open up our evidence backpack

Our Evidence Backpack It’s the API It’s one user It’s
a heavier workload Our rate limiting & throttling can’t handle this workload We temp resolved by killing queries from this customer

Let’s check what rate limiting and throttling is currently set
to

We need to ﬁx that, add an action item.

Let’s also reach out to the customer and understand this
heavy workload they are performing

They do batch-style processing using our API. They plan to
do it Monday 7pm every week. How can we better support it long-term?

That’s what a SEV 0 looks like

What are SEV levels?

SEV Level Description Target resolution time Who is notiﬁed SEV
0 Catastrophic Service Impact Resolve within 15 min Entire company SEV 1 Critical Service Impact Resolve within 8 hours Teams working on SEV & CTO SEV 2 High Service Impact Resolve within 24 hours Teams working on SEV How To Establish SEV levels

How do your resolution times impact SLOs/SLAs?

What is an SLA of 99.99%?

Daily: 8.6s Weekly: 1m 0.5s Monthly: 4m 23.0s Yearly: 52m
35.7s

What is 52 minutes in a year? Less than 1
meeting

How can you be ready to sprint to mitigation at
any moment?

What should a SEV not look like?

What is the full lifecycle of a SEV?

How are SEVs measured?

% loss * outage duration

How do you create SEV levels for your company?

SEV levels for data loss SEV Level Data Loss Impact
SEV 0 Loss of customer data SEV 1 Loss of primary backup SEV 2 Loss of secondary backup

What does a SEV look like?

We measure this SEV as: 0.2% * 30 min (6)
for WWW 0.11% * 30 min (3.3) for API

How do you ensure your team operates effectively during a
SEV 0?

Incident Manager On-Call (IMOC)

Small Rotation of Engineering Leaders

One person is on-call in this role at any point
in time

Can be paged by emailing imoc-pager@

Wide knowledge of services and engineering teams

Tech Lead On-Call (TLOC)

The engineer responsible for resolving the SEV

Deep knowledge of own service area

Deep knowledge of upstream and downstream dependencies

How do you setup IMOCs for success during SEV 0s?

How do you categorise SEVs?

How do you empower everyone in your company to ﬁx
things that are broken?

gremlin.com/community

How should you name SEVs?

0086343430

SEV 0 Fast Frog

What causes SEVs?

Pareto Principle

Technical & Cultural Issues

What are some of the expected issues you are likely
to experience?

Technical Issues Dependency Failure Region/Zone Failure Provider Failure Overheating PDU
failure Network upgrades Rack failures Core Switch failures Connectivity issues Flaky DNS Misconﬁgured machines Bugs Corrupt or unavailable backups Cultural Issues Lack of knowledge sharing Lack of knowledge handover Lack of on-call training Lack of chaos engineering Lack of an incident management program Lack of documentation and playbooks Lack of alerts and pages Lack of eﬀective alerting thresholds Lack of backup strategy

How do you prevent SEVs from repeating?

Let’s look at high impact practices….

An Incident Management Program

A helpful IMOC Rotation

Automation Tooling For Incident Management

Chaos Engineering

Insert calm kid calling on the phone Calling for help
when an incident happens is awesome!

HANDS-ON EXERCISE (GROUPS OF 3 OR 4)

CREATE YOUR OWN INCIDENT MANAGEMENT PROGRAM 1. DETERMINE HOW YOU
WILL MEASURE SEVS 2. DETERMINE SEV LEVELS 3. SET YOUR SLOS 4. CREATE YOUR IMOC ROTATION 5. START USING AUTOMATION TOOLING FOR SEVS 6. BUILD A CRITICAL SERVICE DASHBOARD

ENJOY YOUR MORNING BREAK ☕ @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

MEASURING THE SUCCESS OF YOUR  INCIDENT MANAGEMENT PROGRAM @TAMMYBUTOW @ANA_M_MEDINA
GREMLIN

MEASURE YOUR INCIDENT MANAGEMENT PROGRAM 1. ENSURING YOUR TEAM OPERATES
EFFECTIVELY DURING A SEV 0 2. SETTING UP IMOCS FOR SUCCESS DURING SEV 0s 3. EMPOWERING EVERYONE IN YOUR COMPANY TO REPORT SEVs 4. SEV CAUSES 5. CATEGORISING SEVs 6. PREVENTING SEVs FROM REPEATING 7. USING CHAOS ENGINEERING FOR SEV PREVENTION

GOAL: ENSURING YOUR TEAM OPERATES EFFECTIVELY DURING A SEV 0
MEASURED BY: SURVEY FEEDBACK & TTR

GOAL: SETTING UP IMOCS FOR SUCCESS DURING SEV 0s MEASURED
BY: IMOC & TLOC SURVEYS

GOAL: EMPOWERING EVERYONE IN YOUR COMPANY  TO RECORD SEVS MEASURED
BY: TTD & COMPANY-WIDE SURVEY

GOAL: UNDERSTAND SEV CAUSES MEASURED BY: TAG SEVS BY CAUSES

GOAL: CATEGORISE SEVS MEASURE BY: TAGS FOR SERVICE, TEAM, DEPARTMENT
ETC.

GOAL: PREVENT SEVS FROM REPEATING MEASURE BY: TBF FOR SEVS

GOAL: USE CHAOS ENGINEERING TO EMPOWER YOUR TEAMS TO PREVENT
SEVS MEASURE BY: SEVS WHICH HAVE BEEN REPRODUCED THROUGH CHAOS ENGINEERING.

WHAT ELSE CAN YOU MEASURE?

METRICS FOR YOUR INCIDENT MANAGEMENT PROGRAM 1. A SEV DASHBOARD
2. CREATE & SEND SEV REPORTS 3. CREATE & SEND KPI REPORTS 4. SET GOALS FOR SEV REDUCTION 5. ESTABLISH A MONTHLY SRE SYNC 6. SEV TRAINING INCLUDING GAMEDAYS & CHAOSDAYS

HANDS-ON PRACTICE @TAMMYBUTOW @ANA_M_MEDINA GREMLIN

BRAINSTORM: HOW DO YOU REDUCE TTD FOR YOUR TOP 5
CRITICAL SERVICES?

•Step 0 - Incident classification including; SEV descriptions and levels,
the SEV timeline and the TTD timeline  •Step 1 - Organization-wide critical service monitoring including; key dashboards and KPI metrics emails   •Step 2 - Service ownership and metrics including; measuring TTD by service, service triage, service ownership, building a service ownership service (SOS) and service alerting.  •Step 3 - On-Call Principles including; pareto principle, rotation structure, alert threshold maintenance and escalation practices.  •Step 4 - Chaos Engineering including; chaos days and continuous chaos.   •Step 5 - Self-Healing Systems including; when automation incidents occur, monitoring and metrics for self-healing system automation  •

PRACTICE A SEV REVIEW

BAD POST-SEV REVIEW EXAMPLE

GOOD POST-SEV REVIEW EXAMPLE

DETERMINE HOW YOU WOULD CREATE THE FOLLOWING: 1. A SEV
DASHBOARD 2. CREATE & SEND SEV REPORTS 3. CREATE & SEND KPI REPORTS 4. SET GOALS FOR SEV REDUCTION 5. ESTABLISH A MONTHLY SRE SYNC 6. SEV TRAINING INCLUDING GAMEDAYS & CHAOSDAYS

SHARE SOMETHING YOU   WILL TAKE BACK TO YOUR  
COMPANY WITH EVERYONE

TODAY IS A BEAUTIFUL DAY TO START A HIGH SEVERITY
INCIDENT MANAGEMENT PROGRAM

Learn from & help others on this journey: Join the
Chaos & Reliability Community  gremlin.com/community Thank you [email protected] gremlin.com/slack @TAMMYBUTOW @ANA_M_MEDINA GREMLIN [email protected]

Velocity 2018 - How To Establish A High Severit...

Velocity 2018 - How To Establish A High Severity Incident Management Program

More Decks by Tammy Bryant Butow

Featured

Transcript