Chaos management during a major incident

Aish Raj Dahal, So.ware Engineer, PagerDuty Chaos management during a
major incident

sudo rm -rf /

Failure free operations require experience with failure  - Richard Cook

A story about failure

Chapter I: This is ﬁne

Almost every engineer I knew was on the call

Someone dialed in from a Bar

Most people were doing the same task

We had no clue where to start

Chapter II: Dark and Stormy night

“It was a dark and stormy night; the rain fell
in torrents — except at occasional intervals, when it was checked by a violent gust of wind…..” - Almost every clueless person on the call

Nobody to coordinate

Chapter III: The exec-swoop

“Can you send me a spreadsheet with a list of
aﬀected customers ?”

Chapter IV: The morning aOer

What’s wrong with the picture ?

Deﬁne, Prepare, Measure

Failures should be unique, if not automate the response.

Single responsibility principle

Role: The Subject Matter Expert

Role: The Incident Commander

Notify that this is a major incident

Verify that all SMEs are present

Divide and conquer

Communicate effectively

Avoid the bystander effect

“Please say yes, if you think it is a good
idea to do so.”

“Is there any strong objection to that ?”

Role: The Deputy

Assist the Incident Commander

Get all Subject Matter Experts up to speed about what’s
happening

Liaise with stakeholders

Role: The Scribe

Documents the timeline of an incident as it progresses

Role: Customer Liaison

Notify customers about the incident

Keep the Incident Commander apprised of any relevant customer information

Transfer of command if necessary

Blameless post-mortem

You can’t fire you way to reliability

Review

happens, prepare for it

Develop on-call empathy

People are your most valuable asset. Don’t burn them out
doing something that can be automated.

aishrajdahal

Appendix • The Big Red Button from Flickr by włodi
CC SA • Upside down from Flickr by Akimasa Harada (CC SA) • Geography from Flicker (CC SA) • Gene Kranz’s image on Public Domain • That’s all folks image on the Public Domain

References • Cook, Richard I. "How complex systems fail." Cognitive
Technologies Laboratory, University of Chicago. Chicago IL (1998). • Incident Response, PagerDuty Incident Response Docs, https:/ /response.pagerduty.com/ • Allspaw, John. "Fault injection in production." Communications of the ACM 55.10 (2012): 48-52. • Krishnan, Kripa. "Weathering the Unexpected." Commun. ACM 55.11 (2012): 48-52. • Limoncelli, Tom, et al. "Resilience engineering: learning to embrace failure." Communications of the ACM 55.11 (2012): 40-47.

Chaos management during a major incident

Chaos management during a major incident

More Decks by Aish

Other Decks in Programming

Featured

Transcript