Slide 1

Slide 1 text

Aish Raj Dahal, So.ware Engineer, PagerDuty Chaos management during a major incident

Slide 2

Slide 2 text

sudo rm -rf /

Slide 3

Slide 3 text

Failure free operations require experience with failure
 - Richard Cook

Slide 4

Slide 4 text

A story about failure

Slide 5

Slide 5 text

Chapter I: This is fine

Slide 6

Slide 6 text

Almost every engineer I knew was on the call

Slide 7

Slide 7 text

Someone dialed in from a Bar

Slide 8

Slide 8 text

Most people were doing the same task

Slide 9

Slide 9 text

We had no clue where to start

Slide 10

Slide 10 text

Chapter II: Dark and Stormy night

Slide 11

Slide 11 text

“It was a dark and stormy night; the rain fell in torrents — except at occasional intervals, when it was checked by a violent gust of wind…..” - Almost every clueless person on the call

Slide 12

Slide 12 text

Nobody to coordinate

Slide 13

Slide 13 text

Chapter III: The exec-swoop

Slide 14

Slide 14 text

“Can you send me a spreadsheet with a list of affected customers ?”

Slide 15

Slide 15 text

Chapter IV: The morning aOer

Slide 16

Slide 16 text

What’s wrong with the picture ?

Slide 17

Slide 17 text

Define, Prepare, Measure

Slide 18

Slide 18 text

Failures should be unique, if not automate the response.

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Single responsibility principle

Slide 21

Slide 21 text

Role: The Subject Matter Expert

Slide 22

Slide 22 text

Role: The Incident Commander

Slide 23

Slide 23 text

Notify that this is a major incident

Slide 24

Slide 24 text

Verify that all SMEs are present

Slide 25

Slide 25 text

Divide and conquer

Slide 26

Slide 26 text

Communicate effectively

Slide 27

Slide 27 text

Avoid the bystander effect

Slide 28

Slide 28 text

“Please say yes, if you think it is a good idea to do so.”

Slide 29

Slide 29 text

“Is there any strong objection to that ?”

Slide 30

Slide 30 text

Role: The Deputy

Slide 31

Slide 31 text

Assist the Incident Commander

Slide 32

Slide 32 text

Get all Subject Matter Experts up to speed about what’s happening

Slide 33

Slide 33 text

Liaise with stakeholders

Slide 34

Slide 34 text

Role: The Scribe

Slide 35

Slide 35 text

Documents the timeline of an incident as it progresses

Slide 36

Slide 36 text

Role: Customer Liaison

Slide 37

Slide 37 text

Notify customers about the incident

Slide 38

Slide 38 text

Keep the Incident Commander apprised of any relevant customer information

Slide 39

Slide 39 text

Transfer of command if necessary

Slide 40

Slide 40 text

Blameless post-mortem

Slide 41

Slide 41 text

You can’t fire you way to reliability

Slide 42

Slide 42 text

Review

Slide 43

Slide 43 text

happens, prepare for it

Slide 44

Slide 44 text

Develop on-call empathy

Slide 45

Slide 45 text

People are your most valuable asset. Don’t burn them out doing something that can be automated.

Slide 46

Slide 46 text

aishrajdahal

Slide 47

Slide 47 text

Appendix • The Big Red Button from Flickr by włodi CC SA • Upside down from Flickr by Akimasa Harada (CC SA) • Geography from Flicker (CC SA) • Gene Kranz’s image on Public Domain • That’s all folks image on the Public Domain

Slide 48

Slide 48 text

References • Cook, Richard I. "How complex systems fail." Cognitive Technologies Laboratory, University of Chicago. Chicago IL (1998). • Incident Response, PagerDuty Incident Response Docs, https:/ /response.pagerduty.com/ • Allspaw, John. "Fault injection in production." Communications of the ACM 55.10 (2012): 48-52. • Krishnan, Kripa. "Weathering the Unexpected." Commun. ACM 55.11 (2012): 48-52. • Limoncelli, Tom, et al. "Resilience engineering: learning to embrace failure." Communications of the ACM 55.11 (2012): 40-47.