Slide 1

Slide 1 text

I N C I D E N T R E S P O N S E D O N E R I G H T F R O M F I R S T PA G E T O P O S T M O R T E M

Slide 2

Slide 2 text

W I L L FA R R I N G T O N @wfarr on the Internet ! Ops @ GitHub, 2012-now Ops @ Rails Machine, 2009-2011

Slide 3

Slide 3 text

I N C I D E N T R E S P O N S E L E T ’ S TA L K A B O U T

Slide 4

Slide 4 text

I N C I D E N T E V E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M

Slide 5

Slide 5 text

I N C I D E N T E V E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M

Slide 6

Slide 6 text

– M E The recipe to terrible software is simple: just add software.

Slide 7

Slide 7 text

All software is terrible.

Slide 8

Slide 8 text

All software breaks.

Slide 9

Slide 9 text

I N C I D E N T E V E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M

Slide 10

Slide 10 text

– M E Writing your own calendaring and alerting application is a terrible idea.

Slide 11

Slide 11 text

Use PagerDuty, OpsGenie, smoke signals, or a carrier pigeon.

Slide 12

Slide 12 text

I N C I D E N T E V E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M

Slide 13

Slide 13 text

W H AT I S T H E P R O B L E M ? W H A T Y O U R E A L LY WA N T T O K N O W I S

Slide 14

Slide 14 text

A P P I N P U T O U T P U T

Slide 15

Slide 15 text

I N P U T O U T P U T L B A P P A U T H D B C A C H E A P I S

Slide 16

Slide 16 text

I N P U T O U T P U T

Slide 17

Slide 17 text

I N P U T O U T P U T ARCHITECTURE

Slide 18

Slide 18 text

P R O C E S S H O W W O U L D Y O U D E S C R I B E Y O U R

Slide 19

Slide 19 text

“Methodical” and “organized” aren’t often the first thought.

Slide 20

Slide 20 text

Unfortunately, resolving the problem without identifying what the problem is usually results in more harm than good.

Slide 21

Slide 21 text

A process more refined than guesswork is required.

Slide 22

Slide 22 text

C H E C K L I S T S I R E C O M M E N D T H I S B O O K A B O U T

Slide 23

Slide 23 text

– AT U L G A WA N D E “ It is common to misconceive how checklists function in complex lines of work. They are not comprehensive how-to guides, whether for building a skyscraper or getting a plane out of trouble. They are quick and simple tools aimed to buttress the skills of expert professionals.”

Slide 24

Slide 24 text

A G O O D C H E C K L I S T P R E C I S E E F F I C I E N T C O N C I S E P R A C T I C A L E A S Y T O U S E

Slide 25

Slide 25 text

ENGINE FAILURE DURING FLIGHT • Airspeed ! • Fuel Shutoff Valve • Fuel Selector • Auxiliary Fuel Pump • Mixture • Ignition Switch FLY THE AIRPLANE! 68 KIAS ! ON (IN) BOTH ON RICH BOTH

Slide 26

Slide 26 text

Checklists help you eliminate the “obvious” from your mind so you can focus on the hard stuff.

Slide 27

Slide 27 text

Checklists transform the process of identifying problems in a rapidly degrading situation from being haphazard and error-prone to methodical and organized.

Slide 28

Slide 28 text

Occam’s Razor as a Service

Slide 29

Slide 29 text

I N C I D E N T E V E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M

Slide 30

Slide 30 text

– M E “The first step, in anything, is giving a shit.”

Slide 31

Slide 31 text

– AT U L G A WA N D E ““That’s not my problem” is possibly the worst thing people can think.”

Slide 32

Slide 32 text

It’s actually the single worst thing anyone can think.

Slide 33

Slide 33 text

F I X T H E P R O B L E M I T ’ S T I M E T O

Slide 34

Slide 34 text

Do you have a checklist for that?

Slide 35

Slide 35 text

C H E C K L I S T S I R E C O M M E N D T H I S B O O K A B O U T ( A G A I N )

Slide 36

Slide 36 text

ENGINE FAILURE DURING FLIGHT • Airspeed ! • Fuel Shutoff Valve • Fuel Selector • Auxiliary Fuel Pump • Mixture • Ignition Switch FLY THE AIRPLANE! 68 KIAS ! ON (IN) BOTH ON RICH BOTH

Slide 37

Slide 37 text

Let’s say Elasticsearch is split-brained.

Slide 38

Slide 38 text

You should immediately reach for the checklist.

Slide 39

Slide 39 text

ELASTICSEARCH: SPLIT BRAIN • circuit break search OFF ! • disable allocation • get cluster state • shutdown all nodes w/ API • start the cluster • wait for all members • enable allocation UPDATE THE STATUS!

Slide 40

Slide 40 text

Communicate synchronously.

Slide 41

Slide 41 text

MTTR is the name of the game. ! Reduce it safely, by whatever means.

Slide 42

Slide 42 text

Delegate

Slide 43

Slide 43 text

On-call engineer Incident Commander Communicator

Slide 44

Slide 44 text

Take 30s at the start of the hangout to make sure everyone knows who’s doing what. ! Make sure you say what your role is.

Slide 45

Slide 45 text

Atul Gawande found that the simple act of a surgical team introducing themselves to one another before an operation increased the feeling of teamwork and efficacy across the team. ! It also enabled people to speak up when they see something.

Slide 46

Slide 46 text

Communicate to the customer.

Slide 47

Slide 47 text

Do it often! ! Every 15-20 minutes should be the upper-bound.

Slide 48

Slide 48 text

Terrible things happen and if you don’t communicate to your customers, they’ll assume the worst.

Slide 49

Slide 49 text

I N C I D E N T E V E N T N O T I F I C AT I O N I D E N T I F I C AT I O N R E S O L U T I O N P O S T M O RT E M

Slide 50

Slide 50 text

N O W L E T ’ S TA L K A B O U T I T W E ’ V E F I X E D T H E P R O B L E M

Slide 51

Slide 51 text

– J E S S E R O B B I N S “Regular postmortems are the closest thing you have to employing a scientific method to the complicated problem of web operations. By gathering real evidence, you can focus your limited resources on solving the issues that are actually causing you problems.”

Slide 52

Slide 52 text

A G O O D P O S T M O R T E M D E S C R I P T I O N O F T H E I N C I D E N T D E S C R I P T I O N O F T H E R O O T C A U S E D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S T I M E L I N E O F T H E I N C I D E N T H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S

Slide 53

Slide 53 text

A G O O D P O S T M O R T E M D E S C R I P T I O N O F T H E I N C I D E N T D E S C R I P T I O N O F T H E R O O T C A U S E D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S T I M E L I N E O F T H E I N C I D E N T H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S

Slide 54

Slide 54 text

T R U S T A N D H O N E S T Y A G O O D P O S T M O R T E M R E Q U I R E S

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Blame and punitive measures cannot enter the realm of possibility. ! Otherwise, you create a conflict of interest about honesty.

Slide 57

Slide 57 text

H U M A N E R R O R I R E C O M M E N D T H I S B O O K A B O U T

Slide 58

Slide 58 text

– S I D N E Y D E K K E R “Different perspectives on a sequence of events: Looking from the outside and hindsight you have knowledge of the outcome and dangers involved. From the inside, you may have neither.”

Slide 59

Slide 59 text

Let’s entertain the thought that we don’t hire mindless automatons. ! We hire people who can and do think, and who care.

Slide 60

Slide 60 text

Faced with a complex problem in a high-pressure scenario, with a process ill-equipped to effectively help them navigate the situation, their actions were entirely logical and yet doomed to fail.

Slide 61

Slide 61 text

The most important thing is having all the facts.

Slide 62

Slide 62 text

If facts are altered or missing, you cannot effectively remediate.

Slide 63

Slide 63 text

A G O O D P O S T M O R T E M D E S C R I P T I O N O F T H E I N C I D E N T D E S C R I P T I O N O F T H E R O O T C A U S E D E S C R I P T I O N O F T H E R E S O L U T I O N P R O C E S S T I M E L I N E O F T H E I N C I D E N T H O W T H E I N C I D E N T A F F E C T E D C U S T O M E R S R E M E D I AT I O N S O R C O R R E C T I V E A C T I O N S

Slide 64

Slide 64 text

S E T T I N G Y O U R S E L F U P F O R FA I L U R E T H E M O S T C O M M O N P R O B L E M I S

Slide 65

Slide 65 text

Your corrective actions should be aimed at figuring out how your process made the failure possible, and fixing the process.

Slide 66

Slide 66 text

More training and trying harder are never the right answer.

Slide 67

Slide 67 text

P U B L I C P O S T M O R T E M S

Slide 68

Slide 68 text

Apologize first. Mean it.

Slide 69

Slide 69 text

Own your availability.

Slide 70

Slide 70 text

Own your security.

Slide 71

Slide 71 text

Own your mistakes.

Slide 72

Slide 72 text

Own your ignorance.

Slide 73

Slide 73 text

Know your audience.

Slide 74

Slide 74 text

Don’t bullshit. Ever.

Slide 75

Slide 75 text

B U L L S H I T I R E C O M M E N D T H I S B L O G P O S T A B O U T

Slide 76

Slide 76 text

– D AV I D H E I N E M E I E R H A N S S O N “The most important part of saying you’re sorry is to project some real empathy. If you can’t put yourself in your users’ shoes, then it’s going to out wrong.”

Slide 77

Slide 77 text

Post it relatively soon.

Slide 78

Slide 78 text

T H E B I G S E C R E T

Slide 79

Slide 79 text

Nobody does this perfectly. ! Definitely not us.

Slide 80

Slide 80 text

The point is to get better at it.

Slide 81

Slide 81 text

T H A N K S