Slide 1

Slide 1 text

STAYING ALIVE PATTERNS FOR FAILURE MANAGEMENT FROM THE BOTTOM OF THE OCEAN 1 — devopsdays MSP 2018 @rondoftw

Slide 2

Slide 2 text

Ronnie Chen @rondoftw 2 — devopsdays MSP 2018 @rondoftw

Slide 3

Slide 3 text

3 — devopsdays MSP 2018 @rondoftw

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

5 — devopsdays MSP 2018 @rondoftw

Slide 6

Slide 6 text

6 — devopsdays MSP 2018 @rondoftw

Slide 7

Slide 7 text

7 — devopsdays MSP 2018 @rondoftw

Slide 8

Slide 8 text

8 — devopsdays MSP 2018 @rondoftw

Slide 9

Slide 9 text

TECHNICAL DIVING ▸ longer dive times ▸ deeper dives ▸ overhead ceiling ▸ decompression obligations ▸ more gear. a lot more. ▸ higher pressure ▸ more risks 9 — devopsdays MSP 2018 @rondoftw

Slide 10

Slide 10 text

RISKS MAY INCLUDE... 1. hypoxia 2. hyperoxia 3. nitrogen narcosis 4. carbon dioxide buildup 5. oxygen sensor failure 6. deep tissue isobaric counterdiffusion (ICD) 7. high pressure nervous syndrome (HPNS) 8. exhausting your carbon dioxide scrubber 9. carbon dioxide channeling from a poorly packed scrubber 10. carbon buildup causing an spark leading to an oxygen fire. underwater. 11. flooding of breathing loop or circuitry 12. water mixing with the scrubbing agent to produce a toxic caustic soda that will give you chemical burns on your mouth, airway, and lungs 13. plain old decompression sickness 10 — devopsdays MSP 2018 @rondoftw

Slide 11

Slide 11 text

  If you own a rebreather for five years, two percent of you are going to die on it. — Jill Heinerth, underwater explorer 11 — devopsdays MSP 2018 @rondoftw

Slide 12

Slide 12 text

12 — devopsdays MSP 2018 @rondoftw

Slide 13

Slide 13 text

13 — devopsdays MSP 2018 @rondoftw

Slide 14

Slide 14 text

14 — devopsdays MSP 2018 @rondoftw

Slide 15

Slide 15 text

HOLD UP! 15 — devopsdays MSP 2018 @rondoftw

Slide 16

Slide 16 text

THIS IS A TALK ABOUT COMMUNICATION AND PROCESS ? ? ? 16 — devopsdays MSP 2018 @rondoftw

Slide 17

Slide 17 text

(it was a trap) 17 — devopsdays MSP 2018 @rondoftw

Slide 18

Slide 18 text

YOU CAME TO HEAR COOL STORIES... 18 — devopsdays MSP 2018 @rondoftw

Slide 19

Slide 19 text

BUT YOU'RE GETTING A MEANDERING MEDITATION ON BEST PRACTICES* WHEN DEALING WITH COMPLEX SYSTEMS INSTEAD * These guidelines have only been shown to work for life or death situations under the ocean. They have not been proven to work for tech. 19 — devopsdays MSP 2018 @rondoftw

Slide 20

Slide 20 text

How failures really happen 20 — devopsdays MSP 2018 @rondoftw

Slide 21

Slide 21 text

Complex systems are designed to protect against simple failures. 21 — devopsdays MSP 2018 @rondoftw

Slide 22

Slide 22 text

But accidents still happen. 22 — devopsdays MSP 2018 @rondoftw

Slide 23

Slide 23 text

CATASTROPHES ARE CAUSED BY A FAILURE CASCADE ▸ you have a rebreather malfunction ▸ which you would have caught it if you were testing your equipment on a regular basis ▸ your backup tank had a leak and is running low and that wasn't caught either ▸ and your buddy is too far away and isn't checking in with you ▸ and your dive light that you use to communicate at a distance is out of power ▸ and in the excitement you kick up silt and the visibility drops ▸ and in your panic your air consumption goes up and then you breathe through the last of the air in your tank ▸ so you swim for the surface even though you have a decompression obligation 23 — devopsdays MSP 2018 @rondoftw

Slide 24

Slide 24 text

A post-mortem that blames this incident on a simple mechanical malfunction would only cover 12.5% of the issues that led up to this accident. 24 — devopsdays MSP 2018 @rondoftw

Slide 25

Slide 25 text

Complex system failures don't happen because a single part of the system fails. They happen because all the safety procedures that are supposed to stop the failure from cascading didn't work. 25 — devopsdays MSP 2018 @rondoftw

Slide 26

Slide 26 text

CORE RULES OF SAFETY SYSTEMS 1. An unused safety system doesn't exist. 26 — devopsdays MSP 2018 @rondoftw

Slide 27

Slide 27 text

NORMALIZATION OF DEVIANCE That natural human tendency, particularly in pressure circumstances, to take a safety shortcut. — Colonel Mike Mullane, astronaut 27 — devopsdays MSP 2018 @rondoftw

Slide 28

Slide 28 text

FALSE FEEDBACK the absence of something bad happening means that it was safe ADAPTATION experience is no longer a suitable gauge of risk SOCIAL PRESSURE this is just how we do things 28 — devopsdays MSP 2018 @rondoftw

Slide 29

Slide 29 text

CORE RULES OF SAFETY SYSTEMS 2. An untested safety system doesn't exist either! 29 — devopsdays MSP 2018 @rondoftw

Slide 30

Slide 30 text

CORE RULES OF SAFETY SYSTEMS 3.Unused or untested safety systems are more dangerous than not having one at all. Therefore, safety systems must be tested at regular intervals. 30 — devopsdays MSP 2018 @rondoftw

Slide 31

Slide 31 text

The length of this interval should be determined not only by how likely it is for this system to fail but also how great the impact will be if it does. 31 — devopsdays MSP 2018 @rondoftw

Slide 32

Slide 32 text

A QUICK SIDENOTE ON ASSESSING RISK ▸ Make assessments based on likelihood of occurrence. ▸ Make assessments based on magnitude of regret. If you are only evaluating risk based on the chance of it happening, you must be prepared to experience the corresponding level of regret if it does. 32 — devopsdays MSP 2018 @rondoftw

Slide 33

Slide 33 text

failures will happen 33 — devopsdays MSP 2018 @rondoftw

Slide 34

Slide 34 text

WHAT IS SAFETY? 34 — devopsdays MSP 2018 @rondoftw

Slide 35

Slide 35 text

FAILURE MANAGEMENT ▸ A framework for resiliency ▸ The training and judgment to use it 35 — devopsdays MSP 2018 @rondoftw

Slide 36

Slide 36 text

FAILURE MANAGEMENT FOR SYSTEMS ▸ Have redundancy for systems that you cannot survive without. ▸ Have a redundant pathway to success: a procedure for graceful degradation for systems that are important but not critical. ▸ Have a process for changing over from primary to redundant systems. 36 — devopsdays MSP 2018 @rondoftw

Slide 37

Slide 37 text

FAILURE MANAGEMENT FOR SYSTEMS (CONT) ▸ Keep failures contained so that they don't bring down other systems ▸ Make it easy to do the right thing and hard to do the dangerous things 37 — devopsdays MSP 2018 @rondoftw

Slide 38

Slide 38 text

FAILURE MANAGEMENT FOR HUMAN SYSTEMS 38 — devopsdays MSP 2018 @rondoftw

Slide 39

Slide 39 text

TRAIN FOR PRESSURE 39 — devopsdays MSP 2018 @rondoftw

Slide 40

Slide 40 text

TRAINING: INEXPERIENCED PEOPLE TO THE FRONT ▸ Most inexperienced person leads ▸ Experienced person advises, only intervening when necessary ▸ Team is invested in personal success to ensure mission success 40 — devopsdays MSP 2018 @rondoftw

Slide 41

Slide 41 text

TRAINING: INEXPERIENCED PEOPLE TO THE FRONT (CONT) ▸ One of the best ways to equalize a gap in experience ▸ Opportunity to revise and improve problematic systems ▸ Help build good judgment 41 — devopsdays MSP 2018 @rondoftw

Slide 42

Slide 42 text

GOOD JUDGMENT Good judgment enables the reshaping of rules and frameworks to adapt to a changing environment. 42 — devopsdays MSP 2018 @rondoftw

Slide 43

Slide 43 text

REFINING JUDGMENT ▸ Post-Mortems ▸ Pre-Mortems ▸ Fire Drills ▸ Revisit Past Decisions 43 — devopsdays MSP 2018 @rondoftw

Slide 44

Slide 44 text

POST-MORTEMS ▸ Look at the safety procedures that failed to stop the cascade ▸ Look for opportunities to create new safety systems at critical points 44 — devopsdays MSP 2018 @rondoftw

Slide 45

Slide 45 text

PRE-MORTEMS ▸ Don't wait for failures to build safety frameworks ▸ Identify potential avenues of of failure and make plans for them ▸ Include both likely failures and high regret failures 45 — devopsdays MSP 2018 @rondoftw

Slide 46

Slide 46 text

FIRE DRILLS ▸ Vet your plans and safety systems ▸ Perform targeted training ▸ Evaluate effectiveness of tools and documentation 46 — devopsdays MSP 2018 @rondoftw

Slide 47

Slide 47 text

REVISIT PAST DECISIONS ▸ Examine successful operations to see what key insights were helpful ▸ Identify any dependency on luck in previous projects ▸ Share rationale for decisions 47 — devopsdays MSP 2018 @rondoftw

Slide 48

Slide 48 text

RECOGNIZING SUCCESS 48 — devopsdays MSP 2018 @rondoftw

Slide 49

Slide 49 text

I WANT TO LEARN MORE! 1. Richard I. Cook - How Complex Systems Fail 2. Astronaut Mike Mullane - https://www.youtube.com/watch?v=Ljzj9Msli5o 3. Steve Lewis aka decodoppler (Technical Diving Instructor) - Staying Alive 4. Sidney Dekker - Drift into Failure 5. Diane Vaughn - The Challenger Launch Decision 49 — devopsdays MSP 2018 @rondoftw

Slide 50

Slide 50 text

50 — devopsdays MSP 2018 @rondoftw

Slide 51

Slide 51 text

Questions? @rondoftw 51 — devopsdays MSP 2018 @rondoftw