Slide 1

Slide 1 text

LEARNING IN PRODUCTION 
 or why the Apollo 11 landing 
 nearly failed Michiel Rook 
 @michieltcs

Slide 2

Slide 2 text

1969

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

@michieltcs IT ALMOST 
 DIDN'T HAPPEN

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

1970

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

2020

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

@michieltcs

Slide 17

Slide 17 text

@michieltcs "SpaceX provided audio recordings from the Crew Dragon’s fi rst orbital test fl ight to help prepare Hurley and Behnken for the ride during launch and re-entry."

Slide 18

Slide 18 text

@michieltcs KEY TAKEAWAYS:

Slide 19

Slide 19 text

@michieltcs TESTING

Slide 20

Slide 20 text

@michieltcs EXPERIMENTATION

Slide 21

Slide 21 text

@michieltcs SIMULATION

Slide 22

Slide 22 text

@michieltcs TRAINING

Slide 23

Slide 23 text

@michieltcs ADAPTATION

Slide 24

Slide 24 text

@michieltcs "BUT THAT IS 
 ROCKET SCIENCE!"

Slide 25

Slide 25 text

@michieltcs "THAT WOULDN'T WORK 
 HERE"

Slide 26

Slide 26 text

@michieltcs "WE DON'T HAVE THE BUDGET"

Slide 27

Slide 27 text

@michieltcs "WE DON'T HAVE THE PEOPLE"

Slide 28

Slide 28 text

@michieltcs "WE DON'T HAVE THE TIME"

Slide 29

Slide 29 text

@michieltcs WHAT CAN WE LEARN FROM SPACE?

Slide 30

Slide 30 text

@michieltcs WE ARE BUILDING

Slide 31

Slide 31 text

@michieltcs COMPLEX DISTRIBUTED SYSTEMS

Slide 32

Slide 32 text

@michieltcs "NON-LINEAR"

Slide 33

Slide 33 text

@michieltcs "HARD TO REASON ABOUT"

Slide 34

Slide 34 text

@michieltcs "NO SINGLE PERSON CAN UNDERSTAND THE SYSTEM"

Slide 35

Slide 35 text

@michieltcs "MODEL DOES NOT MATCH REALITY"

Slide 36

Slide 36 text

@michieltcs "SURPRISING FAILURE MODES"

Slide 37

Slide 37 text

@michieltcs "OKAY, BUT WE CAN BUILD SIMPLE THINGS"

Slide 38

Slide 38 text

@michieltcs "WE SHOULD JUST PLAN BETTER"

Slide 39

Slide 39 text

@michieltcs "WE SHOULD JUST BE MORE CAREFUL"

Slide 40

Slide 40 text

@michieltcs "WE SHOULD JUST NOT MAKE MISTAKES"

Slide 41

Slide 41 text

@michieltcs DAV I D WO O D S H T T P S : // YO U T U. B E /G N V X FG C - 5 J W

Slide 42

Slide 42 text

@michieltcs "OKAY, BUT WHAT IF WE JUST TEST MORE"

Slide 43

Slide 43 text

@michieltcs @michieltcs UNIT TESTS UI / 
 E2E / VISUAL TESTS INTEGRATION / CONTRACT 
 TESTS COST SPEED

Slide 44

Slide 44 text

@michieltcs "Testing shows the presence, not absence, of bugs." E D S G E R W. D I J K S T RA

Slide 45

Slide 45 text

@michieltcs @michieltcs

Slide 46

Slide 46 text

@michieltcs @michieltcs

Slide 47

Slide 47 text

@michieltcs "OKAY, BUT WHAT IF WE STOP CHANGE"

Slide 48

Slide 48 text

@michieltcs "... incidents resulting from change is one of the most e ff ective metrics .... It isn’t a measure of system failures; it’s a measure of departmental failures."

Slide 49

Slide 49 text

@michieltcs "... incidents resulting from change is one of the most e ff ective metrics .... It isn’t a measure of system failures; it’s a measure of departmental failures."

Slide 50

Slide 50 text

@michieltcs "Every week of delay between having an idea and launching it to customers can mean millions of dollars lost in opportunity costs. IT matters." S T E V E S M I T H

Slide 51

Slide 51 text

@michieltcs @michieltcs

Slide 52

Slide 52 text

@michieltcs DEALING WITH THE UNKNOWN

Slide 53

Slide 53 text

@michieltcs BUILD YOUR ADAPTIVE CAPACITY

Slide 54

Slide 54 text

@michieltcs @michieltcs

Slide 55

Slide 55 text

@michieltcs @michieltcs

Slide 56

Slide 56 text

@michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

Slide 57

Slide 57 text

@michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews auto scaling, circuit breakers, health checks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

Slide 58

Slide 58 text

@michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews frequent deploys, blue/green, canary, rolling, rollbacks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

Slide 59

Slide 59 text

@michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews terraform, ansible, packer, etc. H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

Slide 60

Slide 60 text

@michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews observability H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

Slide 61

Slide 61 text

@michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews you OWN it 
 you run it H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

Slide 62

Slide 62 text

@michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N / blameless postmortems, knowledge sharing, learning

Slide 63

Slide 63 text

@michieltcs INTEGRATE EARLY

Slide 64

Slide 64 text

@michieltcs INTEGRATE OFTEN

Slide 65

Slide 65 text

@michieltcs MAKE THINGS 
 SMALL

Slide 66

Slide 66 text

@michieltcs BIG STEPS

Slide 67

Slide 67 text

@michieltcs FAIL BIG

Slide 68

Slide 68 text

@michieltcs SMALL STEPS

Slide 69

Slide 69 text

@michieltcs FAIL SMALL

Slide 70

Slide 70 text

@michieltcs $ = REALIZED VALUE 
 C R E D I T S TO @ FG O U L D I N G

Slide 71

Slide 71 text

21 Accelerate: State of DevOps 2019 | How Do We Compare? ELITE PERFORMERS Comparing the elite group against the low performers, we find that elite performers have… frequent code deployments 208 TIMES MORE time to recover from incidents 2,604 TIMES FASTER lead time from commit to deploy 106 TIMES FASTER change failure rate (changes are 1/7 as likely to fail) 7 TIMES LOWER Throughput Stability Source: 2019 State Of DevOps report

Slide 72

Slide 72 text

@michieltcs OBSERVABILITY AND OPERABILITY

Slide 73

Slide 73 text

@michieltcs @michieltcs

Slide 74

Slide 74 text

@michieltcs "a measure of how well internal states of a system can be inferred from knowledge of its external outputs." H T T P S : // E N .W I K I P E D I A .O R G / W I K I / O B S E RVA B I L I T Y

Slide 75

Slide 75 text

@michieltcs @michieltcs source: laredoute.io

Slide 76

Slide 76 text

@michieltcs "the properties of a system which make it work well in production " H T T P S : //C O N F LU X D I G I TA L . N E T/ W H AT- I S - O P E RA B I L I T Y

Slide 77

Slide 77 text

@michieltcs "You cannot inspect quality into a product." H A R O L D S . D O D G E

Slide 78

Slide 78 text

@michieltcs

Slide 79

Slide 79 text

@michieltcs FEEDBACK LOOPS

Slide 80

Slide 80 text

@michieltcs @michieltcs

Slide 81

Slide 81 text

@michieltcs

Slide 82

Slide 82 text

@michieltcs EXPECT FAILURE

Slide 83

Slide 83 text

@michieltcs EMBRACE FAILURE

Slide 84

Slide 84 text

@michieltcs INDUCE FAILURE

Slide 85

Slide 85 text

@michieltcs CHAOS 
 ENGINEERING

Slide 86

Slide 86 text

@michieltcs "the facilitation of experiments to uncover systemic weaknesses"

Slide 87

Slide 87 text

@michieltcs "the discipline of experimenting on a distributed system in order to build con fi dence in the system’s capability to withstand turbulent conditions in production*"

Slide 88

Slide 88 text

@michieltcs NOT (JUST) ABOUT 
 BREAKING THINGS

Slide 89

Slide 89 text

@michieltcs @michieltcs

Slide 90

Slide 90 text

@michieltcs @michieltcs

Slide 91

Slide 91 text

@michieltcs NOT (JUST) ABOUT 
 BREAKING PROD

Slide 92

Slide 92 text

@michieltcs START SMALL

Slide 93

Slide 93 text

@michieltcs TEST ACC PROD

Slide 94

Slide 94 text

@michieltcs H T T P S : // W W W.YO U T U B E .C O M / WATC H ? V = N O O G K N BW0 G K

Slide 95

Slide 95 text

@michieltcs INCIDENT RESPONSE

Slide 96

Slide 96 text

@michieltcs "Incidents are a fact of life. 
 
 How well you respond is your choice." J I M S E V E R I N O

Slide 97

Slide 97 text

@michieltcs "Here's the secret: 
 Incident analysis is not actually about the incident." N O RA J O N ES

Slide 98

Slide 98 text

@michieltcs ROOT CAUSE ANALYSIS?

Slide 99

Slide 99 text

@michieltcs ROOT CAUSE ANALYSIS?

Slide 100

Slide 100 text

@michieltcs "What you call 'root cause' is simply the place where you stop looking any further." S I D N E Y D E K K E R

Slide 101

Slide 101 text

@michieltcs LEARNING CULTURE

Slide 102

Slide 102 text

@michieltcs BLAMELESS POSTMORTEMS

Slide 103

Slide 103 text

@michieltcs BLAME AWARE POSTMORTEMS

Slide 104

Slide 104 text

@michieltcs OPEN & HONEST

Slide 105

Slide 105 text

@michieltcs ACCOUNTABILITY

Slide 106

Slide 106 text

@michieltcs WHAT & HOW OVER WHO & WHY

Slide 107

Slide 107 text

@michieltcs COLLABORATION

Slide 108

Slide 108 text

@michieltcs H T T P S : // V I M EO.C O M /370 0 0 8 1 57

Slide 109

Slide 109 text

No content

Slide 110

Slide 110 text

@michieltcs IN SUMMARY

Slide 111

Slide 111 text

@michieltcs YOU CAN'T TEST EVERYTHING

Slide 112

Slide 112 text

@michieltcs YOU CAN'T PREPARE FOR EVERYTHING

Slide 113

Slide 113 text

@michieltcs YOU CAN LEARN

Slide 114

Slide 114 text

@michieltcs TO BE PREPARED

Slide 115

Slide 115 text

@michieltcs TO DEAL WITH ANYTHING

Slide 116

Slide 116 text

@michieltcs @michieltcs THANK YOU FOR LISTENING! @michieltcs / [email protected] www.michielrook.nl