Learning in production (or why Apollo 11 nearly failed)

LEARNING IN PRODUCTION   or why the Apollo 11 landing
  nearly failed Michiel Rook   @michieltcs

@michieltcs IT ALMOST   DIDN'T HAPPEN

@michieltcs

@michieltcs "SpaceX provided audio recordings from the Crew Dragon’s fi
rst orbital test fl ight to help prepare Hurley and Behnken for the ride during launch and re-entry."

@michieltcs KEY TAKEAWAYS:

@michieltcs TESTING

@michieltcs EXPERIMENTATION

@michieltcs SIMULATION

@michieltcs TRAINING

@michieltcs ADAPTATION

@michieltcs "BUT THAT IS   ROCKET SCIENCE!"

@michieltcs "THAT WOULDN'T WORK   HERE"

@michieltcs "WE DON'T HAVE THE BUDGET"

@michieltcs "WE DON'T HAVE THE PEOPLE"

@michieltcs "WE DON'T HAVE THE TIME"

@michieltcs WHAT CAN WE LEARN FROM SPACE?

@michieltcs WE ARE BUILDING

@michieltcs COMPLEX DISTRIBUTED SYSTEMS

@michieltcs "NON-LINEAR"

@michieltcs "HARD TO REASON ABOUT"

@michieltcs "NO SINGLE PERSON CAN UNDERSTAND THE SYSTEM"

@michieltcs "MODEL DOES NOT MATCH REALITY"

@michieltcs "SURPRISING FAILURE MODES"

@michieltcs "OKAY, BUT WE CAN BUILD SIMPLE THINGS"

@michieltcs "WE SHOULD JUST PLAN BETTER"

@michieltcs "WE SHOULD JUST BE MORE CAREFUL"

@michieltcs "WE SHOULD JUST NOT MAKE MISTAKES"

@michieltcs DAV I D WO O D S H T
T P S : // YO U T U. B E /G N V X FG C - 5 J W

@michieltcs "OKAY, BUT WHAT IF WE JUST TEST MORE"

@michieltcs @michieltcs UNIT TESTS UI /   E2E / VISUAL
TESTS INTEGRATION / CONTRACT   TESTS COST SPEED

@michieltcs "Testing shows the presence, not absence, of bugs." E
D S G E R W. D I J K S T RA

@michieltcs @michieltcs

@michieltcs "OKAY, BUT WHAT IF WE STOP CHANGE"

@michieltcs "... incidents resulting from change is one of the
most e ff ective metrics .... It isn’t a measure of system failures; it’s a measure of departmental failures."

@michieltcs "Every week of delay between having an idea and
launching it to customers can mean millions of dollars lost in opportunity costs. IT matters." S T E V E S M I T H

@michieltcs DEALING WITH THE UNKNOWN

@michieltcs BUILD YOUR ADAPTIVE CAPACITY

@michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated
provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews auto scaling, circuit breakers, health checks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews frequent deploys, blue/green, canary, rolling, rollbacks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews terraform, ansible, packer, etc. H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews observability H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews you OWN it   you run it H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /

provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N / blameless postmortems, knowledge sharing, learning

@michieltcs INTEGRATE EARLY

@michieltcs INTEGRATE OFTEN

@michieltcs MAKE THINGS   SMALL

@michieltcs BIG STEPS

@michieltcs FAIL BIG

@michieltcs SMALL STEPS

@michieltcs FAIL SMALL

@michieltcs $ = REALIZED VALUE   C R E D
I T S TO @ FG O U L D I N G

21 Accelerate: State of DevOps 2019 | How Do We
Compare? ELITE PERFORMERS Comparing the elite group against the low performers, we find that elite performers have… frequent code deployments 208 TIMES MORE time to recover from incidents 2,604 TIMES FASTER lead time from commit to deploy 106 TIMES FASTER change failure rate (changes are 1/7 as likely to fail) 7 TIMES LOWER Throughput Stability Source: 2019 State Of DevOps report

@michieltcs OBSERVABILITY AND OPERABILITY

@michieltcs "a measure of how well internal states of a
system can be inferred from knowledge of its external outputs." H T T P S : // E N .W I K I P E D I A .O R G / W I K I / O B S E RVA B I L I T Y

@michieltcs @michieltcs source: laredoute.io

@michieltcs "the properties of a system which make it work
well in production " H T T P S : //C O N F LU X D I G I TA L . N E T/ W H AT- I S - O P E RA B I L I T Y

@michieltcs "You cannot inspect quality into a product." H A
R O L D S . D O D G E

@michieltcs

@michieltcs FEEDBACK LOOPS

@michieltcs

@michieltcs EXPECT FAILURE

@michieltcs EMBRACE FAILURE

@michieltcs INDUCE FAILURE

@michieltcs CHAOS   ENGINEERING

@michieltcs "the facilitation of experiments to uncover systemic weaknesses"

@michieltcs "the discipline of experimenting on a distributed system in
order to build con fi dence in the system’s capability to withstand turbulent conditions in production*"

@michieltcs NOT (JUST) ABOUT   BREAKING THINGS

@michieltcs NOT (JUST) ABOUT   BREAKING PROD

@michieltcs START SMALL

@michieltcs TEST ACC PROD

@michieltcs H T T P S : // W W
W.YO U T U B E .C O M / WATC H ? V = N O O G K N BW0 G K

@michieltcs INCIDENT RESPONSE

@michieltcs "Incidents are a fact of life.     How
well you respond is your choice." J I M S E V E R I N O

@michieltcs "Here's the secret:   Incident analysis is not actually
about the incident." N O RA J O N ES

@michieltcs ROOT CAUSE ANALYSIS?

@michieltcs "What you call 'root cause' is simply the place
where you stop looking any further." S I D N E Y D E K K E R

@michieltcs LEARNING CULTURE

@michieltcs BLAMELESS POSTMORTEMS

@michieltcs BLAME AWARE POSTMORTEMS

@michieltcs OPEN & HONEST

@michieltcs ACCOUNTABILITY

@michieltcs WHAT & HOW OVER WHO & WHY

@michieltcs COLLABORATION

@michieltcs H T T P S : // V I
M EO.C O M /370 0 0 8 1 57

@michieltcs IN SUMMARY

@michieltcs YOU CAN'T TEST EVERYTHING

@michieltcs YOU CAN'T PREPARE FOR EVERYTHING

@michieltcs YOU CAN LEARN

@michieltcs TO BE PREPARED

@michieltcs TO DEAL WITH ANYTHING

@michieltcs @michieltcs THANK YOU FOR LISTENING! @michieltcs / [email protected] www.michielrook.nl

Learning in production (or why Apollo 11 nearly...

Learning in production (or why Apollo 11 nearly failed)

More Decks by Michiel Rook

Other Decks in Technology

Featured

Transcript