Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning in production (or why Apollo 11 nearly failed)

Learning in production (or why Apollo 11 nearly failed)

Tests, monitoring, these help us assert the known knowns of our systems. But what about the known unknowns? Or, especially in complex distributed systems, the unknown unknowns? What can we learn from the space program? What can we learn from the Apollo 11 landing? How can we prepare for the unknown and build our adaptive capacity?

Michiel Rook

March 16, 2021
Tweet

More Decks by Michiel Rook

Other Decks in Technology

Transcript

  1. LEARNING IN PRODUCTION 
 or why the Apollo 11 landing

    
 nearly failed Michiel Rook 
 @michieltcs
  2. @michieltcs "SpaceX provided audio recordings from the Crew Dragon’s fi

    rst orbital test fl ight to help prepare Hurley and Behnken for the ride during launch and re-entry."
  3. @michieltcs DAV I D WO O D S H T

    T P S : // YO U T U. B E /G N V X FG C - 5 J W
  4. @michieltcs @michieltcs UNIT TESTS UI / 
 E2E / VISUAL

    TESTS INTEGRATION / CONTRACT 
 TESTS COST SPEED
  5. @michieltcs "... incidents resulting from change is one of the

    most e ff ective metrics .... It isn’t a measure of system failures; it’s a measure of departmental failures."
  6. @michieltcs "... incidents resulting from change is one of the

    most e ff ective metrics .... It isn’t a measure of system failures; it’s a measure of departmental failures."
  7. @michieltcs "Every week of delay between having an idea and

    launching it to customers can mean millions of dollars lost in opportunity costs. IT matters." S T E V E S M I T H
  8. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  9. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews auto scaling, circuit breakers, health checks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  10. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews frequent deploys, blue/green, canary, rolling, rollbacks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  11. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews terraform, ansible, packer, etc. H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  12. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews observability H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  13. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews you OWN it 
 you run it H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  14. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N / blameless postmortems, knowledge sharing, learning
  15. @michieltcs $ = REALIZED VALUE 
 C R E D

    I T S TO @ FG O U L D I N G
  16. 21 Accelerate: State of DevOps 2019 | How Do We

    Compare? ELITE PERFORMERS Comparing the elite group against the low performers, we find that elite performers have… frequent code deployments 208 TIMES MORE time to recover from incidents 2,604 TIMES FASTER lead time from commit to deploy 106 TIMES FASTER change failure rate (changes are 1/7 as likely to fail) 7 TIMES LOWER Throughput Stability Source: 2019 State Of DevOps report
  17. @michieltcs "a measure of how well internal states of a

    system can be inferred from knowledge of its external outputs." H T T P S : // E N .W I K I P E D I A .O R G / W I K I / O B S E RVA B I L I T Y
  18. @michieltcs "the properties of a system which make it work

    well in production " H T T P S : //C O N F LU X D I G I TA L . N E T/ W H AT- I S - O P E RA B I L I T Y
  19. @michieltcs "the discipline of experimenting on a distributed system in

    order to build con fi dence in the system’s capability to withstand turbulent conditions in production*"
  20. @michieltcs H T T P S : // W W

    W.YO U T U B E .C O M / WATC H ? V = N O O G K N BW0 G K
  21. @michieltcs "Incidents are a fact of life. 
 
 How

    well you respond is your choice." J I M S E V E R I N O
  22. @michieltcs "What you call 'root cause' is simply the place

    where you stop looking any further." S I D N E Y D E K K E R
  23. @michieltcs H T T P S : // V I

    M EO.C O M /370 0 0 8 1 57