Save 37% off PRO during our Black Friday Sale! »

Learning in production (or why Apollo 11 nearly failed)

Learning in production (or why Apollo 11 nearly failed)

Tests, monitoring, these help us assert the known knowns of our systems. But what about the known unknowns? Or, especially in complex distributed systems, the unknown unknowns? What can we learn from the space program? What can we learn from the Apollo 11 landing? How can we prepare for the unknown and build our adaptive capacity?

2f4800411154a8c66dde489448a044d2?s=128

Michiel Rook

March 16, 2021
Tweet

Transcript

  1. LEARNING IN PRODUCTION 
 or why the Apollo 11 landing

    
 nearly failed Michiel Rook 
 @michieltcs
  2. 1969

  3. None
  4. None
  5. None
  6. @michieltcs IT ALMOST 
 DIDN'T HAPPEN

  7. None
  8. 1970

  9. None
  10. None
  11. None
  12. 2020

  13. None
  14. None
  15. None
  16. @michieltcs

  17. @michieltcs "SpaceX provided audio recordings from the Crew Dragon’s fi

    rst orbital test fl ight to help prepare Hurley and Behnken for the ride during launch and re-entry."
  18. @michieltcs KEY TAKEAWAYS:

  19. @michieltcs TESTING

  20. @michieltcs EXPERIMENTATION

  21. @michieltcs SIMULATION

  22. @michieltcs TRAINING

  23. @michieltcs ADAPTATION

  24. @michieltcs "BUT THAT IS 
 ROCKET SCIENCE!"

  25. @michieltcs "THAT WOULDN'T WORK 
 HERE"

  26. @michieltcs "WE DON'T HAVE THE BUDGET"

  27. @michieltcs "WE DON'T HAVE THE PEOPLE"

  28. @michieltcs "WE DON'T HAVE THE TIME"

  29. @michieltcs WHAT CAN WE LEARN FROM SPACE?

  30. @michieltcs WE ARE BUILDING

  31. @michieltcs COMPLEX DISTRIBUTED SYSTEMS

  32. @michieltcs "NON-LINEAR"

  33. @michieltcs "HARD TO REASON ABOUT"

  34. @michieltcs "NO SINGLE PERSON CAN UNDERSTAND THE SYSTEM"

  35. @michieltcs "MODEL DOES NOT MATCH REALITY"

  36. @michieltcs "SURPRISING FAILURE MODES"

  37. @michieltcs "OKAY, BUT WE CAN BUILD SIMPLE THINGS"

  38. @michieltcs "WE SHOULD JUST PLAN BETTER"

  39. @michieltcs "WE SHOULD JUST BE MORE CAREFUL"

  40. @michieltcs "WE SHOULD JUST NOT MAKE MISTAKES"

  41. @michieltcs DAV I D WO O D S H T

    T P S : // YO U T U. B E /G N V X FG C - 5 J W
  42. @michieltcs "OKAY, BUT WHAT IF WE JUST TEST MORE"

  43. @michieltcs @michieltcs UNIT TESTS UI / 
 E2E / VISUAL

    TESTS INTEGRATION / CONTRACT 
 TESTS COST SPEED
  44. @michieltcs "Testing shows the presence, not absence, of bugs." E

    D S G E R W. D I J K S T RA
  45. @michieltcs @michieltcs

  46. @michieltcs @michieltcs

  47. @michieltcs "OKAY, BUT WHAT IF WE STOP CHANGE"

  48. @michieltcs "... incidents resulting from change is one of the

    most e ff ective metrics .... It isn’t a measure of system failures; it’s a measure of departmental failures."
  49. @michieltcs "... incidents resulting from change is one of the

    most e ff ective metrics .... It isn’t a measure of system failures; it’s a measure of departmental failures."
  50. @michieltcs "Every week of delay between having an idea and

    launching it to customers can mean millions of dollars lost in opportunity costs. IT matters." S T E V E S M I T H
  51. @michieltcs @michieltcs

  52. @michieltcs DEALING WITH THE UNKNOWN

  53. @michieltcs BUILD YOUR ADAPTIVE CAPACITY

  54. @michieltcs @michieltcs

  55. @michieltcs @michieltcs

  56. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  57. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews auto scaling, circuit breakers, health checks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  58. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews frequent deploys, blue/green, canary, rolling, rollbacks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  59. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews terraform, ansible, packer, etc. H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  60. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews observability H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  61. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews you OWN it 
 you run it H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  62. @michieltcs ‣ An adaptive architecture ‣ Incremental deployments ‣ Automated

    provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N / blameless postmortems, knowledge sharing, learning
  63. @michieltcs INTEGRATE EARLY

  64. @michieltcs INTEGRATE OFTEN

  65. @michieltcs MAKE THINGS 
 SMALL

  66. @michieltcs BIG STEPS

  67. @michieltcs FAIL BIG

  68. @michieltcs SMALL STEPS

  69. @michieltcs FAIL SMALL

  70. @michieltcs $ = REALIZED VALUE 
 C R E D

    I T S TO @ FG O U L D I N G
  71. 21 Accelerate: State of DevOps 2019 | How Do We

    Compare? ELITE PERFORMERS Comparing the elite group against the low performers, we find that elite performers have… frequent code deployments 208 TIMES MORE time to recover from incidents 2,604 TIMES FASTER lead time from commit to deploy 106 TIMES FASTER change failure rate (changes are 1/7 as likely to fail) 7 TIMES LOWER Throughput Stability Source: 2019 State Of DevOps report
  72. @michieltcs OBSERVABILITY AND OPERABILITY

  73. @michieltcs @michieltcs

  74. @michieltcs "a measure of how well internal states of a

    system can be inferred from knowledge of its external outputs." H T T P S : // E N .W I K I P E D I A .O R G / W I K I / O B S E RVA B I L I T Y
  75. @michieltcs @michieltcs source: laredoute.io

  76. @michieltcs "the properties of a system which make it work

    well in production " H T T P S : //C O N F LU X D I G I TA L . N E T/ W H AT- I S - O P E RA B I L I T Y
  77. @michieltcs "You cannot inspect quality into a product." H A

    R O L D S . D O D G E
  78. @michieltcs

  79. @michieltcs FEEDBACK LOOPS

  80. @michieltcs @michieltcs

  81. @michieltcs

  82. @michieltcs EXPECT FAILURE

  83. @michieltcs EMBRACE FAILURE

  84. @michieltcs INDUCE FAILURE

  85. @michieltcs CHAOS 
 ENGINEERING

  86. @michieltcs "the facilitation of experiments to uncover systemic weaknesses"

  87. @michieltcs "the discipline of experimenting on a distributed system in

    order to build con fi dence in the system’s capability to withstand turbulent conditions in production*"
  88. @michieltcs NOT (JUST) ABOUT 
 BREAKING THINGS

  89. @michieltcs @michieltcs

  90. @michieltcs @michieltcs

  91. @michieltcs NOT (JUST) ABOUT 
 BREAKING PROD

  92. @michieltcs START SMALL

  93. @michieltcs TEST ACC PROD

  94. @michieltcs H T T P S : // W W

    W.YO U T U B E .C O M / WATC H ? V = N O O G K N BW0 G K
  95. @michieltcs INCIDENT RESPONSE

  96. @michieltcs "Incidents are a fact of life. 
 
 How

    well you respond is your choice." J I M S E V E R I N O
  97. @michieltcs "Here's the secret: 
 Incident analysis is not actually

    about the incident." N O RA J O N ES
  98. @michieltcs ROOT CAUSE ANALYSIS?

  99. @michieltcs ROOT CAUSE ANALYSIS?

  100. @michieltcs "What you call 'root cause' is simply the place

    where you stop looking any further." S I D N E Y D E K K E R
  101. @michieltcs LEARNING CULTURE

  102. @michieltcs BLAMELESS POSTMORTEMS

  103. @michieltcs BLAME AWARE POSTMORTEMS

  104. @michieltcs OPEN & HONEST

  105. @michieltcs ACCOUNTABILITY

  106. @michieltcs WHAT & HOW OVER WHO & WHY

  107. @michieltcs COLLABORATION

  108. @michieltcs H T T P S : // V I

    M EO.C O M /370 0 0 8 1 57
  109. None
  110. @michieltcs IN SUMMARY

  111. @michieltcs YOU CAN'T TEST EVERYTHING

  112. @michieltcs YOU CAN'T PREPARE FOR EVERYTHING

  113. @michieltcs YOU CAN LEARN

  114. @michieltcs TO BE PREPARED

  115. @michieltcs TO DEAL WITH ANYTHING

  116. @michieltcs @michieltcs THANK YOU FOR LISTENING! @michieltcs / michiel@michielrook.nl www.michielrook.nl