Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning in production (or why Apollo 11 nearly...

Learning in production (or why Apollo 11 nearly failed) (ConFoo 2025)

Tests, monitoring, these help us assert the known knowns of our systems. But what about the known unknowns? Or, especially in complex distributed systems, the unknown unknowns? What can we learn from the space program? What can we learn from the Apollo 11 landing? How can we prepare for the unknown and build our adaptive capacity?

Michiel Rook

February 27, 2025
Tweet

More Decks by Michiel Rook

Other Decks in Technology

Transcript

  1. @michieltcs #StandWithUkraine "SpaceX provided audio recordings from the Crew Dragon’s

    fi rst orbital test fl ight to help prepare Hurley and Behnken for the ride during launch and re-entry."
  2. @michieltcs #StandWithUkraine DAV I D WO O D S H

    T T P S : // YO U T U. B E /G N V X FG C - 5 J W
  3. @michieltcs #StandWithUkraine "human failure is the biggest threat to security,

    therefore we periodically train our employees" (found on linkedin)
  4. @michieltcs @michieltcs #StandWithUkraine UNIT TESTS UI / E2E / VISUAL

    TESTS INTEGRATION / CONTRACT TESTS COST SPEED
  5. @michieltcs #StandWithUkraine "... incidents resulting from change is one of

    the most e ff ective metrics .... It isn’t a measure of system failures; it’s a measure of departmental failures"
  6. @michieltcs #StandWithUkraine "... incidents resulting from change is one of

    the most e ff ective metrics .... It isn’t a measure of system failures; it’s a measure of departmental failures"
  7. @michieltcs #StandWithUkraine "Every week of delay between having an idea

    and launching it to customers can mean millions of dollars lost in opportunity costs. IT matters." S T E V E S M I T H
  8. @michieltcs #StandWithUkraine "We don’t rise to the level of our

    expectations, we fall to the level of our training." A R C H I LO C H U S , G R E E K S O L D I E R , P O E T, C . 6 5 0 B C
  9. @michieltcs #StandWithUkraine ‣ An adaptive architecture ‣ Incremental deployments ‣

    Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  10. @michieltcs #StandWithUkraine ‣ An adaptive architecture ‣ Incremental deployments ‣

    Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews auto scaling, circuit breakers, health checks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  11. @michieltcs #StandWithUkraine ‣ An adaptive architecture ‣ Incremental deployments ‣

    Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews small & frequent deploys, blue/green, canary, rolling, rollbacks H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  12. @michieltcs #StandWithUkraine ‣ An adaptive architecture ‣ Incremental deployments ‣

    Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews terraform, ansible, packer, etc. H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  13. @michieltcs #StandWithUkraine ‣ An adaptive architecture ‣ Incremental deployments ‣

    Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews observability and operability H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  14. @michieltcs #StandWithUkraine ‣ An adaptive architecture ‣ Incremental deployments ‣

    Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews controlled experiments on systems H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  15. @michieltcs #StandWithUkraine ‣ An adaptive architecture ‣ Incremental deployments ‣

    Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews you OWN it you run it H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  16. @michieltcs #StandWithUkraine ‣ An adaptive architecture ‣ Incremental deployments ‣

    Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N / blameless postmortems, knowledge sharing, learning
  17. @michieltcs #StandWithUkraine ‣ An adaptive architecture ‣ Incremental deployments ‣

    Automated provisioning ‣ Ubiquitous telemetry ‣ Chaos Engineering ‣ You Build It You Run It ‣ Post-incident reviews H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G / B U I L D - O P E RA B I L I T Y- I N /
  18. 21 Accelerate: State of DevOps 2019 | How Do We

    Compare? ELITE PERFORMERS Comparing the elite group against the low performers, we find that elite performers have… frequent code deployments 208 TIMES MORE time to recover from incidents 2,604 TIMES FASTER lead time from commit to deploy 106 TIMES FASTER change failure rate (changes are 1/7 as likely to fail) 7 TIMES LOWER Throughput Stability Source: 2019 State Of DevOps report
  19. @michieltcs #StandWithUkraine "a measure of how well internal states of

    a system can be inferred from knowledge of its external outputs." H T T P S : // E N .W I K I P E D I A .O R G / W I K I / O B S E RVA B I L I T Y
  20. @michieltcs #StandWithUkraine "the properties of a system which make it

    work well in production " H T T P S : //C O N F LU X D I G I TA L . N E T/ W H AT- I S - O P E RA B I L I T Y
  21. @michieltcs #StandWithUkraine "the properties of a system which make it

    work well in production " H T T P S : //C O N F LU X D I G I TA L . N E T/ W H AT- I S - O P E RA B I L I T Y the operator experience
  22. @michieltcs #StandWithUkraine "the discipline of experimenting on a distributed system

    in order to build con fi dence in the system’s capability to withstand turbulent conditions in production*"
  23. @michieltcs #StandWithUkraine H T T P S : // W

    W W.YO U T U B E .C O M / WATC H ? V = N O O G K N BW0 G K
  24. @michieltcs #StandWithUkraine "What you call 'root cause' is simply the

    place where you stop looking any further." S I D N E Y D E K K E R
  25. @michieltcs #StandWithUkraine “What is the cause of the accident? This

    question is just as bizarre as asking what the cause is of not having an accident.” S I D N E Y D E K K E R
  26. @michieltcs #StandWithUkraine “... the more I realized that this accident

    occurred not because something extraordinary had happened, but rather just the opposite” S C OT T A . S N O O K
  27. @michieltcs #StandWithUkraine “"Regardless of what we discover, we understand and

    truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand."” H T T P S : // R E T R O S P EC T I V E W I K I .O R G / I N D E X . P H P ? T I T L E = T H E _ P R I M E _ D I R EC T I V E