$30 off During Our Annual Pro Sale. View Details »

Learning in production (or why Apollo 11 nearly failed)

Learning in production (or why Apollo 11 nearly failed)

Tests, monitoring, these help us assert the known knowns of our systems. But what about the known unknowns? Or, especially in complex distributed systems, the unknown unknowns? What can we learn from the space program? What can we learn from the Apollo 11 landing? How can we prepare for the unknown and build our adaptive capacity?

Michiel Rook

March 16, 2021
Tweet

More Decks by Michiel Rook

Other Decks in Technology

Transcript

  1. LEARNING IN
    PRODUCTION



    or why the Apollo 11 landing

    nearly failed
    Michiel Rook

    @michieltcs

    View Slide

  2. 1969

    View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. @michieltcs
    IT ALMOST

    DIDN'T HAPPEN

    View Slide

  7. View Slide

  8. 1970

    View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. 2020

    View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. @michieltcs

    View Slide

  17. @michieltcs
    "SpaceX provided audio recordings
    from the Crew Dragon’s
    fi
    rst orbital
    test
    fl
    ight to help prepare Hurley
    and Behnken for the ride during
    launch and re-entry."

    View Slide

  18. @michieltcs
    KEY TAKEAWAYS:

    View Slide

  19. @michieltcs
    TESTING

    View Slide

  20. @michieltcs
    EXPERIMENTATION

    View Slide

  21. @michieltcs
    SIMULATION

    View Slide

  22. @michieltcs
    TRAINING

    View Slide

  23. @michieltcs
    ADAPTATION

    View Slide

  24. @michieltcs
    "BUT THAT IS

    ROCKET SCIENCE!"

    View Slide

  25. @michieltcs
    "THAT WOULDN'T
    WORK

    HERE"

    View Slide

  26. @michieltcs
    "WE DON'T HAVE THE
    BUDGET"

    View Slide

  27. @michieltcs
    "WE DON'T HAVE THE
    PEOPLE"

    View Slide

  28. @michieltcs
    "WE DON'T HAVE THE
    TIME"

    View Slide

  29. @michieltcs
    WHAT CAN WE
    LEARN FROM SPACE?

    View Slide

  30. @michieltcs
    WE ARE BUILDING

    View Slide

  31. @michieltcs
    COMPLEX
    DISTRIBUTED
    SYSTEMS

    View Slide

  32. @michieltcs
    "NON-LINEAR"

    View Slide

  33. @michieltcs
    "HARD TO REASON
    ABOUT"

    View Slide

  34. @michieltcs
    "NO SINGLE PERSON
    CAN UNDERSTAND
    THE SYSTEM"

    View Slide

  35. @michieltcs
    "MODEL DOES NOT
    MATCH REALITY"

    View Slide

  36. @michieltcs
    "SURPRISING
    FAILURE MODES"

    View Slide

  37. @michieltcs
    "OKAY, BUT WE CAN
    BUILD SIMPLE
    THINGS"

    View Slide

  38. @michieltcs
    "WE SHOULD JUST
    PLAN BETTER"

    View Slide

  39. @michieltcs
    "WE SHOULD JUST
    BE MORE CAREFUL"

    View Slide

  40. @michieltcs
    "WE SHOULD JUST
    NOT MAKE MISTAKES"

    View Slide

  41. @michieltcs
    DAV I D WO O D S


    H T T P S : // YO U T U. B E /G N V X FG C - 5 J W

    View Slide

  42. @michieltcs
    "OKAY, BUT WHAT IF
    WE JUST TEST MORE"

    View Slide

  43. @michieltcs
    @michieltcs
    UNIT TESTS
    UI /

    E2E / VISUAL
    TESTS
    INTEGRATION / CONTRACT

    TESTS
    COST SPEED

    View Slide

  44. @michieltcs
    "Testing shows the
    presence, not absence, of
    bugs."
    E D S G E R W. D I J K S T RA

    View Slide

  45. @michieltcs
    @michieltcs

    View Slide

  46. @michieltcs
    @michieltcs

    View Slide

  47. @michieltcs
    "OKAY, BUT WHAT IF
    WE STOP CHANGE"

    View Slide

  48. @michieltcs
    "... incidents resulting from
    change is one of the most
    e
    ff
    ective metrics .... It isn’t a
    measure of system failures; it’s a
    measure of departmental
    failures."

    View Slide

  49. @michieltcs
    "... incidents resulting from
    change is one of the most
    e
    ff
    ective metrics .... It isn’t a
    measure of system failures; it’s a
    measure of departmental
    failures."

    View Slide

  50. @michieltcs
    "Every week of delay
    between having an idea
    and launching it to
    customers can mean
    millions of dollars lost in
    opportunity costs. IT
    matters."
    S T E V E S M I T H

    View Slide

  51. @michieltcs
    @michieltcs

    View Slide

  52. @michieltcs
    DEALING WITH THE
    UNKNOWN

    View Slide

  53. @michieltcs
    BUILD YOUR


    ADAPTIVE CAPACITY

    View Slide

  54. @michieltcs
    @michieltcs

    View Slide

  55. @michieltcs
    @michieltcs

    View Slide

  56. @michieltcs
    ‣ An adaptive architecture


    ‣ Incremental deployments


    ‣ Automated provisioning


    ‣ Ubiquitous telemetry


    ‣ Chaos Engineering


    ‣ You Build It You Run It


    ‣ Post-incident reviews
    H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G /
    B U I L D - O P E RA B I L I T Y- I N /

    View Slide

  57. @michieltcs
    ‣ An adaptive architecture


    ‣ Incremental deployments


    ‣ Automated provisioning


    ‣ Ubiquitous telemetry


    ‣ Chaos Engineering


    ‣ You Build It You Run It


    ‣ Post-incident reviews
    auto
    scaling, circuit
    breakers, health
    checks
    H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G /
    B U I L D - O P E RA B I L I T Y- I N /

    View Slide

  58. @michieltcs
    ‣ An adaptive architecture


    ‣ Incremental deployments


    ‣ Automated provisioning


    ‣ Ubiquitous telemetry


    ‣ Chaos Engineering


    ‣ You Build It You Run It


    ‣ Post-incident reviews
    frequent deploys,
    blue/green, canary,
    rolling, rollbacks
    H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G /
    B U I L D - O P E RA B I L I T Y- I N /

    View Slide

  59. @michieltcs
    ‣ An adaptive architecture


    ‣ Incremental deployments


    ‣ Automated provisioning


    ‣ Ubiquitous telemetry


    ‣ Chaos Engineering


    ‣ You Build It You Run It


    ‣ Post-incident reviews
    terraform,
    ansible, packer,
    etc.
    H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G /
    B U I L D - O P E RA B I L I T Y- I N /

    View Slide

  60. @michieltcs
    ‣ An adaptive architecture


    ‣ Incremental deployments


    ‣ Automated provisioning


    ‣ Ubiquitous telemetry


    ‣ Chaos Engineering


    ‣ You Build It You Run It


    ‣ Post-incident reviews
    observability
    H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G /
    B U I L D - O P E RA B I L I T Y- I N /

    View Slide

  61. @michieltcs
    ‣ An adaptive architecture


    ‣ Incremental deployments


    ‣ Automated provisioning


    ‣ Ubiquitous telemetry


    ‣ Chaos Engineering


    ‣ You Build It You Run It


    ‣ Post-incident reviews
    you OWN it

    you run it
    H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G /
    B U I L D - O P E RA B I L I T Y- I N /

    View Slide

  62. @michieltcs
    ‣ An adaptive architecture


    ‣ Incremental deployments


    ‣ Automated provisioning


    ‣ Ubiquitous telemetry


    ‣ Chaos Engineering


    ‣ You Build It You Run It


    ‣ Post-incident reviews
    H T T P S : // W W W. S T E V ES M I T H .T EC H / B LO G /
    B U I L D - O P E RA B I L I T Y- I N /
    blameless
    postmortems,
    knowledge
    sharing,
    learning

    View Slide

  63. @michieltcs
    INTEGRATE EARLY

    View Slide

  64. @michieltcs
    INTEGRATE OFTEN

    View Slide

  65. @michieltcs
    MAKE THINGS

    SMALL

    View Slide

  66. @michieltcs
    BIG STEPS

    View Slide

  67. @michieltcs
    FAIL BIG

    View Slide

  68. @michieltcs
    SMALL STEPS

    View Slide

  69. @michieltcs
    FAIL SMALL

    View Slide

  70. @michieltcs
    $ = REALIZED VALUE

    C R E D I T S TO @ FG O U L D I N G

    View Slide

  71. 21 Accelerate: State of DevOps 2019 | How Do We Compare?
    ELITE PERFORMERS
    Comparing the elite group against the low
    performers, we find that elite performers have…
    frequent code deployments
    208
    TIMES MORE
    time to recover from incidents
    2,604
    TIMES FASTER
    lead time from
    commit to deploy
    106
    TIMES FASTER
    change failure rate
    (changes are 1/7
    as likely to fail)
    7
    TIMES LOWER
    Throughput Stability
    Source: 2019 State Of DevOps report

    View Slide

  72. @michieltcs
    OBSERVABILITY


    AND


    OPERABILITY

    View Slide

  73. @michieltcs
    @michieltcs

    View Slide

  74. @michieltcs
    "a measure of how well
    internal states of
    a system can be inferred
    from knowledge of its
    external outputs."
    H T T P S : // E N .W I K I P E D I A .O R G / W I K I /
    O B S E RVA B I L I T Y

    View Slide

  75. @michieltcs
    @michieltcs
    source: laredoute.io

    View Slide

  76. @michieltcs
    "the properties of a system
    which make it work well in
    production "
    H T T P S : //C O N F LU X D I G I TA L . N E T/ W H AT- I S -
    O P E RA B I L I T Y

    View Slide

  77. @michieltcs
    "You cannot inspect quality
    into a product."
    H A R O L D S . D O D G E

    View Slide

  78. @michieltcs

    View Slide

  79. @michieltcs
    FEEDBACK LOOPS

    View Slide

  80. @michieltcs
    @michieltcs

    View Slide

  81. @michieltcs

    View Slide

  82. @michieltcs
    EXPECT FAILURE

    View Slide

  83. @michieltcs
    EMBRACE FAILURE

    View Slide

  84. @michieltcs
    INDUCE FAILURE

    View Slide

  85. @michieltcs
    CHAOS

    ENGINEERING

    View Slide

  86. @michieltcs
    "the facilitation of
    experiments to uncover
    systemic weaknesses"

    View Slide

  87. @michieltcs
    "the discipline of
    experimenting on a
    distributed system in order
    to build con
    fi
    dence in the
    system’s capability to
    withstand turbulent
    conditions in production*"

    View Slide

  88. @michieltcs
    NOT (JUST) ABOUT

    BREAKING THINGS

    View Slide

  89. @michieltcs
    @michieltcs

    View Slide

  90. @michieltcs
    @michieltcs

    View Slide

  91. @michieltcs
    NOT (JUST) ABOUT

    BREAKING PROD

    View Slide

  92. @michieltcs
    START SMALL

    View Slide

  93. @michieltcs
    TEST ACC PROD

    View Slide

  94. @michieltcs
    H T T P S : // W W W.YO U T U B E .C O M / WATC H ?
    V = N O O G K N BW0 G K

    View Slide

  95. @michieltcs
    INCIDENT RESPONSE

    View Slide

  96. @michieltcs
    "Incidents are a fact of life.


    How well you respond is your
    choice."
    J I M S E V E R I N O

    View Slide

  97. @michieltcs
    "Here's the secret:

    Incident analysis is not actually
    about the incident."
    N O RA J O N ES

    View Slide

  98. @michieltcs
    ROOT CAUSE
    ANALYSIS?

    View Slide

  99. @michieltcs
    ROOT CAUSE
    ANALYSIS?

    View Slide

  100. @michieltcs
    "What you call 'root
    cause' is simply the
    place where you stop
    looking any further."
    S I D N E Y D E K K E R

    View Slide

  101. @michieltcs
    LEARNING


    CULTURE

    View Slide

  102. @michieltcs
    BLAMELESS


    POSTMORTEMS

    View Slide

  103. @michieltcs
    BLAME AWARE


    POSTMORTEMS

    View Slide

  104. @michieltcs
    OPEN & HONEST

    View Slide

  105. @michieltcs
    ACCOUNTABILITY

    View Slide

  106. @michieltcs
    WHAT & HOW


    OVER


    WHO & WHY

    View Slide

  107. @michieltcs
    COLLABORATION

    View Slide

  108. @michieltcs
    H T T P S : // V I M EO.C O M /370 0 0 8 1 57

    View Slide

  109. View Slide

  110. @michieltcs
    IN SUMMARY

    View Slide

  111. @michieltcs
    YOU CAN'T TEST
    EVERYTHING

    View Slide

  112. @michieltcs
    YOU CAN'T PREPARE
    FOR EVERYTHING

    View Slide

  113. @michieltcs
    YOU CAN LEARN

    View Slide

  114. @michieltcs
    TO BE PREPARED

    View Slide

  115. @michieltcs
    TO DEAL WITH
    ANYTHING

    View Slide

  116. @michieltcs
    @michieltcs
    THANK YOU FOR
    LISTENING!


    @michieltcs / [email protected]


    www.michielrook.nl

    View Slide