Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure and Change: Principles of Reliable Systems

Mark Hibberd
November 30, 2018

Failure and Change: Principles of Reliable Systems

As we construct larger or more complex systems, failure and change are
ever-present. We need to accept and even embrace these tensions to
build software that works and keeps working.This is a talk on building
and operating reliable systems. We will look at how systems fail,
particularly in the face of complexity or scale, and build up a set of
principles and practices that will help us implement, understand and
verify reliable systems.

Mark Hibberd

November 30, 2018
Tweet

More Decks by Mark Hibberd

Other Decks in Programming

Transcript

  1. Failure & Change:
    @markhibberd
    Principles of Reliable Systems

    View full-size slide

  2. Reliability: The quality of performing consistently well.

    View full-size slide

  3. P(failure) = 0.1 or 10%

    View full-size slide

  4. Redundancy, lets do it!

    View full-size slide

  5. P(individual failure) = 0.1

    View full-size slide

  6. P(system failure) = 0.1^10

    View full-size slide

  7. P(system availability) = 1 - 0.1^10

    View full-size slide

  8. 99.99999999% Availability.

    View full-size slide

  9. Are failures really independent?

    View full-size slide

  10. P(failure) = 10% or 0.1

    View full-size slide

  11. P(individual success) = 1 - 0.1 = 0.9 or 90%

    View full-size slide

  12. P(all successes) = 0.9^10

    View full-size slide

  13. P(at least one failure) = 1 - 0.9^10 = 0.65 or 65%

    View full-size slide

  14. P(mutually assured destruction) = 1

    View full-size slide

  15. P(at least one failure) = 1 - 0.9^10 = 0.65 or 65%

    View full-size slide

  16. P(system failure) = 1 - 0.9^10 = 0.65 or 65%

    View full-size slide

  17. Redundancy means embracing more
    failure, but with an increased
    opportunity for handling those failures.

    View full-size slide

  18. Building systems.

    View full-size slide

  19. Player pairing.

    View full-size slide

  20. Player pairing.
    Game play.

    View full-size slide

  21. Player pairing.
    Game play.
    History.

    View full-size slide

  22. Player pairing.
    Game play. Analyse games.
    History.

    View full-size slide

  23. Player pairing.
    Game play. Analyse games.
    History.
    Play engines.

    View full-size slide

  24. Player pairing.
    Game play. Analyse games.
    History.
    Play engines.
    Cheat detection.

    View full-size slide

  25. An argument for monoliths?

    View full-size slide

  26. No guarantee of independence.

    View full-size slide

  27. Magnification of consequences.

    View full-size slide

  28. Service architecture provides the
    opportunity to trade likelihood of failure
    against the consequences of failure.

    View full-size slide

  29. A service approach.

    View full-size slide

  30. A naive service approach.
    Game
    Player

    View full-size slide

  31. A naive service approach.
    Game
    Play History
    Chess
    Player
    Analyse

    View full-size slide

  32. A naive service approach.
    {
    “game-id”: 1234,
    “game-state”: “waiting-for-pair”,
    “player-1”: 5678,
    “player-2”: nil
    }

    View full-size slide

  33. A naive service approach.
    {
    “game-id”: 1234,
    “game-state”: “waiting-for-move”,
    “player-1”: 5678,
    “player-2”: 9123
    }

    View full-size slide

  34. A naive service approach.
    Game
    Play History
    Chess
    Player
    Analyse

    View full-size slide

  35. A naive service approach.
    {
    “game-id”: 1234,
    “game-state”: “finished”,
    “player-1”: 5678,
    “player-2”: 9123
    }

    View full-size slide

  36. A naive service approach.
    Game
    Play History
    Chess
    Player
    Analyse

    View full-size slide

  37. Game
    Play History
    Chess
    Player
    Analyse
    A naive service approach.
    What happens when the game
    service has an issue?

    View full-size slide

  38. A naive service approach.
    We have almost all the
    downsides of the monolithic
    approach, plus we have to
    deal with network interactions
    and the operational complexity
    of multiple services.
    Game
    Play History
    Chess
    Player
    Analyse

    View full-size slide

  39. A service approach.

    View full-size slide

  40. Games as a value instead of a service.
    1. e4 e5
    2. Bc4 Nc6
    3. Qh5 Nf6??
    4. Qxf7#

    View full-size slide

  41. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  42. Pair.
    Pairing can be thought of as negotiating a
    shared identifier.

    View full-size slide

  43. Pair.
    game-id: 13376a3e

    View full-size slide

  44. Pair.
    A party who wants to be paired has to be able
    to communicate its constraints and how they
    can be notified.

    View full-size slide

  45. {
    notify: http://waiting/123,
    constraints: [
    { game-time: 5+3 },
    { min-rating: 1400 },
    { max-rating: 1600 },
    { ideal-rating: 1500 },
    { max-wait: 30 }
    ]
    }
    Pair.

    View full-size slide

  46. Pair.
    The pairing service maintains an indexed list
    of waiting parties.

    View full-size slide

  47. Pair.
    Asynchronously pairing compatible partners,
    generating a unique identifier for the pair, and
    notifying both parties.

    View full-size slide

  48. Pair.
    Independence?

    View full-size slide

  49. Pair.
    Knows: About players waiting for a game.

    View full-size slide

  50. Pair.
    Knows: About players waiting for a game.
    Told: When a new player wants to pair.

    View full-size slide

  51. Pair.
    Knows: About players waiting for a game.
    Told: When a new player wants to pair.
    Asks: Nothing.

    View full-size slide

  52. Pair.
    Knows: About players waiting for a game.
    Told: When a new player wants to pair.
    Asks: Nothing.
    Response: Vends players their unique game id.

    View full-size slide

  53. Play.
    The playing service is responsible for just the
    in-flight games.

    View full-size slide

  54. Play.
    A playing pair have a shared identifier they
    can use to join a game in an initial state.

    View full-size slide

  55. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  56. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  57. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  58. Knows: About games in play.
    Play.

    View full-size slide

  59. Knows: About games in play.
    Told: When a player makes a move.
    Play.

    View full-size slide

  60. Knows: About games in play.
    Told: When a player makes a move.
    Asks: Nothing.
    Play.

    View full-size slide

  61. Knows: About games in play.
    Told: When a player makes a move.
    Asks: Nothing.
    Response: Updated game value.
    Play.

    View full-size slide

  62. History.
    The history service is just a database of static
    games that can be searched.

    View full-size slide

  63. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  64. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  65. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  66. History.
    The history service gives out games as values.

    View full-size slide

  67. Knows: About historical game values.
    History.

    View full-size slide

  68. Knows: About historical game values.
    Told: New complete game values.
    History.

    View full-size slide

  69. Knows: About historical game values.
    Told: New complete game values.
    Asks: Nothing.
    History.

    View full-size slide

  70. Knows: About historical game values.
    Told: New complete game values.
    Asks: Nothing.
    Response: Complete game values.
    History.

    View full-size slide

  71. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  72. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  73. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  74. Knows: About game analysis techniques.
    Analyse.

    View full-size slide

  75. Knows: About game analysis techniques.
    Told: Game value to analyse.
    Analyse.

    View full-size slide

  76. Knows: About game analysis techniques.
    Told: Game value to analyse.
    Asks: Nothing.
    Analyse.

    View full-size slide

  77. Knows: About game analysis techniques.
    Told: Game value to analyse.
    Asks: Nothing.
    Response: Annotated game values.
    Analyse.

    View full-size slide

  78. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  79. Independent responsibilities over
    shared nouns.

    View full-size slide

  80. Operating Systems.

    View full-size slide

  81. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  82. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  83. GET /status
    {
    “name”: engine,
    “version”: “c0ffee”,
    “stats”: {…},
    “status”: “ok”
    }

    View full-size slide

  84. GET /status
    {
    “name”: engine,
    “version”: “c0ffee”,
    “stats”: {…},
    “status”: “ko”
    }

    View full-size slide

  85. 8 Fallacies of Distributed Computing
    - Bill Joy, Tom Lyon, L. Peter Deutsch, James Gosling
    1. The network is reliable.
    2. Latency is zero.
    3. Bandwidth is infinite.
    4. The network is secure.
    5. Topology doesn't change.
    6. There is one administrator.
    7. Transport cost is zero.
    8. The network is homogeneous.

    View full-size slide

  86. Timeouts save lives.
    timeout!

    View full-size slide

  87. Engine
    ~14% of Traffic To Each Server

    View full-size slide

  88. Engine
    100k Requests = ~14k Request Per Server

    View full-size slide

  89. Engine
    ~16% of Traffic To Each Server

    View full-size slide

  90. Engine
    100k Requests = ~16k Request Per Server

    View full-size slide

  91. Engine
    14k -> 16k Requests or ~15% Increase

    View full-size slide

  92. Engine
    14k -> 20k Requests or ~40% Increase

    View full-size slide

  93. Engine
    14k -> 25k Requests or ~80% Increase

    View full-size slide

  94. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  95. Serving some is better than serving none.

    View full-size slide

  96. GET /status
    {
    “name”: engine,
    “version”: “c0ffee”,
    “stats”: {…},
    “status”: “ok”
    }

    View full-size slide

  97. GET /status
    {
    “name”: chess,
    “version”: “ae103”,
    “stats”: {…},
    “status”: “ok”
    }

    View full-size slide

  98. GET /status
    {
    “name”: engine,
    “version”: “c0ffee”,
    “stats”: {…},
    “status”: “ko”
    }

    View full-size slide

  99. GET /status
    {
    “name”: chess,
    “version”: “ae103”,
    “stats”: {…},
    “status”: “ko”
    }

    View full-size slide

  100. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  101. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  102. Play Analyse Engine
    Chess
    Pair History

    View full-size slide

  103. Graceful degradation maintains independence.

    View full-size slide

  104. Changing Systems.

    View full-size slide

  105. Pair Version N

    View full-size slide

  106. Pair Version N

    View full-size slide

  107. Pair Version N Pair Version N + 1

    View full-size slide

  108. Pair Version N Pair Version N + 1
    Chess

    View full-size slide

  109. Pair Version N Pair Version N + 1

    View full-size slide

  110. Pair Version N Pair Version N + 1

    View full-size slide

  111. In-production verification.

    View full-size slide

  112. Pair Version N Pair Version N + 1
    Read Write
    Read + Write

    View full-size slide

  113. Pair Version N Pair Version N + 1

    View full-size slide

  114. Pair Version N Pair Version N + 1

    View full-size slide

  115. Pair Version N Pair Version N + 1

    View full-size slide

  116. Pair Version N Pair Version N + 1

    View full-size slide

  117. Pair Version N Pair Version N + 1

    View full-size slide

  118. Pair Version N Pair Version N + 1

    View full-size slide

  119. Pair Version N Pair Version N + 1

    View full-size slide

  120. Incremental deployment.

    View full-size slide

  121. Pair Version N Pair Version N + 1
    Read + Write Read + Write

    View full-size slide

  122. Pair Version N Pair Version N + 1

    View full-size slide

  123. Pair Version N Pair Version N + 1

    View full-size slide

  124. Pair Version N Pair Version N + 1
    95%

    View full-size slide

  125. Pair Version N Pair Version N + 1
    5%

    View full-size slide

  126. Understand success.

    View full-size slide

  127. Pair Version N Pair Version N + 1
    5%
    200 OK

    View full-size slide

  128. Pair Version N Pair Version N + 1
    5%
    500 SERVER ERROR

    View full-size slide

  129. {
    “match-key”: 1234231,
    “pair”: [“c0ffee”, “7ea”]
    }

    View full-size slide

  130. Average Pair Time: 1.1s

    View full-size slide

  131. Pair Version N Pair Version N + 1
    95% 5%

    View full-size slide

  132. Pair Version N Pair Version N + 1
    50% 50%

    View full-size slide

  133. Pair Version N Pair Version N + 1
    5% 95%

    View full-size slide

  134. Pair Version N Pair Version N + 1
    0% 100%

    View full-size slide

  135. Pair Version N Pair Version N + 1
    50% 50%

    View full-size slide

  136. Pair Version N Pair Version N + 1
    100% 0%

    View full-size slide

  137. Could your system survive shipping a bad
    line of code?

    View full-size slide

  138. Data and code needs to have an indpendent
    lifecycle.

    View full-size slide

  139. Unreliable Parts.

    View full-size slide

  140. “construct reliable systems from
    unreliable parts … from the
    knowledge that any component
    in the system might fail”
    - Holzman & Joshi, Reliable Software Systems Design

    View full-size slide

  141. The following is based on a true story.

    View full-size slide

  142. The context has been changed to
    protect the innocent.

    View full-size slide

  143. AI Super Engine

    View full-size slide

  144. choosing to ship unreliable code…
    AI Super Engine

    View full-size slide

  145. P(failure) = 0.8
    AI Super Engine

    View full-size slide

  146. Create a proxy to take control of interactions.

    View full-size slide

  147. Create a more reliable view of data.

    View full-size slide

  148. On failure we discard state.

    View full-size slide

  149. Use our journal to rebuild the state.

    View full-size slide

  150. This gives us the independence we need.

    View full-size slide

  151. P(failure) = 0.8^2 = 0.64 or 64%

    View full-size slide

  152. P(failure) = 0.8^10 = 0.10 or 10%

    View full-size slide

  153. P(failure) = 0.8^20 = 0.01 or 1%

    View full-size slide

  154. “A beach house isn't just real estate. It's a
    state of mind.”
    Douglas Adams -
    Mostly Harmless (1992)

    View full-size slide

  155. Avoid failure through more reliable parts.

    View full-size slide

  156. Be resilient to failure when it occurs by
    controlling the scope and consequences.

    View full-size slide

  157. Service redundancy means embracing
    more failure, but with an increased
    opportunity for handling those failures.

    View full-size slide

  158. Service architecture provides the
    opportunity to trade likelihood of failure
    against the consequences of failure.

    View full-size slide

  159. Leveraging all these opportunities is
    critical to minimising, absorbing and
    reducing the impact of failure through
    our systems.

    View full-size slide

  160. Failure & Change:
    @markhibberd
    Principles of Reliable Systems

    View full-size slide