Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure and Change: Principles of Reliable Systems

Mark Hibberd
November 30, 2018

Failure and Change: Principles of Reliable Systems

As we construct larger or more complex systems, failure and change are
ever-present. We need to accept and even embrace these tensions to
build software that works and keeps working.This is a talk on building
and operating reliable systems. We will look at how systems fail,
particularly in the face of complexity or scale, and build up a set of
principles and practices that will help us implement, understand and
verify reliable systems.

Mark Hibberd

November 30, 2018
Tweet

More Decks by Mark Hibberd

Other Decks in Programming

Transcript

  1. Failure & Change:
    @markhibberd
    Principles of Reliable Systems

    View Slide

  2. Reliability: The quality of performing consistently well.

    View Slide

  3. Failure.

    View Slide

  4. A service.

    View Slide

  5. P(failure) = 0.1 or 10%

    View Slide

  6. Redundancy, lets do it!

    View Slide

  7. P(individual failure) = 0.1

    View Slide

  8. P(system failure) = 0.1^10

    View Slide

  9. P(system availability) = 1 - 0.1^10

    View Slide

  10. 99.99999999% Availability.

    View Slide

  11. Are failures really independent?

    View Slide

  12. P(failure) = 10% or 0.1

    View Slide

  13. P(individual success) = 1 - 0.1 = 0.9 or 90%

    View Slide

  14. P(all successes) = 0.9^10

    View Slide

  15. P(at least one failure) = 1 - 0.9^10 = 0.65 or 65%

    View Slide

  16. P(mutually assured destruction) = 1

    View Slide

  17. P(at least one failure) = 1 - 0.9^10 = 0.65 or 65%

    View Slide

  18. P(system failure) = 1 - 0.9^10 = 0.65 or 65%

    View Slide

  19. Redundancy means embracing more
    failure, but with an increased
    opportunity for handling those failures.

    View Slide

  20. Building systems.

    View Slide

  21. View Slide

  22. Player pairing.

    View Slide

  23. Player pairing.
    Game play.

    View Slide

  24. Player pairing.
    Game play.
    History.

    View Slide

  25. Player pairing.
    Game play. Analyse games.
    History.

    View Slide

  26. Player pairing.
    Game play. Analyse games.
    History.
    Play engines.

    View Slide

  27. Player pairing.
    Game play. Analyse games.
    History.
    Play engines.
    Cheat detection.

    View Slide

  28. An argument for monoliths?

    View Slide

  29. No guarantee of independence.

    View Slide

  30. Magnification of consequences.

    View Slide

  31. Service architecture provides the
    opportunity to trade likelihood of failure
    against the consequences of failure.

    View Slide

  32. A service approach.

    View Slide

  33. A naive service approach.
    Game
    Player

    View Slide

  34. A naive service approach.
    Game
    Play History
    Chess
    Player
    Analyse

    View Slide

  35. A naive service approach.
    {
    “game-id”: 1234,
    “game-state”: “waiting-for-pair”,
    “player-1”: 5678,
    “player-2”: nil
    }

    View Slide

  36. A naive service approach.
    {
    “game-id”: 1234,
    “game-state”: “waiting-for-move”,
    “player-1”: 5678,
    “player-2”: 9123
    }

    View Slide

  37. A naive service approach.
    Game
    Play History
    Chess
    Player
    Analyse

    View Slide

  38. A naive service approach.
    {
    “game-id”: 1234,
    “game-state”: “finished”,
    “player-1”: 5678,
    “player-2”: 9123
    }

    View Slide

  39. A naive service approach.
    Game
    Play History
    Chess
    Player
    Analyse

    View Slide

  40. Game
    Play History
    Chess
    Player
    Analyse
    A naive service approach.
    What happens when the game
    service has an issue?

    View Slide

  41. A naive service approach.
    We have almost all the
    downsides of the monolithic
    approach, plus we have to
    deal with network interactions
    and the operational complexity
    of multiple services.
    Game
    Play History
    Chess
    Player
    Analyse

    View Slide

  42. A service approach.

    View Slide

  43. Games as a value instead of a service.
    1. e4 e5
    2. Bc4 Nc6
    3. Qh5 Nf6??
    4. Qxf7#

    View Slide

  44. Play Analyse Engine
    Chess
    Pair History

    View Slide

  45. Pair.

    View Slide

  46. Pair.
    Pairing can be thought of as negotiating a
    shared identifier.

    View Slide

  47. Pair.
    game-id: 13376a3e

    View Slide

  48. Pair.
    A party who wants to be paired has to be able
    to communicate its constraints and how they
    can be notified.

    View Slide

  49. {
    notify: http://waiting/123,
    constraints: [
    { game-time: 5+3 },
    { min-rating: 1400 },
    { max-rating: 1600 },
    { ideal-rating: 1500 },
    { max-wait: 30 }
    ]
    }
    Pair.

    View Slide

  50. Pair.
    The pairing service maintains an indexed list
    of waiting parties.

    View Slide

  51. Pair.
    Asynchronously pairing compatible partners,
    generating a unique identifier for the pair, and
    notifying both parties.

    View Slide

  52. Pair.
    Independence?

    View Slide

  53. Pair.
    Knows: About players waiting for a game.

    View Slide

  54. Pair.
    Knows: About players waiting for a game.
    Told: When a new player wants to pair.

    View Slide

  55. Pair.
    Knows: About players waiting for a game.
    Told: When a new player wants to pair.
    Asks: Nothing.

    View Slide

  56. Pair.
    Knows: About players waiting for a game.
    Told: When a new player wants to pair.
    Asks: Nothing.
    Response: Vends players their unique game id.

    View Slide

  57. Play.

    View Slide

  58. Play.
    The playing service is responsible for just the
    in-flight games.

    View Slide

  59. Play.
    A playing pair have a shared identifier they
    can use to join a game in an initial state.

    View Slide

  60. Play Analyse Engine
    Chess
    Pair History

    View Slide

  61. Play Analyse Engine
    Chess
    Pair History

    View Slide

  62. Play Analyse Engine
    Chess
    Pair History

    View Slide

  63. Knows: About games in play.
    Play.

    View Slide

  64. Knows: About games in play.
    Told: When a player makes a move.
    Play.

    View Slide

  65. Knows: About games in play.
    Told: When a player makes a move.
    Asks: Nothing.
    Play.

    View Slide

  66. Knows: About games in play.
    Told: When a player makes a move.
    Asks: Nothing.
    Response: Updated game value.
    Play.

    View Slide

  67. History.

    View Slide

  68. History.
    The history service is just a database of static
    games that can be searched.

    View Slide

  69. Play Analyse Engine
    Chess
    Pair History

    View Slide

  70. Play Analyse Engine
    Chess
    Pair History

    View Slide

  71. Play Analyse Engine
    Chess
    Pair History

    View Slide

  72. History.
    The history service gives out games as values.

    View Slide

  73. Knows: About historical game values.
    History.

    View Slide

  74. Knows: About historical game values.
    Told: New complete game values.
    History.

    View Slide

  75. Knows: About historical game values.
    Told: New complete game values.
    Asks: Nothing.
    History.

    View Slide

  76. Knows: About historical game values.
    Told: New complete game values.
    Asks: Nothing.
    Response: Complete game values.
    History.

    View Slide

  77. Analyse.

    View Slide

  78. Play Analyse Engine
    Chess
    Pair History

    View Slide

  79. Play Analyse Engine
    Chess
    Pair History

    View Slide

  80. Play Analyse Engine
    Chess
    Pair History

    View Slide

  81. Knows: About game analysis techniques.
    Analyse.

    View Slide

  82. Knows: About game analysis techniques.
    Told: Game value to analyse.
    Analyse.

    View Slide

  83. Knows: About game analysis techniques.
    Told: Game value to analyse.
    Asks: Nothing.
    Analyse.

    View Slide

  84. Knows: About game analysis techniques.
    Told: Game value to analyse.
    Asks: Nothing.
    Response: Annotated game values.
    Analyse.

    View Slide

  85. Play Analyse Engine
    Chess
    Pair History

    View Slide

  86. Independent responsibilities over
    shared nouns.

    View Slide

  87. Operating Systems.

    View Slide

  88. Play Analyse Engine
    Chess
    Pair History

    View Slide

  89. Play Analyse Engine
    Chess
    Pair History

    View Slide

  90. Engine

    View Slide

  91. Engine

    View Slide

  92. Engine

    View Slide

  93. Engine

    View Slide

  94. View Slide

  95. GET /status
    {
    “name”: engine,
    “version”: “c0ffee”,
    “stats”: {…},
    “status”: “ok”
    }

    View Slide

  96. GET /status
    {
    “name”: engine,
    “version”: “c0ffee”,
    “stats”: {…},
    “status”: “ko”
    }

    View Slide

  97. View Slide

  98. 8 Fallacies of Distributed Computing
    - Bill Joy, Tom Lyon, L. Peter Deutsch, James Gosling
    1. The network is reliable.
    2. Latency is zero.
    3. Bandwidth is infinite.
    4. The network is secure.
    5. Topology doesn't change.
    6. There is one administrator.
    7. Transport cost is zero.
    8. The network is homogeneous.

    View Slide

  99. Timeouts save lives.
    timeout!

    View Slide

  100. Engine

    View Slide

  101. Engine
    ~14% of Traffic To Each Server

    View Slide

  102. Engine
    100k Requests = ~14k Request Per Server

    View Slide

  103. Engine
    ~16% of Traffic To Each Server

    View Slide

  104. Engine
    100k Requests = ~16k Request Per Server

    View Slide

  105. Engine
    14k -> 16k Requests or ~15% Increase

    View Slide

  106. Engine
    14k -> 20k Requests or ~40% Increase

    View Slide

  107. Engine
    14k -> 25k Requests or ~80% Increase

    View Slide

  108. Engine

    View Slide

  109. Play Analyse Engine
    Chess
    Pair History

    View Slide

  110. Serving some is better than serving none.

    View Slide

  111. View Slide

  112. GET /status
    {
    “name”: engine,
    “version”: “c0ffee”,
    “stats”: {…},
    “status”: “ok”
    }

    View Slide

  113. GET /status
    {
    “name”: chess,
    “version”: “ae103”,
    “stats”: {…},
    “status”: “ok”
    }

    View Slide

  114. GET /status
    {
    “name”: engine,
    “version”: “c0ffee”,
    “stats”: {…},
    “status”: “ko”
    }

    View Slide

  115. GET /status
    {
    “name”: chess,
    “version”: “ae103”,
    “stats”: {…},
    “status”: “ko”
    }

    View Slide

  116. Play Analyse Engine
    Chess
    Pair History

    View Slide

  117. Play Analyse Engine
    Chess
    Pair History

    View Slide

  118. Play Analyse Engine
    Chess
    Pair History

    View Slide

  119. Graceful degradation maintains independence.

    View Slide

  120. View Slide

  121. View Slide

  122. View Slide

  123. Changing Systems.

    View Slide

  124. View Slide

  125. View Slide

  126. View Slide

  127. View Slide

  128. View Slide

  129. Pair

    View Slide

  130. Pair Version N

    View Slide

  131. Pair Version N

    View Slide

  132. Pair Version N Pair Version N + 1

    View Slide

  133. Pair Version N Pair Version N + 1
    Chess

    View Slide

  134. View Slide

  135. Pair Version N Pair Version N + 1

    View Slide

  136. Pair Version N Pair Version N + 1

    View Slide

  137. View Slide

  138. View Slide

  139. View Slide

  140. View Slide

  141. View Slide

  142. View Slide

  143. View Slide

  144. View Slide

  145. View Slide

  146. In-production verification.

    View Slide

  147. Pair Version N Pair Version N + 1
    Read Write
    Read + Write

    View Slide

  148. Pair Version N Pair Version N + 1

    View Slide

  149. Pair Version N Pair Version N + 1

    View Slide

  150. Pair Version N Pair Version N + 1

    View Slide

  151. Pair Version N Pair Version N + 1

    View Slide

  152. Pair Version N Pair Version N + 1

    View Slide

  153. Pair Version N Pair Version N + 1

    View Slide

  154. Pair Version N Pair Version N + 1

    View Slide

  155. Incremental deployment.

    View Slide

  156. Pair Version N Pair Version N + 1
    Read + Write Read + Write

    View Slide

  157. Pair Version N Pair Version N + 1

    View Slide

  158. Pair Version N Pair Version N + 1

    View Slide

  159. Pair Version N Pair Version N + 1
    95%

    View Slide

  160. Pair Version N Pair Version N + 1
    5%

    View Slide

  161. Understand success.

    View Slide

  162. Pair Version N Pair Version N + 1
    5%
    200 OK

    View Slide

  163. Pair Version N Pair Version N + 1
    5%
    500 SERVER ERROR

    View Slide

  164. {
    “match-key”: 1234231,
    “pair”: [“c0ffee”, “7ea”]
    }

    View Slide

  165. Average Pair Time: 1.1s

    View Slide

  166. Pair Version N Pair Version N + 1
    95% 5%

    View Slide

  167. Pair Version N Pair Version N + 1
    50% 50%

    View Slide

  168. Pair Version N Pair Version N + 1
    5% 95%

    View Slide

  169. Pair Version N Pair Version N + 1
    0% 100%

    View Slide

  170. Pair Version N Pair Version N + 1
    50% 50%

    View Slide

  171. Pair Version N Pair Version N + 1
    100% 0%

    View Slide

  172. Could your system survive shipping a bad
    line of code?

    View Slide

  173. Data and code needs to have an indpendent
    lifecycle.

    View Slide

  174. Unreliable Parts.

    View Slide

  175. “construct reliable systems from
    unreliable parts … from the
    knowledge that any component
    in the system might fail”
    - Holzman & Joshi, Reliable Software Systems Design

    View Slide

  176. The following is based on a true story.

    View Slide

  177. The context has been changed to
    protect the innocent.

    View Slide

  178. AI Super Engine

    View Slide

  179. choosing to ship unreliable code…
    AI Super Engine

    View Slide

  180. P(failure) = 0.8
    AI Super Engine

    View Slide

  181. What to do?

    View Slide

  182. Create a proxy to take control of interactions.

    View Slide

  183. Create a more reliable view of data.

    View Slide

  184. On failure we discard state.

    View Slide

  185. Use our journal to rebuild the state.

    View Slide

  186. This gives us the independence we need.

    View Slide

  187. P(failure) = 0.8^2 = 0.64 or 64%

    View Slide

  188. P(failure) = 0.8^10 = 0.10 or 10%

    View Slide

  189. P(failure) = 0.8^20 = 0.01 or 1%

    View Slide

  190. “A beach house isn't just real estate. It's a
    state of mind.”
    Douglas Adams -
    Mostly Harmless (1992)

    View Slide

  191. Avoid failure through more reliable parts.

    View Slide

  192. Be resilient to failure when it occurs by
    controlling the scope and consequences.

    View Slide

  193. Service redundancy means embracing
    more failure, but with an increased
    opportunity for handling those failures.

    View Slide

  194. Service architecture provides the
    opportunity to trade likelihood of failure
    against the consequences of failure.

    View Slide

  195. Leveraging all these opportunities is
    critical to minimising, absorbing and
    reducing the impact of failure through
    our systems.

    View Slide

  196. View Slide

  197. Failure & Change:
    @markhibberd
    Principles of Reliable Systems

    View Slide