Failure and Change: Principles of Reliable Systems

Failure and Change: Principles of Reliable Systems

As we construct larger or more complex systems, failure and change are
ever-present. We need to accept and even embrace these tensions to
build software that works and keeps working.This is a talk on building
and operating reliable systems. We will look at how systems fail,
particularly in the face of complexity or scale, and build up a set of
principles and practices that will help us implement, understand and
verify reliable systems.

42d9867a0fee0fa6de6534e9df0f1e9b?s=128

Mark Hibberd

November 30, 2018
Tweet

Transcript

  1. Failure & Change: @markhibberd Principles of Reliable Systems

  2. Reliability: The quality of performing consistently well.

  3. Failure.

  4. A service.

  5. P(failure) = 0.1 or 10%

  6. Redundancy, lets do it!

  7. P(individual failure) = 0.1

  8. P(system failure) = 0.1^10

  9. P(system availability) = 1 - 0.1^10

  10. 99.99999999% Availability.

  11. Are failures really independent?

  12. P(failure) = 10% or 0.1

  13. P(individual success) = 1 - 0.1 = 0.9 or 90%

  14. P(all successes) = 0.9^10

  15. P(at least one failure) = 1 - 0.9^10 = 0.65

    or 65%
  16. P(mutually assured destruction) = 1

  17. P(at least one failure) = 1 - 0.9^10 = 0.65

    or 65%
  18. P(system failure) = 1 - 0.9^10 = 0.65 or 65%

  19. Redundancy means embracing more failure, but with an increased opportunity

    for handling those failures.
  20. Building systems.

  21. None
  22. Player pairing.

  23. Player pairing. Game play.

  24. Player pairing. Game play. History.

  25. Player pairing. Game play. Analyse games. History.

  26. Player pairing. Game play. Analyse games. History. Play engines.

  27. Player pairing. Game play. Analyse games. History. Play engines. Cheat

    detection.
  28. An argument for monoliths?

  29. No guarantee of independence.

  30. Magnification of consequences.

  31. Service architecture provides the opportunity to trade likelihood of failure

    against the consequences of failure.
  32. A service approach.

  33. A naive service approach. Game Player

  34. A naive service approach. Game Play History Chess Player Analyse

  35. A naive service approach. { “game-id”: 1234, “game-state”: “waiting-for-pair”, “player-1”:

    5678, “player-2”: nil }
  36. A naive service approach. { “game-id”: 1234, “game-state”: “waiting-for-move”, “player-1”:

    5678, “player-2”: 9123 }
  37. A naive service approach. Game Play History Chess Player Analyse

  38. A naive service approach. { “game-id”: 1234, “game-state”: “finished”, “player-1”:

    5678, “player-2”: 9123 }
  39. A naive service approach. Game Play History Chess Player Analyse

  40. Game Play History Chess Player Analyse A naive service approach.

    What happens when the game service has an issue?
  41. A naive service approach. We have almost all the downsides

    of the monolithic approach, plus we have to deal with network interactions and the operational complexity of multiple services. Game Play History Chess Player Analyse
  42. A service approach.

  43. Games as a value instead of a service. 1. e4

    e5 2. Bc4 Nc6 3. Qh5 Nf6?? 4. Qxf7#
  44. Play Analyse Engine Chess Pair History

  45. Pair.

  46. Pair. Pairing can be thought of as negotiating a shared

    identifier.
  47. Pair. game-id: 13376a3e

  48. Pair. A party who wants to be paired has to

    be able to communicate its constraints and how they can be notified.
  49. { notify: http://waiting/123, constraints: [ { game-time: 5+3 }, {

    min-rating: 1400 }, { max-rating: 1600 }, { ideal-rating: 1500 }, { max-wait: 30 } ] } Pair.
  50. Pair. The pairing service maintains an indexed list of waiting

    parties.
  51. Pair. Asynchronously pairing compatible partners, generating a unique identifier for

    the pair, and notifying both parties.
  52. Pair. Independence?

  53. Pair. Knows: About players waiting for a game.

  54. Pair. Knows: About players waiting for a game. Told: When

    a new player wants to pair.
  55. Pair. Knows: About players waiting for a game. Told: When

    a new player wants to pair. Asks: Nothing.
  56. Pair. Knows: About players waiting for a game. Told: When

    a new player wants to pair. Asks: Nothing. Response: Vends players their unique game id.
  57. Play.

  58. Play. The playing service is responsible for just the in-flight

    games.
  59. Play. A playing pair have a shared identifier they can

    use to join a game in an initial state.
  60. Play Analyse Engine Chess Pair History

  61. Play Analyse Engine Chess Pair History

  62. Play Analyse Engine Chess Pair History

  63. Knows: About games in play. Play.

  64. Knows: About games in play. Told: When a player makes

    a move. Play.
  65. Knows: About games in play. Told: When a player makes

    a move. Asks: Nothing. Play.
  66. Knows: About games in play. Told: When a player makes

    a move. Asks: Nothing. Response: Updated game value. Play.
  67. History.

  68. History. The history service is just a database of static

    games that can be searched.
  69. Play Analyse Engine Chess Pair History

  70. Play Analyse Engine Chess Pair History

  71. Play Analyse Engine Chess Pair History

  72. History. The history service gives out games as values.

  73. Knows: About historical game values. History.

  74. Knows: About historical game values. Told: New complete game values.

    History.
  75. Knows: About historical game values. Told: New complete game values.

    Asks: Nothing. History.
  76. Knows: About historical game values. Told: New complete game values.

    Asks: Nothing. Response: Complete game values. History.
  77. Analyse.

  78. Play Analyse Engine Chess Pair History

  79. Play Analyse Engine Chess Pair History

  80. Play Analyse Engine Chess Pair History

  81. Knows: About game analysis techniques. Analyse.

  82. Knows: About game analysis techniques. Told: Game value to analyse.

    Analyse.
  83. Knows: About game analysis techniques. Told: Game value to analyse.

    Asks: Nothing. Analyse.
  84. Knows: About game analysis techniques. Told: Game value to analyse.

    Asks: Nothing. Response: Annotated game values. Analyse.
  85. Play Analyse Engine Chess Pair History

  86. Independent responsibilities over shared nouns.

  87. Operating Systems.

  88. Play Analyse Engine Chess Pair History

  89. Play Analyse Engine Chess Pair History

  90. Engine

  91. Engine

  92. Engine

  93. Engine

  94. None
  95. GET /status { “name”: engine, “version”: “c0ffee”, “stats”: {…}, “status”:

    “ok” }
  96. GET /status { “name”: engine, “version”: “c0ffee”, “stats”: {…}, “status”:

    “ko” }
  97. None
  98. 8 Fallacies of Distributed Computing - Bill Joy, Tom Lyon,

    L. Peter Deutsch, James Gosling 1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous.
  99. Timeouts save lives. timeout!

  100. Engine

  101. Engine ~14% of Traffic To Each Server

  102. Engine 100k Requests = ~14k Request Per Server

  103. Engine ~16% of Traffic To Each Server

  104. Engine 100k Requests = ~16k Request Per Server

  105. Engine 14k -> 16k Requests or ~15% Increase

  106. Engine 14k -> 20k Requests or ~40% Increase

  107. Engine 14k -> 25k Requests or ~80% Increase

  108. Engine

  109. Play Analyse Engine Chess Pair History

  110. Serving some is better than serving none.

  111. None
  112. GET /status { “name”: engine, “version”: “c0ffee”, “stats”: {…}, “status”:

    “ok” }
  113. GET /status { “name”: chess, “version”: “ae103”, “stats”: {…}, “status”:

    “ok” }
  114. GET /status { “name”: engine, “version”: “c0ffee”, “stats”: {…}, “status”:

    “ko” }
  115. GET /status { “name”: chess, “version”: “ae103”, “stats”: {…}, “status”:

    “ko” }
  116. Play Analyse Engine Chess Pair History

  117. Play Analyse Engine Chess Pair History

  118. Play Analyse Engine Chess Pair History

  119. Graceful degradation maintains independence.

  120. None
  121. None
  122. None
  123. Changing Systems.

  124. None
  125. None
  126. None
  127. None
  128. None
  129. Pair

  130. Pair Version N

  131. Pair Version N

  132. Pair Version N Pair Version N + 1

  133. Pair Version N Pair Version N + 1 Chess

  134. None
  135. Pair Version N Pair Version N + 1

  136. Pair Version N Pair Version N + 1

  137. None
  138. None
  139. None
  140. None
  141. None
  142. None
  143. None
  144. None
  145. None
  146. In-production verification.

  147. Pair Version N Pair Version N + 1 Read Write

    Read + Write
  148. Pair Version N Pair Version N + 1

  149. Pair Version N Pair Version N + 1

  150. Pair Version N Pair Version N + 1

  151. Pair Version N Pair Version N + 1

  152. Pair Version N Pair Version N + 1

  153. Pair Version N Pair Version N + 1

  154. Pair Version N Pair Version N + 1

  155. Incremental deployment.

  156. Pair Version N Pair Version N + 1 Read +

    Write Read + Write
  157. Pair Version N Pair Version N + 1

  158. Pair Version N Pair Version N + 1

  159. Pair Version N Pair Version N + 1 95%

  160. Pair Version N Pair Version N + 1 5%

  161. Understand success.

  162. Pair Version N Pair Version N + 1 5% 200

    OK
  163. Pair Version N Pair Version N + 1 5% 500

    SERVER ERROR
  164. { “match-key”: 1234231, “pair”: [“c0ffee”, “7ea”] }

  165. Average Pair Time: 1.1s

  166. Pair Version N Pair Version N + 1 95% 5%

  167. Pair Version N Pair Version N + 1 50% 50%

  168. Pair Version N Pair Version N + 1 5% 95%

  169. Pair Version N Pair Version N + 1 0% 100%

  170. Pair Version N Pair Version N + 1 50% 50%

  171. Pair Version N Pair Version N + 1 100% 0%

  172. Could your system survive shipping a bad line of code?

  173. Data and code needs to have an indpendent lifecycle.

  174. Unreliable Parts.

  175. “construct reliable systems from unreliable parts … from the knowledge

    that any component in the system might fail” - Holzman & Joshi, Reliable Software Systems Design
  176. The following is based on a true story.

  177. The context has been changed to protect the innocent.

  178. AI Super Engine

  179. choosing to ship unreliable code… AI Super Engine

  180. P(failure) = 0.8 AI Super Engine

  181. What to do?

  182. Create a proxy to take control of interactions.

  183. Create a more reliable view of data.

  184. On failure we discard state.

  185. Use our journal to rebuild the state.

  186. This gives us the independence we need.

  187. P(failure) = 0.8^2 = 0.64 or 64%

  188. P(failure) = 0.8^10 = 0.10 or 10%

  189. P(failure) = 0.8^20 = 0.01 or 1%

  190. “A beach house isn't just real estate. It's a state

    of mind.” Douglas Adams - Mostly Harmless (1992)
  191. Avoid failure through more reliable parts.

  192. Be resilient to failure when it occurs by controlling the

    scope and consequences.
  193. Service redundancy means embracing more failure, but with an increased

    opportunity for handling those failures.
  194. Service architecture provides the opportunity to trade likelihood of failure

    against the consequences of failure.
  195. Leveraging all these opportunities is critical to minimising, absorbing and

    reducing the impact of failure through our systems.
  196. None
  197. Failure & Change: @markhibberd Principles of Reliable Systems