Failure and Change: Principles of Reliable Systems

Slide 1

Slide 1 text

Failure & Change: @markhibberd Principles of Reliable Systems

Slide 2

Slide 2 text

Reliability: The quality of performing consistently well.

Slide 3

Slide 3 text

Failure.

Slide 4

Slide 4 text

A service.

Slide 5

Slide 5 text

P(failure) = 0.1 or 10%

Slide 6

Slide 6 text

Redundancy, lets do it!

Slide 7

Slide 7 text

P(individual failure) = 0.1

Slide 8

Slide 8 text

P(system failure) = 0.1^10

Slide 9

Slide 9 text

P(system availability) = 1 - 0.1^10

Slide 10

Slide 10 text

99.99999999% Availability.

Slide 11

Slide 11 text

Are failures really independent?

Slide 12

Slide 12 text

P(failure) = 10% or 0.1

Slide 13

Slide 13 text

P(individual success) = 1 - 0.1 = 0.9 or 90%

Slide 14

Slide 14 text

P(all successes) = 0.9^10

Slide 15

Slide 15 text

P(at least one failure) = 1 - 0.9^10 = 0.65 or 65%

Slide 16

Slide 16 text

P(mutually assured destruction) = 1

Slide 17

Slide 17 text

P(at least one failure) = 1 - 0.9^10 = 0.65 or 65%

Slide 18

Slide 18 text

P(system failure) = 1 - 0.9^10 = 0.65 or 65%

Slide 19

Slide 19 text

Redundancy means embracing more failure, but with an increased opportunity for handling those failures.

Slide 20

Slide 20 text

Building systems.

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Player pairing.

Slide 23

Slide 23 text

Player pairing. Game play.

Slide 24

Slide 24 text

Player pairing. Game play. History.

Slide 25

Slide 25 text

Player pairing. Game play. Analyse games. History.

Slide 26

Slide 26 text

Player pairing. Game play. Analyse games. History. Play engines.

Slide 27

Slide 27 text

Player pairing. Game play. Analyse games. History. Play engines. Cheat detection.

Slide 28

Slide 28 text

An argument for monoliths?

Slide 29

Slide 29 text

No guarantee of independence.

Slide 30

Slide 30 text

Magniﬁcation of consequences.

Slide 31

Slide 31 text

Service architecture provides the opportunity to trade likelihood of failure against the consequences of failure.

Slide 32

Slide 32 text

A service approach.

Slide 33

Slide 33 text

A naive service approach. Game Player

Slide 34

Slide 34 text

A naive service approach. Game Play History Chess Player Analyse

Slide 35

Slide 35 text

A naive service approach. { “game-id”: 1234, “game-state”: “waiting-for-pair”, “player-1”: 5678, “player-2”: nil }

Slide 36

Slide 36 text

A naive service approach. { “game-id”: 1234, “game-state”: “waiting-for-move”, “player-1”: 5678, “player-2”: 9123 }

Slide 37

Slide 37 text

A naive service approach. Game Play History Chess Player Analyse

Slide 38

Slide 38 text

A naive service approach. { “game-id”: 1234, “game-state”: “finished”, “player-1”: 5678, “player-2”: 9123 }

Slide 39

Slide 39 text

A naive service approach. Game Play History Chess Player Analyse

Slide 40

Slide 40 text

Game Play History Chess Player Analyse A naive service approach. What happens when the game service has an issue?

Slide 41

Slide 41 text

A naive service approach. We have almost all the downsides of the monolithic approach, plus we have to deal with network interactions and the operational complexity of multiple services. Game Play History Chess Player Analyse

Slide 42

Slide 42 text

A service approach.

Slide 43

Slide 43 text

Games as a value instead of a service. 1. e4 e5 2. Bc4 Nc6 3. Qh5 Nf6?? 4. Qxf7#

Slide 44

Slide 44 text

Play Analyse Engine Chess Pair History

Slide 45

Slide 45 text

Pair.

Slide 46

Slide 46 text

Pair. Pairing can be thought of as negotiating a shared identiﬁer.

Slide 47

Slide 47 text

Pair. game-id: 13376a3e

Slide 48

Slide 48 text

Pair. A party who wants to be paired has to be able to communicate its constraints and how they can be notiﬁed.

Slide 49

Slide 49 text

{ notify: http://waiting/123, constraints: [ { game-time: 5+3 }, { min-rating: 1400 }, { max-rating: 1600 }, { ideal-rating: 1500 }, { max-wait: 30 } ] } Pair.

Slide 50

Slide 50 text

Pair. The pairing service maintains an indexed list of waiting parties.

Slide 51

Slide 51 text

Pair. Asynchronously pairing compatible partners, generating a unique identiﬁer for the pair, and notifying both parties.

Slide 52

Slide 52 text

Pair. Independence?

Slide 53

Slide 53 text

Pair. Knows: About players waiting for a game.

Slide 54

Slide 54 text

Pair. Knows: About players waiting for a game. Told: When a new player wants to pair.

Slide 55

Slide 55 text

Pair. Knows: About players waiting for a game. Told: When a new player wants to pair. Asks: Nothing.

Slide 56

Slide 56 text

Pair. Knows: About players waiting for a game. Told: When a new player wants to pair. Asks: Nothing. Response: Vends players their unique game id.

Slide 57

Slide 57 text

Play.

Slide 58

Slide 58 text

Play. The playing service is responsible for just the in-ﬂight games.

Slide 59

Slide 59 text

Play. A playing pair have a shared identiﬁer they can use to join a game in an initial state.

Slide 60

Slide 60 text

Play Analyse Engine Chess Pair History

Slide 61

Slide 61 text

Play Analyse Engine Chess Pair History

Slide 62

Slide 62 text

Play Analyse Engine Chess Pair History

Slide 63

Slide 63 text

Knows: About games in play. Play.

Slide 64

Slide 64 text

Knows: About games in play. Told: When a player makes a move. Play.

Slide 65

Slide 65 text

Knows: About games in play. Told: When a player makes a move. Asks: Nothing. Play.

Slide 66

Slide 66 text

Knows: About games in play. Told: When a player makes a move. Asks: Nothing. Response: Updated game value. Play.

Slide 67

Slide 67 text

History.

Slide 68

Slide 68 text

History. The history service is just a database of static games that can be searched.

Slide 69

Slide 69 text

Play Analyse Engine Chess Pair History

Slide 70

Slide 70 text

Play Analyse Engine Chess Pair History

Slide 71

Slide 71 text

Play Analyse Engine Chess Pair History

Slide 72

Slide 72 text

History. The history service gives out games as values.

Slide 73

Slide 73 text

Knows: About historical game values. History.

Slide 74

Slide 74 text

Knows: About historical game values. Told: New complete game values. History.

Slide 75

Slide 75 text

Knows: About historical game values. Told: New complete game values. Asks: Nothing. History.

Slide 76

Slide 76 text

Knows: About historical game values. Told: New complete game values. Asks: Nothing. Response: Complete game values. History.

Slide 77

Slide 77 text

Analyse.

Slide 78

Slide 78 text

Play Analyse Engine Chess Pair History

Slide 79

Slide 79 text

Play Analyse Engine Chess Pair History

Slide 80

Slide 80 text

Play Analyse Engine Chess Pair History

Slide 81

Slide 81 text

Knows: About game analysis techniques. Analyse.

Slide 82

Slide 82 text

Knows: About game analysis techniques. Told: Game value to analyse. Analyse.

Slide 83

Slide 83 text

Knows: About game analysis techniques. Told: Game value to analyse. Asks: Nothing. Analyse.

Slide 84

Slide 84 text

Knows: About game analysis techniques. Told: Game value to analyse. Asks: Nothing. Response: Annotated game values. Analyse.

Slide 85

Slide 85 text

Play Analyse Engine Chess Pair History

Slide 86

Slide 86 text

Independent responsibilities over shared nouns.

Slide 87

Slide 87 text

Operating Systems.

Slide 88

Slide 88 text

Play Analyse Engine Chess Pair History

Slide 89

Slide 89 text

Play Analyse Engine Chess Pair History

Slide 90

Slide 90 text

Engine

Slide 91

Slide 91 text

Engine

Slide 92

Slide 92 text

Engine

Slide 93

Slide 93 text

Engine

Slide 94

Slide 94 text

No content

Slide 95

Slide 95 text

GET /status { “name”: engine, “version”: “c0ffee”, “stats”: {…}, “status”: “ok” }

Slide 96

Slide 96 text

GET /status { “name”: engine, “version”: “c0ffee”, “stats”: {…}, “status”: “ko” }

Slide 97

Slide 97 text

No content

Slide 98

Slide 98 text

8 Fallacies of Distributed Computing - Bill Joy, Tom Lyon, L. Peter Deutsch, James Gosling 1. The network is reliable. 2. Latency is zero. 3. Bandwidth is inﬁnite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous.

Slide 99

Slide 99 text

Timeouts save lives. timeout!

Slide 100

Slide 100 text

Engine

Slide 101

Slide 101 text

Engine ~14% of Trafﬁc To Each Server

Slide 102

Slide 102 text

Engine 100k Requests = ~14k Request Per Server

Slide 103

Slide 103 text

Engine ~16% of Trafﬁc To Each Server

Slide 104

Slide 104 text

Engine 100k Requests = ~16k Request Per Server

Slide 105

Slide 105 text

Engine 14k -> 16k Requests or ~15% Increase

Slide 106

Slide 106 text

Engine 14k -> 20k Requests or ~40% Increase

Slide 107

Slide 107 text

Engine 14k -> 25k Requests or ~80% Increase

Slide 108

Slide 108 text

Engine

Slide 109

Slide 109 text

Play Analyse Engine Chess Pair History

Slide 110

Slide 110 text

Serving some is better than serving none.

Slide 111

Slide 111 text

No content

Slide 112

Slide 112 text

GET /status { “name”: engine, “version”: “c0ffee”, “stats”: {…}, “status”: “ok” }

Slide 113

Slide 113 text

GET /status { “name”: chess, “version”: “ae103”, “stats”: {…}, “status”: “ok” }

Slide 114

Slide 114 text

GET /status { “name”: engine, “version”: “c0ffee”, “stats”: {…}, “status”: “ko” }

Slide 115

Slide 115 text

GET /status { “name”: chess, “version”: “ae103”, “stats”: {…}, “status”: “ko” }

Slide 116

Slide 116 text

Play Analyse Engine Chess Pair History

Slide 117

Slide 117 text

Play Analyse Engine Chess Pair History

Slide 118

Slide 118 text

Play Analyse Engine Chess Pair History

Slide 119

Slide 119 text

Graceful degradation maintains independence.

Slide 120

Slide 120 text

No content

Slide 121

Slide 121 text

No content

Slide 122

Slide 122 text

No content

Slide 123

Slide 123 text

Changing Systems.

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

No content

Slide 126

Slide 126 text

No content

Slide 127

Slide 127 text

No content

Slide 128

Slide 128 text

No content

Slide 129

Slide 129 text

Pair

Slide 130

Slide 130 text

Pair Version N

Slide 131

Slide 131 text

Pair Version N

Slide 132

Slide 132 text

Pair Version N Pair Version N + 1

Slide 133

Slide 133 text

Pair Version N Pair Version N + 1 Chess

Slide 134

Slide 134 text

No content

Slide 135

Slide 135 text

Pair Version N Pair Version N + 1

Slide 136

Slide 136 text

Pair Version N Pair Version N + 1

Slide 137

Slide 137 text

No content

Slide 138

Slide 138 text

No content

Slide 139

Slide 139 text

No content

Slide 140

Slide 140 text

No content

Slide 141

Slide 141 text

No content

Slide 142

Slide 142 text

No content

Slide 143

Slide 143 text

No content

Slide 144

Slide 144 text

No content

Slide 145

Slide 145 text

No content

Slide 146

Slide 146 text

In-production veriﬁcation.

Slide 147

Slide 147 text

Pair Version N Pair Version N + 1 Read Write Read + Write

Slide 148

Slide 148 text

Pair Version N Pair Version N + 1

Slide 149

Slide 149 text

Pair Version N Pair Version N + 1

Slide 150

Slide 150 text

Pair Version N Pair Version N + 1

Slide 151

Slide 151 text

Pair Version N Pair Version N + 1

Slide 152

Slide 152 text

Pair Version N Pair Version N + 1

Slide 153

Slide 153 text

Pair Version N Pair Version N + 1

Slide 154

Slide 154 text

Pair Version N Pair Version N + 1

Slide 155

Slide 155 text

Incremental deployment.

Slide 156

Slide 156 text

Pair Version N Pair Version N + 1 Read + Write Read + Write

Slide 157

Slide 157 text

Pair Version N Pair Version N + 1

Slide 158

Slide 158 text

Pair Version N Pair Version N + 1

Slide 159

Slide 159 text

Pair Version N Pair Version N + 1 95%

Slide 160

Slide 160 text

Pair Version N Pair Version N + 1 5%

Slide 161

Slide 161 text

Understand success.

Slide 162

Slide 162 text

Pair Version N Pair Version N + 1 5% 200 OK

Slide 163

Slide 163 text

Pair Version N Pair Version N + 1 5% 500 SERVER ERROR

Slide 164

Slide 164 text

{ “match-key”: 1234231, “pair”: [“c0ffee”, “7ea”] }

Slide 165

Slide 165 text

Average Pair Time: 1.1s

Slide 166

Slide 166 text

Pair Version N Pair Version N + 1 95% 5%

Slide 167

Slide 167 text

Pair Version N Pair Version N + 1 50% 50%

Slide 168

Slide 168 text

Pair Version N Pair Version N + 1 5% 95%

Slide 169

Slide 169 text

Pair Version N Pair Version N + 1 0% 100%

Slide 170

Slide 170 text

Pair Version N Pair Version N + 1 50% 50%

Slide 171

Slide 171 text

Pair Version N Pair Version N + 1 100% 0%

Slide 172

Slide 172 text

Could your system survive shipping a bad line of code?

Slide 173

Slide 173 text

Data and code needs to have an indpendent lifecycle.

Slide 174

Slide 174 text

Unreliable Parts.

Slide 175

Slide 175 text

“construct reliable systems from unreliable parts … from the knowledge that any component in the system might fail” - Holzman & Joshi, Reliable Software Systems Design

Slide 176

Slide 176 text

The following is based on a true story.

Slide 177

Slide 177 text

The context has been changed to protect the innocent.

Slide 178

Slide 178 text

AI Super Engine

Slide 179

Slide 179 text

choosing to ship unreliable code… AI Super Engine

Slide 180

Slide 180 text

P(failure) = 0.8 AI Super Engine

Slide 181

Slide 181 text

What to do?

Slide 182

Slide 182 text

Create a proxy to take control of interactions.

Slide 183

Slide 183 text

Create a more reliable view of data.

Slide 184

Slide 184 text

On failure we discard state.

Slide 185

Slide 185 text

Use our journal to rebuild the state.

Slide 186

Slide 186 text

This gives us the independence we need.

Slide 187

Slide 187 text

P(failure) = 0.8^2 = 0.64 or 64%

Slide 188

Slide 188 text

P(failure) = 0.8^10 = 0.10 or 10%

Slide 189

Slide 189 text

P(failure) = 0.8^20 = 0.01 or 1%

Slide 190

Slide 190 text

“A beach house isn't just real estate. It's a state of mind.” Douglas Adams - Mostly Harmless (1992)

Slide 191

Slide 191 text

Avoid failure through more reliable parts.

Slide 192

Slide 192 text

Be resilient to failure when it occurs by controlling the scope and consequences.

Slide 193

Slide 193 text

Service redundancy means embracing more failure, but with an increased opportunity for handling those failures.

Slide 194

Slide 194 text

Service architecture provides the opportunity to trade likelihood of failure against the consequences of failure.

Slide 195

Slide 195 text

Leveraging all these opportunities is critical to minimising, absorbing and reducing the impact of failure through our systems.