Failure & Change:
@markhibberd
Principles of Reliable Systems
Slide 2
Slide 2 text
Reliability: The quality of performing consistently well.
Slide 3
Slide 3 text
Failure.
Slide 4
Slide 4 text
A service.
Slide 5
Slide 5 text
P(failure) = 0.1 or 10%
Slide 6
Slide 6 text
Redundancy, lets do it!
Slide 7
Slide 7 text
P(individual failure) = 0.1
Slide 8
Slide 8 text
P(system failure) = 0.1^10
Slide 9
Slide 9 text
P(system availability) = 1 - 0.1^10
Slide 10
Slide 10 text
99.99999999% Availability.
Slide 11
Slide 11 text
Are failures really independent?
Slide 12
Slide 12 text
P(failure) = 10% or 0.1
Slide 13
Slide 13 text
P(individual success) = 1 - 0.1 = 0.9 or 90%
Slide 14
Slide 14 text
P(all successes) = 0.9^10
Slide 15
Slide 15 text
P(at least one failure) = 1 - 0.9^10 = 0.65 or 65%
Slide 16
Slide 16 text
P(mutually assured destruction) = 1
Slide 17
Slide 17 text
P(at least one failure) = 1 - 0.9^10 = 0.65 or 65%
Slide 18
Slide 18 text
P(system failure) = 1 - 0.9^10 = 0.65 or 65%
Slide 19
Slide 19 text
Redundancy means embracing more
failure, but with an increased
opportunity for handling those failures.
Slide 20
Slide 20 text
Building systems.
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
Player pairing.
Slide 23
Slide 23 text
Player pairing.
Game play.
Slide 24
Slide 24 text
Player pairing.
Game play.
History.
Slide 25
Slide 25 text
Player pairing.
Game play. Analyse games.
History.
Slide 26
Slide 26 text
Player pairing.
Game play. Analyse games.
History.
Play engines.
Slide 27
Slide 27 text
Player pairing.
Game play. Analyse games.
History.
Play engines.
Cheat detection.
Slide 28
Slide 28 text
An argument for monoliths?
Slide 29
Slide 29 text
No guarantee of independence.
Slide 30
Slide 30 text
Magnification of consequences.
Slide 31
Slide 31 text
Service architecture provides the
opportunity to trade likelihood of failure
against the consequences of failure.
Slide 32
Slide 32 text
A service approach.
Slide 33
Slide 33 text
A naive service approach.
Game
Player
Slide 34
Slide 34 text
A naive service approach.
Game
Play History
Chess
Player
Analyse
Slide 35
Slide 35 text
A naive service approach.
{
“game-id”: 1234,
“game-state”: “waiting-for-pair”,
“player-1”: 5678,
“player-2”: nil
}
Slide 36
Slide 36 text
A naive service approach.
{
“game-id”: 1234,
“game-state”: “waiting-for-move”,
“player-1”: 5678,
“player-2”: 9123
}
Slide 37
Slide 37 text
A naive service approach.
Game
Play History
Chess
Player
Analyse
Slide 38
Slide 38 text
A naive service approach.
{
“game-id”: 1234,
“game-state”: “finished”,
“player-1”: 5678,
“player-2”: 9123
}
Slide 39
Slide 39 text
A naive service approach.
Game
Play History
Chess
Player
Analyse
Slide 40
Slide 40 text
Game
Play History
Chess
Player
Analyse
A naive service approach.
What happens when the game
service has an issue?
Slide 41
Slide 41 text
A naive service approach.
We have almost all the
downsides of the monolithic
approach, plus we have to
deal with network interactions
and the operational complexity
of multiple services.
Game
Play History
Chess
Player
Analyse
Slide 42
Slide 42 text
A service approach.
Slide 43
Slide 43 text
Games as a value instead of a service.
1. e4 e5
2. Bc4 Nc6
3. Qh5 Nf6??
4. Qxf7#
Slide 44
Slide 44 text
Play Analyse Engine
Chess
Pair History
Slide 45
Slide 45 text
Pair.
Slide 46
Slide 46 text
Pair.
Pairing can be thought of as negotiating a
shared identifier.
Slide 47
Slide 47 text
Pair.
game-id: 13376a3e
Slide 48
Slide 48 text
Pair.
A party who wants to be paired has to be able
to communicate its constraints and how they
can be notified.
8 Fallacies of Distributed Computing
- Bill Joy, Tom Lyon, L. Peter Deutsch, James Gosling
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
Could your system survive shipping a bad
line of code?
Slide 173
Slide 173 text
Data and code needs to have an indpendent
lifecycle.
Slide 174
Slide 174 text
Unreliable Parts.
Slide 175
Slide 175 text
“construct reliable systems from
unreliable parts … from the
knowledge that any component
in the system might fail”
- Holzman & Joshi, Reliable Software Systems Design
Slide 176
Slide 176 text
The following is based on a true story.
Slide 177
Slide 177 text
The context has been changed to
protect the innocent.
Slide 178
Slide 178 text
AI Super Engine
Slide 179
Slide 179 text
choosing to ship unreliable code…
AI Super Engine
Slide 180
Slide 180 text
P(failure) = 0.8
AI Super Engine
Slide 181
Slide 181 text
What to do?
Slide 182
Slide 182 text
Create a proxy to take control of interactions.
Slide 183
Slide 183 text
Create a more reliable view of data.
Slide 184
Slide 184 text
On failure we discard state.
Slide 185
Slide 185 text
Use our journal to rebuild the state.
Slide 186
Slide 186 text
This gives us the independence we need.
Slide 187
Slide 187 text
P(failure) = 0.8^2 = 0.64 or 64%
Slide 188
Slide 188 text
P(failure) = 0.8^10 = 0.10 or 10%
Slide 189
Slide 189 text
P(failure) = 0.8^20 = 0.01 or 1%
Slide 190
Slide 190 text
“A beach house isn't just real estate. It's a
state of mind.”
Douglas Adams -
Mostly Harmless (1992)
Slide 191
Slide 191 text
Avoid failure through more reliable parts.
Slide 192
Slide 192 text
Be resilient to failure when it occurs by
controlling the scope and consequences.
Slide 193
Slide 193 text
Service redundancy means embracing
more failure, but with an increased
opportunity for handling those failures.
Slide 194
Slide 194 text
Service architecture provides the
opportunity to trade likelihood of failure
against the consequences of failure.
Slide 195
Slide 195 text
Leveraging all these opportunities is
critical to minimising, absorbing and
reducing the impact of failure through
our systems.
Slide 196
Slide 196 text
No content
Slide 197
Slide 197 text
Failure & Change:
@markhibberd
Principles of Reliable Systems