Failure and Change: Principles of Reliable Systems

Failure & Change: @markhibberd Principles of Reliable Systems

Reliability: The quality of performing consistently well.

Failure.

A service.

P(failure) = 0.1 or 10%

Redundancy, lets do it!

P(individual failure) = 0.1

P(system failure) = 0.1^10

P(system availability) = 1 - 0.1^10

99.99999999% Availability.

Are failures really independent?

P(failure) = 10% or 0.1

P(individual success) = 1 - 0.1 = 0.9 or 90%

P(all successes) = 0.9^10

P(at least one failure) = 1 - 0.9^10 = 0.65
or 65%

P(mutually assured destruction) = 1

P(at least one failure) = 1 - 0.9^10 = 0.65
or 65%

P(system failure) = 1 - 0.9^10 = 0.65 or 65%

Redundancy means embracing more failure, but with an increased opportunity
for handling those failures.

Building systems.

Player pairing.

Player pairing. Game play.

Player pairing. Game play. History.

Player pairing. Game play. Analyse games. History.

Player pairing. Game play. Analyse games. History. Play engines.

Player pairing. Game play. Analyse games. History. Play engines. Cheat
detection.

An argument for monoliths?

No guarantee of independence.

Magniﬁcation of consequences.

Service architecture provides the opportunity to trade likelihood of failure
against the consequences of failure.

A service approach.

A naive service approach. Game Player

A naive service approach. Game Play History Chess Player Analyse

A naive service approach. { “game-id”: 1234, “game-state”: “waiting-for-pair”, “player-1”:
5678, “player-2”: nil }

A naive service approach. { “game-id”: 1234, “game-state”: “waiting-for-move”, “player-1”:
5678, “player-2”: 9123 }

A naive service approach. { “game-id”: 1234, “game-state”: “finished”, “player-1”:
5678, “player-2”: 9123 }

Game Play History Chess Player Analyse A naive service approach.
What happens when the game service has an issue?

A naive service approach. We have almost all the downsides
of the monolithic approach, plus we have to deal with network interactions and the operational complexity of multiple services. Game Play History Chess Player Analyse

A service approach.

Games as a value instead of a service. 1. e4
e5 2. Bc4 Nc6 3. Qh5 Nf6?? 4. Qxf7#

Play Analyse Engine Chess Pair History

Pair. Pairing can be thought of as negotiating a shared
identiﬁer.

Pair. game-id: 13376a3e

Pair. A party who wants to be paired has to
be able to communicate its constraints and how they can be notiﬁed.

{ notify: http://waiting/123, constraints: [ { game-time: 5+3 }, {
min-rating: 1400 }, { max-rating: 1600 }, { ideal-rating: 1500 }, { max-wait: 30 } ] } Pair.

Pair. The pairing service maintains an indexed list of waiting
parties.

Pair. Asynchronously pairing compatible partners, generating a unique identiﬁer for
the pair, and notifying both parties.

Pair. Independence?

Pair. Knows: About players waiting for a game.

Pair. Knows: About players waiting for a game. Told: When
a new player wants to pair.

a new player wants to pair. Asks: Nothing.

a new player wants to pair. Asks: Nothing. Response: Vends players their unique game id.

Play. The playing service is responsible for just the in-ﬂight
games.

Play. A playing pair have a shared identiﬁer they can
use to join a game in an initial state.

Knows: About games in play. Play.

Knows: About games in play. Told: When a player makes
a move. Play.

a move. Asks: Nothing. Play.

a move. Asks: Nothing. Response: Updated game value. Play.

History.

History. The history service is just a database of static
games that can be searched.

History. The history service gives out games as values.

Knows: About historical game values. History.

Knows: About historical game values. Told: New complete game values.
History.

Asks: Nothing. History.

Asks: Nothing. Response: Complete game values. History.

Analyse.

Knows: About game analysis techniques. Analyse.

Knows: About game analysis techniques. Told: Game value to analyse.
Analyse.

Asks: Nothing. Analyse.

Asks: Nothing. Response: Annotated game values. Analyse.

Independent responsibilities over shared nouns.

Operating Systems.

Engine

GET /status { “name”: engine, “version”: “c0ffee”, “stats”: {…}, “status”:
“ok” }

“ko” }

8 Fallacies of Distributed Computing - Bill Joy, Tom Lyon,
L. Peter Deutsch, James Gosling 1. The network is reliable. 2. Latency is zero. 3. Bandwidth is inﬁnite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous.

Timeouts save lives. timeout!

Engine

Engine ~14% of Trafﬁc To Each Server

Engine 100k Requests = ~14k Request Per Server

Engine ~16% of Trafﬁc To Each Server

Engine 100k Requests = ~16k Request Per Server

Engine 14k -> 16k Requests or ~15% Increase

Engine

Serving some is better than serving none.

“ok” }

GET /status { “name”: chess, “version”: “ae103”, “stats”: {…}, “status”:
“ok” }

“ko” }

GET /status { “name”: chess, “version”: “ae103”, “stats”: {…}, “status”:
“ko” }

Graceful degradation maintains independence.

Changing Systems.

Pair Version N

Pair Version N Pair Version N + 1

Pair Version N Pair Version N + 1 Chess

In-production veriﬁcation.

Pair Version N Pair Version N + 1 Read Write
Read + Write

Incremental deployment.

Pair Version N Pair Version N + 1 Read +
Write Read + Write

Pair Version N Pair Version N + 1 95%

Pair Version N Pair Version N + 1 5%

Understand success.

Pair Version N Pair Version N + 1 5% 200
OK

Pair Version N Pair Version N + 1 5% 500
SERVER ERROR

{ “match-key”: 1234231, “pair”: [“c0ffee”, “7ea”] }

Average Pair Time: 1.1s

Pair Version N Pair Version N + 1 95% 5%

Could your system survive shipping a bad line of code?

Data and code needs to have an indpendent lifecycle.

Unreliable Parts.

“construct reliable systems from unreliable parts … from the knowledge
that any component in the system might fail” - Holzman & Joshi, Reliable Software Systems Design

The following is based on a true story.

The context has been changed to protect the innocent.

AI Super Engine

choosing to ship unreliable code… AI Super Engine

P(failure) = 0.8 AI Super Engine

What to do?

Create a proxy to take control of interactions.

Create a more reliable view of data.

On failure we discard state.

Use our journal to rebuild the state.

This gives us the independence we need.

P(failure) = 0.8^2 = 0.64 or 64%

P(failure) = 0.8^10 = 0.10 or 10%

P(failure) = 0.8^20 = 0.01 or 1%

“A beach house isn't just real estate. It's a state
of mind.” Douglas Adams - Mostly Harmless (1992)

Avoid failure through more reliable parts.

Be resilient to failure when it occurs by controlling the
scope and consequences.

Service redundancy means embracing more failure, but with an increased
opportunity for handling those failures.

Service architecture provides the opportunity to trade likelihood of failure
against the consequences of failure.

Leveraging all these opportunities is critical to minimising, absorbing and
reducing the impact of failure through our systems.

Failure & Change: @markhibberd Principles of Reliable Systems

Failure and Change: Principles of Reliable Systems

Failure and Change: Principles of Reliable Systems

More Decks by Mark Hibberd

Other Decks in Programming

Featured

Transcript