Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure and Change: Principles of Reliable Systems

Mark Hibberd
November 30, 2018

Failure and Change: Principles of Reliable Systems

As we construct larger or more complex systems, failure and change are
ever-present. We need to accept and even embrace these tensions to
build software that works and keeps working.This is a talk on building
and operating reliable systems. We will look at how systems fail,
particularly in the face of complexity or scale, and build up a set of
principles and practices that will help us implement, understand and
verify reliable systems.

Mark Hibberd

November 30, 2018

More Decks by Mark Hibberd

Other Decks in Programming


  1. Game Play History Chess Player Analyse A naive service approach.

    What happens when the game service has an issue?
  2. A naive service approach. We have almost all the downsides

    of the monolithic approach, plus we have to deal with network interactions and the operational complexity of multiple services. Game Play History Chess Player Analyse
  3. Games as a value instead of a service. 1. e4

    e5 2. Bc4 Nc6 3. Qh5 Nf6?? 4. Qxf7#
  4. Pair. A party who wants to be paired has to

    be able to communicate its constraints and how they can be notified.
  5. { notify: http://waiting/123, constraints: [ { game-time: 5+3 }, {

    min-rating: 1400 }, { max-rating: 1600 }, { ideal-rating: 1500 }, { max-wait: 30 } ] } Pair.
  6. Pair. Knows: About players waiting for a game. Told: When

    a new player wants to pair. Asks: Nothing.
  7. Pair. Knows: About players waiting for a game. Told: When

    a new player wants to pair. Asks: Nothing. Response: Vends players their unique game id.
  8. Play. A playing pair have a shared identifier they can

    use to join a game in an initial state.
  9. Knows: About games in play. Told: When a player makes

    a move. Asks: Nothing. Response: Updated game value. Play.
  10. Knows: About historical game values. Told: New complete game values.

    Asks: Nothing. Response: Complete game values. History.
  11. Knows: About game analysis techniques. Told: Game value to analyse.

    Asks: Nothing. Response: Annotated game values. Analyse.
  12. 8 Fallacies of Distributed Computing - Bill Joy, Tom Lyon,

    L. Peter Deutsch, James Gosling 1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous.
  13. “construct reliable systems from unreliable parts … from the knowledge

    that any component in the system might fail” - Holzman & Joshi, Reliable Software Systems Design
  14. “A beach house isn't just real estate. It's a state

    of mind.” Douglas Adams - Mostly Harmless (1992)
  15. Leveraging all these opportunities is critical to minimising, absorbing and

    reducing the impact of failure through our systems.