Slide 1

Slide 1 text

Horrors of Distributed Systems Horrors of Distributed Systems Andrew Godwin @andrewgodwin

Slide 2

Slide 2 text

Hi, I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Channels is a thing, I guess?

Slide 3

Slide 3 text

Seriously. They’re really nasty. Distributed systems are HARD

Slide 4

Slide 4 text

1. How Computers Hate You 2. How This Makes Distributed Hard

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Non-Binary Failure

Slide 7

Slide 7 text

“Either it’ll work, or error out nicely!” - Very optimistic programmers everywhere

Slide 8

Slide 8 text

Disks & RAM Enterprise SSDs Bit flips after as little as a week unpowered Non-ECC memory 1 bit flip per gigabyte EVERY TWO HOURS 32GB of RAM is experiencing a bit flip every FOUR MINUTES Source: Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). "DRAM Errors in the Wild: A Large-Scale Field Study" Source: Alvin Cox (2015). “JEDEC SSD Specifications Explained”

Slide 9

Slide 9 text

“The network is either up or down!” - Those optimistic programmers again

Slide 10

Slide 10 text

Networks How high will you let latency get? How bad does packet loss have to be? Do you have bad neighbours?

Slide 11

Slide 11 text

Time and Space

Slide 12

Slide 12 text

“The speed of light isn’t that important” Ah, the optimists are back again!

Slide 13

Slide 13 text

National Museum of American History

Slide 14

Slide 14 text

Australia is real far away from stuff Melbourne → US-east-1 16,000km At the speed of light 50ms Minimum possible round-trip, ever 100ms Number of web requests just to open Slack 96

Slide 15

Slide 15 text

You can solve a lot of distributed problems by waiting for consensus ...but you can be waiting a long time

Slide 16

Slide 16 text

“Time is the same everywhere” I’d like you to meet my friend GENERAL RELATIVITY

Slide 17

Slide 17 text

Your phone corrects for the time dilation of the GPS satellites. Oh, and all clocks drift quite a decent amount.

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Communication

Slide 20

Slide 20 text

Networks are unreliable. Wait, where did the optimists go?

Slide 21

Slide 21 text

You get a choice: At-most-once or At-least-once

Slide 22

Slide 22 text

Well, alright, more of a spectrum. At most once At least once Exactly once Basically never Eleventy copies Effort

Slide 23

Slide 23 text

Do you want to maybe do it twice? Saving text, liking a tweet Or maybe never? Charging money, sending email

Slide 24

Slide 24 text

Consensus

Slide 25

Slide 25 text

Your servers all need to agree... ... over an unreliable network ...with unreliable storage ...and different ideas of what time is

Slide 26

Slide 26 text

It can happen to YOU! It doesn’t have to be big and fancy to be distributed.

Slide 27

Slide 27 text

Fast Cheap Good

Slide 28

Slide 28 text

Partition Tolerant Available Consistent

Slide 29

Slide 29 text

What to do?

Slide 30

Slide 30 text

Define your interfaces clearly You’ll never find bugs without knowing how it’s supposed to work

Slide 31

Slide 31 text

Product & design can help Is there somewhere you can allow inconsistency or lag?

Slide 32

Slide 32 text

Don’t reinvent the wheel Use existing tech, but know its weaknesses

Slide 33

Slide 33 text

Thanks. Andrew Godwin @andrewgodwin aeracode.org