Horrors of Distributed Systems

Horrors of Distributed Systems

A talk I gave at DjangoCon AU 2017.

077e9a0cb34fa3eba2699240c9509717?s=128

Andrew Godwin

August 04, 2017
Tweet

Transcript

  1. Horrors of Distributed Systems Horrors of Distributed Systems Andrew Godwin

    @andrewgodwin
  2. Hi, I’m Andrew Godwin • Django core developer • Senior

    Software Engineer at • Channels is a thing, I guess?
  3. Seriously. They’re really nasty. Distributed systems are HARD

  4. 1. How Computers Hate You 2. How This Makes Distributed

    Hard
  5. None
  6. Non-Binary Failure

  7. “Either it’ll work, or error out nicely!” - Very optimistic

    programmers everywhere
  8. Disks & RAM Enterprise SSDs Bit flips after as little

    as a week unpowered Non-ECC memory 1 bit flip per gigabyte EVERY TWO HOURS 32GB of RAM is experiencing a bit flip every FOUR MINUTES Source: Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). "DRAM Errors in the Wild: A Large-Scale Field Study" Source: Alvin Cox (2015). “JEDEC SSD Specifications Explained”
  9. “The network is either up or down!” - Those optimistic

    programmers again
  10. Networks How high will you let latency get? How bad

    does packet loss have to be? Do you have bad neighbours?
  11. Time and Space

  12. “The speed of light isn’t that important” Ah, the optimists

    are back again!
  13. National Museum of American History

  14. Australia is real far away from stuff Melbourne → US-east-1

    16,000km At the speed of light 50ms Minimum possible round-trip, ever 100ms Number of web requests just to open Slack 96
  15. You can solve a lot of distributed problems by waiting

    for consensus ...but you can be waiting a long time
  16. “Time is the same everywhere” I’d like you to meet

    my friend GENERAL RELATIVITY
  17. Your phone corrects for the time dilation of the GPS

    satellites. Oh, and all clocks drift quite a decent amount.
  18. None
  19. Communication

  20. Networks are unreliable. Wait, where did the optimists go?

  21. You get a choice: At-most-once or At-least-once

  22. Well, alright, more of a spectrum. At most once At

    least once Exactly once Basically never Eleventy copies Effort
  23. Do you want to maybe do it twice? Saving text,

    liking a tweet Or maybe never? Charging money, sending email
  24. Consensus

  25. Your servers all need to agree... ... over an unreliable

    network ...with unreliable storage ...and different ideas of what time is
  26. It can happen to YOU! It doesn’t have to be

    big and fancy to be distributed.
  27. Fast Cheap Good

  28. Partition Tolerant Available Consistent

  29. What to do?

  30. Define your interfaces clearly You’ll never find bugs without knowing

    how it’s supposed to work
  31. Product & design can help Is there somewhere you can

    allow inconsistency or lag?
  32. Don’t reinvent the wheel Use existing tech, but know its

    weaknesses
  33. Thanks. Andrew Godwin @andrewgodwin aeracode.org