Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Horrors of Distributed Systems

Horrors of Distributed Systems

A talk I gave at DjangoCon AU 2017.

Andrew Godwin

August 04, 2017
Tweet

More Decks by Andrew Godwin

Other Decks in Programming

Transcript

  1. Horrors of
    Distributed
    Systems
    Horrors of
    Distributed
    Systems
    Andrew Godwin
    @andrewgodwin

    View Slide

  2. Hi, I’m
    Andrew Godwin
    • Django core developer
    • Senior Software Engineer at
    • Channels is a thing, I guess?

    View Slide

  3. Seriously. They’re really nasty.
    Distributed systems are HARD

    View Slide

  4. 1. How Computers Hate You
    2. How This Makes Distributed Hard

    View Slide

  5. View Slide

  6. Non-Binary Failure

    View Slide

  7. “Either it’ll work, or error out nicely!”
    - Very optimistic programmers everywhere

    View Slide

  8. Disks & RAM
    Enterprise SSDs
    Bit flips after as little as a week unpowered
    Non-ECC memory
    1 bit flip per gigabyte EVERY TWO HOURS
    32GB of RAM is experiencing a bit flip every FOUR MINUTES
    Source: Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich (2009). "DRAM Errors in the Wild: A Large-Scale Field Study"
    Source: Alvin Cox (2015). “JEDEC SSD Specifications Explained”

    View Slide

  9. “The network is either up or down!”
    - Those optimistic programmers again

    View Slide

  10. Networks
    How high will you let latency get?
    How bad does packet loss have to be?
    Do you have bad neighbours?

    View Slide

  11. Time and Space

    View Slide

  12. “The speed of light isn’t that important”
    Ah, the optimists are back again!

    View Slide

  13. National Museum of American History

    View Slide

  14. Australia is real far away from stuff
    Melbourne → US-east-1 16,000km
    At the speed of light 50ms
    Minimum possible round-trip, ever 100ms
    Number of web requests just to open Slack 96

    View Slide

  15. You can solve a lot of distributed problems
    by waiting for consensus
    ...but you can be waiting a long time

    View Slide

  16. “Time is the same everywhere”
    I’d like you to meet my friend GENERAL RELATIVITY

    View Slide

  17. Your phone corrects for the time dilation
    of the GPS satellites.
    Oh, and all clocks drift quite a decent amount.

    View Slide

  18. View Slide

  19. Communication

    View Slide

  20. Networks are unreliable.
    Wait, where did the optimists go?

    View Slide

  21. You get a choice:
    At-most-once or At-least-once

    View Slide

  22. Well, alright, more of a spectrum.
    At most once At least once
    Exactly once
    Basically never Eleventy copies
    Effort

    View Slide

  23. Do you want to maybe do it twice?
    Saving text, liking a tweet
    Or maybe never?
    Charging money, sending email

    View Slide

  24. Consensus

    View Slide

  25. Your servers all need to agree...
    ... over an unreliable network
    ...with unreliable storage
    ...and different ideas of what time is

    View Slide

  26. It can happen to YOU!
    It doesn’t have to be big and fancy to be distributed.

    View Slide

  27. Fast
    Cheap Good

    View Slide

  28. Partition Tolerant
    Available Consistent

    View Slide

  29. What to do?

    View Slide

  30. Define your interfaces clearly
    You’ll never find bugs without knowing how it’s supposed to work

    View Slide

  31. Product & design can help
    Is there somewhere you can allow inconsistency or lag?

    View Slide

  32. Don’t reinvent the wheel
    Use existing tech, but know its weaknesses

    View Slide

  33. Thanks.
    Andrew Godwin
    @andrewgodwin aeracode.org

    View Slide