Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Consensus in Distributed Systems

Daniel Upton
September 25, 2018

Consensus in Distributed Systems

We rely ever more heavily on distributed systems in our daily lives, from spending money on a debit card to posting a tweet to our followers (tweeple?). We’ll dive into the challenges in building such systems identified by the CAP theorem, and take a look at a solution offered by “Raft” the consensus algorithm at the core of projects such as Consul, etcd and CockroachDB.

Image Credits:

Photo of L Peter Deutsch - Parma Recordings (source: https://parmarecordings-news.com/the-inside-story-coro-del-mundo-moto-bello-and-l-peter-deutsch/)

Photo of Eric Brewer - CC BY-SA 4.0 (source: https://en.wikipedia.org/wiki/Eric_Brewer_(scientist)#/media/File:TNW_Con_EU15_-_Eric_Brewer_(scientist)-2.jpg)

Raft Logo - CC 3.0 (source: https://raft.github.io/)

Rest of the Owl Meme - (source: https://www.reddit.com/r/funny/comments/eccj2/how_to_draw_an_owl/)

Daniel Upton

September 25, 2018
Tweet

Other Decks in Programming

Transcript

  1. Consensus
    in
    Distributed
    Systems

    View full-size slide


  2. Brian
    (an ideas guy)

    View full-size slide


  3. Brian
    (an ideas guy)
    Twitter
    for
    Alsatians?

    View full-size slide


  4. Brian
    (an ideas guy)
    Uber for
    unicycles?

    View full-size slide


  5. Brian
    (an ideas guy)

    View full-size slide

  6. MySQL
    Database
    Ruby on Rails
    API

    View full-size slide

  7. MySQL
    Database
    Ruby on Rails
    API

    View full-size slide

  8. Leader
    (master)
    Follower
    (slave)
    Replication

    View full-size slide

  9. Leader
    (master)
    Follower
    (slave)
    Replication

    View full-size slide

  10. Leader
    (master)
    Failover

    View full-size slide

  11. Leader
    (master)
    Follower
    (slave)
    Network

    View full-size slide

  12. Leader
    (master)
    Follower
    (slave)
    Network
    (actual physical cables and stuff)

    View full-size slide

  13. Leader
    (master)
    Follower
    (slave)
    Network
    (actual physical cables and stuff)

    View full-size slide

  14. Fallacies of distributed computing
    #1 The network is
    reliable.
    L Peter Deutsch (et al)

    View full-size slide

  15. Leader
    (master)
    Follower
    (slave)
    Network

    Order
    #17623

    View full-size slide

  16. Leader
    (master)
    Follower
    (slave)
    Network

    Order #17623

    View full-size slide

  17. Leader
    (master)
    Follower
    (slave)
    Network


    Order #17623
    Details of order
    #17623 please Huh?

    View full-size slide

  18. Options
    You’ve got two of ‘em

    View full-size slide

  19. Leader
    (master)
    Follower
    (slave)
    Network


    Order #17623
    Details of order
    #17623 please Huh?
    Option A

    View full-size slide

  20. Leader
    (master)
    Follower
    (slave)
    Network

    Order
    #17623
    No
    Option C (!)

    View full-size slide

  21. CAP theorem (paraphrased)
    Eric Brewer
    When operating in a catastrophically broken or
    unreliable network
    a distributed system must choose to either risk returning
    stale/outdated data
    or refuse to accept writes/updates

    View full-size slide

  22. CAP theorem (paraphrased)
    Eric Brewer
    When operating in a catastrophically broken or
    unreliable network (Partition Tolerance)
    a distributed system must choose to either risk returning
    stale/outdated data (Availability)
    or refuse to accept writes/updates (Consistency)

    View full-size slide

  23. Raft
    Consensus
    Algorithm

    View full-size slide

  24. Strongly Consistent
    but also
    Highly Available

    View full-size slide

  25. Quorum
    (and you need an odd number of nodes)

    View full-size slide

  26. Distributed
    Log

    View full-size slide

  27. best_programming_language = Ruby
    current_year = 2008
    linux_on_desktop = Maybe
    State Machine
    Distributed Log

    View full-size slide

  28. best_programming_language = Ruby
    current_year = 2018
    linux_on_desktop = Maybe
    State Machine
    current_year = 2018
    SET

    View full-size slide

  29. best_programming_language = Go
    current_year = 2018
    linux_on_desktop = Maybe
    State Machine
    best_programming_language = Go
    SET
    current_year = 2018
    SET

    View full-size slide

  30. best_programming_language = Go
    current_year = 2018
    State Machine
    best_programming_language = Go
    SET
    current_year = 2018
    SET
    linux_on_desktop
    DELETE

    View full-size slide

  31. Getting a majority of servers in a cluster to agree on
    What’s in the
    log

    View full-size slide

  32. I like my leadership the same way I like my ☕
    Strong.
    — Raft

    View full-size slide

  33. Leader
    Election

    View full-size slide


  34. Random
    Timers

    View full-size slide

  35. * + ,
    Monotonically
    Increasing Terms

    View full-size slide

  36. every node starts off as a
    Follower
    if a follower doesn’t hear from a leader for a while (random timer) it becomes a
    Candidate
    if the candidate receives votes from a majority of nodes it will become the
    Leader

    View full-size slide

  37. In the case of a split-vote nodes will simply
    Wait for another
    election

    View full-size slide

  38. Leader Election

    View full-size slide

  39. Leader goes AWOL

    View full-size slide

  40. Log
    Replication

    View full-size slide

  41. 1. Client sends a command to the Leader.
    2. Leader appends an entry to its own log.
    3. Leader issues an RPC (AppendEntries) to each Follower.
    4. Follower appends the entry to its log and responds to the Leader to
    acknowledge the entry.
    5. Once the entry has been acknowledged by a majority of Followers the
    Leader responds to the Client.
    6. Leader issues a heartbeat RPC (AppendEntries) to each Follower which
    “commits” the entry and applies it to each Follower’s state machine.

    View full-size slide

  42. Log Replication

    View full-size slide


  43. Handling Turbulent
    Network Conditions

    View full-size slide

  44. Safety Guarantees
    Election Safety
    Only a single leader will be elected in each term.
    Append Only Leaders
    The leader will never delete or overwrite entries.
    Log Matching
    Any two logs with an entry of the same index and term, will contain the same value.
    Leader Completeness
    An entry committed in an earlier term will be present in the logs of leaders in later terms.
    State Machine Safety
    If a log entry at a given index has been applied to a server’s state machine, no other server will ever apply a
    different log entry at the same index.

    View full-size slide

  45. Preventing Split-Brain
    1
    1
    1
    1
    1

    View full-size slide

  46. 1
    1
    1
    1
    1
    Preventing Split-Brain

    View full-size slide

  47. 2
    2
    2
    1
    1
    Preventing Split-Brain

    View full-size slide

  48. 2
    2
    2
    1
    1
    Preventing Split-Brain
    X=1
    X=2

    View full-size slide

  49. 2
    2
    2
    1
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    X=2

    View full-size slide

  50. 2
    2
    2
    1
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    X=2

    View full-size slide

  51. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    1
    X=2

    View full-size slide

  52. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    1
    X=2
    AppendEntries
    Term: 1
    X = 2

    View full-size slide

  53. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    1
    X=2
    NOPE.
    Term is 2 now

    View full-size slide

  54. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    2

    View full-size slide

  55. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    2
    AppendEntries
    Term: 2
    X = 1

    View full-size slide

  56. 2
    2
    2
    2
    Preventing Split-Brain
    X=1
    X=1
    X=1
    2
    X=1
    X=1

    View full-size slide


  57. Snapshots / Log
    Compaction

    View full-size slide


  58. Thanks!
    https://raft.github.io/raft.pdf
    http://thesecretlivesofdata.com/raft/

    View full-size slide