Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Consensus in Distributed Systems

Daniel Upton
September 25, 2018

Consensus in Distributed Systems

We rely ever more heavily on distributed systems in our daily lives, from spending money on a debit card to posting a tweet to our followers (tweeple?). We’ll dive into the challenges in building such systems identified by the CAP theorem, and take a look at a solution offered by “Raft” the consensus algorithm at the core of projects such as Consul, etcd and CockroachDB.

Image Credits:

Photo of L Peter Deutsch - Parma Recordings (source: https://parmarecordings-news.com/the-inside-story-coro-del-mundo-moto-bello-and-l-peter-deutsch/)

Photo of Eric Brewer - CC BY-SA 4.0 (source: https://en.wikipedia.org/wiki/Eric_Brewer_(scientist)#/media/File:TNW_Con_EU15_-_Eric_Brewer_(scientist)-2.jpg)

Raft Logo - CC 3.0 (source: https://raft.github.io/)

Rest of the Owl Meme - (source: https://www.reddit.com/r/funny/comments/eccj2/how_to_draw_an_owl/)

Daniel Upton

September 25, 2018
Tweet

Other Decks in Programming

Transcript

  1. Consensus
    in
    Distributed
    Systems

    View Slide


  2. Brian

    View Slide


  3. Brian
    (an ideas guy)

    View Slide


  4. Brian
    (an ideas guy)
    Twitter
    for
    Alsatians?

    View Slide


  5. Brian
    (an ideas guy)
    Uber for
    unicycles?

    View Slide


  6. Brian
    (an ideas guy)

    View Slide

  7. MySQL
    Database
    Ruby on Rails
    API

    View Slide

  8. MySQL
    Database
    Ruby on Rails
    API

    View Slide

  9. Leader
    (master)
    Follower
    (slave)
    Replication

    View Slide

  10. Leader
    (master)
    Follower
    (slave)
    Replication

    View Slide

  11. Leader
    (master)
    Failover

    View Slide

  12. Leader
    (master)
    Follower
    (slave)
    Network

    View Slide

  13. Leader
    (master)
    Follower
    (slave)
    Network
    (actual physical cables and stuff)

    View Slide

  14. Leader
    (master)
    Follower
    (slave)
    Network
    (actual physical cables and stuff)

    View Slide

  15. Fallacies of distributed computing
    #1 The network is
    reliable.
    L Peter Deutsch (et al)

    View Slide

  16. Leader
    (master)
    Follower
    (slave)
    Network

    Order
    #17623

    View Slide

  17. Leader
    (master)
    Follower
    (slave)
    Network

    Order #17623

    View Slide

  18. Leader
    (master)
    Follower
    (slave)
    Network


    Order #17623
    Details of order
    #17623 please Huh?

    View Slide

  19. Options
    You’ve got two of ‘em

    View Slide

  20. Leader
    (master)
    Follower
    (slave)
    Network


    Order #17623
    Details of order
    #17623 please Huh?
    Option A

    View Slide

  21. Leader
    (master)
    Follower
    (slave)
    Network

    Order
    #17623
    No
    Option C (!)

    View Slide

  22. CAP theorem (paraphrased)
    Eric Brewer
    When operating in a catastrophically broken or
    unreliable network
    a distributed system must choose to either risk returning
    stale/outdated data
    or refuse to accept writes/updates

    View Slide

  23. CAP theorem (paraphrased)
    Eric Brewer
    When operating in a catastrophically broken or
    unreliable network (Partition Tolerance)
    a distributed system must choose to either risk returning
    stale/outdated data (Availability)
    or refuse to accept writes/updates (Consistency)

    View Slide

  24. Trade-offs

    View Slide

  25. Raft
    Consensus
    Algorithm

    View Slide

  26. Strongly Consistent
    but also
    Highly Available

    View Slide

  27. Quorum
    (and you need an odd number of nodes)

    View Slide

  28. View Slide

  29. View Slide

  30. Distributed
    Log

    View Slide

  31. best_programming_language = Ruby
    current_year = 2008
    linux_on_desktop = Maybe
    State Machine
    Distributed Log

    View Slide

  32. best_programming_language = Ruby
    current_year = 2018
    linux_on_desktop = Maybe
    State Machine
    current_year = 2018
    SET

    View Slide

  33. best_programming_language = Go
    current_year = 2018
    linux_on_desktop = Maybe
    State Machine
    best_programming_language = Go
    SET
    current_year = 2018
    SET

    View Slide

  34. best_programming_language = Go
    current_year = 2018
    State Machine
    best_programming_language = Go
    SET
    current_year = 2018
    SET
    linux_on_desktop
    DELETE

    View Slide

  35. View Slide

  36. Getting a majority of servers in a cluster to agree on
    What’s in the
    log

    View Slide

  37. I like my leadership the same way I like my ☕
    Strong.
    — Raft

    View Slide

  38. View Slide

  39. Leader
    Election

    View Slide


  40. Random
    Timers

    View Slide

  41. * + ,
    Monotonically
    Increasing Terms

    View Slide

  42. every node starts off as a
    Follower
    if a follower doesn’t hear from a leader for a while (random timer) it becomes a
    Candidate
    if the candidate receives votes from a majority of nodes it will become the
    Leader

    View Slide

  43. In the case of a split-vote nodes will simply
    Wait for another
    election

    View Slide

  44. Leader Election

    View Slide

  45. Leader goes AWOL

    View Slide

  46. Log
    Replication

    View Slide

  47. 1. Client sends a command to the Leader.
    2. Leader appends an entry to its own log.
    3. Leader issues an RPC (AppendEntries) to each Follower.
    4. Follower appends the entry to its log and responds to the Leader to
    acknowledge the entry.
    5. Once the entry has been acknowledged by a majority of Followers the
    Leader responds to the Client.
    6. Leader issues a heartbeat RPC (AppendEntries) to each Follower which
    “commits” the entry and applies it to each Follower’s state machine.

    View Slide

  48. Log Replication

    View Slide


  49. Handling Turbulent
    Network Conditions

    View Slide

  50. Safety Guarantees
    Election Safety
    Only a single leader will be elected in each term.
    Append Only Leaders
    The leader will never delete or overwrite entries.
    Log Matching
    Any two logs with an entry of the same index and term, will contain the same value.
    Leader Completeness
    An entry committed in an earlier term will be present in the logs of leaders in later terms.
    State Machine Safety
    If a log entry at a given index has been applied to a server’s state machine, no other server will ever apply a
    different log entry at the same index.

    View Slide

  51. Preventing Split-Brain
    1
    1
    1
    1
    1

    View Slide

  52. 1
    1
    1
    1
    1
    Preventing Split-Brain

    View Slide

  53. 2
    2
    2
    1
    1
    Preventing Split-Brain

    View Slide

  54. 2
    2
    2
    1
    1
    Preventing Split-Brain
    X=1
    X=2

    View Slide

  55. 2
    2
    2
    1
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    X=2

    View Slide

  56. 2
    2
    2
    1
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    X=2

    View Slide

  57. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    1
    X=2

    View Slide

  58. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    1
    X=2
    AppendEntries
    Term: 1
    X = 2

    View Slide

  59. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    1
    X=2
    NOPE.
    Term is 2 now

    View Slide

  60. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    2

    View Slide

  61. 2
    2
    2
    1
    Preventing Split-Brain
    X=1
    X=1
    X=1
    X=2
    2
    AppendEntries
    Term: 2
    X = 1

    View Slide

  62. 2
    2
    2
    2
    Preventing Split-Brain
    X=1
    X=1
    X=1
    2
    X=1
    X=1

    View Slide


  63. Snapshots / Log
    Compaction

    View Slide


  64. Thanks!
    https://raft.github.io/raft.pdf
    http://thesecretlivesofdata.com/raft/

    View Slide