Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Raft: The Understandable Distributed Consensus Protocol

benbjohnson
September 20, 2013

Raft: The Understandable Distributed Consensus Protocol

Raft presentation at Strange Loop 2013.

Video: http://www.infoq.com/presentations/raft

This work is licensed under a Creative Commons Attribution 4.0 International License.

benbjohnson

September 20, 2013
Tweet

More Decks by benbjohnson

Other Decks in Technology

Transcript

  1. Raft
    !" U#$"r%&'#$'b("
    D)%&r)b*&"$ C+#%"#%*% Pr+&+,+(
    @benbjohnson

    View full-size slide

  2. W-'& )%
    D)%&r)b*&"$ C+#%"#%*%?

    View full-size slide

  3. Distributed = Many nodes
    Consensus = Agreement

    View full-size slide

  4. Distributed = Many nodes
    Consensus = Agreement

    View full-size slide

  5. Leader Election
    Data Replication
    Distributed Locks

    View full-size slide

  6. A Really Short History Of
    Distributed Consensus Protocols

    View full-size slide

  7. A Really Short History Of
    Distributed Consensus Protocols
    Paxos (1989)

    View full-size slide

  8. Paxos In A Nutshell

    View full-size slide

  9. Paxos In A Nutshell
    Client

    View full-size slide

  10. Paxos In A Nutshell
    Client Proposer
    Client requests change to system

    View full-size slide

  11. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Proposer tells Acceptors to get ready for a change
    Ready?

    View full-size slide

  12. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Acceptors confirm to Proposer that they’re ready
    Hell yeah!

    View full-size slide

  13. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Proposer sends change to Acceptors
    Here
    you go

    View full-size slide

  14. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Learner
    Learner
    Learner
    Learner
    Acceptors propagate change to Learners

    View full-size slide

  15. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Learner
    Learner
    Learner
    Learner
    Proposer is now recognized as leader
    Leader

    View full-size slide

  16. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Learner
    Learner
    Learner
    Learner
    Repeat for every new change to the system
    Leader

    View full-size slide

  17. Fun Raft Facts

    View full-size slide

  18. Diego Ongaro
    Ph.D. Student
    Stanford University

    View full-size slide

  19. John Ousterhout
    Professor of Computer Science
    Stanford University
    Diego Ongaro
    Ph.D. Student
    Stanford University

    View full-size slide

  20. 28 Implementations
    across various languages

    View full-size slide

  21. In Commercial Use
    CoreOS
    (etcd)
    go-raft

    View full-size slide

  22. Three Roles:

    View full-size slide

  23. The Follower

    View full-size slide

  24. The Candidate

    View full-size slide

  25. High-Level Example:

    View full-size slide

  26. C
    F
    F
    Vote for me!
    Vote for me!

    View full-size slide

  27. L
    F
    F
    Log entries
    Log entries

    View full-size slide

  28. L
    F
    F
    Heartbeats
    Heartbeats

    View full-size slide

  29. X
    C
    F
    Vote for me!

    View full-size slide

  30. X
    L
    F
    Log Entries &
    Heartbeats

    View full-size slide

  31. Leader Election

    View full-size slide

  32. F1 F1
    F1
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    0ms

    View full-size slide

  33. C2 F1
    F1
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    Request Vote
    150ms
    Request Vote
    (Fails)
    One follower becomes a candidate after an election timeout and requests votes

    View full-size slide

  34. C2 F2
    F1
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    Grant Vote
    155ms
    Candidate receives one vote from a peer and one vote from self

    View full-size slide

  35. L2 F2
    F2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    156ms
    Two votes is a majority so candidate becomes leader

    View full-size slide

  36. Leader Election
    (Split Vote)

    View full-size slide

  37. F1 F1
    F1
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    0ms
    F1

    View full-size slide

  38. C2 F1
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F1
    Request Vote
    Request Vote
    150ms
    Two followers become candidates simultaneously and begin requesting votes

    View full-size slide

  39. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Vote Granted
    Vote Granted
    155ms
    Each candidate receives a vote from themselves and from one peer

    View full-size slide

  40. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Request Vote
    156ms
    Each candidate requests a vote from a peer who has already voted

    View full-size slide

  41. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Vote Denied
    160ms
    Vote requests are denied because the follower has already voted

    View full-size slide

  42. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Request Vote
    161ms
    Candidates try to request votes from each other

    View full-size slide

  43. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Vote Denied
    165ms
    Vote requests are denied because candidates voted for themselves

    View full-size slide

  44. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    200ms
    Candidates wait for a randomized election timeout to occur (150ms - 300ms)

    View full-size slide

  45. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    250ms
    Still waiting...

    View full-size slide

  46. C3 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    300ms
    Request Vote
    Request Vote
    One candidate begins election term #3

    View full-size slide

  47. L3 F3
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F3
    305ms
    Vote Granted
    Vote Granted
    Candidate receives vote from itself and two peer votes so it becomes leader for election term #3

    View full-size slide

  48. L3 F3
    C3
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F3
    306ms
    Request Vote
    Request Vote
    Second candidate doesn’t know first candidate won the term and begins requesting votes

    View full-size slide

  49. L3 F3
    C3
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F3
    306ms
    Vote Denied
    Vote Denied
    Peers already voted so votes are denied

    View full-size slide

  50. L3 F3
    F3
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F3
    310ms
    Leader notifies peers of election and other candidate steps down

    View full-size slide

  51. Log Replication

    View full-size slide

  52. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    0ms
    <“”>
    <“”>
    <“”>

    View full-size slide

  53. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    10ms
    1 “sally”
    <“”>
    <“”>
    <“”>
    A new uncommitted log entry is added to the leader

    View full-size slide

  54. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    20ms
    <“”>
    <“”>
    <“”>
    1 “sally”
    1 “sally”
    1 “sally”
    Append Entries
    Append Entries
    At the next heartbeat, the log entry is replicated to followers

    View full-size slide

  55. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    OK
    <“sally”>
    <“”>
    <“”>
    1 “sally”
    1 “sally”
    1 “sally”
    22ms
    A majority of nodes have written the log entry written to disk so it becomes committed

    View full-size slide

  56. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“”>
    <“”>
    1 “sally”
    1 “sally”
    1 “sally”
    OK
    25ms

    View full-size slide

  57. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    1 “sally”
    1 “sally”
    Append Entries
    Append Entries
    40ms
    At the next heartbeat, the leader notifies followers of updated committed entries

    View full-size slide

  58. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    1 “sally”
    1 “sally”
    50ms

    View full-size slide

  59. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    1 “sally”
    1 “sally”
    Append Entries
    Append Entries
    60ms
    At the next heartbeat, no new log information is sent

    View full-size slide

  60. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    1 “sally”
    1 “sally”
    70ms

    View full-size slide

  61. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    <“sally”>
    75ms
    A new uncommitted log entry is added to the leader

    View full-size slide

  62. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    Append Entries
    Append Entries
    80ms
    At the next heartbeat, the entry is replicated to the followers

    View full-size slide

  63. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    OK
    <“bob”>
    <“sally”>
    <“sally”>
    1 “sally”
    2 “bob”
    1 “sally”
    1 “sally”
    OK
    82ms
    The entry is committed once the followers acknowledge the request

    View full-size slide

  64. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“bob”>
    <“bob”>
    <“bob”>
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    Append Entries
    Append Entries
    100ms
    At the next heartbeat, the leader notifies the followers of the new committed entry

    View full-size slide

  65. Log Replication
    (with Network Partitions)

    View full-size slide

  66. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    0ms
    <“”>
    <“”>
    F1
    <“”>
    F1
    <“”>
    F1
    <“”>

    View full-size slide

  67. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    10ms
    1 “sally”
    <“”>
    <“”>
    F1
    <“”>
    F1
    <“”>
    F1
    <“”>
    A new uncommitted log entry is added to the leader

    View full-size slide

  68. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    20ms
    1 “sally”
    1 “sally”
    <“”>
    <“”>
    F1
    1 “sally”
    <“”>
    F1 1 “sally”
    <“”>
    F1 1 “sally”
    <“”>
    Append Entries
    On the next heartbeat, the entry is replicated to the followers

    View full-size slide

  69. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    25ms
    1 “sally”
    1 “sally”
    <“sally”>
    <“”>
    F1
    1 “sally”
    <“”>
    F1 1 “sally”
    <“”>
    F1 1 “sally”
    <“”>
    OK
    The followers acknowledge the entry and the entry is committed

    View full-size slide

  70. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    40ms
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    Append Entries
    On the next heartbeat, the committed entry is replicated to the followers

    View full-size slide

  71. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    50ms

    View full-size slide

  72. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    A network partition makes a majority of nodes inaccessible from the leader
    60ms

    View full-size slide

  73. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    A new log entry is added to the leader
    70ms

    View full-size slide

  74. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    The leader replicates the entry to the only accessible follower
    Append Entries
    80ms

    View full-size slide

  75. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    The follower acknowledges the entry but there is not a quorum
    OK
    85ms

    View full-size slide

  76. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    90ms

    View full-size slide

  77. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    C2
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    After an election timeout, one disconnected follower becomes a candidate
    Request Vote
    190ms

    View full-size slide

  78. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    The candidate receives a majority of votes and becomes a leader
    Vote Granted
    195ms

    View full-size slide

  79. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    200ms

    View full-size slide

  80. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    A log entry is added to the new leader
    210ms

    View full-size slide

  81. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“sally”>
    F2 1 “sally”
    2 “tom”
    <“sally”>
    F2 1 “sally”
    2 “tom”
    <“sally”>
    The log entry is replicated to the accessible followers
    Append Entries
    220ms

    View full-size slide

  82. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“sally”>
    F2 1 “sally”
    2 “tom”
    <“sally”>
    A majority of nodes acknowledge the entry so it becomes committed
    OK
    225ms

    View full-size slide

  83. Append Entries
    L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    On the next heartbeat, the followers are notified the entry is committed
    240ms

    View full-size slide

  84. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    250ms

    View full-size slide

  85. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    The network recovers and there is no longer a partition
    255ms

    View full-size slide

  86. Append Entries
    L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    The new leader sends a heartbeat on the next heartbeat timeout
    260ms

    View full-size slide

  87. Append Entries
    F2
    F2
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    The leader of term #1 steps down after seeing a new leader in term #2
    260ms

    View full-size slide

  88. Append Entries
    F2
    F2
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    Uncommitted entries from disconnected nodes are discarded
    260ms

    View full-size slide

  89. Append Entries
    F2
    F2
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “tom”
    1 “sally”
    2 “tom”
    <“tom”>
    <“tom”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    New log entries are appended to the previously disconnected nodes
    260ms

    View full-size slide

  90. F2
    F2
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “tom”
    1 “sally”
    2 “tom”
    <“tom”>
    <“tom”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    260ms

    View full-size slide

  91. Log Compaction

    View full-size slide

  92. Unbounded log can grow until
    there’s no more disk

    View full-size slide

  93. Recovery time increases
    as log length increases

    View full-size slide

  94. Three Log Compaction Strategies

    View full-size slide

  95. #1: Leader-Initiated, Stored in Log
    [start]
    chunk
    entry
    chunk
    entry
    [end]
    Raft Library

    View full-size slide

  96. #2: Leader-Initiated, Stored Externally
    entry
    entry
    entry
    Snapshot
    Raft Library

    View full-size slide

  97. #3: Independently-Initiated, Stored Externally
    Application
    entry
    entry
    entry
    Snapshot

    View full-size slide

  98. Questions?
    Twitter: @benbjohnson
    GitHub: benbjohnson
    [email protected]

    View full-size slide

  99. Image Attribution
    Database designed by Sergey Shmidt from The Noun Project
    Question designed by Greg Pabst from The Noun Project
    Lock from The Noun Project
    Floppy Disk designed by Mike Wirth from The Noun Project
    Movie designed by Anna Weiss from The Noun Project

    View full-size slide