Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Raft: The Understandable Distributed Consensus Protocol

benbjohnson
September 20, 2013

Raft: The Understandable Distributed Consensus Protocol

Raft presentation at Strange Loop 2013.

Video: http://www.infoq.com/presentations/raft

This work is licensed under a Creative Commons Attribution 4.0 International License.

benbjohnson

September 20, 2013
Tweet

More Decks by benbjohnson

Other Decks in Technology

Transcript

  1. Raft
    !" U#$"r%&'#$'b("
    D)%&r)b*&"$ C+#%"#%*% Pr+&+,+(
    @benbjohnson

    View Slide

  2. W-'& )%
    D)%&r)b*&"$ C+#%"#%*%?

    View Slide

  3. Distributed = Many nodes
    Consensus = Agreement

    View Slide

  4. Distributed = Many nodes
    Consensus = Agreement

    View Slide

  5. Leader Election
    Data Replication
    Distributed Locks

    View Slide

  6. A Really Short History Of
    Distributed Consensus Protocols

    View Slide

  7. A Really Short History Of
    Distributed Consensus Protocols
    Paxos (1989)

    View Slide

  8. Paxos In A Nutshell

    View Slide

  9. Paxos In A Nutshell
    Client

    View Slide

  10. Paxos In A Nutshell
    Client Proposer
    Client requests change to system

    View Slide

  11. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Proposer tells Acceptors to get ready for a change
    Ready?

    View Slide

  12. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Acceptors confirm to Proposer that they’re ready
    Hell yeah!

    View Slide

  13. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Proposer sends change to Acceptors
    Here
    you go

    View Slide

  14. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Learner
    Learner
    Learner
    Learner
    Acceptors propagate change to Learners

    View Slide

  15. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Learner
    Learner
    Learner
    Learner
    Proposer is now recognized as leader
    Leader

    View Slide

  16. Paxos In A Nutshell
    Client Proposer
    Acceptor
    Acceptor
    Acceptor
    Learner
    Learner
    Learner
    Learner
    Repeat for every new change to the system
    Leader

    View Slide

  17. Fun Raft Facts

    View Slide

  18. Created By:

    View Slide

  19. Diego Ongaro
    Ph.D. Student
    Stanford University

    View Slide

  20. John Ousterhout
    Professor of Computer Science
    Stanford University
    Diego Ongaro
    Ph.D. Student
    Stanford University

    View Slide

  21. 28 Implementations
    across various languages

    View Slide

  22. In Commercial Use
    CoreOS
    (etcd)
    go-raft

    View Slide

  23. Raft Basics

    View Slide

  24. Three Roles:

    View Slide

  25. The Leader

    View Slide

  26. The Follower

    View Slide

  27. The Candidate

    View Slide

  28. High-Level Example:

    View Slide

  29. F
    F
    F

    View Slide

  30. C
    F
    F

    View Slide

  31. C
    F
    F
    Vote for me!
    Vote for me!

    View Slide

  32. C
    F
    F
    Ok!
    Ok!

    View Slide

  33. L
    F
    F

    View Slide

  34. L
    F
    F
    Log entries
    Log entries

    View Slide

  35. L
    F
    F
    Heartbeats
    Heartbeats

    View Slide

  36. X
    F
    F

    View Slide

  37. X
    C
    F

    View Slide

  38. X
    C
    F
    Vote for me!

    View Slide

  39. X
    C
    F
    Ok!

    View Slide

  40. X
    L
    F
    Log Entries &
    Heartbeats

    View Slide

  41. Leader Election

    View Slide

  42. F1 F1
    F1
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    0ms

    View Slide

  43. C2 F1
    F1
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    Request Vote
    150ms
    Request Vote
    (Fails)
    One follower becomes a candidate after an election timeout and requests votes

    View Slide

  44. C2 F2
    F1
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    Grant Vote
    155ms
    Candidate receives one vote from a peer and one vote from self

    View Slide

  45. L2 F2
    F2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    156ms
    Two votes is a majority so candidate becomes leader

    View Slide

  46. Leader Election
    (Split Vote)

    View Slide

  47. F1 F1
    F1
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    0ms
    F1

    View Slide

  48. C2 F1
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F1
    Request Vote
    Request Vote
    150ms
    Two followers become candidates simultaneously and begin requesting votes

    View Slide

  49. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Vote Granted
    Vote Granted
    155ms
    Each candidate receives a vote from themselves and from one peer

    View Slide

  50. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Request Vote
    156ms
    Each candidate requests a vote from a peer who has already voted

    View Slide

  51. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Vote Denied
    160ms
    Vote requests are denied because the follower has already voted

    View Slide

  52. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Request Vote
    161ms
    Candidates try to request votes from each other

    View Slide

  53. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    Vote Denied
    165ms
    Vote requests are denied because candidates voted for themselves

    View Slide

  54. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    200ms
    Candidates wait for a randomized election timeout to occur (150ms - 300ms)

    View Slide

  55. C2 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    250ms
    Still waiting...

    View Slide

  56. C3 F2
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F2
    300ms
    Request Vote
    Request Vote
    One candidate begins election term #3

    View Slide

  57. L3 F3
    C2
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F3
    305ms
    Vote Granted
    Vote Granted
    Candidate receives vote from itself and two peer votes so it becomes leader for election term #3

    View Slide

  58. L3 F3
    C3
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F3
    306ms
    Request Vote
    Request Vote
    Second candidate doesn’t know first candidate won the term and begins requesting votes

    View Slide

  59. L3 F3
    C3
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F3
    306ms
    Vote Denied
    Vote Denied
    Peers already voted so votes are denied

    View Slide

  60. L3 F3
    F3
    Leader Election
    t(ms)
    0 200 400 600 800 1000
    F3
    310ms
    Leader notifies peers of election and other candidate steps down

    View Slide

  61. Log Replication

    View Slide

  62. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    0ms
    <“”>
    <“”>
    <“”>

    View Slide

  63. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    10ms
    1 “sally”
    <“”>
    <“”>
    <“”>
    A new uncommitted log entry is added to the leader

    View Slide

  64. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    20ms
    <“”>
    <“”>
    <“”>
    1 “sally”
    1 “sally”
    1 “sally”
    Append Entries
    Append Entries
    At the next heartbeat, the log entry is replicated to followers

    View Slide

  65. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    OK
    <“sally”>
    <“”>
    <“”>
    1 “sally”
    1 “sally”
    1 “sally”
    22ms
    A majority of nodes have written the log entry written to disk so it becomes committed

    View Slide

  66. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“”>
    <“”>
    1 “sally”
    1 “sally”
    1 “sally”
    OK
    25ms

    View Slide

  67. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    1 “sally”
    1 “sally”
    Append Entries
    Append Entries
    40ms
    At the next heartbeat, the leader notifies followers of updated committed entries

    View Slide

  68. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    1 “sally”
    1 “sally”
    50ms

    View Slide

  69. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    1 “sally”
    1 “sally”
    Append Entries
    Append Entries
    60ms
    At the next heartbeat, no new log information is sent

    View Slide

  70. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    1 “sally”
    1 “sally”
    70ms

    View Slide

  71. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    <“sally”>
    75ms
    A new uncommitted log entry is added to the leader

    View Slide

  72. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“sally”>
    <“sally”>
    <“sally”>
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    Append Entries
    Append Entries
    80ms
    At the next heartbeat, the entry is replicated to the followers

    View Slide

  73. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    OK
    <“bob”>
    <“sally”>
    <“sally”>
    1 “sally”
    2 “bob”
    1 “sally”
    1 “sally”
    OK
    82ms
    The entry is committed once the followers acknowledge the request

    View Slide

  74. L1 F1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    <“bob”>
    <“bob”>
    <“bob”>
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    Append Entries
    Append Entries
    100ms
    At the next heartbeat, the leader notifies the followers of the new committed entry

    View Slide

  75. Log Replication
    (with Network Partitions)

    View Slide

  76. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    0ms
    <“”>
    <“”>
    F1
    <“”>
    F1
    <“”>
    F1
    <“”>

    View Slide

  77. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    10ms
    1 “sally”
    <“”>
    <“”>
    F1
    <“”>
    F1
    <“”>
    F1
    <“”>
    A new uncommitted log entry is added to the leader

    View Slide

  78. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    20ms
    1 “sally”
    1 “sally”
    <“”>
    <“”>
    F1
    1 “sally”
    <“”>
    F1 1 “sally”
    <“”>
    F1 1 “sally”
    <“”>
    Append Entries
    On the next heartbeat, the entry is replicated to the followers

    View Slide

  79. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    25ms
    1 “sally”
    1 “sally”
    <“sally”>
    <“”>
    F1
    1 “sally”
    <“”>
    F1 1 “sally”
    <“”>
    F1 1 “sally”
    <“”>
    OK
    The followers acknowledge the entry and the entry is committed

    View Slide

  80. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    40ms
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    Append Entries
    On the next heartbeat, the committed entry is replicated to the followers

    View Slide

  81. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    50ms

    View Slide

  82. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    A network partition makes a majority of nodes inaccessible from the leader
    60ms

    View Slide

  83. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    A new log entry is added to the leader
    70ms

    View Slide

  84. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    The leader replicates the entry to the only accessible follower
    Append Entries
    80ms

    View Slide

  85. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    The follower acknowledges the entry but there is not a quorum
    OK
    85ms

    View Slide

  86. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    F1
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    90ms

    View Slide

  87. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    C2
    1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    F1 1 “sally”
    <“sally”>
    After an election timeout, one disconnected follower becomes a candidate
    Request Vote
    190ms

    View Slide

  88. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    The candidate receives a majority of votes and becomes a leader
    Vote Granted
    195ms

    View Slide

  89. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    200ms

    View Slide

  90. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    F2 1 “sally”
    <“sally”>
    A log entry is added to the new leader
    210ms

    View Slide

  91. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“sally”>
    F2 1 “sally”
    2 “tom”
    <“sally”>
    F2 1 “sally”
    2 “tom”
    <“sally”>
    The log entry is replicated to the accessible followers
    Append Entries
    220ms

    View Slide

  92. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“sally”>
    F2 1 “sally”
    2 “tom”
    <“sally”>
    A majority of nodes acknowledge the entry so it becomes committed
    OK
    225ms

    View Slide

  93. Append Entries
    L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    On the next heartbeat, the followers are notified the entry is committed
    240ms

    View Slide

  94. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    250ms

    View Slide

  95. L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    The network recovers and there is no longer a partition
    255ms

    View Slide

  96. Append Entries
    L1
    F1
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    The new leader sends a heartbeat on the next heartbeat timeout
    260ms

    View Slide

  97. Append Entries
    F2
    F2
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “bob”
    1 “sally”
    2 “bob”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    The leader of term #1 steps down after seeing a new leader in term #2
    260ms

    View Slide

  98. Append Entries
    F2
    F2
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    1 “sally”
    <“sally”>
    <“sally”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    Uncommitted entries from disconnected nodes are discarded
    260ms

    View Slide

  99. Append Entries
    F2
    F2
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “tom”
    1 “sally”
    2 “tom”
    <“tom”>
    <“tom”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    New log entries are appended to the previously disconnected nodes
    260ms

    View Slide

  100. F2
    F2
    Log Replication
    t(ms)
    0 200 400 600 800 1000
    1 “sally”
    2 “tom”
    1 “sally”
    2 “tom”
    <“tom”>
    <“tom”>
    L2
    1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    F2 1 “sally”
    2 “tom”
    <“tom”>
    260ms

    View Slide

  101. Log Compaction

    View Slide

  102. Unbounded log can grow until
    there’s no more disk

    View Slide

  103. Recovery time increases
    as log length increases

    View Slide

  104. Three Log Compaction Strategies

    View Slide

  105. #1: Leader-Initiated, Stored in Log
    [start]
    chunk
    entry
    chunk
    entry
    [end]
    Raft Library

    View Slide

  106. #2: Leader-Initiated, Stored Externally
    entry
    entry
    entry
    Snapshot
    Raft Library

    View Slide

  107. #3: Independently-Initiated, Stored Externally
    Application
    entry
    entry
    entry
    Snapshot

    View Slide

  108. Questions?
    Twitter: @benbjohnson
    GitHub: benbjohnson
    [email protected]

    View Slide

  109. Image Attribution
    Database designed by Sergey Shmidt from The Noun Project
    Question designed by Greg Pabst from The Noun Project
    Lock from The Noun Project
    Floppy Disk designed by Mike Wirth from The Noun Project
    Movie designed by Anna Weiss from The Noun Project

    View Slide