Raft: The Understandable Distributed Consensus Protocol

6c76488dff9b5d9a872dff88f008f88e?s=47 benbjohnson
September 20, 2013

Raft: The Understandable Distributed Consensus Protocol

Raft presentation at Strange Loop 2013.

Video: http://www.infoq.com/presentations/raft

This work is licensed under a Creative Commons Attribution 4.0 International License.

6c76488dff9b5d9a872dff88f008f88e?s=128

benbjohnson

September 20, 2013
Tweet

Transcript

  1. 11.

    Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Proposer

    tells Acceptors to get ready for a change Ready?
  2. 12.

    Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Acceptors

    confirm to Proposer that they’re ready Hell yeah!
  3. 14.

    Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Learner

    Learner Learner Learner Acceptors propagate change to Learners
  4. 15.

    Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Learner

    Learner Learner Learner Proposer is now recognized as leader Leader
  5. 16.

    Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Learner

    Learner Learner Learner Repeat for every new change to the system Leader
  6. 29.
  7. 30.
  8. 33.
  9. 36.
  10. 37.
  11. 39.
  12. 43.

    C2 F1 F1 Leader Election t(ms) 0 200 400 600

    800 1000 Request Vote 150ms Request Vote (Fails) One follower becomes a candidate after an election timeout and requests votes
  13. 44.

    C2 F2 F1 Leader Election t(ms) 0 200 400 600

    800 1000 Grant Vote 155ms Candidate receives one vote from a peer and one vote from self
  14. 45.

    L2 F2 F2 Leader Election t(ms) 0 200 400 600

    800 1000 156ms Two votes is a majority so candidate becomes leader
  15. 48.

    C2 F1 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F1 Request Vote Request Vote 150ms Two followers become candidates simultaneously and begin requesting votes
  16. 49.

    C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Vote Granted Vote Granted 155ms Each candidate receives a vote from themselves and from one peer
  17. 50.

    C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Request Vote 156ms Each candidate requests a vote from a peer who has already voted
  18. 51.

    C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Vote Denied 160ms Vote requests are denied because the follower has already voted
  19. 52.

    C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Request Vote 161ms Candidates try to request votes from each other
  20. 53.

    C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Vote Denied 165ms Vote requests are denied because candidates voted for themselves
  21. 54.

    C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 200ms Candidates wait for a randomized election timeout to occur (150ms - 300ms)
  22. 55.

    C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 250ms Still waiting...
  23. 56.

    C3 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 300ms Request Vote Request Vote One candidate begins election term #3
  24. 57.

    L3 F3 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F3 305ms Vote Granted Vote Granted Candidate receives vote from itself and two peer votes so it becomes leader for election term #3
  25. 58.

    L3 F3 C3 Leader Election t(ms) 0 200 400 600

    800 1000 F3 306ms Request Vote Request Vote Second candidate doesn’t know first candidate won the term and begins requesting votes
  26. 59.

    L3 F3 C3 Leader Election t(ms) 0 200 400 600

    800 1000 F3 306ms Vote Denied Vote Denied Peers already voted so votes are denied
  27. 60.

    L3 F3 F3 Leader Election t(ms) 0 200 400 600

    800 1000 F3 310ms Leader notifies peers of election and other candidate steps down
  28. 62.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 0ms <“”> <“”> <“”>
  29. 63.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 10ms 1 “sally” <“”> <“”> <“”> A new uncommitted log entry is added to the leader
  30. 64.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 20ms <“”> <“”> <“”> 1 “sally” 1 “sally” 1 “sally” Append Entries Append Entries At the next heartbeat, the log entry is replicated to followers
  31. 65.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 OK <“sally”> <“”> <“”> 1 “sally” 1 “sally” 1 “sally” 22ms A majority of nodes have written the log entry written to disk so it becomes committed
  32. 66.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“”> <“”> 1 “sally” 1 “sally” 1 “sally” OK 25ms
  33. 67.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” Append Entries Append Entries 40ms At the next heartbeat, the leader notifies followers of updated committed entries
  34. 68.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” 50ms
  35. 69.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” Append Entries Append Entries 60ms At the next heartbeat, no new log information is sent
  36. 70.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” 70ms
  37. 71.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 1 “sally” 2 “bob” 1 “sally” 1 “sally” <“sally”> <“sally”> <“sally”> 75ms A new uncommitted log entry is added to the leader
  38. 72.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 2 “bob” 1 “sally” 2 “bob” 1 “sally” 2 “bob” Append Entries Append Entries 80ms At the next heartbeat, the entry is replicated to the followers
  39. 73.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 OK <“bob”> <“sally”> <“sally”> 1 “sally” 2 “bob” 1 “sally” 1 “sally” OK 82ms The entry is committed once the followers acknowledge the request
  40. 74.

    L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“bob”> <“bob”> <“bob”> 1 “sally” 2 “bob” 1 “sally” 2 “bob” 1 “sally” 2 “bob” Append Entries Append Entries 100ms At the next heartbeat, the leader notifies the followers of the new committed entry
  41. 76.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 0ms <“”> <“”> F1 <“”> F1 <“”> F1 <“”>
  42. 77.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 10ms 1 “sally” <“”> <“”> F1 <“”> F1 <“”> F1 <“”> A new uncommitted log entry is added to the leader
  43. 78.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 20ms 1 “sally” 1 “sally” <“”> <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> Append Entries On the next heartbeat, the entry is replicated to the followers
  44. 79.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 25ms 1 “sally” 1 “sally” <“sally”> <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> OK The followers acknowledge the entry and the entry is committed
  45. 80.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 40ms 1 “sally” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> Append Entries On the next heartbeat, the committed entry is replicated to the followers
  46. 81.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> 50ms
  47. 82.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> A network partition makes a majority of nodes inaccessible from the leader 60ms
  48. 83.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> A new log entry is added to the leader 70ms
  49. 84.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> The leader replicates the entry to the only accessible follower Append Entries 80ms
  50. 85.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> The follower acknowledges the entry but there is not a quorum OK 85ms
  51. 86.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> 90ms
  52. 87.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> C2 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> After an election timeout, one disconnected follower becomes a candidate Request Vote 190ms
  53. 88.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” <“sally”> F2 1 “sally” <“sally”> F2 1 “sally” <“sally”> The candidate receives a majority of votes and becomes a leader Vote Granted 195ms
  54. 89.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” <“sally”> F2 1 “sally” <“sally”> F2 1 “sally” <“sally”> 200ms
  55. 90.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” <“sally”> F2 1 “sally” <“sally”> A log entry is added to the new leader 210ms
  56. 91.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” 2 “tom” <“sally”> The log entry is replicated to the accessible followers Append Entries 220ms
  57. 92.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” 2 “tom” <“sally”> A majority of nodes acknowledge the entry so it becomes committed OK 225ms
  58. 93.

    Append Entries L1 F1 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> On the next heartbeat, the followers are notified the entry is committed 240ms
  59. 94.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> 250ms
  60. 95.

    L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> The network recovers and there is no longer a partition 255ms
  61. 96.

    Append Entries L1 F1 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> The new leader sends a heartbeat on the next heartbeat timeout 260ms
  62. 97.

    Append Entries F2 F2 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> The leader of term #1 steps down after seeing a new leader in term #2 260ms
  63. 98.

    Append Entries F2 F2 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 1 “sally” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> Uncommitted entries from disconnected nodes are discarded 260ms
  64. 99.

    Append Entries F2 F2 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 2 “tom” 1 “sally” 2 “tom” <“tom”> <“tom”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> New log entries are appended to the previously disconnected nodes 260ms
  65. 100.

    F2 F2 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “tom” 1 “sally” 2 “tom” <“tom”> <“tom”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> 260ms
  66. 109.

    Image Attribution Database designed by Sergey Shmidt from The Noun

    Project Question designed by Greg Pabst from The Noun Project Lock from The Noun Project Floppy Disk designed by Mike Wirth from The Noun Project Movie designed by Anna Weiss from The Noun Project