Raft: The Understandable Distributed Consensus Protocol

6c76488dff9b5d9a872dff88f008f88e?s=47 benbjohnson
September 20, 2013

Raft: The Understandable Distributed Consensus Protocol

Raft presentation at Strange Loop 2013.

Video: http://www.infoq.com/presentations/raft

This work is licensed under a Creative Commons Attribution 4.0 International License.

6c76488dff9b5d9a872dff88f008f88e?s=128

benbjohnson

September 20, 2013
Tweet

Transcript

  1. Raft !" U#$"r%&'#$'b(" D)%&r)b*&"$ C+#%"#%*% Pr+&+,+( @benbjohnson

  2. W-'& )% D)%&r)b*&"$ C+#%"#%*%?

  3. Distributed = Many nodes Consensus = Agreement

  4. Distributed = Many nodes Consensus = Agreement

  5. Leader Election Data Replication Distributed Locks

  6. A Really Short History Of Distributed Consensus Protocols

  7. A Really Short History Of Distributed Consensus Protocols Paxos (1989)

  8. Paxos In A Nutshell

  9. Paxos In A Nutshell Client

  10. Paxos In A Nutshell Client Proposer Client requests change to

    system
  11. Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Proposer

    tells Acceptors to get ready for a change Ready?
  12. Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Acceptors

    confirm to Proposer that they’re ready Hell yeah!
  13. Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Proposer

    sends change to Acceptors Here you go
  14. Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Learner

    Learner Learner Learner Acceptors propagate change to Learners
  15. Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Learner

    Learner Learner Learner Proposer is now recognized as leader Leader
  16. Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Learner

    Learner Learner Learner Repeat for every new change to the system Leader
  17. Fun Raft Facts

  18. Created By:

  19. Diego Ongaro Ph.D. Student Stanford University

  20. John Ousterhout Professor of Computer Science Stanford University Diego Ongaro

    Ph.D. Student Stanford University
  21. 28 Implementations across various languages

  22. In Commercial Use CoreOS (etcd) go-raft

  23. Raft Basics

  24. Three Roles:

  25. The Leader

  26. The Follower

  27. The Candidate

  28. High-Level Example:

  29. F F F

  30. C F F

  31. C F F Vote for me! Vote for me!

  32. C F F Ok! Ok!

  33. L F F

  34. L F F Log entries Log entries

  35. L F F Heartbeats Heartbeats

  36. X F F

  37. X C F

  38. X C F Vote for me!

  39. X C F Ok!

  40. X L F Log Entries & Heartbeats

  41. Leader Election

  42. F1 F1 F1 Leader Election t(ms) 0 200 400 600

    800 1000 0ms
  43. C2 F1 F1 Leader Election t(ms) 0 200 400 600

    800 1000 Request Vote 150ms Request Vote (Fails) One follower becomes a candidate after an election timeout and requests votes
  44. C2 F2 F1 Leader Election t(ms) 0 200 400 600

    800 1000 Grant Vote 155ms Candidate receives one vote from a peer and one vote from self
  45. L2 F2 F2 Leader Election t(ms) 0 200 400 600

    800 1000 156ms Two votes is a majority so candidate becomes leader
  46. Leader Election (Split Vote)

  47. F1 F1 F1 Leader Election t(ms) 0 200 400 600

    800 1000 0ms F1
  48. C2 F1 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F1 Request Vote Request Vote 150ms Two followers become candidates simultaneously and begin requesting votes
  49. C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Vote Granted Vote Granted 155ms Each candidate receives a vote from themselves and from one peer
  50. C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Request Vote 156ms Each candidate requests a vote from a peer who has already voted
  51. C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Vote Denied 160ms Vote requests are denied because the follower has already voted
  52. C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Request Vote 161ms Candidates try to request votes from each other
  53. C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 Vote Denied 165ms Vote requests are denied because candidates voted for themselves
  54. C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 200ms Candidates wait for a randomized election timeout to occur (150ms - 300ms)
  55. C2 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 250ms Still waiting...
  56. C3 F2 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F2 300ms Request Vote Request Vote One candidate begins election term #3
  57. L3 F3 C2 Leader Election t(ms) 0 200 400 600

    800 1000 F3 305ms Vote Granted Vote Granted Candidate receives vote from itself and two peer votes so it becomes leader for election term #3
  58. L3 F3 C3 Leader Election t(ms) 0 200 400 600

    800 1000 F3 306ms Request Vote Request Vote Second candidate doesn’t know first candidate won the term and begins requesting votes
  59. L3 F3 C3 Leader Election t(ms) 0 200 400 600

    800 1000 F3 306ms Vote Denied Vote Denied Peers already voted so votes are denied
  60. L3 F3 F3 Leader Election t(ms) 0 200 400 600

    800 1000 F3 310ms Leader notifies peers of election and other candidate steps down
  61. Log Replication

  62. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 0ms <“”> <“”> <“”>
  63. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 10ms 1 “sally” <“”> <“”> <“”> A new uncommitted log entry is added to the leader
  64. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 20ms <“”> <“”> <“”> 1 “sally” 1 “sally” 1 “sally” Append Entries Append Entries At the next heartbeat, the log entry is replicated to followers
  65. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 OK <“sally”> <“”> <“”> 1 “sally” 1 “sally” 1 “sally” 22ms A majority of nodes have written the log entry written to disk so it becomes committed
  66. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“”> <“”> 1 “sally” 1 “sally” 1 “sally” OK 25ms
  67. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” Append Entries Append Entries 40ms At the next heartbeat, the leader notifies followers of updated committed entries
  68. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” 50ms
  69. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” Append Entries Append Entries 60ms At the next heartbeat, no new log information is sent
  70. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” 70ms
  71. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 1 “sally” 2 “bob” 1 “sally” 1 “sally” <“sally”> <“sally”> <“sally”> 75ms A new uncommitted log entry is added to the leader
  72. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 2 “bob” 1 “sally” 2 “bob” 1 “sally” 2 “bob” Append Entries Append Entries 80ms At the next heartbeat, the entry is replicated to the followers
  73. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 OK <“bob”> <“sally”> <“sally”> 1 “sally” 2 “bob” 1 “sally” 1 “sally” OK 82ms The entry is committed once the followers acknowledge the request
  74. L1 F1 F1 Log Replication t(ms) 0 200 400 600

    800 1000 <“bob”> <“bob”> <“bob”> 1 “sally” 2 “bob” 1 “sally” 2 “bob” 1 “sally” 2 “bob” Append Entries Append Entries 100ms At the next heartbeat, the leader notifies the followers of the new committed entry
  75. Log Replication (with Network Partitions)

  76. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 0ms <“”> <“”> F1 <“”> F1 <“”> F1 <“”>
  77. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 10ms 1 “sally” <“”> <“”> F1 <“”> F1 <“”> F1 <“”> A new uncommitted log entry is added to the leader
  78. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 20ms 1 “sally” 1 “sally” <“”> <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> Append Entries On the next heartbeat, the entry is replicated to the followers
  79. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 25ms 1 “sally” 1 “sally” <“sally”> <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> OK The followers acknowledge the entry and the entry is committed
  80. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 40ms 1 “sally” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> Append Entries On the next heartbeat, the committed entry is replicated to the followers
  81. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> 50ms
  82. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> A network partition makes a majority of nodes inaccessible from the leader 60ms
  83. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> A new log entry is added to the leader 70ms
  84. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> The leader replicates the entry to the only accessible follower Append Entries 80ms
  85. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> The follower acknowledges the entry but there is not a quorum OK 85ms
  86. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> 90ms
  87. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> C2 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> After an election timeout, one disconnected follower becomes a candidate Request Vote 190ms
  88. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” <“sally”> F2 1 “sally” <“sally”> F2 1 “sally” <“sally”> The candidate receives a majority of votes and becomes a leader Vote Granted 195ms
  89. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” <“sally”> F2 1 “sally” <“sally”> F2 1 “sally” <“sally”> 200ms
  90. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” <“sally”> F2 1 “sally” <“sally”> A log entry is added to the new leader 210ms
  91. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” 2 “tom” <“sally”> The log entry is replicated to the accessible followers Append Entries 220ms
  92. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” 2 “tom” <“sally”> A majority of nodes acknowledge the entry so it becomes committed OK 225ms
  93. Append Entries L1 F1 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> On the next heartbeat, the followers are notified the entry is committed 240ms
  94. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> 250ms
  95. L1 F1 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> The network recovers and there is no longer a partition 255ms
  96. Append Entries L1 F1 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> The new leader sends a heartbeat on the next heartbeat timeout 260ms
  97. Append Entries F2 F2 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> The leader of term #1 steps down after seeing a new leader in term #2 260ms
  98. Append Entries F2 F2 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 1 “sally” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> Uncommitted entries from disconnected nodes are discarded 260ms
  99. Append Entries F2 F2 Log Replication t(ms) 0 200 400

    600 800 1000 1 “sally” 2 “tom” 1 “sally” 2 “tom” <“tom”> <“tom”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> New log entries are appended to the previously disconnected nodes 260ms
  100. F2 F2 Log Replication t(ms) 0 200 400 600 800

    1000 1 “sally” 2 “tom” 1 “sally” 2 “tom” <“tom”> <“tom”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> 260ms
  101. Log Compaction

  102. Unbounded log can grow until there’s no more disk

  103. Recovery time increases as log length increases

  104. Three Log Compaction Strategies

  105. #1: Leader-Initiated, Stored in Log [start] chunk entry chunk entry

    [end] Raft Library
  106. #2: Leader-Initiated, Stored Externally entry entry entry Snapshot Raft Library

  107. #3: Independently-Initiated, Stored Externally Application entry entry entry Snapshot

  108. Questions? Twitter: @benbjohnson GitHub: benbjohnson ben@skylandlabs.com

  109. Image Attribution Database designed by Sergey Shmidt from The Noun

    Project Question designed by Greg Pabst from The Noun Project Lock from The Noun Project Floppy Disk designed by Mike Wirth from The Noun Project Movie designed by Anna Weiss from The Noun Project