Slide 1

Slide 1 text

Raft !" U#$"r%&'#$'b(" D)%&r)b*&"$ C+#%"#%*% Pr+&+,+( @benbjohnson

Slide 2

Slide 2 text

W-'& )% D)%&r)b*&"$ C+#%"#%*%?

Slide 3

Slide 3 text

Distributed = Many nodes Consensus = Agreement

Slide 4

Slide 4 text

Distributed = Many nodes Consensus = Agreement

Slide 5

Slide 5 text

Leader Election Data Replication Distributed Locks

Slide 6

Slide 6 text

A Really Short History Of Distributed Consensus Protocols

Slide 7

Slide 7 text

A Really Short History Of Distributed Consensus Protocols Paxos (1989)

Slide 8

Slide 8 text

Paxos In A Nutshell

Slide 9

Slide 9 text

Paxos In A Nutshell Client

Slide 10

Slide 10 text

Paxos In A Nutshell Client Proposer Client requests change to system

Slide 11

Slide 11 text

Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Proposer tells Acceptors to get ready for a change Ready?

Slide 12

Slide 12 text

Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Acceptors confirm to Proposer that they’re ready Hell yeah!

Slide 13

Slide 13 text

Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Proposer sends change to Acceptors Here you go

Slide 14

Slide 14 text

Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Learner Learner Learner Learner Acceptors propagate change to Learners

Slide 15

Slide 15 text

Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Learner Learner Learner Learner Proposer is now recognized as leader Leader

Slide 16

Slide 16 text

Paxos In A Nutshell Client Proposer Acceptor Acceptor Acceptor Learner Learner Learner Learner Repeat for every new change to the system Leader

Slide 17

Slide 17 text

Fun Raft Facts

Slide 18

Slide 18 text

Created By:

Slide 19

Slide 19 text

Diego Ongaro Ph.D. Student Stanford University

Slide 20

Slide 20 text

John Ousterhout Professor of Computer Science Stanford University Diego Ongaro Ph.D. Student Stanford University

Slide 21

Slide 21 text

28 Implementations across various languages

Slide 22

Slide 22 text

In Commercial Use CoreOS (etcd) go-raft

Slide 23

Slide 23 text

Raft Basics

Slide 24

Slide 24 text

Three Roles:

Slide 25

Slide 25 text

The Leader

Slide 26

Slide 26 text

The Follower

Slide 27

Slide 27 text

The Candidate

Slide 28

Slide 28 text

High-Level Example:

Slide 29

Slide 29 text

F F F

Slide 30

Slide 30 text

C F F

Slide 31

Slide 31 text

C F F Vote for me! Vote for me!

Slide 32

Slide 32 text

C F F Ok! Ok!

Slide 33

Slide 33 text

L F F

Slide 34

Slide 34 text

L F F Log entries Log entries

Slide 35

Slide 35 text

L F F Heartbeats Heartbeats

Slide 36

Slide 36 text

X F F

Slide 37

Slide 37 text

X C F

Slide 38

Slide 38 text

X C F Vote for me!

Slide 39

Slide 39 text

X C F Ok!

Slide 40

Slide 40 text

X L F Log Entries & Heartbeats

Slide 41

Slide 41 text

Leader Election

Slide 42

Slide 42 text

F1 F1 F1 Leader Election t(ms) 0 200 400 600 800 1000 0ms

Slide 43

Slide 43 text

C2 F1 F1 Leader Election t(ms) 0 200 400 600 800 1000 Request Vote 150ms Request Vote (Fails) One follower becomes a candidate after an election timeout and requests votes

Slide 44

Slide 44 text

C2 F2 F1 Leader Election t(ms) 0 200 400 600 800 1000 Grant Vote 155ms Candidate receives one vote from a peer and one vote from self

Slide 45

Slide 45 text

L2 F2 F2 Leader Election t(ms) 0 200 400 600 800 1000 156ms Two votes is a majority so candidate becomes leader

Slide 46

Slide 46 text

Leader Election (Split Vote)

Slide 47

Slide 47 text

F1 F1 F1 Leader Election t(ms) 0 200 400 600 800 1000 0ms F1

Slide 48

Slide 48 text

C2 F1 C2 Leader Election t(ms) 0 200 400 600 800 1000 F1 Request Vote Request Vote 150ms Two followers become candidates simultaneously and begin requesting votes

Slide 49

Slide 49 text

C2 F2 C2 Leader Election t(ms) 0 200 400 600 800 1000 F2 Vote Granted Vote Granted 155ms Each candidate receives a vote from themselves and from one peer

Slide 50

Slide 50 text

C2 F2 C2 Leader Election t(ms) 0 200 400 600 800 1000 F2 Request Vote 156ms Each candidate requests a vote from a peer who has already voted

Slide 51

Slide 51 text

C2 F2 C2 Leader Election t(ms) 0 200 400 600 800 1000 F2 Vote Denied 160ms Vote requests are denied because the follower has already voted

Slide 52

Slide 52 text

C2 F2 C2 Leader Election t(ms) 0 200 400 600 800 1000 F2 Request Vote 161ms Candidates try to request votes from each other

Slide 53

Slide 53 text

C2 F2 C2 Leader Election t(ms) 0 200 400 600 800 1000 F2 Vote Denied 165ms Vote requests are denied because candidates voted for themselves

Slide 54

Slide 54 text

C2 F2 C2 Leader Election t(ms) 0 200 400 600 800 1000 F2 200ms Candidates wait for a randomized election timeout to occur (150ms - 300ms)

Slide 55

Slide 55 text

C2 F2 C2 Leader Election t(ms) 0 200 400 600 800 1000 F2 250ms Still waiting...

Slide 56

Slide 56 text

C3 F2 C2 Leader Election t(ms) 0 200 400 600 800 1000 F2 300ms Request Vote Request Vote One candidate begins election term #3

Slide 57

Slide 57 text

L3 F3 C2 Leader Election t(ms) 0 200 400 600 800 1000 F3 305ms Vote Granted Vote Granted Candidate receives vote from itself and two peer votes so it becomes leader for election term #3

Slide 58

Slide 58 text

L3 F3 C3 Leader Election t(ms) 0 200 400 600 800 1000 F3 306ms Request Vote Request Vote Second candidate doesn’t know first candidate won the term and begins requesting votes

Slide 59

Slide 59 text

L3 F3 C3 Leader Election t(ms) 0 200 400 600 800 1000 F3 306ms Vote Denied Vote Denied Peers already voted so votes are denied

Slide 60

Slide 60 text

L3 F3 F3 Leader Election t(ms) 0 200 400 600 800 1000 F3 310ms Leader notifies peers of election and other candidate steps down

Slide 61

Slide 61 text

Log Replication

Slide 62

Slide 62 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 0ms <“”> <“”> <“”>

Slide 63

Slide 63 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 10ms 1 “sally” <“”> <“”> <“”> A new uncommitted log entry is added to the leader

Slide 64

Slide 64 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 20ms <“”> <“”> <“”> 1 “sally” 1 “sally” 1 “sally” Append Entries Append Entries At the next heartbeat, the log entry is replicated to followers

Slide 65

Slide 65 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 OK <“sally”> <“”> <“”> 1 “sally” 1 “sally” 1 “sally” 22ms A majority of nodes have written the log entry written to disk so it becomes committed

Slide 66

Slide 66 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 <“sally”> <“”> <“”> 1 “sally” 1 “sally” 1 “sally” OK 25ms

Slide 67

Slide 67 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” Append Entries Append Entries 40ms At the next heartbeat, the leader notifies followers of updated committed entries

Slide 68

Slide 68 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” 50ms

Slide 69

Slide 69 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” Append Entries Append Entries 60ms At the next heartbeat, no new log information is sent

Slide 70

Slide 70 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 1 “sally” 1 “sally” 70ms

Slide 71

Slide 71 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 1 “sally” <“sally”> <“sally”> <“sally”> 75ms A new uncommitted log entry is added to the leader

Slide 72

Slide 72 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 <“sally”> <“sally”> <“sally”> 1 “sally” 2 “bob” 1 “sally” 2 “bob” 1 “sally” 2 “bob” Append Entries Append Entries 80ms At the next heartbeat, the entry is replicated to the followers

Slide 73

Slide 73 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 OK <“bob”> <“sally”> <“sally”> 1 “sally” 2 “bob” 1 “sally” 1 “sally” OK 82ms The entry is committed once the followers acknowledge the request

Slide 74

Slide 74 text

L1 F1 F1 Log Replication t(ms) 0 200 400 600 800 1000 <“bob”> <“bob”> <“bob”> 1 “sally” 2 “bob” 1 “sally” 2 “bob” 1 “sally” 2 “bob” Append Entries Append Entries 100ms At the next heartbeat, the leader notifies the followers of the new committed entry

Slide 75

Slide 75 text

Log Replication (with Network Partitions)

Slide 76

Slide 76 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 0ms <“”> <“”> F1 <“”> F1 <“”> F1 <“”>

Slide 77

Slide 77 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 10ms 1 “sally” <“”> <“”> F1 <“”> F1 <“”> F1 <“”> A new uncommitted log entry is added to the leader

Slide 78

Slide 78 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 20ms 1 “sally” 1 “sally” <“”> <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> Append Entries On the next heartbeat, the entry is replicated to the followers

Slide 79

Slide 79 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 25ms 1 “sally” 1 “sally” <“sally”> <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> F1 1 “sally” <“”> OK The followers acknowledge the entry and the entry is committed

Slide 80

Slide 80 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 40ms 1 “sally” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> Append Entries On the next heartbeat, the committed entry is replicated to the followers

Slide 81

Slide 81 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> 50ms

Slide 82

Slide 82 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> A network partition makes a majority of nodes inaccessible from the leader 60ms

Slide 83

Slide 83 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> A new log entry is added to the leader 70ms

Slide 84

Slide 84 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> The leader replicates the entry to the only accessible follower Append Entries 80ms

Slide 85

Slide 85 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> The follower acknowledges the entry but there is not a quorum OK 85ms

Slide 86

Slide 86 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> 90ms

Slide 87

Slide 87 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> C2 1 “sally” <“sally”> F1 1 “sally” <“sally”> F1 1 “sally” <“sally”> After an election timeout, one disconnected follower becomes a candidate Request Vote 190ms

Slide 88

Slide 88 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” <“sally”> F2 1 “sally” <“sally”> F2 1 “sally” <“sally”> The candidate receives a majority of votes and becomes a leader Vote Granted 195ms

Slide 89

Slide 89 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” <“sally”> F2 1 “sally” <“sally”> F2 1 “sally” <“sally”> 200ms

Slide 90

Slide 90 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” <“sally”> F2 1 “sally” <“sally”> A log entry is added to the new leader 210ms

Slide 91

Slide 91 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” 2 “tom” <“sally”> The log entry is replicated to the accessible followers Append Entries 220ms

Slide 92

Slide 92 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“sally”> F2 1 “sally” 2 “tom” <“sally”> A majority of nodes acknowledge the entry so it becomes committed OK 225ms

Slide 93

Slide 93 text

Append Entries L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> On the next heartbeat, the followers are notified the entry is committed 240ms

Slide 94

Slide 94 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> 250ms

Slide 95

Slide 95 text

L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> The network recovers and there is no longer a partition 255ms

Slide 96

Slide 96 text

Append Entries L1 F1 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> The new leader sends a heartbeat on the next heartbeat timeout 260ms

Slide 97

Slide 97 text

Append Entries F2 F2 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “bob” 1 “sally” 2 “bob” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> The leader of term #1 steps down after seeing a new leader in term #2 260ms

Slide 98

Slide 98 text

Append Entries F2 F2 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 1 “sally” <“sally”> <“sally”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> Uncommitted entries from disconnected nodes are discarded 260ms

Slide 99

Slide 99 text

Append Entries F2 F2 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “tom” 1 “sally” 2 “tom” <“tom”> <“tom”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> New log entries are appended to the previously disconnected nodes 260ms

Slide 100

Slide 100 text

F2 F2 Log Replication t(ms) 0 200 400 600 800 1000 1 “sally” 2 “tom” 1 “sally” 2 “tom” <“tom”> <“tom”> L2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> F2 1 “sally” 2 “tom” <“tom”> 260ms

Slide 101

Slide 101 text

Log Compaction

Slide 102

Slide 102 text

Unbounded log can grow until there’s no more disk

Slide 103

Slide 103 text

Recovery time increases as log length increases

Slide 104

Slide 104 text

Three Log Compaction Strategies

Slide 105

Slide 105 text

#1: Leader-Initiated, Stored in Log [start] chunk entry chunk entry [end] Raft Library

Slide 106

Slide 106 text

#2: Leader-Initiated, Stored Externally entry entry entry Snapshot Raft Library

Slide 107

Slide 107 text

#3: Independently-Initiated, Stored Externally Application entry entry entry Snapshot

Slide 108

Slide 108 text

Questions? Twitter: @benbjohnson GitHub: benbjohnson ben@skylandlabs.com

Slide 109

Slide 109 text

Image Attribution Database designed by Sergey Shmidt from The Noun Project Question designed by Greg Pabst from The Noun Project Lock from The Noun Project Floppy Disk designed by Mike Wirth from The Noun Project Movie designed by Anna Weiss from The Noun Project