Slide 1

Slide 1 text

Consensus Algorithms Consensus Algorithms Aman Garg Fall 2019 “ Is it better to be alive and wrong than be correct and dead? - Jay Kreps

Slide 2

Slide 2 text

What is consensus? What is consensus? A general agreement is what a consensus is all about. Get everyone to agree about a resource / thing Make decisions through a formal process Delay and analyse to gain harmony or else retry Establish a shared opinion

Slide 3

Slide 3 text

What is consensus? What is consensus? A general agreement is what a consensus is all about. Get everyone to agree about a resource / thing Make decisions through a formal process Delay and analyse to gain harmony or else retry Establish a shared opinion The resulting consensus doesn't have to be unanimous. This person here is clearly unhappy but has consented to the majority

Slide 4

Slide 4 text

But why is it But why is it important? important? Let us take a use case and understand

Slide 5

Slide 5 text

Map Partition Owner Map Partition Owner Let's take a partitioned distributed map, say Hazelcast Assume you are writing a client over the IMap and there's a PUT operation on a certain key K. Assume a simple MOD hash function exists to find the partition where the key resides. Since the map is distributed (over N nodes), we need to find the owner for the given partition. What all various ways exist for you as a client to figure out the owner?

Slide 6

Slide 6 text

Different ways of routing a request to the right node

Slide 7

Slide 7 text

Different ways of routing a request to the right node

Slide 8

Slide 8 text

Different ways of routing a request to the right node Client forwards to any node which either forwards to the correct node or returns

Slide 9

Slide 9 text

Different ways of routing a request to the right node Client forwards to any node which either forwards to the correct node or returns

Slide 10

Slide 10 text

Different ways of routing a request to the right node Client forwards to any node which either forwards to the correct node or returns Client sends to a partition aware routing tier which knows the owner for "foo"

Slide 11

Slide 11 text

Different ways of routing a request to the right node Client forwards to any node which either forwards to the correct node or returns Client sends to a partition aware routing tier which knows the owner for "foo"

Slide 12

Slide 12 text

Different ways of routing a request to the right node Client forwards to any node which either forwards to the correct node or returns Client sends to a partition aware routing tier which knows the owner for "foo" Client itself knows the owner node for a particular partition

Slide 13

Slide 13 text

Why is it a problem though? Why is it a problem though?

Slide 14

Slide 14 text

Why is it a problem though? Why is it a problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise.

Slide 15

Slide 15 text

Why is it a problem though? Why is it a problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here.

Slide 16

Slide 16 text

Why is it a problem though? Why is it a problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here. Distributed consensus is a hard problem. Easy to reason about but unfathomably hard to implement. There are a lot of edge cases to handle

Slide 17

Slide 17 text

Why is it a problem though? Why is it a problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here. Distributed consensus is a hard problem. Easy to reason about but unfathomably hard to implement. There are a lot of edge cases to handle Many such distributed systems rely on an external service provider that gives strong coordination and consensus guarantees on the cluster metadata

Slide 18

Slide 18 text

Why is it a problem though? Why is it a problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here. Distributed consensus is a hard problem. Easy to reason about but unfathomably hard to implement. There are a lot of edge cases to handle Many such distributed systems rely on an external service provider that gives strong coordination and consensus guarantees on the cluster metadata Our favourite system enters to help us here. Wait for it....

Slide 19

Slide 19 text

Apache Zookeeper Apache Zookeeper Zookeeper tracks information that is to be synchronised across the cluster Each node registers itself in the Zookeeper (ZK) ZK tracks the partitions against a particular set of nodes Clients can subscribe to the above metadata. Whenever new nodes are added or removed, client is notified

Slide 20

Slide 20 text

Consensus is the key Consensus is the key Solving consensus is the key to solving at least the following problems in computer science:

Slide 21

Slide 21 text

Consensus is the key Consensus is the key Solving consensus is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast

Slide 22

Slide 22 text

Consensus is the key Consensus is the key Solving consensus is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulfilling A and C in ACID properties

Slide 23

Slide 23 text

Consensus is the key Consensus is the key Solving consensus is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulfilling A and C in ACID properties Terminating reliable broadcasts Sending messages to a list of processes, say in multiplayer gaming

Slide 24

Slide 24 text

Consensus is the key Consensus is the key Solving consensus is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulfilling A and C in ACID properties Terminating reliable broadcasts Sending messages to a list of processes, say in multiplayer gaming Dynamic group membership Who is the master? Which workers are available? What task is assigned to the worker?

Slide 25

Slide 25 text

Consensus is the key Consensus is the key Solving consensus is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulfilling A and C in ACID properties Terminating reliable broadcasts Sending messages to a list of processes, say in multiplayer gaming Dynamic group membership Who is the master? Which workers are available? What task is assigned to the worker? Stronger shared stored models Like how a concurrent hashmap helps concurrent threads to reach an agreement

Slide 26

Slide 26 text

So what is a Consensus So what is a Consensus Algorithm? Algorithm? Consensus algorithms allow a collection of machines to work as a coherent group that can survive failures of some of its members

Slide 27

Slide 27 text

Properties of a consensus algorithm Properties of a consensus algorithm

Slide 28

Slide 28 text

Properties of a consensus algorithm Properties of a consensus algorithm Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder

Slide 29

Slide 29 text

Properties of a consensus algorithm Properties of a consensus algorithm Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder Fault Tolerant System is available and fully functional in case of failure of nodes.

Slide 30

Slide 30 text

Properties of a consensus algorithm Properties of a consensus algorithm Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder Fault Tolerant System is available and fully functional in case of failure of nodes. Correctness Performance is not impacted by minority of slow nodes. Does not depend on consistency of time for correctness.

Slide 31

Slide 31 text

Properties of a consensus algorithm Properties of a consensus algorithm Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder Fault Tolerant System is available and fully functional in case of failure of nodes. Correctness Performance is not impacted by minority of slow nodes. Does not depend on consistency of time for correctness. Real World Core algorithm should be understandable and intuitive. The internal workings should seem obvious. Implementation shouldn't require a major overhaul in existing arch.

Slide 32

Slide 32 text

Replicated State Machine Replicated State Machine A collection of servers computing identical copies of the same state They operate even if a minority of servers are down

Slide 33

Slide 33 text

Replicated State Machine Replicated State Machine A collection of servers computing identical copies of the same state They operate even if a minority of servers are down

Slide 34

Slide 34 text

Replicated State Machine Replicated State Machine A collection of servers computing identical copies of the same state They operate even if a minority of servers are down

Slide 35

Slide 35 text

Replicated State Machine Replicated State Machine A collection of servers computing identical copies of the same state They operate even if a minority of servers are down

Slide 36

Slide 36 text

Replicated State Machine Replicated State Machine A collection of servers computing identical copies of the same state They operate even if a minority of servers are down

Slide 37

Slide 37 text

Replicated State Machine Replicated State Machine A collection of servers computing identical copies of the same state They operate even if a minority of servers are down

Slide 38

Slide 38 text

Replicated State Machine Replicated State Machine A collection of servers computing identical copies of the same state They operate even if a minority of servers are down

Slide 39

Slide 39 text

Replicated State Machine Replicated State Machine A collection of servers computing identical copies of the same state They operate even if a minority of servers are down Replicated log <=> Replicated state machines All servers execute the same command in order. Consensus module ensures proper replication

Slide 40

Slide 40 text

Introduction to Raft Introduction to Raft

Slide 41

Slide 41 text

Introduction to Raft Introduction to Raft Consensus algorithm for building fault tolerant distributed systems using a replicated state machine approach

Slide 42

Slide 42 text

Introduction to Raft Introduction to Raft Consensus algorithm for building fault tolerant distributed systems using a replicated state machine approach In Raft each server runs a deterministic state machine that Has a given state Takes some commands as an input Generates outputs Moves on to a new state after generating the output.

Slide 43

Slide 43 text

Introduction to Raft Introduction to Raft Consensus algorithm for building fault tolerant distributed systems using a replicated state machine approach In Raft each server runs a deterministic state machine that Has a given state Takes some commands as an input Generates outputs Moves on to a new state after generating the output. Built around a centralised topology Leader-follower like architecture

Slide 44

Slide 44 text

Introduction to Raft Introduction to Raft Consensus algorithm for building fault tolerant distributed systems using a replicated state machine approach In Raft each server runs a deterministic state machine that Has a given state Takes some commands as an input Generates outputs Moves on to a new state after generating the output. Built around a centralised topology Leader-follower like architecture

Slide 45

Slide 45 text

Raft Consensus Algorithm Raft Consensus Algorithm

Slide 46

Slide 46 text

Raft Consensus Algorithm Raft Consensus Algorithm At the core is persistent log containing commands issue by clients. This log is local to the server

Slide 47

Slide 47 text

Raft Consensus Algorithm Raft Consensus Algorithm At the core is persistent log containing commands issue by clients. This log is local to the server Each server runs an instance of Raft Consensus Module

Slide 48

Slide 48 text

Raft Consensus Algorithm Raft Consensus Algorithm At the core is persistent log containing commands issue by clients. This log is local to the server Each server runs an instance of Raft Consensus Module Consensus module receives commands, adds it to the local log and communicates the same to the module on other servers so that the logs are stored in the same order

Slide 49

Slide 49 text

Raft Consensus Algorithm Raft Consensus Algorithm At the core is persistent log containing commands issue by clients. This log is local to the server Each server runs an instance of Raft Consensus Module Consensus module receives commands, adds it to the local log and communicates the same to the module on other servers so that the logs are stored in the same order Each server feeds commands from its local log into its state machine, generates the same output and result is returned

Slide 50

Slide 50 text

How do we engineers solve How do we engineers solve problems in general? problems in general? Problem decomposition Break the original problem into sub-problems Try to solve and conquer them individually Minimise the state space Handle multiple problems with a single mechanism Eliminate special cases Maximise coherence Minimise non determinism

Slide 51

Slide 51 text

Raft Decomposition Raft Decomposition

Slide 52

Slide 52 text

Raft Decomposition Raft Decomposition Leader Election Majority voting to select one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes.

Slide 53

Slide 53 text

Raft Decomposition Raft Decomposition Leader Election Majority voting to select one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes. Log Replication Leader accepts commands from clients and appends to its log Leaders replicates its log to other servers forcing them to agree Overwrite log consistencies using consistency checks

Slide 54

Slide 54 text

Raft Decomposition Raft Decomposition Leader Election Majority voting to select one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes. Log Replication Leader accepts commands from clients and appends to its log Leaders replicates its log to other servers forcing them to agree Overwrite log consistencies using consistency checks Safety Only servers with up to date logs can become leader New leaders will discard any uncommitted entries A leader is always correct

Slide 55

Slide 55 text

Raft Decomposition Raft Decomposition Leader Election Majority voting to select one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes. Log Replication Leader accepts commands from clients and appends to its log Leaders replicates its log to other servers forcing them to agree Overwrite log consistencies using consistency checks Safety Only servers with up to date logs can become leader New leaders will discard any uncommitted entries A leader is always correct

Slide 56

Slide 56 text

Leader Election: Normal Operation Leader Election: Normal Operation Election process for the first term in a 5 node cluster Election time out is assumed to be >> broadcast timeout

Slide 57

Slide 57 text

Leader Crash Scenario Leader Crash Scenario As seen previously, S2 is the leader for term 2 Let us crash the current leader S2 and see what happens

Slide 58

Slide 58 text

Raft Leader Election: Basics Raft Leader Election: Basics

Slide 59

Slide 59 text

Leader Election: Split Vote Leader Election: Split Vote If many followers become candidates at the same time, votes would be split such that no candidate gets a majority Since there isn't a majority, the only possible course of action is to trigger a re-election. This is done automatically For the term 4, there emerges 2 candidates: Node C and Node A Both request votes from the majority. Both can't get it as a minimum of 3 votes are required here.

Slide 60

Slide 60 text

Split Vote Scenario Split Vote Scenario

Slide 61

Slide 61 text

Split Vote Scenario Split Vote Scenario Let's see what happens when S1 and S5 both emerge as candidates for term

Slide 62

Slide 62 text

Split Vote Scenario Split Vote Scenario Let's see what happens when S1 and S5 both emerge as candidates for term

Slide 63

Slide 63 text

Split Vote Scenario Split Vote Scenario Let's see what happens when S1 and S5 both emerge as candidates for term Randomized election timeouts is a boon to ensure that split votes are rare

Slide 64

Slide 64 text

Split Vote Scenario Split Vote Scenario Let's see what happens when S1 and S5 both emerge as candidates for term Randomized election timeouts is a boon to ensure that split votes are rare

Slide 65

Slide 65 text

Leader Election: RequestVote RPC Leader Election: RequestVote RPC Only one RPC is required to solve leader election in Raft Invoked by candidates only Term is used for safety checks. Will be discussed ahead Focus on intuition and simplicity

Slide 66

Slide 66 text

Leader Election: Safety Leader Election: Safety For any given term, there can be at most 1 leader The ongoing latest term number is exchanged between the servers. The term number is encapsulated as part of requests sent by the leader to follower nodes. A node rejects any request received with an old term number. A server will vote for at most one candidate for the particular term. This will be done on first-come-first-served basis. Once a candidate wins, it establishes its authority over other follower nodes by broadcasting a message to all other nodes. This will let everyone know who the new leader is. A server will give its vote to at most one candidate (if it hasn't voted for itself) of course. This information will be persisted to disk as well.

Slide 67

Slide 67 text

Logs and Log Replication Logs and Log Replication Contains the client command specified Index to identify the position of entry in the log Term number to logically identify when the entry was written But first, what is really the log we're talking about? Entries must survive crashes They are therefore persisted locally Replicated by leader onto others Executed in state machine if committed command term

Slide 68

Slide 68 text

Log Replication: Happy Flow Log Replication: Happy Flow

Slide 69

Slide 69 text

Log Replication: Happy Flow Log Replication: Happy Flow Assume nodes from S1 to S5 and S1 being the leader for term 2 The only node active in cluster is S2.

Slide 70

Slide 70 text

Log Replication: Happy Flow Log Replication: Happy Flow Assume nodes from S1 to S5 and S1 being the leader for term 2 The only node active in cluster is S2. Assume nodes from S1 to S5 and S1 being the leader for term 2 The only node active in cluster is S2. Now let us bring back S3 to life

Slide 71

Slide 71 text

Log Replication: Repairing Inconsistencies Log Replication: Repairing Inconsistencies

Slide 72

Slide 72 text

Log Replication: Repairing Inconsistencies Log Replication: Repairing Inconsistencies Assume S1 the current leader, dies with some uncommitted entries S2 gets elected and tries to bring consistency

Slide 73

Slide 73 text

Log Replication: Repairing Inconsistencies Log Replication: Repairing Inconsistencies Assume S1 the current leader, dies with some uncommitted entries S2 gets elected and tries to bring consistency Assume S1 the current leader, dies with some uncommitted entries S2 gets elected and tries to bring consistency S1 comes back from its hiatus and finds a new leader and entry

Slide 74

Slide 74 text

Log Matching Property Log Matching Property Goal: High level of consistency between the logs in servers If log entries on two servers have same index and term: They store the same command The logs are identical in all preceding entries If a given entry is committed, all the preceding entries are also committed

Slide 75

Slide 75 text

Log Matching Property Log Matching Property Goal: High level of consistency between the logs in servers If log entries on two servers have same index and term: They store the same command The logs are identical in all preceding entries If a given entry is committed, all the preceding entries are also committed

Slide 76

Slide 76 text

Log Matching Property Log Matching Property Goal: High level of consistency between the logs in servers If log entries on two servers have same index and term: They store the same command The logs are identical in all preceding entries If a given entry is committed, all the preceding entries are also committed In above picture, the entries till index 6 are considered committed

Slide 77

Slide 77 text

Log Replication RPC: AppendEntries Log Replication RPC: AppendEntries One RPC for replicating logs across the cluster Same RPC is used to trigger heartbeats from leader

Slide 78

Slide 78 text

Append Entries RPC: Correctness check Append Entries RPC: Correctness check AppendEntries RPC include of preceding entry Follower must contain matching entry, otherwise it rejects the request Leader then has to retry with a lower index until match

Slide 79

Slide 79 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 cluster index term

Slide 80

Slide 80 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? cluster index term

Slide 81

Slide 81 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? [ S1 to S5 ] [ S5 ] [ S3 ] cluster index term

Slide 82

Slide 82 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? [ S1 to S5 ] [ S5 ] [ S3 ] cluster index term

Slide 83

Slide 83 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 cluster index term

Slide 84

Slide 84 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 cluster index term

Slide 85

Slide 85 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] cluster index term

Slide 86

Slide 86 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] cluster index term

Slide 87

Slide 87 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] cluster index term

Slide 88

Slide 88 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] cluster index term

Slide 89

Slide 89 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] cluster index term

Slide 90

Slide 90 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] cluster index term

Slide 91

Slide 91 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] [ S1 ] [ S2 ] [ S5 ] cluster index term

Slide 92

Slide 92 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? If S5 becomes candidater for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] [ S1 ] [ S2 ] [ S5 ] cluster index term

Slide 93

Slide 93 text

Raft log Trivia Raft log Trivia 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? If S5 becomes candidater for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] [ S1 ] [ S2 ] [ S5 ] Φ cluster index term

Slide 94

Slide 94 text

Raft timing and availability Raft timing and availability Timing is critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm

Slide 95

Slide 95 text

Raft timing and availability Raft timing and availability Timing is critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF

Slide 96

Slide 96 text

Raft timing and availability Raft timing and availability Timing is critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF broadcastTime is the average time it takes a server to send request to every server in the cluster and receive responses. It is relative to the infrastructure. Typically 0.5 - 20 ms

Slide 97

Slide 97 text

Raft timing and availability Raft timing and availability Timing is critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF broadcastTime is the average time it takes a server to send request to every server in the cluster and receive responses. It is relative to the infrastructure. Typically 0.5 - 20 ms electionTimeout is the configurable time post which a election triggers. Typically 150-300ms A good measure is around ~10x the mean network latency

Slide 98

Slide 98 text

Raft timing and availability Raft timing and availability Timing is critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF broadcastTime is the average time it takes a server to send request to every server in the cluster and receive responses. It is relative to the infrastructure. Typically 0.5 - 20 ms electionTimeout is the configurable time post which a election triggers. Typically 150-300ms A good measure is around ~10x the mean network latency MTBF (Mean Time Between Failures) is the average time between failures for a server

Slide 99

Slide 99 text

Raft System Safety Constraints Raft System Safety Constraints Election safety A leader never overwrites or delete entries rather appends Leader append only Only one leader will be elected per election term If two logs contain an entry with same index and term, they are consistent. and their logs are identical in all entries up to the index Log Matching If a log entry is committed in a term then that entry will be present in the logs of leaders for all higher numbered terms. Leader completeness If a server has applied a log at a particular index in its state machine, no other server will apply a different log to that index State machine safety

Slide 100

Slide 100 text

Where is Raft being used? Where is Raft being used? CockroachDB A Scalable, Survivable, Strongly- Consistent SQL Database dgraph A Scalable, Distributed, Low Latency, High Throughput Graph Database etcd A distributed reliable key-value store for critical data in Go tikv A Distributed transactional key value database powered by Rust and Raft swarmkit A toolkit for orchestrating distributed systems at any scale. chain core Software for operating permissioned and multi-asset blockchain networks

Slide 101

Slide 101 text

Consensus in the wild Consensus in the wild SR: Server Replication LR: Log Replication SS: Sync Service BO: Barrier Orch SD: Service Discovery LE: Leader Election MM: Metadata Mgmt MQ: Message Queues

Slide 102

Slide 102 text

Summary Summary

Slide 103

Slide 103 text

Summary Summary Raft is divided into 3 parts: Leader election, Log replication and Safety

Slide 104

Slide 104 text

Summary Summary Raft is divided into 3 parts: Leader election, Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader

Slide 105

Slide 105 text

Summary Summary Raft is divided into 3 parts: Leader election, Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout

Slide 106

Slide 106 text

Summary Summary Raft is divided into 3 parts: Leader election, Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout A Candidate will vote for itself and send RequestVote RPCs to all the other nodes

Slide 107

Slide 107 text

Summary Summary Raft is divided into 3 parts: Leader election, Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout A Candidate will vote for itself and send RequestVote RPCs to all the other nodes If it gets votes from the majority of the nodes, it becomes the new Leader

Slide 108

Slide 108 text

Summary Summary Raft is divided into 3 parts: Leader election, Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout A Candidate will vote for itself and send RequestVote RPCs to all the other nodes If it gets votes from the majority of the nodes, it becomes the new Leader The leader is the only node responsible for managing the log. Followers just add new entries to their logs in response to the leader AppendEntries RPC

Slide 109

Slide 109 text

Summary ... Summary ...

Slide 110

Slide 110 text

Summary ... Summary ... When the leader receives a command from the client, it first saves this uncommitted message in its log, then sends it to every follower

Slide 111

Slide 111 text

Summary ... Summary ... When the leader receives a command from the client, it first saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a confirmation.

Slide 112

Slide 112 text

Summary ... Summary ... When the leader receives a command from the client, it first saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a confirmation. In the next AppendEntries RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message

Slide 113

Slide 113 text

Summary ... Summary ... When the leader receives a command from the client, it first saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a confirmation. In the next AppendEntries RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message The AppendEntries RPC implements a consistency check, to guarantee its local log is consistent with the leader's log.

Slide 114

Slide 114 text

Summary ... Summary ... When the leader receives a command from the client, it first saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a confirmation. In the next AppendEntries RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message The AppendEntries RPC implements a consistency check, to guarantee its local log is consistent with the leader's log. A follower will grant its vote to a candidate that has a log at least as up to date as its own

Slide 115

Slide 115 text

Where to go next? Where to go next? should be the first place to start Read about cluster membership in Raft You can now read about ZAB protocol in Zookeeper Raft in even simpler terms Sample implementations of the Raft, particularly in Java ZK logs should make even more sense now Dive into Paxos and compare the models for yourself. Raft paper Visualize