Raft Consensus Algorithm

Consensus Algorithms Consensus Algorithms Aman Garg Fall 2019 “ Is
it better to be alive and wrong than be correct and dead? - Jay Kreps

What is consensus? What is consensus? A general agreement is
what a consensus is all about. Get everyone to agree about a resource / thing Make decisions through a formal process Delay and analyse to gain harmony or else retry Establish a shared opinion

What is consensus? What is consensus? A general agreement is
what a consensus is all about. Get everyone to agree about a resource / thing Make decisions through a formal process Delay and analyse to gain harmony or else retry Establish a shared opinion The resulting consensus doesn't have to be unanimous. This person here is clearly unhappy but has consented to the majority

But why is it But why is it important? important?
Let us take a use case and understand

Map Partition Owner Map Partition Owner Let's take a partitioned
distributed map, say Hazelcast Assume you are writing a client over the IMap and there's a PUT operation on a certain key K. Assume a simple MOD hash function exists to find the partition where the key resides. Since the map is distributed (over N nodes), we need to find the owner for the given partition. What all various ways exist for you as a client to figure out the owner?

Diﬀerent ways of routing a request to the right node

Client forwards to any node which either forwards to the correct node or returns

Client forwards to any node which either forwards to the correct node or returns Client sends to a partition aware routing tier which knows the owner for "foo"

Client forwards to any node which either forwards to the correct node or returns Client sends to a partition aware routing tier which knows the owner for "foo" Client itself knows the owner node for a particular partition

Why is it a problem though? Why is it a
problem though?

problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise.

problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here.

problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here. Distributed consensus is a hard problem. Easy to reason about but unfathomably hard to implement. There are a lot of edge cases to handle

problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here. Distributed consensus is a hard problem. Easy to reason about but unfathomably hard to implement. There are a lot of edge cases to handle Many such distributed systems rely on an external service provider that gives strong coordination and consensus guarantees on the cluster metadata

problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here. Distributed consensus is a hard problem. Easy to reason about but unfathomably hard to implement. There are a lot of edge cases to handle Many such distributed systems rely on an external service provider that gives strong coordination and consensus guarantees on the cluster metadata Our favourite system enters to help us here. Wait for it....

Apache Zookeeper Apache Zookeeper Zookeeper tracks information that is to
be synchronised across the cluster Each node registers itself in the Zookeeper (ZK) ZK tracks the partitions against a particular set of nodes Clients can subscribe to the above metadata. Whenever new nodes are added or removed, client is notiﬁed

Consensus is the key Consensus is the key Solving consensus
is the key to solving at least the following problems in computer science:

is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast

is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulﬁlling A and C in ACID properties

is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulﬁlling A and C in ACID properties Terminating reliable broadcasts Sending messages to a list of processes, say in multiplayer gaming

is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulﬁlling A and C in ACID properties Terminating reliable broadcasts Sending messages to a list of processes, say in multiplayer gaming Dynamic group membership Who is the master? Which workers are available? What task is assigned to the worker?

is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulﬁlling A and C in ACID properties Terminating reliable broadcasts Sending messages to a list of processes, say in multiplayer gaming Dynamic group membership Who is the master? Which workers are available? What task is assigned to the worker? Stronger shared stored models Like how a concurrent hashmap helps concurrent threads to reach an agreement

So what is a Consensus So what is a Consensus
Algorithm? Algorithm? Consensus algorithms allow a collection of machines to work as a coherent group that can survive failures of some of its members

Properties of a consensus algorithm Properties of a consensus algorithm

Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder

Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder Fault Tolerant System is available and fully functional in case of failure of nodes.

Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder Fault Tolerant System is available and fully functional in case of failure of nodes. Correctness Performance is not impacted by minority of slow nodes. Does not depend on consistency of time for correctness.

Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder Fault Tolerant System is available and fully functional in case of failure of nodes. Correctness Performance is not impacted by minority of slow nodes. Does not depend on consistency of time for correctness. Real World Core algorithm should be understandable and intuitive. The internal workings should seem obvious. Implementation shouldn't require a major overhaul in existing arch.

Replicated State Machine Replicated State Machine A collection of servers
computing identical copies of the same state They operate even if a minority of servers are down

Replicated State Machine Replicated State Machine A collection of servers
computing identical copies of the same state They operate even if a minority of servers are down Replicated log <=> Replicated state machines All servers execute the same command in order. Consensus module ensures proper replication

Introduction to Raft Introduction to Raft

Introduction to Raft Introduction to Raft Consensus algorithm for building
fault tolerant distributed systems using a replicated state machine approach

fault tolerant distributed systems using a replicated state machine approach In Raft each server runs a deterministic state machine that Has a given state Takes some commands as an input Generates outputs Moves on to a new state after generating the output.

fault tolerant distributed systems using a replicated state machine approach In Raft each server runs a deterministic state machine that Has a given state Takes some commands as an input Generates outputs Moves on to a new state after generating the output. Built around a centralised topology Leader-follower like architecture

Raft Consensus Algorithm Raft Consensus Algorithm

Raft Consensus Algorithm Raft Consensus Algorithm At the core is
persistent log containing commands issue by clients. This log is local to the server

persistent log containing commands issue by clients. This log is local to the server Each server runs an instance of Raft Consensus Module

persistent log containing commands issue by clients. This log is local to the server Each server runs an instance of Raft Consensus Module Consensus module receives commands, adds it to the local log and communicates the same to the module on other servers so that the logs are stored in the same order

persistent log containing commands issue by clients. This log is local to the server Each server runs an instance of Raft Consensus Module Consensus module receives commands, adds it to the local log and communicates the same to the module on other servers so that the logs are stored in the same order Each server feeds commands from its local log into its state machine, generates the same output and result is returned

How do we engineers solve How do we engineers solve
problems in general? problems in general? Problem decomposition Break the original problem into sub-problems Try to solve and conquer them individually Minimise the state space Handle multiple problems with a single mechanism Eliminate special cases Maximise coherence Minimise non determinism

Raft Decomposition Raft Decomposition

Raft Decomposition Raft Decomposition Leader Election Majority voting to select
one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes.

one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes. Log Replication Leader accepts commands from clients and appends to its log Leaders replicates its log to other servers forcing them to agree Overwrite log consistencies using consistency checks

one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes. Log Replication Leader accepts commands from clients and appends to its log Leaders replicates its log to other servers forcing them to agree Overwrite log consistencies using consistency checks Safety Only servers with up to date logs can become leader New leaders will discard any uncommitted entries A leader is always correct

Leader Election: Normal Operation Leader Election: Normal Operation Election process
for the ﬁrst term in a 5 node cluster Election time out is assumed to be >> broadcast timeout

Leader Crash Scenario Leader Crash Scenario As seen previously, S2
is the leader for term 2 Let us crash the current leader S2 and see what happens

Raft Leader Election: Basics Raft Leader Election: Basics

Leader Election: Split Vote Leader Election: Split Vote If many
followers become candidates at the same time, votes would be split such that no candidate gets a majority Since there isn't a majority, the only possible course of action is to trigger a re-election. This is done automatically For the term 4, there emerges 2 candidates: Node C and Node A Both request votes from the majority. Both can't get it as a minimum of 3 votes are required here.

Split Vote Scenario Split Vote Scenario

Split Vote Scenario Split Vote Scenario Let's see what happens
when S1 and S5 both emerge as candidates for term

Split Vote Scenario Split Vote Scenario Let's see what happens
when S1 and S5 both emerge as candidates for term Randomized election timeouts is a boon to ensure that split votes are rare

Leader Election: RequestVote RPC Leader Election: RequestVote RPC Only one
RPC is required to solve leader election in Raft Invoked by candidates only Term is used for safety checks. Will be discussed ahead Focus on intuition and simplicity

Leader Election: Safety Leader Election: Safety For any given term,
there can be at most 1 leader The ongoing latest term number is exchanged between the servers. The term number is encapsulated as part of requests sent by the leader to follower nodes. A node rejects any request received with an old term number. A server will vote for at most one candidate for the particular term. This will be done on ﬁrst-come-ﬁrst-served basis. Once a candidate wins, it establishes its authority over other follower nodes by broadcasting a message to all other nodes. This will let everyone know who the new leader is. A server will give its vote to at most one candidate (if it hasn't voted for itself) of course. This information will be persisted to disk as well.

Logs and Log Replication Logs and Log Replication Contains the
client command speciﬁed Index to identify the position of entry in the log Term number to logically identify when the entry was written But ﬁrst, what is really the log we're talking about? Entries must survive crashes They are therefore persisted locally Replicated by leader onto others Executed in state machine if committed command term

Log Replication: Happy Flow Log Replication: Happy Flow

Log Replication: Happy Flow Log Replication: Happy Flow Assume nodes
from S1 to S5 and S1 being the leader for term 2 The only node active in cluster is S2.

Log Replication: Happy Flow Log Replication: Happy Flow Assume nodes
from S1 to S5 and S1 being the leader for term 2 The only node active in cluster is S2. Assume nodes from S1 to S5 and S1 being the leader for term 2 The only node active in cluster is S2. Now let us bring back S3 to life

Log Replication: Repairing Inconsistencies Log Replication: Repairing Inconsistencies

Log Replication: Repairing Inconsistencies Log Replication: Repairing Inconsistencies Assume S1
the current leader, dies with some uncommitted entries S2 gets elected and tries to bring consistency

Log Replication: Repairing Inconsistencies Log Replication: Repairing Inconsistencies Assume S1
the current leader, dies with some uncommitted entries S2 gets elected and tries to bring consistency Assume S1 the current leader, dies with some uncommitted entries S2 gets elected and tries to bring consistency S1 comes back from its hiatus and ﬁnds a new leader and entry

Log Matching Property Log Matching Property Goal: High level of
consistency between the logs in servers If log entries on two servers have same index and term: They store the same command The logs are identical in all preceding entries If a given entry is committed, all the preceding entries are also committed

Log Matching Property Log Matching Property Goal: High level of
consistency between the logs in servers If log entries on two servers have same index and term: They store the same command The logs are identical in all preceding entries If a given entry is committed, all the preceding entries are also committed In above picture, the entries till index 6 are considered committed

Log Replication RPC: AppendEntries Log Replication RPC: AppendEntries One RPC
for replicating logs across the cluster Same RPC is used to trigger heartbeats from leader

Append Entries RPC: Correctness check Append Entries RPC: Correctness check
AppendEntries RPC include <index, term> of preceding entry Follower must contain matching entry, otherwise it rejects the request Leader then has to retry with a lower index until match

Raft log Trivia Raft log Trivia 1 1 1 1
1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? [ S1 to S5 ] [ S5 ] [ S3 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? [ S1 to S5 ] [ S5 ] [ S3 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] [ S1 ] [ S2 ] [ S5 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? If S5 becomes candidater for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] [ S1 ] [ S2 ] [ S5 ] cluster index term

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? If S5 becomes candidater for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] [ S1 ] [ S2 ] [ S5 ] Φ cluster index term

Raft timing and availability Raft timing and availability Timing is
critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm

critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF

critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF broadcastTime is the average time it takes a server to send request to every server in the cluster and receive responses. It is relative to the infrastructure. Typically 0.5 - 20 ms

critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF broadcastTime is the average time it takes a server to send request to every server in the cluster and receive responses. It is relative to the infrastructure. Typically 0.5 - 20 ms electionTimeout is the conﬁgurable time post which a election triggers. Typically 150-300ms A good measure is around ~10x the mean network latency

critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF broadcastTime is the average time it takes a server to send request to every server in the cluster and receive responses. It is relative to the infrastructure. Typically 0.5 - 20 ms electionTimeout is the conﬁgurable time post which a election triggers. Typically 150-300ms A good measure is around ~10x the mean network latency MTBF (Mean Time Between Failures) is the average time between failures for a server

Raft System Safety Constraints Raft System Safety Constraints Election safety
A leader never overwrites or delete entries rather appends Leader append only Only one leader will be elected per election term If two logs contain an entry with same index and term, they are consistent. and their logs are identical in all entries up to the index Log Matching If a log entry is committed in a term then that entry will be present in the logs of leaders for all higher numbered terms. Leader completeness If a server has applied a log at a particular index in its state machine, no other server will apply a diﬀerent log to that index State machine safety

Where is Raft being used? Where is Raft being used?
CockroachDB A Scalable, Survivable, Strongly- Consistent SQL Database dgraph A Scalable, Distributed, Low Latency, High Throughput Graph Database etcd A distributed reliable key-value store for critical data in Go tikv A Distributed transactional key value database powered by Rust and Raft swarmkit A toolkit for orchestrating distributed systems at any scale. chain core Software for operating permissioned and multi-asset blockchain networks

Consensus in the wild Consensus in the wild SR: Server
Replication LR: Log Replication SS: Sync Service BO: Barrier Orch SD: Service Discovery LE: Leader Election MM: Metadata Mgmt MQ: Message Queues

Summary Summary

Summary Summary Raft is divided into 3 parts: Leader election,
Log replication and Safety

Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader

Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout

Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout A Candidate will vote for itself and send RequestVote RPCs to all the other nodes

Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout A Candidate will vote for itself and send RequestVote RPCs to all the other nodes If it gets votes from the majority of the nodes, it becomes the new Leader

Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout A Candidate will vote for itself and send RequestVote RPCs to all the other nodes If it gets votes from the majority of the nodes, it becomes the new Leader The leader is the only node responsible for managing the log. Followers just add new entries to their logs in response to the leader AppendEntries RPC

Summary ... Summary ...

Summary ... Summary ... When the leader receives a command
from the client, it ﬁrst saves this uncommitted message in its log, then sends it to every follower

from the client, it ﬁrst saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a conﬁrmation.

from the client, it ﬁrst saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a conﬁrmation. In the next AppendEntries RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message

from the client, it ﬁrst saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a conﬁrmation. In the next AppendEntries RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message The AppendEntries RPC implements a consistency check, to guarantee its local log is consistent with the leader's log.

from the client, it ﬁrst saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a conﬁrmation. In the next AppendEntries RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message The AppendEntries RPC implements a consistency check, to guarantee its local log is consistent with the leader's log. A follower will grant its vote to a candidate that has a log at least as up to date as its own

Where to go next? Where to go next? should be
the ﬁrst place to start Read about cluster membership in Raft You can now read about ZAB protocol in Zookeeper Raft in even simpler terms Sample implementations of the Raft, particularly in Java ZK logs should make even more sense now Dive into Paxos and compare the models for yourself. Raft paper Visualize

Raft Consensus Algorithm

Raft Consensus Algorithm

Other Decks in Programming

Featured

Transcript