Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Raft Consensus Algorithm

Raft Consensus Algorithm

Aman Garg

May 14, 2021
Tweet

Other Decks in Programming

Transcript

  1. Consensus Algorithms Consensus Algorithms Aman Garg Fall 2019 “ Is

    it better to be alive and wrong than be correct and dead? - Jay Kreps
  2. What is consensus? What is consensus? A general agreement is

    what a consensus is all about. Get everyone to agree about a resource / thing Make decisions through a formal process Delay and analyse to gain harmony or else retry Establish a shared opinion
  3. What is consensus? What is consensus? A general agreement is

    what a consensus is all about. Get everyone to agree about a resource / thing Make decisions through a formal process Delay and analyse to gain harmony or else retry Establish a shared opinion The resulting consensus doesn't have to be unanimous. This person here is clearly unhappy but has consented to the majority
  4. But why is it But why is it important? important?

    Let us take a use case and understand
  5. Map Partition Owner Map Partition Owner Let's take a partitioned

    distributed map, say Hazelcast Assume you are writing a client over the IMap and there's a PUT operation on a certain key K. Assume a simple MOD hash function exists to find the partition where the key resides. Since the map is distributed (over N nodes), we need to find the owner for the given partition. What all various ways exist for you as a client to figure out the owner?
  6. Different ways of routing a request to the right node

    Client forwards to any node which either forwards to the correct node or returns
  7. Different ways of routing a request to the right node

    Client forwards to any node which either forwards to the correct node or returns
  8. Different ways of routing a request to the right node

    Client forwards to any node which either forwards to the correct node or returns Client sends to a partition aware routing tier which knows the owner for "foo"
  9. Different ways of routing a request to the right node

    Client forwards to any node which either forwards to the correct node or returns Client sends to a partition aware routing tier which knows the owner for "foo"
  10. Different ways of routing a request to the right node

    Client forwards to any node which either forwards to the correct node or returns Client sends to a partition aware routing tier which knows the owner for "foo" Client itself knows the owner node for a particular partition
  11. Why is it a problem though? Why is it a

    problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise.
  12. Why is it a problem though? Why is it a

    problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here.
  13. Why is it a problem though? Why is it a

    problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here. Distributed consensus is a hard problem. Easy to reason about but unfathomably hard to implement. There are a lot of edge cases to handle
  14. Why is it a problem though? Why is it a

    problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here. Distributed consensus is a hard problem. Easy to reason about but unfathomably hard to implement. There are a lot of edge cases to handle Many such distributed systems rely on an external service provider that gives strong coordination and consensus guarantees on the cluster metadata
  15. Why is it a problem though? Why is it a

    problem though? All participants have to agree on what the correct owner node is for a particular partition. There's no point in guaranteeing a consistent view of a map, otherwise. Regardless of whether this information lies with the routing tier, nodes or the client, we need some sort of a coordination here. Distributed consensus is a hard problem. Easy to reason about but unfathomably hard to implement. There are a lot of edge cases to handle Many such distributed systems rely on an external service provider that gives strong coordination and consensus guarantees on the cluster metadata Our favourite system enters to help us here. Wait for it....
  16. Apache Zookeeper Apache Zookeeper Zookeeper tracks information that is to

    be synchronised across the cluster Each node registers itself in the Zookeeper (ZK) ZK tracks the partitions against a particular set of nodes Clients can subscribe to the above metadata. Whenever new nodes are added or removed, client is notified
  17. Consensus is the key Consensus is the key Solving consensus

    is the key to solving at least the following problems in computer science:
  18. Consensus is the key Consensus is the key Solving consensus

    is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast
  19. Consensus is the key Consensus is the key Solving consensus

    is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulfilling A and C in ACID properties
  20. Consensus is the key Consensus is the key Solving consensus

    is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulfilling A and C in ACID properties Terminating reliable broadcasts Sending messages to a list of processes, say in multiplayer gaming
  21. Consensus is the key Consensus is the key Solving consensus

    is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulfilling A and C in ACID properties Terminating reliable broadcasts Sending messages to a list of processes, say in multiplayer gaming Dynamic group membership Who is the master? Which workers are available? What task is assigned to the worker?
  22. Consensus is the key Consensus is the key Solving consensus

    is the key to solving at least the following problems in computer science: Total order broadcast Used in ZAB : Zookeeper Atomic Broadcast Atomic Commit (Databases) Fulfilling A and C in ACID properties Terminating reliable broadcasts Sending messages to a list of processes, say in multiplayer gaming Dynamic group membership Who is the master? Which workers are available? What task is assigned to the worker? Stronger shared stored models Like how a concurrent hashmap helps concurrent threads to reach an agreement
  23. So what is a Consensus So what is a Consensus

    Algorithm? Algorithm? Consensus algorithms allow a collection of machines to work as a coherent group that can survive failures of some of its members
  24. Properties of a consensus algorithm Properties of a consensus algorithm

    Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder
  25. Properties of a consensus algorithm Properties of a consensus algorithm

    Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder Fault Tolerant System is available and fully functional in case of failure of nodes.
  26. Properties of a consensus algorithm Properties of a consensus algorithm

    Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder Fault Tolerant System is available and fully functional in case of failure of nodes. Correctness Performance is not impacted by minority of slow nodes. Does not depend on consistency of time for correctness.
  27. Properties of a consensus algorithm Properties of a consensus algorithm

    Safety Never returns an incorrect result despite Network partitions and delays Packet loss, duplications and reorder Fault Tolerant System is available and fully functional in case of failure of nodes. Correctness Performance is not impacted by minority of slow nodes. Does not depend on consistency of time for correctness. Real World Core algorithm should be understandable and intuitive. The internal workings should seem obvious. Implementation shouldn't require a major overhaul in existing arch.
  28. Replicated State Machine Replicated State Machine A collection of servers

    computing identical copies of the same state They operate even if a minority of servers are down
  29. Replicated State Machine Replicated State Machine A collection of servers

    computing identical copies of the same state They operate even if a minority of servers are down
  30. Replicated State Machine Replicated State Machine A collection of servers

    computing identical copies of the same state They operate even if a minority of servers are down
  31. Replicated State Machine Replicated State Machine A collection of servers

    computing identical copies of the same state They operate even if a minority of servers are down
  32. Replicated State Machine Replicated State Machine A collection of servers

    computing identical copies of the same state They operate even if a minority of servers are down
  33. Replicated State Machine Replicated State Machine A collection of servers

    computing identical copies of the same state They operate even if a minority of servers are down
  34. Replicated State Machine Replicated State Machine A collection of servers

    computing identical copies of the same state They operate even if a minority of servers are down
  35. Replicated State Machine Replicated State Machine A collection of servers

    computing identical copies of the same state They operate even if a minority of servers are down Replicated log <=> Replicated state machines All servers execute the same command in order. Consensus module ensures proper replication
  36. Introduction to Raft Introduction to Raft Consensus algorithm for building

    fault tolerant distributed systems using a replicated state machine approach
  37. Introduction to Raft Introduction to Raft Consensus algorithm for building

    fault tolerant distributed systems using a replicated state machine approach In Raft each server runs a deterministic state machine that Has a given state Takes some commands as an input Generates outputs Moves on to a new state after generating the output.
  38. Introduction to Raft Introduction to Raft Consensus algorithm for building

    fault tolerant distributed systems using a replicated state machine approach In Raft each server runs a deterministic state machine that Has a given state Takes some commands as an input Generates outputs Moves on to a new state after generating the output. Built around a centralised topology Leader-follower like architecture
  39. Introduction to Raft Introduction to Raft Consensus algorithm for building

    fault tolerant distributed systems using a replicated state machine approach In Raft each server runs a deterministic state machine that Has a given state Takes some commands as an input Generates outputs Moves on to a new state after generating the output. Built around a centralised topology Leader-follower like architecture
  40. Raft Consensus Algorithm Raft Consensus Algorithm At the core is

    persistent log containing commands issue by clients. This log is local to the server
  41. Raft Consensus Algorithm Raft Consensus Algorithm At the core is

    persistent log containing commands issue by clients. This log is local to the server Each server runs an instance of Raft Consensus Module
  42. Raft Consensus Algorithm Raft Consensus Algorithm At the core is

    persistent log containing commands issue by clients. This log is local to the server Each server runs an instance of Raft Consensus Module Consensus module receives commands, adds it to the local log and communicates the same to the module on other servers so that the logs are stored in the same order
  43. Raft Consensus Algorithm Raft Consensus Algorithm At the core is

    persistent log containing commands issue by clients. This log is local to the server Each server runs an instance of Raft Consensus Module Consensus module receives commands, adds it to the local log and communicates the same to the module on other servers so that the logs are stored in the same order Each server feeds commands from its local log into its state machine, generates the same output and result is returned
  44. How do we engineers solve How do we engineers solve

    problems in general? problems in general? Problem decomposition Break the original problem into sub-problems Try to solve and conquer them individually Minimise the state space Handle multiple problems with a single mechanism Eliminate special cases Maximise coherence Minimise non determinism
  45. Raft Decomposition Raft Decomposition Leader Election Majority voting to select

    one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes.
  46. Raft Decomposition Raft Decomposition Leader Election Majority voting to select

    one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes. Log Replication Leader accepts commands from clients and appends to its log Leaders replicates its log to other servers forcing them to agree Overwrite log consistencies using consistency checks
  47. Raft Decomposition Raft Decomposition Leader Election Majority voting to select

    one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes. Log Replication Leader accepts commands from clients and appends to its log Leaders replicates its log to other servers forcing them to agree Overwrite log consistencies using consistency checks Safety Only servers with up to date logs can become leader New leaders will discard any uncommitted entries A leader is always correct
  48. Raft Decomposition Raft Decomposition Leader Election Majority voting to select

    one leader per term Heartbeats and timeouts to detect crashes and elect new leader Randomized voting to avoid split votes. Log Replication Leader accepts commands from clients and appends to its log Leaders replicates its log to other servers forcing them to agree Overwrite log consistencies using consistency checks Safety Only servers with up to date logs can become leader New leaders will discard any uncommitted entries A leader is always correct
  49. Leader Election: Normal Operation Leader Election: Normal Operation Election process

    for the first term in a 5 node cluster Election time out is assumed to be >> broadcast timeout
  50. Leader Crash Scenario Leader Crash Scenario As seen previously, S2

    is the leader for term 2 Let us crash the current leader S2 and see what happens
  51. Leader Election: Split Vote Leader Election: Split Vote If many

    followers become candidates at the same time, votes would be split such that no candidate gets a majority Since there isn't a majority, the only possible course of action is to trigger a re-election. This is done automatically For the term 4, there emerges 2 candidates: Node C and Node A Both request votes from the majority. Both can't get it as a minimum of 3 votes are required here.
  52. Split Vote Scenario Split Vote Scenario Let's see what happens

    when S1 and S5 both emerge as candidates for term
  53. Split Vote Scenario Split Vote Scenario Let's see what happens

    when S1 and S5 both emerge as candidates for term
  54. Split Vote Scenario Split Vote Scenario Let's see what happens

    when S1 and S5 both emerge as candidates for term Randomized election timeouts is a boon to ensure that split votes are rare
  55. Split Vote Scenario Split Vote Scenario Let's see what happens

    when S1 and S5 both emerge as candidates for term Randomized election timeouts is a boon to ensure that split votes are rare
  56. Leader Election: RequestVote RPC Leader Election: RequestVote RPC Only one

    RPC is required to solve leader election in Raft Invoked by candidates only Term is used for safety checks. Will be discussed ahead Focus on intuition and simplicity
  57. Leader Election: Safety Leader Election: Safety For any given term,

    there can be at most 1 leader The ongoing latest term number is exchanged between the servers. The term number is encapsulated as part of requests sent by the leader to follower nodes. A node rejects any request received with an old term number. A server will vote for at most one candidate for the particular term. This will be done on first-come-first-served basis. Once a candidate wins, it establishes its authority over other follower nodes by broadcasting a message to all other nodes. This will let everyone know who the new leader is. A server will give its vote to at most one candidate (if it hasn't voted for itself) of course. This information will be persisted to disk as well.
  58. Logs and Log Replication Logs and Log Replication Contains the

    client command specified Index to identify the position of entry in the log Term number to logically identify when the entry was written But first, what is really the log we're talking about? Entries must survive crashes They are therefore persisted locally Replicated by leader onto others Executed in state machine if committed command term
  59. Log Replication: Happy Flow Log Replication: Happy Flow Assume nodes

    from S1 to S5 and S1 being the leader for term 2 The only node active in cluster is S2.
  60. Log Replication: Happy Flow Log Replication: Happy Flow Assume nodes

    from S1 to S5 and S1 being the leader for term 2 The only node active in cluster is S2. Assume nodes from S1 to S5 and S1 being the leader for term 2 The only node active in cluster is S2. Now let us bring back S3 to life
  61. Log Replication: Repairing Inconsistencies Log Replication: Repairing Inconsistencies Assume S1

    the current leader, dies with some uncommitted entries S2 gets elected and tries to bring consistency
  62. Log Replication: Repairing Inconsistencies Log Replication: Repairing Inconsistencies Assume S1

    the current leader, dies with some uncommitted entries S2 gets elected and tries to bring consistency Assume S1 the current leader, dies with some uncommitted entries S2 gets elected and tries to bring consistency S1 comes back from its hiatus and finds a new leader and entry
  63. Log Matching Property Log Matching Property Goal: High level of

    consistency between the logs in servers If log entries on two servers have same index and term: They store the same command The logs are identical in all preceding entries If a given entry is committed, all the preceding entries are also committed
  64. Log Matching Property Log Matching Property Goal: High level of

    consistency between the logs in servers If log entries on two servers have same index and term: They store the same command The logs are identical in all preceding entries If a given entry is committed, all the preceding entries are also committed
  65. Log Matching Property Log Matching Property Goal: High level of

    consistency between the logs in servers If log entries on two servers have same index and term: They store the same command The logs are identical in all preceding entries If a given entry is committed, all the preceding entries are also committed In above picture, the entries till index 6 are considered committed
  66. Log Replication RPC: AppendEntries Log Replication RPC: AppendEntries One RPC

    for replicating logs across the cluster Same RPC is used to trigger heartbeats from leader
  67. Append Entries RPC: Correctness check Append Entries RPC: Correctness check

    AppendEntries RPC include <index, term> of preceding entry Follower must contain matching entry, otherwise it rejects the request Leader then has to retry with a lower index until match
  68. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 cluster index term
  69. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? cluster index term
  70. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? [ S1 to S5 ] [ S5 ] [ S3 ] cluster index term
  71. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? [ S1 to S5 ] [ S5 ] [ S3 ] cluster index term
  72. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 cluster index term
  73. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 cluster index term
  74. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] cluster index term
  75. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] cluster index term
  76. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] cluster index term
  77. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] cluster index term
  78. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] cluster index term
  79. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] cluster index term
  80. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] [ S1 ] [ S2 ] [ S5 ] cluster index term
  81. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? If S5 becomes candidater for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] [ S1 ] [ S2 ] [ S5 ] cluster index term
  82. Raft log Trivia Raft log Trivia 1 1 1 1

    1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 1 1 1 2 2 3 3 3 S1 S5 S3 S4 S2 1 2 3 4 5 6 7 8 9 Who could be the leader for Term 1 , Term 2 and Term 3 ? Till what index is the log consistent across the nodes S1-S5 ? If S1 becomes candidate for Term 4 , who all can vote for it ? If S2 becomes candidate for Term 4 , who all can vote for it ? If S3 becomes candidate for Term 4 , who all can vote for it ? If S4 becomes candidate for Term 4 , who all can vote for it ? If S5 becomes candidater for Term 4 , who all can vote for it ? [ S1 to S5 ] [ S5 ] [ S3 ] Index 8 [ S2 ] [ S4 ] [ S5 ] [ S5 ] [ S1 ] [ S2 ] [ S4] [ S5 ] [ S1 ] [ S2 ] [ S5 ] Φ cluster index term
  83. Raft timing and availability Raft timing and availability Timing is

    critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm
  84. Raft timing and availability Raft timing and availability Timing is

    critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF
  85. Raft timing and availability Raft timing and availability Timing is

    critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF broadcastTime is the average time it takes a server to send request to every server in the cluster and receive responses. It is relative to the infrastructure. Typically 0.5 - 20 ms
  86. Raft timing and availability Raft timing and availability Timing is

    critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF broadcastTime is the average time it takes a server to send request to every server in the cluster and receive responses. It is relative to the infrastructure. Typically 0.5 - 20 ms electionTimeout is the configurable time post which a election triggers. Typically 150-300ms A good measure is around ~10x the mean network latency
  87. Raft timing and availability Raft timing and availability Timing is

    critical in Raft to elect and maintain a steady leader over time, in order to have a perfect availability of the cluster. Stability is ensured by respecting the timing requirement of the algorithm broadcastTime << electionTimeout << MTBF broadcastTime is the average time it takes a server to send request to every server in the cluster and receive responses. It is relative to the infrastructure. Typically 0.5 - 20 ms electionTimeout is the configurable time post which a election triggers. Typically 150-300ms A good measure is around ~10x the mean network latency MTBF (Mean Time Between Failures) is the average time between failures for a server
  88. Raft System Safety Constraints Raft System Safety Constraints Election safety

    A leader never overwrites or delete entries rather appends Leader append only Only one leader will be elected per election term If two logs contain an entry with same index and term, they are consistent. and their logs are identical in all entries up to the index Log Matching If a log entry is committed in a term then that entry will be present in the logs of leaders for all higher numbered terms. Leader completeness If a server has applied a log at a particular index in its state machine, no other server will apply a different log to that index State machine safety
  89. Where is Raft being used? Where is Raft being used?

    CockroachDB A Scalable, Survivable, Strongly- Consistent SQL Database dgraph A Scalable, Distributed, Low Latency, High Throughput Graph Database etcd A distributed reliable key-value store for critical data in Go tikv A Distributed transactional key value database powered by Rust and Raft swarmkit A toolkit for orchestrating distributed systems at any scale. chain core Software for operating permissioned and multi-asset blockchain networks
  90. Consensus in the wild Consensus in the wild SR: Server

    Replication LR: Log Replication SS: Sync Service BO: Barrier Orch SD: Service Discovery LE: Leader Election MM: Metadata Mgmt MQ: Message Queues
  91. Summary Summary Raft is divided into 3 parts: Leader election,

    Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader
  92. Summary Summary Raft is divided into 3 parts: Leader election,

    Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout
  93. Summary Summary Raft is divided into 3 parts: Leader election,

    Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout A Candidate will vote for itself and send RequestVote RPCs to all the other nodes
  94. Summary Summary Raft is divided into 3 parts: Leader election,

    Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout A Candidate will vote for itself and send RequestVote RPCs to all the other nodes If it gets votes from the majority of the nodes, it becomes the new Leader
  95. Summary Summary Raft is divided into 3 parts: Leader election,

    Log replication and Safety A node can be in one of three states: Follower, Candidate or Leader Every node starts as a Follower and transitions to candidate state after election timeout A Candidate will vote for itself and send RequestVote RPCs to all the other nodes If it gets votes from the majority of the nodes, it becomes the new Leader The leader is the only node responsible for managing the log. Followers just add new entries to their logs in response to the leader AppendEntries RPC
  96. Summary ... Summary ... When the leader receives a command

    from the client, it first saves this uncommitted message in its log, then sends it to every follower
  97. Summary ... Summary ... When the leader receives a command

    from the client, it first saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a confirmation.
  98. Summary ... Summary ... When the leader receives a command

    from the client, it first saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a confirmation. In the next AppendEntries RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message
  99. Summary ... Summary ... When the leader receives a command

    from the client, it first saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a confirmation. In the next AppendEntries RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message The AppendEntries RPC implements a consistency check, to guarantee its local log is consistent with the leader's log.
  100. Summary ... Summary ... When the leader receives a command

    from the client, it first saves this uncommitted message in its log, then sends it to every follower When it gets a successful response from the majority of nodes, the command is committed and the client gets a confirmation. In the next AppendEntries RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message The AppendEntries RPC implements a consistency check, to guarantee its local log is consistent with the leader's log. A follower will grant its vote to a candidate that has a log at least as up to date as its own
  101. Where to go next? Where to go next? should be

    the first place to start Read about cluster membership in Raft You can now read about ZAB protocol in Zookeeper Raft in even simpler terms Sample implementations of the Raft, particularly in Java ZK logs should make even more sense now Dive into Paxos and compare the models for yourself. Raft paper Visualize