Slide 1

Slide 1 text

& Consensus RAFT RAFTER

Slide 2

Slide 2 text

Tom Santero cats I newports distributed systems

Slide 3

Slide 3 text

Andrew Stone cats I bread distributed systems

Slide 4

Slide 4 text

@tsantero @andew_j_stone

Slide 5

Slide 5 text

tsantero andewjstone

Slide 6

Slide 6 text

tsantero astone @ basho.com

Slide 7

Slide 7 text

tsantero astone @ basho.com (notice Andrew’s contact keeps getting shorter?)

Slide 8

Slide 8 text

http://thinkdistributed.io A Chris Meiklejohn Production

Slide 9

Slide 9 text

The Usual Suspects

Slide 10

Slide 10 text

“Strongly Consistent Datastores” MongoDB Redis MySQL others...

Slide 11

Slide 11 text

async {replication disk persistence Failure Detection

Slide 12

Slide 12 text

Problem?

Slide 13

Slide 13 text

Single Node w/ async disc writes Data is written to fs buffer, user is sent acknowledgement, power goes out Data not yet written to disk is LOST System is UNAVAILABLE Single Disk Solutions: fsync, battery backup, prayer Failure Mode 1

Slide 14

Slide 14 text

Master/Slave with asynchronous replication Data is written by user and acknowledged Data synced on Primary, but crashes Failure Mode 2

Slide 15

Slide 15 text

? Consistent Available

Slide 16

Slide 16 text

? Consistent Available

Slide 17

Slide 17 text

? Consistent Available Primary Failed. Data not yet written to Secondary Write already ack’d to Client if promote_secondary() == true; { stderr(“data loss”); } else { stderr(“system unavailable”); }

Slide 18

Slide 18 text

(›°□°ʣ›ớ ᵲᴸᵲ

Slide 19

Slide 19 text

Synchronous Writes FTW?

Slide 20

Slide 20 text

PostgreSQL / Oracle Master / Slave Ack when Slave confirms Write

Slide 21

Slide 21 text

Problem?

Slide 22

Slide 22 text

Problem? Failure Detection Automated Failover “split brain” partitions

Slide 23

Slide 23 text

Solution!

Slide 24

Slide 24 text

Solution! Consensus protocols! (Paxos, ZAB, Raft) RYOW Consistency Safe Serializability

Slide 25

Slide 25 text

What is Consensus?

Slide 26

Slide 26 text

“The problem of reaching agreement among remote processes is one of the most fundamental problems in distributed computing and is at the core of many algorithms for distributed data processing, distributed file management, and fault- tolerant distributed applications.”

Slide 27

Slide 27 text

In a distributed system... despite failures. multiple processes agreeing on a value

Slide 28

Slide 28 text

host0 host1 host2 Replicated Log

Slide 29

Slide 29 text

v0 host0 host1 host2 Replicated Log

Slide 30

Slide 30 text

v0 v1 v2 v3 v4 v5 ... v(n-1) host0 host1 host2 Replicated Log

Slide 31

Slide 31 text

Consensus {termination agreement validity

Slide 32

Slide 32 text

Consensus {termination agreement validity non faulty processes eventually decide on a value

Slide 33

Slide 33 text

Consensus {termination agreement validity non faulty processes eventually decide on a value processes that decide do so on the same value

Slide 34

Slide 34 text

Consensus {termination agreement validity non faulty processes eventually decide on a value processes that decide do so on the same value value must have been proposed

Slide 35

Slide 35 text

Theoretical Real World

Slide 36

Slide 36 text

Back to 1985...

Slide 37

Slide 37 text

Back to 1985... The FLP Result

Slide 38

Slide 38 text

& Safety Liveness

Slide 39

Slide 39 text

bad things can’t happen

Slide 40

Slide 40 text

good things eventually happen

Slide 41

Slide 41 text

Consensus {termination agreement validity non faulty processes eventually decide on a value processes that decide do so on the same value value must have been proposed

Slide 42

Slide 42 text

{termination agreement validity non faulty processes eventually decide on a value processes that decide do so on the same value value must have been proposed Safety Liveness

Slide 43

Slide 43 text

Safety Liveness {termination agreement validity non faulty processes eventually decide on a value processes that decide do so on the same value value must have been proposed

Slide 44

Slide 44 text

Safety Liveness {termination agreement validity non faulty processes eventually decide on a value processes that decide do so on the same value value must have been proposed

Slide 45

Slide 45 text

Safety Liveness {termination agreement validity non faulty processes eventually decide on a value processes that decide do so on the same value value must have been proposed non-triviality

Slide 46

Slide 46 text

The FLP Result: perfect Safety and Liveness in async consensus is impossible

Slide 47

Slide 47 text

Symmetric vs Asymmetric

Slide 48

Slide 48 text

Raft

Slide 49

Slide 49 text

Motivation: RAMCloud large scale, general purpose, distributed storage all data lives in DRAM strong consistency model https://ramcloud.stanford.edu/

Slide 50

Slide 50 text

Motivation: RAMCloud large scale, general purpose, distributed storage all data lives in DRAM strong consistency model 100 byte object reads in 5μs https://ramcloud.stanford.edu/

Slide 51

Slide 51 text

John Ousterhout Diego Ongaro In Search of an Understandable Consensus Algorithm https://ramcloud.stanford.edu/raft.pdf

Slide 52

Slide 52 text

“Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. Furthermore, its architecture is unsuitable for building practical systems, requiring complex changes to create an efficient and complete solution. As a result, both system builders and students struggle with Paxos.”

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

Design Goals: Understandability & Decomposition Strong Leadership Model Joint Consensus for Membership Changes

Slide 55

Slide 55 text

Log SM C Consensus Module Replicated Log State Machine

Slide 56

Slide 56 text

Log SM C Log SM C Log SM C Client C

Slide 57

Slide 57 text

Log SM C Log SM C Log SM C Client 1. client makes request to Leader C

Slide 58

Slide 58 text

Log SM C Log SM C Log SM C Client 2. consensus module manages request C

Slide 59

Slide 59 text

Log SM C Log SM C Log SM C Client 3. persist instruction to local log v C

Slide 60

Slide 60 text

Log SM C Log SM C Log SM C Client v C

Slide 61

Slide 61 text

Log SM C Log SM C Log SM C Client v 4. leader replicates command to other machines C C C

Slide 62

Slide 62 text

Log SM C Log SM C Log SM C Client v C C C

Slide 63

Slide 63 text

Log SM C Log SM C Log SM C Client v C C v v 5. command recorded to local machines’ log C

Slide 64

Slide 64 text

Log SM C Log SM C Log SM C Client v C C v v C

Slide 65

Slide 65 text

Log SM C Log SM C Log SM C Client v C C v v C

Slide 66

Slide 66 text

Log SM C Log SM C Log SM C Client v C C v v C

Slide 67

Slide 67 text

Log SM C Log SM C Log SM C Client v C C v v 7. command forwarded to state machines for processing SM SM SM C

Slide 68

Slide 68 text

Log SM C Log SM C Log SM C Client v C C v v 7. command forwarded to state machines for processing SM SM SM C

Slide 69

Slide 69 text

Log SM C Log SM C Log SM C Client v C C v v SM C

Slide 70

Slide 70 text

Log SM C Log SM C Log SM C Client v C C v v SM 8. SM processes command, ACKs to client C

Slide 71

Slide 71 text

Log SM C Log SM C Log SM C Client v C C v v SM C

Slide 72

Slide 72 text

Why does that work? job of the consensus module to: C manage replicated logs determine when it’s safe to pass to state machine for execution only requires majority participation

Slide 73

Slide 73 text

Why does that work? job of the consensus module to: C manage replicated logs determine when it’s safe to pass to state machine for execution only requires majority participation Safety { Liveness {

Slide 74

Slide 74 text

2F + 1

Slide 75

Slide 75 text

2F + 1 solve for F

Slide 76

Slide 76 text

F + 1 service unavailable

Slide 77

Slide 77 text

Fail-Stop Behavior

Slide 78

Slide 78 text

What If The Leader DIES?

Slide 79

Slide 79 text

Leader Election!

Slide 80

Slide 80 text

1. Select 1/N servers to act as Leader 2. Leader ensures Safety and Linearizability 3. Detect crashes + Elect new Leader 4. Maintain consistency after Leadership “coups” 5. Depose old Leaders if they return 6. Manage cluster topology

Slide 81

Slide 81 text

Possible Server Roles: Leader Follower Candidate

Slide 82

Slide 82 text

Possible Server Roles: Leader Follower Candidate At most only 1 valid Leader at a time Receives commands from clients Commits entries Sends heartbeats

Slide 83

Slide 83 text

Possible Server Roles: Leader Follower Candidate Replicate state changes Passive member of cluster during normal operation Vote for Candidates

Slide 84

Slide 84 text

Possible Server Roles: Leader Follower Candidate Initiate and coordinate Leader Election Was previously a Follower

Slide 85

Slide 85 text

Terms: election normal operation Term 1 Term 2 Term 3 Term 4 no emerging leader

Slide 86

Slide 86 text

Leader Follower Candidate

Slide 87

Slide 87 text

Leader Follower Candidate times out, starts election

Slide 88

Slide 88 text

Leader Follower Candidate

Slide 89

Slide 89 text

Leader Follower Candidate times out, new election

Slide 90

Slide 90 text

Leader Follower Candidate

Slide 91

Slide 91 text

Leader Follower Candidate receives votes from majority of servers

Slide 92

Slide 92 text

Leader Follower Candidate

Slide 93

Slide 93 text

Leader Follower Candidate discover server with higher term

Slide 94

Slide 94 text

Leader Follower Candidate

Slide 95

Slide 95 text

Leader Follower Candidate discover current leader or higher term

Slide 96

Slide 96 text

Leader Follower Candidate

Slide 97

Slide 97 text

Potential Use Cases: Distributed Lock Manager Database Transactions Automated Failover Configuration Management http://coreos.com/blog/distributed-configuration-with-etcd/index.html Service Discovery etc...

Slide 98

Slide 98 text

Rafter github.com/andrewjstone/rafter

Slide 99

Slide 99 text

•A labor of love, a work in progress •A library for building strongly consistent distributed systems in Erlang •Implements the raft consensus protocol in Erlang •Fundamental abstraction is the replicated log What:

Slide 100

Slide 100 text

Replicated Log •API operates on log entries •Log entries contain commands •Commands are transparent to Rafter •Systems build on top of rafter with pluggable state machines that process commands upon log entry commit.

Slide 101

Slide 101 text

Erlang

Slide 102

Slide 102 text

Erlang: A Concurrent Language •Processes are the fundamental abstraction •Processes can only communicate by sending each other messages •Processes do not share state •Processes are managed by supervisor processes in a hierarchy

Slide 103

Slide 103 text

Erlang: A Concurrent Language  loop()  -­‐>            receive                    {From,  Msg}  -­‐>                              From  !  Msg,                              loop()            end.   %%  Spawn  100,000  echo  servers                       Pids  =  [spawn(fun  loop/0)  ||  _  <-­‐   lists:seq(1,100000)] %%  Send  a  message  to  the  first  process lists:nth(0,  Pids)  !  {self(),  ayo}.

Slide 104

Slide 104 text

Erlang: A Functional Language • Single Assignment Variables • Tail-Recursion • Pattern Matching {op,  {set,  Key,    Val}}  =  {op,  {set,  <<“job”>>,  <<“developer”>>}} • Bit Syntax Header  =  <>

Slide 105

Slide 105 text

Erlang: A Distributed Language Location Transparency: Processes can send messages to other processes without having to know if the other process is local. %%  Send  to  a  local  gen_server  process gen_server:cast(peer1,  do_something) %%  Send  to  a  gen_server  on  another  machine gen_server:cast({‘[email protected]’},  do_something) %%  wrapped  in  a  function  with  a  variable  name  for  a  clean  client  API do_something(Name)  -­‐>  gen_server:cast(Name,  do_something). %%  Using  the  API Result  =  do_something(peer1).

Slide 106

Slide 106 text

Erlang: A Reliable Language •Erlang embraces “Fail-Fast” •Code for the good case. Fail otherwise. •Supervisors relaunch failed processes •Links and Monitors alert other processes of failure •Avoids coding most error paths and helps prevent logic errors from propagating

Slide 107

Slide 107 text

OTP • OTP is a set of modules and standards that simplifies the building of reliable, well engineered erlang applications. • The gen_server, gen_fsm and gen_event modules are the most important parts of OTP • They wrap processes as server “behaviors” in order to facilitate building common, standardized distributed applications that integrate well with the Erlang Runtime

Slide 108

Slide 108 text

Implementation github.com/andrewjstone/rafter

Slide 109

Slide 109 text

Peers •Each peer is made up of two supervised processes •A gen_fsm that implements the raft consensus fsm •A gen_server that wraps the persistent log •An API module hides the implementation

Slide 110

Slide 110 text

Rafter API • The entire user api lives in rafter.erl • rafter:start_node(peer1, kv_sm). • rafter:set_config(peer1, [peer1, peer2, peer3, peer4, peer5]). • rafter:op(peer1, {set, <<“Omar”>>, <<“gonna get got”>>}). • rafter:op(peer1, {get, <<“Omar”>>}).

Slide 111

Slide 111 text

Output State Machines •Commands are applied in order to each peer’s state machine as their entries are committed •All peers in a consensus group can only run one type of state machine passed in during start_node/2 •Each State machine must export apply/1

Slide 112

Slide 112 text

Hypothetical KV store %%  API kv_sm:set(Key,  Val)  -­‐>          Peer  =  get_local_peer(),        rafter:op(Peer,  {set,  Key,  Value}). %%  State  Machine  callback kv_sm:apply({set,  Key,  Value})  -­‐>  ets:insert({kv_sm_store,   {Key,  Value}); kv_sm:apply({get,  Key})  -­‐>  ets:lookup(kv_sm_store,  Key).

Slide 113

Slide 113 text

rafter_consensus_fsm •gen_fsm that implements Raft •3 states - follower, candidate, leader •Messages sent and received between fsm’s according to raft protocol •State handling functions pattern match on messages to simplify and shorten handler clauses.

Slide 114

Slide 114 text

rafter_log.erl • Log API used by rafter_consensus_fsm and rafter_config • Utilizes Binary pattern matching for reading logs • Writes out entries to append only log. • State machine commands encoded with term_to_binary/1

Slide 115

Slide 115 text

rafter_config.erl •Rafter handles dynamic reconfiguration of it’s clusters at runtime •Depending upon the configuration of the cluster, different code paths need navigating, such as whether a majority of votes has been received. •Instead of embedding this logic in the consensus fsm, it was abstracted out into a module of pure functions

Slide 116

Slide 116 text

rafter_config.erl API -­‐spec  quorum_min(peer(),    #config{},  dict())  -­‐>  non_neg_integer(). -­‐spec  has_vote(peer(),  #config{})  -­‐>  boolean(). -­‐spec  allow_config(#config{},  list(peer()))  -­‐>  boolean(). -­‐spec  voters(#config{})  -­‐>  list(peer()).

Slide 117

Slide 117 text

Testing

Slide 118

Slide 118 text

Property Based Testing •Use Erlang QuickCheck •Too complex to get into now •Come hear me talk about it at Erlang Factory Lite in Berlin! shameless plug

Slide 119

Slide 119 text

Other Raft Implementations https://ramcloud.stanford.edu/wiki/display/logcabin/LogCabin http://coreos.com/blog/distributed-configuration-with-etcd/index.html https://github.com/benbjohnson/go-raft https://github.com/coreos/etcd

Slide 120

Slide 120 text

github.com/andrewjstone/rafter

Slide 121

Slide 121 text

Plugs Shameless (a few more)

Slide 122

Slide 122 text

RICON West http://ricon.io/west.html Études for Erlang http://meetup.com/Erlang-NYC

Slide 123

Slide 123 text

Andy Gross - Introducing us to Raft Diego Ongaro - writing Raft, clarifying Tom’s understanding, reviewing slides Chris Meiklejohn - http://thinkdistributed.io - being an inspiration Justin Sheehy - reviewing slides, correcting poor assumptions Reid Draper - helping rubber duck solutions Kelly McLaughlin - helping rubber duck solutions John Daily - for his consistent pedantry concerning Tom’s abuse of English Basho - letting us indulge our intellect on the company’s dime (we’re hiring) Thanks File

Slide 124

Slide 124 text

Any and all questions can be sent to /dev/null @tsantero @andrew_j_stone