Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Consensus, Raft and Rafter

Consensus, Raft and Rafter

Tech talk on consensus protocols, with a focus on Raft and Andrew Stone's Erlang implementation: rafter.

Presented at Erlang NYC: http://www.meetup.com/Erlang-NYC/events/131394712/

Tom Santero

August 01, 2013
Tweet

More Decks by Tom Santero

Other Decks in Programming

Transcript

  1. &
    Consensus
    RAFT
    RAFTER

    View Slide

  2. Tom Santero
    cats
    I newports
    distributed
    systems

    View Slide

  3. Andrew Stone
    cats
    I bread
    distributed
    systems

    View Slide

  4. @tsantero
    @andew_j_stone

    View Slide

  5. tsantero
    andewjstone

    View Slide

  6. tsantero
    astone
    @
    basho.com

    View Slide

  7. tsantero
    astone
    @
    basho.com
    (notice Andrew’s contact keeps getting shorter?)

    View Slide

  8. http://thinkdistributed.io
    A Chris Meiklejohn Production

    View Slide

  9. The Usual
    Suspects

    View Slide

  10. “Strongly Consistent Datastores”
    MongoDB Redis
    MySQL others...

    View Slide

  11. async
    {replication
    disk persistence
    Failure Detection

    View Slide

  12. Problem?

    View Slide

  13. Single Node w/ async disc writes
    Data is written to fs buffer, user is sent
    acknowledgement, power goes out
    Data not yet written to disk is LOST
    System is UNAVAILABLE
    Single Disk Solutions: fsync, battery backup, prayer
    Failure Mode 1

    View Slide

  14. Master/Slave with asynchronous
    replication
    Data is written by user and acknowledged
    Data synced on Primary, but crashes
    Failure Mode 2

    View Slide

  15. ?
    Consistent Available

    View Slide

  16. ?
    Consistent Available

    View Slide

  17. ?
    Consistent Available
    Primary Failed. Data not yet written to Secondary
    Write already ack’d to Client
    if promote_secondary() == true;
    {
    stderr(“data loss”);
    }
    else
    {
    stderr(“system unavailable”);
    }

    View Slide

  18. (›°□°ʣ›ớ ᵲᴸᵲ

    View Slide

  19. Synchronous Writes FTW?

    View Slide

  20. PostgreSQL / Oracle
    Master / Slave
    Ack when Slave confirms Write

    View Slide

  21. Problem?

    View Slide

  22. Problem?
    Failure Detection
    Automated Failover
    “split brain” partitions

    View Slide

  23. Solution!

    View Slide

  24. Solution!
    Consensus protocols!
    (Paxos, ZAB, Raft)
    RYOW Consistency
    Safe Serializability

    View Slide

  25. What is
    Consensus?

    View Slide

  26. “The problem of reaching agreement among
    remote processes is one of the most
    fundamental problems in distributed
    computing and is at the core of many
    algorithms for distributed data processing,
    distributed file management, and fault-
    tolerant distributed applications.”

    View Slide

  27. In a distributed system...
    despite failures.
    multiple processes
    agreeing on a value

    View Slide

  28. host0 host1 host2
    Replicated Log

    View Slide

  29. v0
    host0 host1 host2
    Replicated Log

    View Slide

  30. v0 v1 v2 v3 v4 v5 ... v(n-1)
    host0 host1 host2
    Replicated Log

    View Slide

  31. Consensus
    {termination
    agreement
    validity

    View Slide

  32. Consensus
    {termination
    agreement
    validity
    non faulty processes
    eventually decide on a value

    View Slide

  33. Consensus
    {termination
    agreement
    validity
    non faulty processes
    eventually decide on a value
    processes that decide
    do so on the same value

    View Slide

  34. Consensus
    {termination
    agreement
    validity
    non faulty processes
    eventually decide on a value
    processes that decide
    do so on the same value
    value must have been proposed

    View Slide

  35. Theoretical
    Real World

    View Slide

  36. Back to 1985...

    View Slide

  37. Back to 1985... The
    FLP
    Result

    View Slide

  38. &
    Safety Liveness

    View Slide

  39. bad things can’t happen

    View Slide

  40. good things
    eventually happen

    View Slide

  41. Consensus
    {termination
    agreement
    validity
    non faulty processes
    eventually decide on a value
    processes that decide
    do so on the same value
    value must have been proposed

    View Slide

  42. {termination
    agreement
    validity
    non faulty processes
    eventually decide on a value
    processes that decide
    do so on the same value
    value must have been proposed
    Safety
    Liveness

    View Slide

  43. Safety
    Liveness
    {termination
    agreement
    validity
    non faulty processes
    eventually decide on a value
    processes that decide
    do so on the same value
    value must have been proposed

    View Slide

  44. Safety
    Liveness
    {termination
    agreement
    validity
    non faulty processes
    eventually decide on a value
    processes that decide
    do so on the same value
    value must have been proposed

    View Slide

  45. Safety
    Liveness
    {termination
    agreement
    validity
    non faulty processes
    eventually decide on a value
    processes that decide
    do so on the same value
    value must have been proposed
    non-triviality

    View Slide

  46. The FLP Result:
    perfect Safety and Liveness in
    async consensus is impossible

    View Slide

  47. Symmetric
    vs
    Asymmetric

    View Slide

  48. Raft

    View Slide

  49. Motivation: RAMCloud
    large scale, general purpose, distributed storage
    all data lives in DRAM
    strong consistency model
    https://ramcloud.stanford.edu/

    View Slide

  50. Motivation: RAMCloud
    large scale, general purpose, distributed storage
    all data lives in DRAM
    strong consistency model
    100 byte object
    reads in 5μs
    https://ramcloud.stanford.edu/

    View Slide

  51. John Ousterhout
    Diego Ongaro
    In Search of an
    Understandable
    Consensus Algorithm
    https://ramcloud.stanford.edu/raft.pdf

    View Slide

  52. “Unfortunately, Paxos is quite difficult to
    understand, in spite of numerous attempts to
    make it more approachable. Furthermore, its
    architecture is unsuitable for building
    practical systems, requiring complex changes
    to create an efficient and complete solution.
    As a result, both system builders and students
    struggle with Paxos.”

    View Slide

  53. View Slide

  54. Design Goals:
    Understandability & Decomposition
    Strong Leadership Model
    Joint Consensus for Membership Changes

    View Slide

  55. Log
    SM
    C Consensus Module
    Replicated Log
    State Machine

    View Slide

  56. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    C

    View Slide

  57. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    1. client makes request to Leader
    C

    View Slide

  58. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    2. consensus module manages request
    C

    View Slide

  59. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    3. persist instruction to local log
    v
    C

    View Slide

  60. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C

    View Slide

  61. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    4. leader replicates command to
    other machines
    C C
    C

    View Slide

  62. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    C

    View Slide

  63. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    v v
    5. command recorded to local
    machines’ log
    C

    View Slide

  64. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    v v
    C

    View Slide

  65. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    v v
    C

    View Slide

  66. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    v v
    C

    View Slide

  67. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    v v
    7. command forwarded to state machines
    for processing
    SM SM SM
    C

    View Slide

  68. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    v v
    7. command forwarded to state machines
    for processing
    SM SM SM
    C

    View Slide

  69. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    v v
    SM
    C

    View Slide

  70. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    v v
    SM
    8. SM processes
    command, ACKs to client
    C

    View Slide

  71. Log
    SM
    C
    Log
    SM
    C
    Log
    SM
    C
    Client
    v
    C C
    v v
    SM
    C

    View Slide

  72. Why does that work?
    job of the consensus module to:
    C
    manage replicated logs
    determine when it’s safe to pass
    to state machine for execution
    only requires majority participation

    View Slide

  73. Why does that work?
    job of the consensus module to:
    C
    manage replicated logs
    determine when it’s safe to pass
    to state machine for execution
    only requires majority participation
    Safety
    {
    Liveness
    {

    View Slide

  74. 2F + 1

    View Slide

  75. 2F + 1
    solve for F

    View Slide

  76. F + 1
    service
    unavailable

    View Slide

  77. Fail-Stop
    Behavior

    View Slide

  78. What If The
    Leader
    DIES?

    View Slide

  79. Leader Election!

    View Slide

  80. 1. Select 1/N servers to act as Leader
    2. Leader ensures Safety and Linearizability
    3. Detect crashes + Elect new Leader
    4. Maintain consistency after Leadership “coups”
    5. Depose old Leaders if they return
    6. Manage cluster topology

    View Slide

  81. Possible Server Roles:
    Leader Follower Candidate

    View Slide

  82. Possible Server Roles:
    Leader Follower Candidate
    At most only 1 valid Leader at a time
    Receives commands from clients
    Commits entries
    Sends heartbeats

    View Slide

  83. Possible Server Roles:
    Leader Follower Candidate
    Replicate state changes
    Passive member of cluster
    during normal operation
    Vote for Candidates

    View Slide

  84. Possible Server Roles:
    Leader Follower Candidate
    Initiate and coordinate Leader Election
    Was previously a Follower

    View Slide

  85. Terms:
    election normal operation
    Term 1 Term 2 Term 3 Term 4
    no emerging leader

    View Slide

  86. Leader
    Follower Candidate

    View Slide

  87. Leader
    Follower Candidate
    times out,
    starts election

    View Slide

  88. Leader
    Follower Candidate

    View Slide

  89. Leader
    Follower Candidate
    times out,
    new election

    View Slide

  90. Leader
    Follower Candidate

    View Slide

  91. Leader
    Follower Candidate
    receives votes from
    majority of servers

    View Slide

  92. Leader
    Follower Candidate

    View Slide

  93. Leader
    Follower Candidate
    discover server with
    higher term

    View Slide

  94. Leader
    Follower Candidate

    View Slide

  95. Leader
    Follower Candidate
    discover current leader
    or higher term

    View Slide

  96. Leader
    Follower Candidate

    View Slide

  97. Potential Use Cases:
    Distributed Lock Manager
    Database Transactions Automated Failover
    Configuration Management
    http://coreos.com/blog/distributed-configuration-with-etcd/index.html
    Service Discovery etc...

    View Slide

  98. Rafter
    github.com/andrewjstone/rafter

    View Slide

  99. •A labor of love, a work in progress
    •A library for building strongly consistent distributed
    systems in Erlang
    •Implements the raft consensus protocol in Erlang
    •Fundamental abstraction is the replicated log
    What:

    View Slide

  100. Replicated Log
    •API operates on log entries
    •Log entries contain commands
    •Commands are transparent to Rafter
    •Systems build on top of rafter with pluggable state
    machines that process commands upon log entry
    commit.

    View Slide

  101. Erlang

    View Slide

  102. Erlang: A Concurrent Language
    •Processes are the fundamental abstraction
    •Processes can only communicate by sending each
    other messages
    •Processes do not share state
    •Processes are managed by supervisor processes in a
    hierarchy

    View Slide

  103. Erlang: A Concurrent Language
     loop()  -­‐>
               receive
                       {From,  Msg}  -­‐>
                                 From  !  Msg,
                                 loop()
               end.
     
    %%  Spawn  100,000  echo  servers                      
    Pids  =  [spawn(fun  loop/0)  ||  _  lists:seq(1,100000)]
    %%  Send  a  message  to  the  first  process
    lists:nth(0,  Pids)  !  {self(),  ayo}.

    View Slide

  104. Erlang: A Functional Language
    • Single Assignment Variables
    • Tail-Recursion
    • Pattern Matching
    {op,  {set,  Key,    Val}}  =  {op,  {set,  <>,  <>}}
    • Bit Syntax
    Header  =  <>

    View Slide

  105. Erlang: A Distributed Language
    Location Transparency: Processes can send messages to other
    processes without having to know if the other process is local.
    %%  Send  to  a  local  gen_server  process
    gen_server:cast(peer1,  do_something)
    %%  Send  to  a  gen_server  on  another  machine
    gen_server:cast({‘[email protected]’},  do_something)
    %%  wrapped  in  a  function  with  a  variable  name  for  a  clean  client  API
    do_something(Name)  -­‐>  gen_server:cast(Name,  do_something).
    %%  Using  the  API
    Result  =  do_something(peer1).

    View Slide

  106. Erlang: A Reliable Language
    •Erlang embraces “Fail-Fast”
    •Code for the good case. Fail otherwise.
    •Supervisors relaunch failed processes
    •Links and Monitors alert other processes of failure
    •Avoids coding most error paths and helps prevent
    logic errors from propagating

    View Slide

  107. OTP
    • OTP is a set of modules and standards that simplifies the
    building of reliable, well engineered erlang applications.
    • The gen_server, gen_fsm and gen_event modules are the
    most important parts of OTP
    • They wrap processes as server “behaviors” in order to facilitate
    building common, standardized distributed applications that
    integrate well with the Erlang Runtime

    View Slide

  108. Implementation
    github.com/andrewjstone/rafter

    View Slide

  109. Peers
    •Each peer is made up of two supervised processes
    •A gen_fsm that implements the raft consensus fsm
    •A gen_server that wraps the persistent log
    •An API module hides the implementation

    View Slide

  110. Rafter API
    • The entire user api lives in rafter.erl
    • rafter:start_node(peer1, kv_sm).
    • rafter:set_config(peer1, [peer1, peer2, peer3, peer4, peer5]).
    • rafter:op(peer1, {set, <>, <>}).
    • rafter:op(peer1, {get, <>}).

    View Slide

  111. Output State Machines
    •Commands are applied in order to each peer’s state
    machine as their entries are committed
    •All peers in a consensus group can only run one type
    of state machine passed in during start_node/2
    •Each State machine must export apply/1

    View Slide

  112. Hypothetical KV store
    %%  API
    kv_sm:set(Key,  Val)  -­‐>  
           Peer  =  get_local_peer(),
           rafter:op(Peer,  {set,  Key,  Value}).
    %%  State  Machine  callback
    kv_sm:apply({set,  Key,  Value})  -­‐>  ets:insert({kv_sm_store,  
    {Key,  Value});
    kv_sm:apply({get,  Key})  -­‐>  ets:lookup(kv_sm_store,  Key).

    View Slide

  113. rafter_consensus_fsm
    •gen_fsm that implements Raft
    •3 states - follower, candidate, leader
    •Messages sent and received between fsm’s according
    to raft protocol
    •State handling functions pattern match on messages
    to simplify and shorten handler clauses.

    View Slide

  114. rafter_log.erl
    • Log API used by rafter_consensus_fsm and rafter_config
    • Utilizes Binary pattern matching for reading logs
    • Writes out entries to append only log.
    • State machine commands encoded with term_to_binary/1

    View Slide

  115. rafter_config.erl
    •Rafter handles dynamic reconfiguration of it’s clusters
    at runtime
    •Depending upon the configuration of the cluster,
    different code paths need navigating, such as whether
    a majority of votes has been received.
    •Instead of embedding this logic in the consensus fsm,
    it was abstracted out into a module of pure functions

    View Slide

  116. rafter_config.erl API
    -­‐spec  quorum_min(peer(),    #config{},  dict())  -­‐>  non_neg_integer().
    -­‐spec  has_vote(peer(),  #config{})  -­‐>  boolean().
    -­‐spec  allow_config(#config{},  list(peer()))  -­‐>  boolean().
    -­‐spec  voters(#config{})  -­‐>  list(peer()).

    View Slide

  117. Testing

    View Slide

  118. Property Based Testing
    •Use Erlang QuickCheck
    •Too complex to get into now
    •Come hear me talk about it at Erlang Factory Lite in
    Berlin!
    shameless plug

    View Slide

  119. Other Raft Implementations
    https://ramcloud.stanford.edu/wiki/display/logcabin/LogCabin
    http://coreos.com/blog/distributed-configuration-with-etcd/index.html
    https://github.com/benbjohnson/go-raft
    https://github.com/coreos/etcd

    View Slide

  120. github.com/andrewjstone/rafter

    View Slide

  121. Plugs
    Shameless
    (a few more)

    View Slide

  122. RICON West
    http://ricon.io/west.html
    Études for Erlang
    http://meetup.com/Erlang-NYC

    View Slide

  123. Andy Gross - Introducing us to Raft
    Diego Ongaro - writing Raft, clarifying Tom’s understanding, reviewing slides
    Chris Meiklejohn - http://thinkdistributed.io - being an inspiration
    Justin Sheehy - reviewing slides, correcting poor assumptions
    Reid Draper - helping rubber duck solutions
    Kelly McLaughlin - helping rubber duck solutions
    John Daily - for his consistent pedantry concerning Tom’s abuse of English
    Basho - letting us indulge our intellect on the company’s dime (we’re hiring)
    Thanks File

    View Slide

  124. Any and all questions
    can be sent to /dev/null
    @tsantero @andrew_j_stone

    View Slide