Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Choose Your Own Consistency

Tom Santero
November 23, 2013

Choose Your Own Consistency

A concurrent talk about CRDTs and Strong Consistency in Riak 2.0 --delivered with Chris Meiklejohn (@cmeik) at Erlang Factory Lite Toronto.

Tom Santero

November 23, 2013
Tweet

More Decks by Tom Santero

Other Decks in Technology

Transcript

  1. Choose your own
    Consistency

    View Slide

  2. A concurrent talk by:
    @cmeik & @tsantero

    View Slide

  3. Leveraging Riak’s New Data Types
    Conflict-Free Replicated Datatypes
    Chris Meiklejohn
    @cmeik
    EFL Toronto 2013

    View Slide

  4. cmeiklejohn

    View Slide

  5. Riak Made Consistent
    Strong Consistency in Riak 2.0
    Tom Santero
    @tsantero
    EFL Toronto 2013

    View Slide

  6. @
    basho.com
    tsantero

    View Slide

  7. Riak Overview

    View Slide

  8. Riak Overview
    Erlang implementation of Dynamo

    View Slide

  9. Riak Overview
    Erlang implementation of Dynamo

    View Slide

  10. Riak Object

    View Slide

  11. Key Value

    View Slide

  12. Riak Overview
    Consistent hashing

    View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. Riak Overview
    Dynamic membership

    View Slide

  17. Riak Overview
    Replication factor

    View Slide

  18. Replica Replica Replica

    View Slide

  19. High Availability
    “...any non-failing node can respond to any request”
    --Gilbert & Lynch

    View Slide

  20. Eventual Consistency
    “Eventual consistency is a consistency model used in
    distributed computing that informally guarantees that, if no
    new updates are made to a given data item, eventually all
    accesses to that item will return the last updated value.”
    --Wikipedia

    View Slide

  21. Riak Overview
    Two Writes: {Writer, Value, Time}

    View Slide

  22. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}]

    View Slide

  23. Riak Overview
    Last Writer Wins
    Allow Mult

    View Slide

  24. Riak Overview
    Last Writer Wins [{b, v1, t2}]
    [{b, v1, t2}]
    [{b, v1, t2}]

    View Slide

  25. Riak Overview
    Allow Mult [{a, v1, t1}, {b, v1, t2}]
    [{a, v1, t1}, {b, v1, t2}]
    [{a, v1, t1}, {b, v1, t2}]

    View Slide

  26. User specificed
    Merge

    View Slide

  27. CRDTs

    View Slide

  28. CRDTs
    Convergent Replicated Data Types

    View Slide

  29. CRDTs
    Commutative Replicated Data Types

    View Slide

  30. CRDTs
    Synchronization-free data structures

    View Slide

  31. CRDTs
    Monotonic and confluent; convergent

    View Slide

  32. CRDTs
    Create siblings; resolve via merge.

    View Slide

  33. The Theory

    View Slide

  34. View Slide

  35. View Slide

  36. View Slide

  37. Bounded Join Semilattices

    View Slide

  38. Bounded Join Semilattices
    Partially ordered set; least upper bound; ACI.

    View Slide

  39. Bounded Join Semilattices
    Associativity: (X · Y) · Z = X · (Y · Z)

    View Slide

  40. Bounded Join Semilattices
    Commutativity: X · Y = Y · X

    View Slide

  41. Bounded Join Semilattices
    Idempotence: X · X = X

    View Slide

  42. Bounded Join Semilattices
    Objects grow over time; merge computes LUB

    View Slide

  43. Bounded Join Semilattices
    Monotonic and confluent; convergent

    View Slide

  44. Bounded Join Semilattices
    Map into another via monotone functions

    View Slide

  45. Bounded Join Semilattices
    Examples

    View Slide

  46. b a c
    a, b a, c
    a, b, c
    Set; merge function: union.
    b, c

    View Slide

  47. 3 5 7
    5 7
    7
    Increasing natural; merge function: max.

    View Slide

  48. F F T
    F T
    T
    Booleans; merge function: or.

    View Slide

  49. x <= y montone f(x) <= f(y)

    View Slide

  50. CvRDT Examples
    OR-SET

    View Slide

  51. [ [{1, a}], [] ] [ [{1, a}], [] ]

    View Slide

  52. [ [{1, a}], [] ] [ [{1, a}], [] ]
    [ [{1, a}], [{1, a}] ]
    [ [{1, a}], [{1, a}] ]

    View Slide

  53. [ [{1, a}], [] ] [ [{1, a}], [] ]
    [ [{1, a}], [{1, a}] ]
    [ [{1, a}], [{1, a}] ]
    [ [{1, a}, {2, a}], [{1, a}] ]

    View Slide

  54. [ [{1, a}], [] ] [ [{1, a}], [] ]
    [ [{1, a}], [{1, a}] ]
    [ [{1, a}], [{1, a}] ]
    [ [{1, a}, {2, a}], [{1, a}] ]
    [ [{1, a}, {2, a}], [{1, a}] ]

    View Slide

  55. [ [{1, a}], [] ] [ [{1, a}], [] ]

    View Slide

  56. [ [{1, a}], [] ] [ [{1, a}], [] ]
    [ [{1, a}, {2, b}], [] ]

    View Slide

  57. [ [{1, a}], [] ] [ [{1, a}], [] ]
    [ [{1, a}, {2, b}], [] ]
    [ [{1, a}], [{1, a}] ]

    View Slide

  58. [ [{1, a}], [] ] [ [{1, a}], [] ]
    [ [{1, a}, {2, b}], [] ]
    [ [{1, a}], [{1, a}] ]
    [ [{1, a}, {2, b}], [{1, a}] ]

    View Slide

  59. riak_dt
    git clone [email protected]:basho/riak_dt.git

    View Slide

  60. -­‐type  crdt()  ::  term().
    -­‐type  operation()  ::  term().
    -­‐type  actor()  ::  term().
    -­‐type  value()  ::  term().
    -­‐type  error()  ::  term().
    -­‐callback  new()  -­‐>  crdt().
    -­‐callback  value(crdt())  -­‐>  term().
    -­‐callback  value(term(),  crdt())  -­‐>  value().
    -­‐callback  update(operation(),  actor(),  crdt())  -­‐>  
                          {ok,  crdt()}  |  {error,  error()}.
    -­‐callback  merge(crdt(),  crdt())  -­‐>  crdt().
    -­‐callback  equal(crdt(),  crdt())  -­‐>  boolean().
    -­‐callback  to_binary(crdt())  -­‐>  binary().
    -­‐callback  from_binary(binary())  -­‐>  crdt().
    riak_dt/src/riak_dt.erl

    View Slide

  61. Riak 1.4

    View Slide

  62. Riak 1.4
    Counters: G-Counter, PN-Counter

    View Slide

  63. Riak 1.4
    Counters: non-idempotent; O(Actors)

    View Slide

  64. riak_dt/src/riak_dt_gcounter.erl
    -­‐module(riak_dt_gcounter).
    -­‐export([new/0,  new/2,  value/1,  value/2,  update/3,  merge/2,
     equal/2,  to_binary/1,  from_binary/1]).
    -­‐export_type([gcounter/0,  gcounter_op/0]).
    -­‐opaque  gcounter()  ::  orddict:orddict().
    -­‐type  gcounter_op()  ::  increment  |  {increment,  pos_integer()}.

    View Slide

  65. riak_dt/src/riak_dt_pncounter.erl
    -­‐module(riak_dt_pncounter).
    -­‐export([new/0,  new/2,  value/1,  value/2,
                     update/3,  merge/2,  equal/2,  to_binary/1,  from_binary/1]).
    -­‐export_type([pncounter/0,  pncounter_op/0]).
    -­‐opaque  pncounter()    ::  [{Actor::riak_dt:actor(),  Inc::pos_integer(),  
                                                       Dec::pos_integer()}].
    -­‐type  pncounter_op()  ::  riak_dt_gcounter:gcounter_op()  |  decrement_op().
    -­‐type  decrement_op()  ::  decrement  |  {decrement,  pos_integer()}.
    -­‐type  pncounter_q()    ::  positive  |  negative.

    View Slide

  66. Riak 2.0

    View Slide

  67. Riak 2.0
    Requires bucket types; HTTP and protobuff API

    View Slide

  68. Riak 2.0
    Caveats: 2i, MapReduce, Yokozuna

    View Slide

  69. Riak 2.0
    Sets: Add, Remove, Membership; Idempotent

    View Slide

  70. Riak 2.0
    Sets: Add wins; O(Actors + Elements)

    View Slide

  71. riak_dt/src/riak_dt_gset.erl
    -­‐module(riak_dt_gset).
    -­‐behaviour(riak_dt).
    %%  API
    -­‐export([new/0,  value/1,  update/3,  merge/2,  equal/2,
                     to_binary/1,  from_binary/1,  value/2]).
    -­‐export_type([gset/0,  binary_gset/0,  gset_op/0]).
    -­‐opaque  gset()  ::  members().
    -­‐type  binary_gset()  ::  binary().
    -­‐type  gset_op()  ::  {add,  member()}.

    View Slide

  72. riak_dt/src/riak_dt_orset.erl
    -­‐module(riak_dt_orset).
    -­‐behaviour(riak_dt).
    %%  API
    -­‐export([new/0,  value/1,  update/3,  merge/2,  equal/2,
                     to_binary/1,  from_binary/1,  value/2,  precondition_context/1]).
    -­‐export_type([orset/0,  binary_orset/0,  orset_op/0]).
    -­‐opaque  orset()  ::  orddict:orddict().
    -­‐type  binary_orset()  ::  binary().  %%  A  binary  that  from_binary/1  will  
    operate  on.
    -­‐type  orset_op()  ::  {add,  member()}  |  {remove,  member()}  |
                                           {add_all,  [member()]}  |  {remove_all,  [member()]}  |
                                           {update,  [orset_op()]}.
    -­‐type  actor()  ::  riak_dt:actor().
    -­‐type  member()  ::  term().

    View Slide

  73. riak_dt/src/riak_dt_orswot.erl
    -­‐module(riak_dt_orswot).
    -­‐behaviour(riak_dt).
    -­‐export_type([orswot/0,  orswot_op/0,  binary_orswot/0]).
    -­‐opaque  orswot()  ::  {riak_dt_vclock:vclock(),  entries()}.
    -­‐type  binary_orswot()  ::  binary().  %%  A  binary  that  from_binary/1  will  operate  
    on.
    -­‐type  orswot_op()  ::    {add,  member()}  |  {remove,  member()}  |
                                               {add_all,  [member()]}  |  {remove_all,  [member()]}  |
                                               {update,  [orswot_op()]}.
    -­‐type  orswot_q()    ::  size  |  {contains,  term()}.
    -­‐type  actor()  ::  riak_dt:actor().
    -­‐type  entries()  ::  [{member(),  minimal_clock()}].
    -­‐type  minimal_clock()  ::  [dot()].
    -­‐type  dot()  ::  {actor(),  Count::pos_integer()}.
    -­‐type  member()  ::  term().

    View Slide

  74. Riak 2.0
    Maps: Recursive; Associative Array; Nestable

    View Slide

  75. Riak 2.0
    Maps: Update wins; O(Actors + Elements)

    View Slide

  76. riak_dt/src/riak_dt_map.erl
    -­‐module(riak_dt_map).
    -­‐behaviour(riak_dt).
    %%  API
    -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,
                     equal/2,  to_binary/1,  from_binary/1,  precondition_context/1]).
    -­‐export_type([map/0,  binary_map/0,  map_op/0]).
    -­‐type  binary_map()  ::  binary().  %%  A  binary  that  from_binary/1  will  accept
    -­‐type  map()  ::  {riak_dt_vclock:vclock(),  valuelist()}.
    -­‐type  field()  ::  {Name::term(),  Type::crdt_mod()}.
    -­‐type  crdt_mod()  ::  riak_dt_pncounter  |  riak_dt_lwwreg  |
                                           riak_dt_od_flag  |
                                           riak_dt_map  |  riak_dt_orswot.
    -­‐type  valuelist()  ::  [{field(),  entry()}].
    -­‐type  entry()  ::  {minimal_clock(),  crdt()}.
    -­‐type  crdt()    ::    riak_dt_pncounter:pncounter()  |  riak_dt_od_flag:od_flag()  |
                                       riak_dt_lwwreg:lwwreg()  |
                                       riak_dt_orswot:orswot()  |
                                       riak_dt_map:map().
    -­‐type  map_op()  ::  {update,  [map_field_update()  |  map_field_op()]}.

    View Slide

  77. Riak 2.0
    Maps: LWW-Register, Booleans, Sets, and Maps

    View Slide

  78. Riak 2.0
    LWW-Register: last writer wins

    View Slide

  79. -­‐module(riak_dt_lwwreg).
    -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,
                     equal/2,  to_binary/1,  from_binary/1]).
    -­‐export_type([lwwreg/0,  lwwreg_op/0]).
    -­‐opaque  lwwreg()  ::  {term(),  non_neg_integer()}.
    -­‐type  lwwreg_op()  ::  {assign,  term(),  non_neg_integer()}    |  {assign,  
    term()}.
    -­‐type  lww_q()  ::  timestamp.
    riak_dt/src/riak_dt_lwwreg.erl

    View Slide

  80. Riak 2.0
    Boolean: Enabled, Disabled; O(Actors)

    View Slide

  81. -­‐module(riak_dt_enable_flag).
    -­‐behaviour(riak_dt).
    -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,
     equal/2,  from_binary/1,  to_binary/1]).
    riak_dt/src/riak_dt_enable_flag.erl

    View Slide

  82. This project is funded by the European Union,
    7th Research Framework Programme, ICT call 10,
    grant agreement n°609551.

    View Slide

  83. View Slide

  84. Strong Consistency

    View Slide

  85. Strong Consistency
    Why?

    View Slide

  86. Strong Consistency
    Why? atomicity

    View Slide

  87. Strong Consistency
    Why? recency

    View Slide

  88. Strong Consistency
    Why? partial writes

    View Slide

  89. A A A

    View Slide

  90. A A A
    Val = <<“B”>>.

    View Slide

  91. A A A
    Val = <<“B”>>.

    View Slide

  92. B A A

    View Slide

  93. B A A
    riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>).

    View Slide

  94. B A A
    riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>).

    View Slide

  95. B A A
    riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>).
    B B

    View Slide

  96. Strong Consistency
    Single key atomic operations

    View Slide

  97. Strong Consistency
    any get sees most recent put

    View Slide

  98. Strong Consistency
    get/modify/put cycle fails if object is changed

    View Slide

  99. Strong Consistency
    put w/o vclock fails if object exists

    View Slide

  100. Consensus

    View Slide

  101. Distributed Consensus
    “The problem of reaching agreement among remote
    processes is one of the most fundamental problems in
    distributed computing and is at the core of many
    algorithms for distributed data processing, distributed
    file management, and fault-tolerant distributed
    applications.”
    --Fischer, Lynch, Paterson

    View Slide

  102. Consensus
    Guarantees: termination, agreement, validity

    View Slide

  103. Termination
    processes eventually decide on a value

    View Slide

  104. Agreement
    processes that decide do so on the same value

    View Slide

  105. Validity
    values must have been proposed

    View Slide

  106. Consensus Algorithms

    View Slide

  107. Consensus Algorithms
    Paxos, ZAB, Raft

    View Slide

  108. Paxos
    Lamport 1990: “The Part-Time Parliament”

    View Slide

  109. View Slide

  110. ZAB
    ZooKeeper Atomic Broadcast

    View Slide

  111. Raft
    Ousterhout, Ongaro 2013:
    In search of an understandable consensus
    algorithm

    View Slide

  112. Paxos
    coordinated requests; leaders

    View Slide

  113. Paxos
    2 round trips / request

    View Slide

  114. Node 1 Node 2 Node 3
    N++
    prepare(N)
    promise(N, V )
    b
    promise(N, V )
    c
    V = f(V , V , V )
    b c
    a
    N commit(N, V )
    N
    accept(N)

    View Slide

  115. Multi-Paxos
    First Request

    View Slide

  116. Node 1 Node 2 Node 3
    N++; I = 0
    prepare(N, I)
    promise(N, I, V )
    b
    promise(N, I, V )
    c
    V = f(V , V , V )
    b c
    a
    N commit(N, I, V )
    N
    accept(N, I)

    View Slide

  117. Multi-Paxos
    Each Additional Request

    View Slide

  118. Node 1 Node 2 Node 3
    I++
    commit(N, I, V)
    accept(N, I)

    View Slide

  119. Multi-Paxos
    Each Request: ship entire state

    View Slide

  120. Riak

    View Slide

  121. Riak
    Key/Value

    View Slide

  122. Riak
    Keys are Independent

    View Slide

  123. Riak
    read-repair, active-anti entropy

    View Slide

  124. Riak
    individual key = isolated state

    View Slide

  125. Riak
    multi-paxos per key

    View Slide

  126. Consensus Groups

    View Slide

  127. Consensus Groups
    participants in decisioning; ensembles

    View Slide

  128. Consensus Groups
    preference lists (preflists) in Riak

    View Slide

  129. preflist

    View Slide

  130. View Slide

  131. View Slide

  132. View Slide

  133. View Slide

  134. Consensus Groups
    one group/ensemble per preflist

    View Slide

  135. Consensus Groups
    ring_size = 256, 256 ensembles

    View Slide

  136. Ensembles

    View Slide

  137. Ensembles
    leader election; Epochs; get/put operations

    View Slide

  138. Get Operations
    leader reads local object; if Epoch old: refresh

    View Slide

  139. Node 1 Node 2 Node 3
    obj.epoch < epoch
    get(key)
    reply(Epoch , Seq , Val )
    b
    Val = latest(Val , Val , Val )
    Val.epoch = epoch
    write(Epoch, ++Seq, Val)
    ack(Epoch, Seq)
    b b
    reply(Epoch , Seq , Val )
    c c c
    a b c

    View Slide

  140. Node 1 Node 2 Node 3
    obj.epoch == epoch
    Reply = local_get(Key)

    View Slide

  141. Get Operations
    Worst Case: 2 roundtrips / request
    Best Case: 0 roundtrips / request

    View Slide

  142. Put Operations
    leader reads local object; if Epoch old: refresh
    modify object
    commit modified object

    View Slide

  143. Node 1 Node 2 Node 3
    obj.epoch < epoch
    get(key)
    reply(Epoch , Seq , Val )
    b
    Latest = latest(Val , Val , Val )
    Val = modify(Latest)
    write(Epoch, ++Seq, Val)
    ack(Epoch, Seq)
    b b
    reply(Epoch , Seq , Val )
    c c c
    a b c

    View Slide

  144. Node 1 Node 2 Node 3
    obj.epoch == epoch
    Latest = local_get(Key)
    Val = modify(Latest)
    write(Epoch, ++Seq, Val)
    ack(Epoch, Seq)

    View Slide

  145. Put Operations
    Worst Case: 2 roundtrips / write
    Best Case: 1 roundtrips / write

    View Slide

  146. Failed Quorums
    Leader Re-election ; new Epoch

    View Slide

  147. Cluster Membership

    View Slide

  148. Cluster Membership
    add/remove nodes

    View Slide

  149. Cluster Membership
    consensus state

    View Slide

  150. Cluster Membership
    multi-paxos joint consensus

    View Slide

  151. Existing Ensemble Joining Ensemble
    riak_01
    riak_02
    riak_03
    riak_07
    riak_08
    riak_09
    [{riak_01}, {riak_02}, {riak_03}] [{riak_07}, {riak_08}, {riak_09}]

    View Slide

  152. Joint-Consensus Ensemble
    [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

    View Slide

  153. Joint-Consensus Ensemble
    [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

    View Slide

  154. New Ensemble
    riak_07
    riak_08
    riak_09
    [{riak_07}, {riak_08}, {riak_09}]

    View Slide

  155. Q & A

    View Slide