Choose Your Own Consistency

7c4bac30ed2d3a9d346ced746b1d985d?s=47 Tom Santero
November 23, 2013

Choose Your Own Consistency

A concurrent talk about CRDTs and Strong Consistency in Riak 2.0 --delivered with Chris Meiklejohn (@cmeik) at Erlang Factory Lite Toronto.

7c4bac30ed2d3a9d346ced746b1d985d?s=128

Tom Santero

November 23, 2013
Tweet

Transcript

  1. Choose your own Consistency

  2. A concurrent talk by: @cmeik & @tsantero

  3. Leveraging Riak’s New Data Types Conflict-Free Replicated Datatypes Chris Meiklejohn

    @cmeik EFL Toronto 2013
  4. cmeiklejohn

  5. Riak Made Consistent Strong Consistency in Riak 2.0 Tom Santero

    @tsantero EFL Toronto 2013
  6. @ basho.com tsantero

  7. Riak Overview

  8. Riak Overview Erlang implementation of Dynamo

  9. Riak Overview Erlang implementation of Dynamo

  10. Riak Object

  11. Key Value

  12. Riak Overview Consistent hashing

  13. None
  14. None
  15. None
  16. Riak Overview Dynamic membership

  17. Riak Overview Replication factor

  18. Replica Replica Replica

  19. High Availability “...any non-failing node can respond to any request”

    --Gilbert & Lynch
  20. Eventual Consistency “Eventual consistency is a consistency model used in

    distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.” --Wikipedia
  21. Riak Overview Two Writes: {Writer, Value, Time}

  22. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}]

  23. Riak Overview Last Writer Wins Allow Mult

  24. Riak Overview Last Writer Wins [{b, v1, t2}] [{b, v1,

    t2}] [{b, v1, t2}]
  25. Riak Overview Allow Mult [{a, v1, t1}, {b, v1, t2}]

    [{a, v1, t1}, {b, v1, t2}] [{a, v1, t1}, {b, v1, t2}]
  26. User specificed Merge

  27. CRDTs

  28. CRDTs Convergent Replicated Data Types

  29. CRDTs Commutative Replicated Data Types

  30. CRDTs Synchronization-free data structures

  31. CRDTs Monotonic and confluent; convergent

  32. CRDTs Create siblings; resolve via merge.

  33. The Theory

  34. None
  35. None
  36. None
  37. Bounded Join Semilattices

  38. Bounded Join Semilattices Partially ordered set; least upper bound; ACI.

  39. Bounded Join Semilattices Associativity: (X · Y) · Z =

    X · (Y · Z)
  40. Bounded Join Semilattices Commutativity: X · Y = Y ·

    X
  41. Bounded Join Semilattices Idempotence: X · X = X

  42. Bounded Join Semilattices Objects grow over time; merge computes LUB

  43. Bounded Join Semilattices Monotonic and confluent; convergent

  44. Bounded Join Semilattices Map into another via monotone functions

  45. Bounded Join Semilattices Examples

  46. b a c a, b a, c a, b, c

    Set; merge function: union. b, c
  47. 3 5 7 5 7 7 Increasing natural; merge function:

    max.
  48. F F T F T T Booleans; merge function: or.

  49. x <= y montone f(x) <= f(y)

  50. CvRDT Examples OR-SET

  51. [ [{1, a}], [] ] [ [{1, a}], [] ]

  52. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ]
  53. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ]
  54. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ]
  55. [ [{1, a}], [] ] [ [{1, a}], [] ]

  56. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}, {2, b}], [] ]
  57. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ]
  58. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, b}], [{1, a}] ]
  59. riak_dt git clone git@github.com:basho/riak_dt.git

  60. -­‐type  crdt()  ::  term(). -­‐type  operation()  ::  term(). -­‐type  actor()

     ::  term(). -­‐type  value()  ::  term(). -­‐type  error()  ::  term(). -­‐callback  new()  -­‐>  crdt(). -­‐callback  value(crdt())  -­‐>  term(). -­‐callback  value(term(),  crdt())  -­‐>  value(). -­‐callback  update(operation(),  actor(),  crdt())  -­‐>                         {ok,  crdt()}  |  {error,  error()}. -­‐callback  merge(crdt(),  crdt())  -­‐>  crdt(). -­‐callback  equal(crdt(),  crdt())  -­‐>  boolean(). -­‐callback  to_binary(crdt())  -­‐>  binary(). -­‐callback  from_binary(binary())  -­‐>  crdt(). riak_dt/src/riak_dt.erl
  61. Riak 1.4

  62. Riak 1.4 Counters: G-Counter, PN-Counter

  63. Riak 1.4 Counters: non-idempotent; O(Actors)

  64. riak_dt/src/riak_dt_gcounter.erl -­‐module(riak_dt_gcounter). -­‐export([new/0,  new/2,  value/1,  value/2,  update/3,  merge/2,  equal/2,  to_binary/1,

     from_binary/1]). -­‐export_type([gcounter/0,  gcounter_op/0]). -­‐opaque  gcounter()  ::  orddict:orddict(). -­‐type  gcounter_op()  ::  increment  |  {increment,  pos_integer()}.
  65. riak_dt/src/riak_dt_pncounter.erl -­‐module(riak_dt_pncounter). -­‐export([new/0,  new/2,  value/1,  value/2,        

             update/3,  merge/2,  equal/2,  to_binary/1,  from_binary/1]). -­‐export_type([pncounter/0,  pncounter_op/0]). -­‐opaque  pncounter()    ::  [{Actor::riak_dt:actor(),  Inc::pos_integer(),                                                      Dec::pos_integer()}]. -­‐type  pncounter_op()  ::  riak_dt_gcounter:gcounter_op()  |  decrement_op(). -­‐type  decrement_op()  ::  decrement  |  {decrement,  pos_integer()}. -­‐type  pncounter_q()    ::  positive  |  negative.
  66. Riak 2.0

  67. Riak 2.0 Requires bucket types; HTTP and protobuff API

  68. Riak 2.0 Caveats: 2i, MapReduce, Yokozuna

  69. Riak 2.0 Sets: Add, Remove, Membership; Idempotent

  70. Riak 2.0 Sets: Add wins; O(Actors + Elements)

  71. riak_dt/src/riak_dt_gset.erl -­‐module(riak_dt_gset). -­‐behaviour(riak_dt). %%  API -­‐export([new/0,  value/1,  update/3,  merge/2,  equal/2,

                     to_binary/1,  from_binary/1,  value/2]). -­‐export_type([gset/0,  binary_gset/0,  gset_op/0]). -­‐opaque  gset()  ::  members(). -­‐type  binary_gset()  ::  binary(). -­‐type  gset_op()  ::  {add,  member()}.
  72. riak_dt/src/riak_dt_orset.erl -­‐module(riak_dt_orset). -­‐behaviour(riak_dt). %%  API -­‐export([new/0,  value/1,  update/3,  merge/2,  equal/2,

                     to_binary/1,  from_binary/1,  value/2,  precondition_context/1]). -­‐export_type([orset/0,  binary_orset/0,  orset_op/0]). -­‐opaque  orset()  ::  orddict:orddict(). -­‐type  binary_orset()  ::  binary().  %%  A  binary  that  from_binary/1  will   operate  on. -­‐type  orset_op()  ::  {add,  member()}  |  {remove,  member()}  |                                        {add_all,  [member()]}  |  {remove_all,  [member()]}  |                                        {update,  [orset_op()]}. -­‐type  actor()  ::  riak_dt:actor(). -­‐type  member()  ::  term().
  73. riak_dt/src/riak_dt_orswot.erl -­‐module(riak_dt_orswot). -­‐behaviour(riak_dt). -­‐export_type([orswot/0,  orswot_op/0,  binary_orswot/0]). -­‐opaque  orswot()  ::  {riak_dt_vclock:vclock(),

     entries()}. -­‐type  binary_orswot()  ::  binary().  %%  A  binary  that  from_binary/1  will  operate   on. -­‐type  orswot_op()  ::    {add,  member()}  |  {remove,  member()}  |                                            {add_all,  [member()]}  |  {remove_all,  [member()]}  |                                            {update,  [orswot_op()]}. -­‐type  orswot_q()    ::  size  |  {contains,  term()}. -­‐type  actor()  ::  riak_dt:actor(). -­‐type  entries()  ::  [{member(),  minimal_clock()}]. -­‐type  minimal_clock()  ::  [dot()]. -­‐type  dot()  ::  {actor(),  Count::pos_integer()}. -­‐type  member()  ::  term().
  74. Riak 2.0 Maps: Recursive; Associative Array; Nestable

  75. Riak 2.0 Maps: Update wins; O(Actors + Elements)

  76. riak_dt/src/riak_dt_map.erl -­‐module(riak_dt_map). -­‐behaviour(riak_dt). %%  API -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,

                     equal/2,  to_binary/1,  from_binary/1,  precondition_context/1]). -­‐export_type([map/0,  binary_map/0,  map_op/0]). -­‐type  binary_map()  ::  binary().  %%  A  binary  that  from_binary/1  will  accept -­‐type  map()  ::  {riak_dt_vclock:vclock(),  valuelist()}. -­‐type  field()  ::  {Name::term(),  Type::crdt_mod()}. -­‐type  crdt_mod()  ::  riak_dt_pncounter  |  riak_dt_lwwreg  |                                        riak_dt_od_flag  |                                        riak_dt_map  |  riak_dt_orswot. -­‐type  valuelist()  ::  [{field(),  entry()}]. -­‐type  entry()  ::  {minimal_clock(),  crdt()}. -­‐type  crdt()    ::    riak_dt_pncounter:pncounter()  |  riak_dt_od_flag:od_flag()  |                                    riak_dt_lwwreg:lwwreg()  |                                    riak_dt_orswot:orswot()  |                                    riak_dt_map:map(). -­‐type  map_op()  ::  {update,  [map_field_update()  |  map_field_op()]}.
  77. Riak 2.0 Maps: LWW-Register, Booleans, Sets, and Maps

  78. Riak 2.0 LWW-Register: last writer wins

  79. -­‐module(riak_dt_lwwreg). -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,        

             equal/2,  to_binary/1,  from_binary/1]). -­‐export_type([lwwreg/0,  lwwreg_op/0]). -­‐opaque  lwwreg()  ::  {term(),  non_neg_integer()}. -­‐type  lwwreg_op()  ::  {assign,  term(),  non_neg_integer()}    |  {assign,   term()}. -­‐type  lww_q()  ::  timestamp. riak_dt/src/riak_dt_lwwreg.erl
  80. Riak 2.0 Boolean: Enabled, Disabled; O(Actors)

  81. -­‐module(riak_dt_enable_flag). -­‐behaviour(riak_dt). -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,  equal/2,  from_binary/1,  to_binary/1]).

    riak_dt/src/riak_dt_enable_flag.erl
  82. This project is funded by the European Union, 7th Research

    Framework Programme, ICT call 10, grant agreement n°609551.
  83. None
  84. Strong Consistency

  85. Strong Consistency Why?

  86. Strong Consistency Why? atomicity

  87. Strong Consistency Why? recency

  88. Strong Consistency Why? partial writes

  89. A A A

  90. A A A Val = <<“B”>>.

  91. A A A Val = <<“B”>>.

  92. B A A

  93. B A A riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>).

  94. B A A riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>).

  95. B A A riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>). B B

  96. Strong Consistency Single key atomic operations

  97. Strong Consistency any get sees most recent put

  98. Strong Consistency get/modify/put cycle fails if object is changed

  99. Strong Consistency put w/o vclock fails if object exists

  100. Consensus

  101. Distributed Consensus “The problem of reaching agreement among remote processes

    is one of the most fundamental problems in distributed computing and is at the core of many algorithms for distributed data processing, distributed file management, and fault-tolerant distributed applications.” --Fischer, Lynch, Paterson
  102. Consensus Guarantees: termination, agreement, validity

  103. Termination processes eventually decide on a value

  104. Agreement processes that decide do so on the same value

  105. Validity values must have been proposed

  106. Consensus Algorithms

  107. Consensus Algorithms Paxos, ZAB, Raft

  108. Paxos Lamport 1990: “The Part-Time Parliament”

  109. None
  110. ZAB ZooKeeper Atomic Broadcast

  111. Raft Ousterhout, Ongaro 2013: In search of an understandable consensus

    algorithm
  112. Paxos coordinated requests; leaders

  113. Paxos 2 round trips / request

  114. Node 1 Node 2 Node 3 N++ prepare(N) promise(N, V

    ) b promise(N, V ) c V = f(V , V , V ) b c a N commit(N, V ) N accept(N)
  115. Multi-Paxos First Request

  116. Node 1 Node 2 Node 3 N++; I = 0

    prepare(N, I) promise(N, I, V ) b promise(N, I, V ) c V = f(V , V , V ) b c a N commit(N, I, V ) N accept(N, I)
  117. Multi-Paxos Each Additional Request

  118. Node 1 Node 2 Node 3 I++ commit(N, I, V)

    accept(N, I)
  119. Multi-Paxos Each Request: ship entire state

  120. Riak

  121. Riak Key/Value

  122. Riak Keys are Independent

  123. Riak read-repair, active-anti entropy

  124. Riak individual key = isolated state

  125. Riak multi-paxos per key

  126. Consensus Groups

  127. Consensus Groups participants in decisioning; ensembles

  128. Consensus Groups preference lists (preflists) in Riak

  129. preflist

  130. None
  131. None
  132. None
  133. None
  134. Consensus Groups one group/ensemble per preflist

  135. Consensus Groups ring_size = 256, 256 ensembles

  136. Ensembles

  137. Ensembles leader election; Epochs; get/put operations

  138. Get Operations leader reads local object; if Epoch old: refresh

  139. Node 1 Node 2 Node 3 obj.epoch < epoch get(key)

    reply(Epoch , Seq , Val ) b Val = latest(Val , Val , Val ) Val.epoch = epoch write(Epoch, ++Seq, Val) ack(Epoch, Seq) b b reply(Epoch , Seq , Val ) c c c a b c
  140. Node 1 Node 2 Node 3 obj.epoch == epoch Reply

    = local_get(Key)
  141. Get Operations Worst Case: 2 roundtrips / request Best Case:

    0 roundtrips / request
  142. Put Operations leader reads local object; if Epoch old: refresh

    modify object commit modified object
  143. Node 1 Node 2 Node 3 obj.epoch < epoch get(key)

    reply(Epoch , Seq , Val ) b Latest = latest(Val , Val , Val ) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq) b b reply(Epoch , Seq , Val ) c c c a b c
  144. Node 1 Node 2 Node 3 obj.epoch == epoch Latest

    = local_get(Key) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq)
  145. Put Operations Worst Case: 2 roundtrips / write Best Case:

    1 roundtrips / write
  146. Failed Quorums Leader Re-election ; new Epoch

  147. Cluster Membership

  148. Cluster Membership add/remove nodes

  149. Cluster Membership consensus state

  150. Cluster Membership multi-paxos joint consensus

  151. Existing Ensemble Joining Ensemble riak_01 riak_02 riak_03 riak_07 riak_08 riak_09

    [{riak_01}, {riak_02}, {riak_03}] [{riak_07}, {riak_08}, {riak_09}]
  152. Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

  153. Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

  154. New Ensemble riak_07 riak_08 riak_09 [{riak_07}, {riak_08}, {riak_09}]

  155. Q & A