Save 37% off PRO during our Black Friday Sale! »

Choose Your Own Consistency

Choose Your Own Consistency

London Erlang User Group, April 2014

3e09fee7b359be847ed5fa48f524a3d3?s=128

Christopher Meiklejohn

April 24, 2014
Tweet

Transcript

  1. Consistency Choose Your Own @cmeik Christopher Meiklejohn Basho Technologies, Inc.

    London Erlang User Group, April 2014
  2. Riak Overview

  3. Riak Overview Erlang implementation of Dynamo

  4. Riak Overview Erlang implementation of Dynamo

  5. Riak Object

  6. Key Value

  7. Riak Overview Consistent hashing

  8. None
  9. None
  10. None
  11. Riak Overview Dynamic membership

  12. Riak Overview Replication factor

  13. Replica Replica Replica

  14. High Availability “...any non-failing node can respond to any request”

    ! --Gilbert & Lynch
  15. Eventual Consistency “Eventual consistency is a consistency model used in

    distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.” ! --Wikipedia
  16. Riak Overview Two Writes: {Writer, Value, Time}

  17. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] Concurrent

    Writes
  18. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] Last

    Writer Wins [{b, v1, t2}] [{b, v1, t2}] [{b, v1, t2}]
  19. [{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] Allow

    Mult [[{a, v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}]
  20. User specificed Merge

  21. CRDTs

  22. CRDTs Convergent Replicated Data Types

  23. CRDTs Commutative Replicated Data Types

  24. CRDTs Synchronization-free data structures

  25. CRDTs Monotonic & confluent; convergent

  26. CRDTs Create siblings; resolve via merge.

  27. The Theory

  28. None
  29. None
  30. None
  31. Bounded Join Semilattices

  32. Bounded Join Semilattices Partially ordered set; least upper bound; ACI.

  33. Bounded Join Semilattices Associativity: (X · Y) · Z =

    X · (Y · Z)
  34. Bounded Join Semilattices Commutativity: X · Y = Y ·

    X
  35. Bounded Join Semilattices Idempotence: X · X = X

  36. Bounded Join Semilattices Objects grow over time; merge computes LUB

  37. Bounded Join Semilattices Monotonic and confluent; convergent

  38. Bounded Join Semilattices Map into another via monotone functions

  39. Bounded Join Semilattices Examples

  40. b a c a, b a, c a, b, c

    Set; merge function: union. b, c
  41. 3 5 7 5 7 7 Increasing natural; merge function:

    max.
  42. F F T F T T Booleans; merge function: or.

  43. x <= y montone f(x) <= f(y)

  44. CvRDT Examples OR-SET

  45. [ [{1, a}], [] ] [ [{1, a}], [] ]

  46. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ]
  47. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ]
  48. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ]
  49. [ [{1, a}], [] ] [ [{1, a}], [] ]

  50. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}, {2, b}], [] ]
  51. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ]
  52. [ [{1, a}], [] ] [ [{1, a}], [] ]

    [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, b}], [{1, a}] ]
  53. riak_dt git clone git@github.com:basho/riak_dt.git

  54. -­‐type  crdt()  ::  term().   -­‐type  operation()  ::  term().  

    -­‐type  actor()  ::  term().   -­‐type  value()  ::  term().   -­‐type  error()  ::  term().   ! -­‐callback  new()  -­‐>  crdt().   -­‐callback  value(crdt())  -­‐>  term().   -­‐callback  value(term(),  crdt())  -­‐>  value().   -­‐callback  update(operation(),  actor(),  crdt())  -­‐>                           {ok,  crdt()}  |  {error,  error()}.   -­‐callback  merge(crdt(),  crdt())  -­‐>  crdt().   -­‐callback  equal(crdt(),  crdt())  -­‐>  boolean().   -­‐callback  to_binary(crdt())  -­‐>  binary().   -­‐callback  from_binary(binary())  -­‐>  crdt().   riak_dt/src/riak_dt.erl
  55. Riak 1.4

  56. Riak 1.4 Counters: G-Counter, PN-Counter

  57. Riak 1.4 Counters: non-idempotent; O(Actors)

  58. riak_dt/src/riak_dt_gcounter.erl -­‐module(riak_dt_gcounter).   ! -­‐export([new/0,  new/2,  value/1,  value/2,  update/3,  merge/2,


     equal/2,  to_binary/1,  from_binary/1]).   ! -­‐export_type([gcounter/0,  gcounter_op/0]).   ! -­‐opaque  gcounter()  ::  orddict:orddict().   ! -­‐type  gcounter_op()  ::  increment  |  {increment,  pos_integer()}.  
  59. riak_dt/src/riak_dt_pncounter.erl -­‐module(riak_dt_pncounter).   ! -­‐export([new/0,  new/2,  value/1,  value/2,    

                   update/3,  merge/2,  equal/2,  to_binary/1,  from_binary/1]).   ! -­‐export_type([pncounter/0,  pncounter_op/0]).   ! -­‐opaque  pncounter()    ::  [{Actor::riak_dt:actor(),  Inc::pos_integer(),  
                                                    Dec::pos_integer()}].   ! -­‐type  pncounter_op()  ::  riak_dt_gcounter:gcounter_op()  |  decrement_op().   -­‐type  decrement_op()  ::  decrement  |  {decrement,  pos_integer()}.   -­‐type  pncounter_q()    ::  positive  |  negative.
  60. Riak 2.0

  61. Riak 2.0 Requires bucket types; HTTP and protobuff API

  62. Riak 2.0 Caveats: 2i, MapReduce

  63. Riak 2.0 Sets: Add, Remove, Membership; Idempotent

  64. Riak 2.0 Sets: Add wins; O(Actors + Elements)

  65. riak_dt/src/riak_dt_gset.erl -­‐module(riak_dt_gset).   ! -­‐behaviour(riak_dt).   ! %%  API  

    -­‐export([new/0,  value/1,  update/3,  merge/2,  equal/2,                    to_binary/1,  from_binary/1,  value/2]).   ! -­‐export_type([gset/0,  binary_gset/0,  gset_op/0]).   ! -­‐opaque  gset()  ::  members().   ! -­‐type  binary_gset()  ::  binary().   ! -­‐type  gset_op()  ::  {add,  member()}.  
  66. riak_dt/src/riak_dt_orset.erl -­‐module(riak_dt_orset).   ! -­‐behaviour(riak_dt).   ! %%  API  

    -­‐export([new/0,  value/1,  update/3,  merge/2,  equal/2,                    to_binary/1,  from_binary/1,  value/2,  precondition_context/1]).   ! -­‐export_type([orset/0,  binary_orset/0,  orset_op/0]).   -­‐opaque  orset()  ::  orddict:orddict().   ! -­‐type  binary_orset()  ::  binary().  %%  A  binary  that  from_binary/1  will   operate  on.   ! -­‐type  orset_op()  ::  {add,  member()}  |  {remove,  member()}  |                                          {add_all,  [member()]}  |  {remove_all,  [member()]}  |                                          {update,  [orset_op()]}.   ! -­‐type  actor()  ::  riak_dt:actor().   -­‐type  member()  ::  term().  
  67. riak_dt/src/riak_dt_orswot.erl -­‐module(riak_dt_orswot).   ! -­‐behaviour(riak_dt).   ! -­‐export_type([orswot/0,  orswot_op/0,  binary_orswot/0]).

      ! -­‐opaque  orswot()  ::  {riak_dt_vclock:vclock(),  entries()}.   -­‐type  binary_orswot()  ::  binary().  %%  A  binary  that  from_binary/1  will  operate   on.   ! -­‐type  orswot_op()  ::    {add,  member()}  |  {remove,  member()}  |                                              {add_all,  [member()]}  |  {remove_all,  [member()]}  |                                              {update,  [orswot_op()]}.   -­‐type  orswot_q()    ::  size  |  {contains,  term()}.   ! -­‐type  actor()  ::  riak_dt:actor().   ! ! -­‐type  entries()  ::  [{member(),  minimal_clock()}].   ! -­‐type  minimal_clock()  ::  [dot()].   -­‐type  dot()  ::  {actor(),  Count::pos_integer()}.   -­‐type  member()  ::  term().  
  68. Riak 2.0 Maps: Recursive; Associative Array; Nestable

  69. Riak 2.0 Maps: Update wins; O(Actors + Elements)

  70. riak_dt/src/riak_dt_map.erl -­‐module(riak_dt_map).   ! -­‐behaviour(riak_dt).   ! %%  API  

    -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,                    equal/2,  to_binary/1,  from_binary/1,  precondition_context/1]).   ! -­‐export_type([map/0,  binary_map/0,  map_op/0]).   ! -­‐type  binary_map()  ::  binary().  %%  A  binary  that  from_binary/1  will  accept   -­‐type  map()  ::  {riak_dt_vclock:vclock(),  valuelist()}.   -­‐type  field()  ::  {Name::term(),  Type::crdt_mod()}.   -­‐type  crdt_mod()  ::  riak_dt_pncounter  |  riak_dt_lwwreg  |                                          riak_dt_od_flag  |                                          riak_dt_map  |  riak_dt_orswot.   -­‐type  valuelist()  ::  [{field(),  entry()}].   -­‐type  entry()  ::  {minimal_clock(),  crdt()}.   ! -­‐type  crdt()    ::    riak_dt_pncounter:pncounter()  |  riak_dt_od_flag:od_flag()  |                                      riak_dt_lwwreg:lwwreg()  |                                      riak_dt_orswot:orswot()  |                                      riak_dt_map:map().   ! -­‐type  map_op()  ::  {update,  [map_field_update()  |  map_field_op()]}.  
  71. Riak 2.0 Maps: LWW-Register, Booleans, Sets, and Maps

  72. Riak 2.0 LWW-Register: last writer wins

  73. -­‐module(riak_dt_lwwreg).   ! -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,    

                   equal/2,  to_binary/1,  from_binary/1]).   ! -­‐export_type([lwwreg/0,  lwwreg_op/0]).   ! -­‐opaque  lwwreg()  ::  {term(),  non_neg_integer()}.   ! -­‐type  lwwreg_op()  ::  {assign,  term(),  non_neg_integer()}    |  {assign,   term()}.   ! -­‐type  lww_q()  ::  timestamp.   riak_dt/src/riak_dt_lwwreg.erl
  74. Riak 2.0 Boolean: Enabled, Disabled; O(Actors)

  75. -­‐module(riak_dt_enable_flag).   ! -­‐behaviour(riak_dt).   ! -­‐export([new/0,  value/1,  value/2,  update/3,

     merge/2,
  equal/2,  from_binary/1,  to_binary/1]).   riak_dt/src/riak_dt_enable_flag.erl
  76. This project is funded by the European Union, 7th Research

    Framework Programme, ICT call 10, grant agreement n°609551.
  77. None
  78. Strong Consistency

  79. Strong Consistency Why?

  80. Strong Consistency Why? atomicity

  81. Strong Consistency Why? recency

  82. Strong Consistency Why? partial writes

  83. A A A

  84. A A A Val = <<“B”>>.

  85. B A A

  86. B A A riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>). B B

  87. Strong Consistency Single key atomic operations

  88. Strong Consistency any get sees most recent put

  89. Strong Consistency get/modify/put cycle fails if object is changed

  90. Strong Consistency put w/o vclock fails if object exists

  91. Consensus

  92. Distributed Consensus “The problem of reaching agreement among remote processes

    is one of the most fundamental problems in distributed computing and is at the core of many algorithms for distributed data processing, distributed file management, and fault-tolerant distributed applications.” ! --Fischer, Lynch, Paterson
  93. Consensus Guarantees: termination, agreement, validity

  94. Termination processes eventually decide on a value

  95. Agreement processes that decide do so on the same value

  96. Validity values must have been proposed

  97. Consensus Algorithms

  98. Consensus Algorithms Paxos, ZAB, Raft

  99. Paxos Lamport 1990: “The Part-Time Parliament”

  100. None
  101. ZAB ZooKeeper Atomic Broadcast

  102. Raft Ousterhout, Ongaro; 2013:
 “In Search Of An Understandable Consensus

    Algorithm”
  103. Paxos coordinated requests; leaders

  104. Paxos 2 round trips / request

  105. Node 1 Node 2 Node 3 N++ prepare(N) promise(N, V

    ) b promise(N, V ) c V = f(V , V , V ) b c a N commit(N, V ) N accept(N)
  106. Multi-Paxos First Request

  107. Node 1 Node 2 Node 3 N++; I = 0

    prepare(N, I) promise(N, I, V ) b promise(N, I, V ) c V = f(V , V , V ) b c a N commit(N, I, V ) N accept(N, I)
  108. Multi-Paxos Each Additional Request

  109. Node 1 Node 2 Node 3 I++ commit(N, I, V)

    accept(N, I)
  110. Multi-Paxos Each Request: ship entire state

  111. Riak

  112. Riak Key/Value

  113. Riak Keys are Independent

  114. Riak read-repair, active-anti entropy

  115. Riak individual key = isolated state

  116. Riak multi-paxos per key

  117. Consensus Groups

  118. Consensus Groups participants in decisioning; ensembles

  119. Consensus Groups preference lists (preflists) in Riak

  120. preflist

  121. None
  122. None
  123. None
  124. None
  125. Consensus Groups one group/ensemble per preflist

  126. Consensus Groups ring_size = 256, 256 ensembles

  127. Ensembles

  128. Ensembles leader election; Epochs; get/put operations

  129. Get Operations leader reads local object; if Epoch old: refresh

  130. Node 1 Node 2 Node 3 obj.epoch < epoch get(key)

    reply(Epoch , Seq , Val ) b Val = latest(Val , Val , Val ) Val.epoch = epoch write(Epoch, ++Seq, Val) ack(Epoch, Seq) b b reply(Epoch , Seq , Val ) c c c a b c
  131. Node 1 Node 2 Node 3 obj.epoch == epoch Reply

    = local_get(Key)
  132. Get Operations Worst Case: 2 roundtrips / request Best Case:

    0 roundtrips / request
  133. Put Operations leader reads local object; if Epoch old: refresh

    modify object commit modified object
  134. Node 1 Node 2 Node 3 obj.epoch < epoch get(key)

    reply(Epoch , Seq , Val ) b Latest = latest(Val , Val , Val ) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq) b b reply(Epoch , Seq , Val ) c c c a b c
  135. Node 1 Node 2 Node 3 obj.epoch == epoch Latest

    = local_get(Key) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq)
  136. Put Operations Worst Case: 2 roundtrips / write Best Case:

    1 roundtrips / write
  137. Failed Quorums Leader Re-election ; new Epoch

  138. Cluster Membership

  139. Cluster Membership add/remove nodes

  140. Cluster Membership consensus state

  141. Cluster Membership multi-paxos joint consensus

  142. Existing Ensemble Joining Ensemble riak_01 riak_02 riak_03 riak_07 riak_08 riak_09

    [{riak_01}, {riak_02}, {riak_03}] [{riak_07}, {riak_08}, {riak_09}]
  143. Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

  144. Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

  145. New Ensemble riak_07 riak_08 riak_09 [{riak_07}, {riak_08}, {riak_09}]

  146. Q & A