Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Structures in Riak (NoSQL Matters Cologne 2013)

Data Structures in Riak (NoSQL Matters Cologne 2013)

Since the beginning, Riak has supported high write-availability using Dynamo-style multi-valued keys – also known as conflicts or siblings. The tradeoff for this type of availability is that the application must include logic to resolve conflicting updates. While it is convenient to say that the application can reason best about conflicts, ad hoc resolution is error-prone and can result in surprising anomalies, like the reappearing item problem in Dynamo’s shopping cart.

What is needed is a more formal and general approach to the problem of conflict resolution for complex data structures. Luckily, there are some formal strategies in recent literature, including Conflict-Free Replicated Data Types (CRDTs) and BloomL lattices. We’ll review these strategies and cover some recent work we’ve done toward adding automatically-convergent data structures to Riak.

Sean Cribbs

April 26, 2013
Tweet

More Decks by Sean Cribbs

Other Decks in Technology

Transcript

  1. No clear winner! Throw one out? Keep both? Consistency? 1

    2 3 B A Cassandra Friday, April 26, 13
  2. No clear winner! Throw one out? Keep both? Consistency? 1

    2 3 B A Cassandra Riak & Voldemort Friday, April 26, 13
  3. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH

    YVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the   frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  text/plain Content-­‐Length:  102 Siblings: 16vic4eU9ny46o4KPiDz1f 4v5xOg4bVwUYZdMkqf0d6I 6nr5tDTmhxnwuAFJDd2s6G 6zRSZFUJlHXZ15o9CG0BYl Friday, April 26, 13
  4. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH

    YVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the   frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  text/plain Content-­‐Length:  102 Siblings: 16vic4eU9ny46o4KPiDz1f 4v5xOg4bVwUYZdMkqf0d6I 6nr5tDTmhxnwuAFJDd2s6G 6zRSZFUJlHXZ15o9CG0BYl list of siblings Friday, April 26, 13
  5. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl

    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh Content-­‐Length:  766 -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded Link:  </riak/test>;  rel="up" Etag:  16vic4eU9ny46o4KPiDz1f Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" Etag:  4v5xOg4bVwUYZdMkqf0d6I Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" Friday, April 26, 13
  6. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl

    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh Content-­‐Length:  766 -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded Link:  </riak/test>;  rel="up" Etag:  16vic4eU9ny46o4KPiDz1f Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" Etag:  4v5xOg4bVwUYZdMkqf0d6I Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" all the values Friday, April 26, 13
  7. Semantic Resolution • Your app knows the domain - use

    business rules to resolve • Amazon Dynamo’s shopping cart Friday, April 26, 13
  8. Semantic Resolution • Your app knows the domain - use

    business rules to resolve • Amazon Dynamo’s shopping cart BAD Friday, April 26, 13
  9. Semantic Resolution • Your app knows the domain - use

    business rules to resolve • Amazon Dynamo’s shopping cart BAD “Ad hoc approaches have proven brittle and error-prone” Friday, April 26, 13
  10. WARNING This is a lot of math. Side effects may

    include dry mouth, itchy rash, and a desire to go back for a PhD. Friday, April 26, 13
  11. Monotonic Functions • Change in strictly a single direction •

    Consecutive values may be equal • Monotonic: Linear, Exponential • Non-monotonic: Quadratic, Sinusoidal Friday, April 26, 13
  12. Monotonic Functions • Change in strictly a single direction •

    Consecutive values may be equal • Monotonic: Linear, Exponential • Non-monotonic: Quadratic, Sinusoidal Friday, April 26, 13
  13. Monotonic Logic •Existing facts are never refuted •New facts can

    be added •“Knowledge only grows” Friday, April 26, 13
  14. Monotonic Logic •Existing facts are never refuted •New facts can

    be added •“Knowledge only grows” “monotonicity of entailment” Friday, April 26, 13
  15. Bounded Join Semi-Lattice ʪS, ⊔, ⊥ʫ ⊔ is a least-upper

    bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z Friday, April 26, 13
  16. Bounded Join Semi-Lattice ∀x, y ∈ S: x ≤S y

    㱻 x ⊔ y = y “partial order” ʪS, ⊔, ⊥ʫ ⊔ is a least-upper bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z Friday, April 26, 13
  17. Bounded Join Semi-Lattice ∀x, y ∈ S: x ≤S y

    㱻 x ⊔ y = y “partial order” ∀x ∈ S: x ⊔ ⊥ = x “identity” ʪS, ⊔, ⊥ʫ ⊔ is a least-upper bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z Friday, April 26, 13
  18. “Set” Lattice {a} {b} {c} {d} {e} {a,b} {b,c} {c,d}

    {d,e} {a,b,c} {c,d,e} {b,c,d,e} {a,b,c,d} {b,c,d} {a,b,c,d,e} Time S = all finite sets ⊔ = set-union ⊥ = {} Friday, April 26, 13
  19. • Vector clock is a lattice... Vector Clock S =

    all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector) Friday, April 26, 13
  20. • Vector clock is a lattice... • ...but the associated

    Riak value is non- monotonic, Vector Clock S = all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector) Friday, April 26, 13
  21. • Vector clock is a lattice... • ...but the associated

    Riak value is non- monotonic, • ...and the vclock is not meaningful to the client. Vector Clock S = all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector) Friday, April 26, 13
  22. CRDT Flavors • Convergent (state-based) • One replica updates, then

    forwards entire state, downstream merges • Commutative (operation-based) • Only mutations (ops) communicated • Needs a reliable broadcast channel Friday, April 26, 13
  23. CRDT Types Registers LWW, MV Counters Positive, P/N Sets Grow

    only, Two-Phase, Observed-Remove Graphs 2P-2P Lists Growable-array Collaborative editing Treedoc Friday, April 26, 13
  24. Garbage! •Monotonic types are space-inefficient •Garbage-collection is non-monotonic •Strong consensus

    may be needed for GC •Related: vector-clock pruning, key deletion Friday, April 26, 13
  25. Riak DT •Riak Core Application •State-based CRDTS •Independent storage engine

    •HTTP API •CRDT behaviour •QuickCheck convergence property Friday, April 26, 13
  26. •new/0 empty CRDT •value/1 the resolved value •update/3 mutate CRDT

    •merge/2 converge two CRDTs •equal/2 compare internal value CRDT Behaviour Friday, April 26, 13
  27. G-Counter •Simple version vector (28 LoC) [{ActorId,Count}] •Update: increment actor’s

    count •Merge: greatest value per Actor •Value: sum of Counts Friday, April 26, 13
  28. G-Counter new()  -­‐>        []. value(GCnt)  -­‐>  

         sum([Cnt  ||  {_Act,  Cnt}  <-­‐  GCnt]). equal(VA,VB)  -­‐>        lists:sort(VA)  =:=  lists:sort(VB). Friday, April 26, 13
  29. PN-Counter •2 x G-Counter •P - N = value {

       P  =  [{a,10},{b,2}],    N  =  [{a,1},{c,5}] } (10  +  2)  -­‐  (1  +  5)      =  12  -­‐  6      =  6 Friday, April 26, 13
  30. Riak DT In Action •Bitcask storage per vnode •Value /

    Update FSM per request •Webmachine resource(s) e.g. GET  /counters/key •Read-repair Friday, April 26, 13
  31. Trade-Offs •Update: Primary replica only •Secondary/Fallbacks may Merge •PW=DW=1 by

    default •Read-before-Write in the request path Friday, April 26, 13
  32. CRDTs in KV •Store CRDTs as values inside KV “Objects”

    •Requires allow_mult  =  true •Resolve CRDTs separately from opaque values •Counters-only at first Friday, April 26, 13
  33. KV Counters GET  /buckets/b/counters/c    =>  5 POST  /buckets/b/counters/c 10

    GET  /buckets/b/keys/c Stop doing that. Friday, April 26, 13