Slide 1

Slide 1 text

Chas Emerick @cemerick @QuiltProject Papers We Love NYC #4 May 15, 2014 for your edification and entertainment, an audiovisual précis of: A comprehensive study of Convergent and Commutative Replicated Data Types by Shapiro, Preguiça, Baquero, and Zawirski (2011)

Slide 2

Slide 2 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Preface ● “Who is this guy?” ● Not going to follow the order of topics from the paper exactly ● Key topics will be introduced along with section numbers from the paper (e.g. §4.3) ● I'll be drawing upon materials used in the authors' presentations related to this paper (see references for links)

Slide 3

Slide 3 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Itinerary ● The paper – Motivating problem – Theoretically- and algebraically-sound solution – Specification of practical data type designs – Challenges – Related work, further reading ● Impact outside of academia ● Questions & discussion

Slide 4

Slide 4 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Actors within distributed systems exchange and share state A t B x a x a x b x b x c x c

Slide 5

Slide 5 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject The problem Conflicting concurrent modifications A t B x x x Δb x Δa x Δa ∥x Δb Δ b Δ a x ???

Slide 6

Slide 6 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject The problem Conflicting concurrent modifications require consensus A t B x x x Δab x Δa x Δab Δ b Δ a x Δa

Slide 7

Slide 7 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject The problem Conflicting concurrent modifications require consensus, which is partition-intolerant and affects availability: A t B x x x Δab x Δa x Δab Δ b Δ a x Δa

Slide 8

Slide 8 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Linearizability✝ ● Transactional databases, Redis, consensus services (Paxos/Raft) ● Global consensus → consistency ● Total order of all events ● Very expensive & constrains availability A t B x x x Δab x Δa x Δab Δ b Δ a x Δa ✝As well as strict serializability, which has even stronger guarantees.

Slide 9

Slide 9 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Thus, eventual consistency ● No guarantee of ordering of events ● Maximal availability & performance ● How to reconcile results from concurrent operations? A t B x x x Δb x Δa x Δa ∥x Δb Δ b Δ a x ???

Slide 10

Slide 10 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject How to reconcile results of concurrent operations? ● “Background” (deferred) consensus – Post-hoc resolution or rollback of conflicting updates ● This is what we do today, all the time! – Resolving CouchDB conflicts and Riak siblings within applications – Merging (semi-)textual content via diffs ● Very difficult to implement correctly, and no guiding formalisms to indicate correctness or warn against problems

Slide 11

Slide 11 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Proposed solution Conflict-free Replicated Data Types (CRDTs) deterministically reconcile concurrent updates such that no conflicts arise – Performance, availability, scale of eventual consistency + reliable reconciliation as if you were using a consensus mechanism – Provably sound – Limitations: ● No consensus → limitations on what can be stored, replicated, and reconciled (i.e. no global invariants) ● Unbounded growth → “garbage collection”

Slide 12

Slide 12 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Chocolate and vanilla CRDTs ● What is replicated? – Entire state of the datatype? State-based, a.k.a. convergent replicated data type, a.k.a. CvRDT – Individual operations (+ arguments)? Operation-based, a.k.a. commutative replicated data type a.k.a. CmRDT – Options correspond to the two strategies for implementing optimistic replication✝ ● These are formally equivalent §2.4 – Strategy: understand state-based constructions, move on to operation-based as optimization ✝http://research.microsoft.com/apps/pubs/default.aspx?id=66979

Slide 13

Slide 13 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject State­based (CvRDT) §2.2.1, §2.3.1 All possible states within a CvRDT form a semilattice – Partially-ordered set established by a least upper bound (join, ) or greatest ≤ lower bound (meet, ) ≥ – Both relations are definitionally commutative, associative, and idempotent – Each application of join or meet yields a monotonically increasing or decreasing value {b} ø {a} {a} {a,b} t

Slide 14

Slide 14 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject State­based (CvRDT) ● join and meet are formally equivalent, join presumed throughout the literature ● Update locally, propagate results to other replicas, where it must converge (the 'v' in “CvRDT”) ● Requires weakest eventual consistency guarantees to yield convergence among all replicas, since join is associative and commutative – “infinitely often” transmission of state – Insensitive to reordered/dropped/repeated messages – Very expensive worst case, but easier to reason about

Slide 15

Slide 15 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Language for specifying asynchronous replication ● More than boxes, arrows ● Better than (most) pseudocode: explicit about preconditions, where things happen, (a)synchrony, etc

Slide 16

Slide 16 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Language for specifying asynchronous replication

Slide 17

Slide 17 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject A “Portfolio of basic CRDTs” §3 ● Counters ● Registers ● Sets ● Sequences ● CRDTs compose and retain their characteristics – Sets → maps, multimaps, graphs

Slide 18

Slide 18 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Designing a register CRDT✝ ● Sequential specification: – R.set(v) → R.get() == v ● Join relation over R.set(v a ) || R.set(v b ) – Linearizable? – Error state? – Last writer wins? ✝Framework from http://bit.ly/shapiro-msr-talk

Slide 19

Slide 19 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject LWW (last writer wins) register §3.2.1 ● Ensures only a single value in register ● Semilattice is ordered by timestamps A t B ∅ y t2 x t1 y t2 Δ t2 Δ t1 ∅ y t2

Slide 20

Slide 20 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Mapping LWW­register to its semilattice A t B ∅ y t2 x t1 y t2 Δ t2 Δ t1 ∅ y t2 [t 2 ,y] ø ø [t 1 ,x] [t 2 ,y] t

Slide 21

Slide 21 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject MV (multi­value) register §3.2.2 ● Assignments carry causal history (e.g. version vector) which defines semilattice's partial order ● join retains all values assigned concurrently; some client can later assign a single value A t B ∅ y Δb x Δa x Δa ∥y Δb Δ b Δ a ∅ x Δa ∥y Δb client z [Δa,Δb,Δc] z [Δa,Δb,Δc]

Slide 22

Slide 22 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Mapping MV­register to its semilattice [Δb,y] ø [Δa,x] #{[Δa,x] [Δb,y]} t [[Δa,Δb,Δc],z] A t B ∅ y Δb x Δa x Δa ∥y Δb Δ b Δ a ∅ x Δa ∥y Δb client z [Δa,Δb,Δc] z [Δa,Δb,Δc]

Slide 23

Slide 23 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Designing a set CRDT✝ ● Sequential specification: – S.add(e) → S.contains(e) == true – S.remove(e) → S.contains(e) == false ● Join relation over S.add(e) || S.remove(e) – Linearizable? – Disallow removals? – Error state? – Last writer wins? – Add wins? – Remove wins? ✝Framework from http://bit.ly/shapiro-msr-talk

Slide 24

Slide 24 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Sets §3.3 ● Counterintuitive convergent characterizations – G-Set (“grow-only”): can add, cannot remove – 2P-Set (“two phase”): once removed, cannot add an element back ● Composition of two G-sets – LWW-Set – PN-Set (positive & negative counters track membership): addition may not yield membership

Slide 25

Slide 25 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Observed­Remove Set §3.3.5 ● Set CRDT with intuitive semantics ≈ – Given S.add(e) || S.remove(e), add wins ● Strategy – tag each element uniquely (per actor or per operation), e τ – operation removing e must include set of all previously-unremoved τ for e – Set.contains(e) == true iff an e τ exists where τ has not been implicated in a removal of e ● Tags are not exposed in userland API

Slide 26

Slide 26 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Observed­Remove Set Progression & Semilattice e a A e b B e b C e a,b e a,-a e a,-a,b {e} ∅ e b,a,-a {e} {e} {e} ∅ ∅ ∅ {e} {e} t ● A and B concurrently add e a and e b ● A removes e a ; this has no effect on e's membership in C's view of the set because of its knowledge of e b ● e b is eventually replicated to A, yielding consistency

Slide 27

Slide 27 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Graphs §3.4 ● Two sets, vertices + edges ● Many different possible constructions given the local invariants one might want to preserve between edges and vertices ● Global invariants cannot be guaranteed because of concurrent operations – e.g. cannot prevent cycles

Slide 28

Slide 28 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Sequences §3.5.2 ● Set of (identifier, value) where identifiers are selected from a dense, totally-ordered set ● Explored deeply in papers on Logoot and Treedoc CRDTs

Slide 29

Slide 29 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Operation­based CmRDT §2.2.2, §2.3.2 ● Requires “reliable broadcast channel” – Operations delivered to each replica in causal order < d – All concurrent operations that are unordered with respect to < d must commute (the 'm' in CmRDT) ● Far more efficient than worst-case state-based specification

Slide 30

Slide 30 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Operation­based CmRDT: tradeoffs ● More complex, more difficult to reason about ● More challenging to implement – Causal relationships between operations must be identified + maintained – Generally requires tracking “group membership”

Slide 31

Slide 31 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Garbage Collection §4 ● “Garbage”: additional overhead that accumulates in order to satisfy CRDT semantics – “tombstones” (e.g. remove tags in an OR-Set) – Unbalanced trees of identifiers in sequences ● Optimistically collecting garbage and rolling back as necessary is an option in some cases ● Others appear to require various levels of consensus to achieve ● “Garbage” is not always waste – The right kind of tombstones are what makes consistent snapshot possible

Slide 32

Slide 32 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Prior & related work Lots of prior work had portions of CRDTs' semantics, before “CRDT” was identified as a concept: – Wuu and Bernstein, 'Efficient solutions to the replicated log and dictionary problems' (1984!) – Operational transforms – Any Dynamo-style system uses registers for values ● LWW-registers: S3 ● MV-registers: CouchDB conflicts, Riak siblings

Slide 33

Slide 33 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Prior & related work “Consistency as Logical Monotonicity” (CALM theorem) – s/semilattices/monotonic logic ● Stricter semantics than semilattices; no way to characterize non-monotonic operations (remove, etc) without consensus – Implemented at the language level by Bloom ● Nearly all data structures are monotonic or lattices ● Allows for static analysis that identifies parts of your program that aren't monotonic (require synchronization/consensus mechanism to ensure safety)

Slide 34

Slide 34 text

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Resources ● Meetup page for this talk: http://bit.ly/pwl-nyc-4 ● Shapiro et al. paper: http://bit.ly/shapiro-crdt-pdf ● Shapiro talk @ MSR: http://bit.ly/shapiro-msr-talk ● Chris Meiklejohn's 'Readings in Distributed Systems': http://bit.ly/cmeik-dist-sys-readings ● CRDTs offered in v2.0 of Riak: http://bit.ly/riak-crdts ● Bloom, a Ruby DSL for “disorderly programming”, an implementation of CALM: http://www.bloom-lang.net ● The Quilt Project: http://quilt.org