Chas Emerick @cemerick @QuiltProject Papers We Love NYC #4 May 15, 2014 for your edification and entertainment, an audiovisual précis of: A comprehensive study of Convergent and Commutative Replicated Data Types by Shapiro, Preguiça, Baquero, and Zawirski (2011)
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Preface ● “Who is this guy?” ● Not going to follow the order of topics from the paper exactly ● Key topics will be introduced along with section numbers from the paper (e.g. §4.3) ● I'll be drawing upon materials used in the authors' presentations related to this paper (see references for links)
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Itinerary ● The paper – Motivating problem – Theoretically- and algebraically-sound solution – Specification of practical data type designs – Challenges – Related work, further reading ● Impact outside of academia ● Questions & discussion
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject The problem Conflicting concurrent modifications A t B x x x Δb x Δa x Δa ∥x Δb Δ b Δ a x ???
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject The problem Conflicting concurrent modifications require consensus A t B x x x Δab x Δa x Δab Δ b Δ a x Δa
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject The problem Conflicting concurrent modifications require consensus, which is partition-intolerant and affects availability: A t B x x x Δab x Δa x Δab Δ b Δ a x Δa
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Linearizability✝ ● Transactional databases, Redis, consensus services (Paxos/Raft) ● Global consensus → consistency ● Total order of all events ● Very expensive & constrains availability A t B x x x Δab x Δa x Δab Δ b Δ a x Δa ✝As well as strict serializability, which has even stronger guarantees.
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Thus, eventual consistency ● No guarantee of ordering of events ● Maximal availability & performance ● How to reconcile results from concurrent operations? A t B x x x Δb x Δa x Δa ∥x Δb Δ b Δ a x ???
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject How to reconcile results of concurrent operations? ● “Background” (deferred) consensus – Post-hoc resolution or rollback of conflicting updates ● This is what we do today, all the time! – Resolving CouchDB conflicts and Riak siblings within applications – Merging (semi-)textual content via diffs ● Very difficult to implement correctly, and no guiding formalisms to indicate correctness or warn against problems
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Proposed solution Conflict-free Replicated Data Types (CRDTs) deterministically reconcile concurrent updates such that no conflicts arise – Performance, availability, scale of eventual consistency + reliable reconciliation as if you were using a consensus mechanism – Provably sound – Limitations: ● No consensus → limitations on what can be stored, replicated, and reconciled (i.e. no global invariants) ● Unbounded growth → “garbage collection”
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Chocolate and vanilla CRDTs ● What is replicated? – Entire state of the datatype? State-based, a.k.a. convergent replicated data type, a.k.a. CvRDT – Individual operations (+ arguments)? Operation-based, a.k.a. commutative replicated data type a.k.a. CmRDT – Options correspond to the two strategies for implementing optimistic replication✝ ● These are formally equivalent §2.4 – Strategy: understand state-based constructions, move on to operation-based as optimization ✝http://research.microsoft.com/apps/pubs/default.aspx?id=66979
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Statebased (CvRDT) §2.2.1, §2.3.1 All possible states within a CvRDT form a semilattice – Partially-ordered set established by a least upper bound (join, ) or greatest ≤ lower bound (meet, ) ≥ – Both relations are definitionally commutative, associative, and idempotent – Each application of join or meet yields a monotonically increasing or decreasing value {b} ø {a} {a} {a,b} t
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Statebased (CvRDT) ● join and meet are formally equivalent, join presumed throughout the literature ● Update locally, propagate results to other replicas, where it must converge (the 'v' in “CvRDT”) ● Requires weakest eventual consistency guarantees to yield convergence among all replicas, since join is associative and commutative – “infinitely often” transmission of state – Insensitive to reordered/dropped/repeated messages – Very expensive worst case, but easier to reason about
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Language for specifying asynchronous replication ● More than boxes, arrows ● Better than (most) pseudocode: explicit about preconditions, where things happen, (a)synchrony, etc
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject LWW (last writer wins) register §3.2.1 ● Ensures only a single value in register ● Semilattice is ordered by timestamps A t B ∅ y t2 x t1 y t2 Δ t2 Δ t1 ∅ y t2
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Mapping LWWregister to its semilattice A t B ∅ y t2 x t1 y t2 Δ t2 Δ t1 ∅ y t2 [t 2 ,y] ø ø [t 1 ,x] [t 2 ,y] t
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject MV (multivalue) register §3.2.2 ● Assignments carry causal history (e.g. version vector) which defines semilattice's partial order ● join retains all values assigned concurrently; some client can later assign a single value A t B ∅ y Δb x Δa x Δa ∥y Δb Δ b Δ a ∅ x Δa ∥y Δb client z [Δa,Δb,Δc] z [Δa,Δb,Δc]
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Mapping MVregister to its semilattice [Δb,y] ø [Δa,x] #{[Δa,x] [Δb,y]} t [[Δa,Δb,Δc],z] A t B ∅ y Δb x Δa x Δa ∥y Δb Δ b Δ a ∅ x Δa ∥y Δb client z [Δa,Δb,Δc] z [Δa,Δb,Δc]
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject ObservedRemove Set §3.3.5 ● Set CRDT with intuitive semantics ≈ – Given S.add(e) || S.remove(e), add wins ● Strategy – tag each element uniquely (per actor or per operation), e τ – operation removing e must include set of all previously-unremoved τ for e – Set.contains(e) == true iff an e τ exists where τ has not been implicated in a removal of e ● Tags are not exposed in userland API
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject ObservedRemove Set Progression & Semilattice e a A e b B e b C e a,b e a,-a e a,-a,b {e} ∅ e b,a,-a {e} {e} {e} ∅ ∅ ∅ {e} {e} t ● A and B concurrently add e a and e b ● A removes e a ; this has no effect on e's membership in C's view of the set because of its knowledge of e b ● e b is eventually replicated to A, yielding consistency
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Graphs §3.4 ● Two sets, vertices + edges ● Many different possible constructions given the local invariants one might want to preserve between edges and vertices ● Global invariants cannot be guaranteed because of concurrent operations – e.g. cannot prevent cycles
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Sequences §3.5.2 ● Set of (identifier, value) where identifiers are selected from a dense, totally-ordered set ● Explored deeply in papers on Logoot and Treedoc CRDTs
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Operationbased CmRDT §2.2.2, §2.3.2 ● Requires “reliable broadcast channel” – Operations delivered to each replica in causal order < d – All concurrent operations that are unordered with respect to < d must commute (the 'm' in CmRDT) ● Far more efficient than worst-case state-based specification
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Operationbased CmRDT: tradeoffs ● More complex, more difficult to reason about ● More challenging to implement – Causal relationships between operations must be identified + maintained – Generally requires tracking “group membership”
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Garbage Collection §4 ● “Garbage”: additional overhead that accumulates in order to satisfy CRDT semantics – “tombstones” (e.g. remove tags in an OR-Set) – Unbalanced trees of identifiers in sequences ● Optimistically collecting garbage and rolling back as necessary is an option in some cases ● Others appear to require various levels of consensus to achieve ● “Garbage” is not always waste – The right kind of tombstones are what makes consistent snapshot possible
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Prior & related work Lots of prior work had portions of CRDTs' semantics, before “CRDT” was identified as a concept: – Wuu and Bernstein, 'Efficient solutions to the replicated log and dictionary problems' (1984!) – Operational transforms – Any Dynamo-style system uses registers for values ● LWW-registers: S3 ● MV-registers: CouchDB conflicts, Riak siblings
@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Prior & related work “Consistency as Logical Monotonicity” (CALM theorem) – s/semilattices/monotonic logic ● Stricter semantics than semilattices; no way to characterize non-monotonic operations (remove, etc) without consensus – Implemented at the language level by Bloom ● Nearly all data structures are monotonic or lattices ● Allows for static analysis that identifies parts of your program that aren't monotonic (require synchronization/consensus mechanism to ensure safety)