An audiovisual précis of 'A comprehensive study of Convergent and Commutative Replicated Data Types'

Chas Emerick @cemerick @QuiltProject Papers We Love NYC #4 May
15, 2014 for your edification and entertainment, an audiovisual précis of: A comprehensive study of Convergent and Commutative Replicated Data Types by Shapiro, Preguiça, Baquero, and Zawirski (2011)

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Preface • “Who
is this guy?” • Not going to follow the order of topics from the paper exactly • Key topics will be introduced along with section numbers from the paper (e.g. §4.3) • I'll be drawing upon materials used in the authors' presentations related to this paper (see references for links)

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Itinerary • The
paper – Motivating problem – Theoretically- and algebraically-sound solution – Specification of practical data type designs – Challenges – Related work, further reading • Impact outside of academia • Questions & discussion

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Actors within distributed
systems exchange and share state A t B x a x a x b x b x c x c

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject The problem Conflicting
concurrent modifications A t B x x x Δb x Δa x Δa ∥x Δb Δ b Δ a x ???

concurrent modifications require consensus A t B x x x Δab x Δa x Δab Δ b Δ a x Δa

concurrent modifications require consensus, which is partition-intolerant and affects availability: A t B x x x Δab x Δa x Δab Δ b Δ a x Δa

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Linearizability✝ • Transactional
databases, Redis, consensus services (Paxos/Raft) • Global consensus → consistency • Total order of all events • Very expensive & constrains availability A t B x x x Δab x Δa x Δab Δ b Δ a x Δa ✝As well as strict serializability, which has even stronger guarantees.

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Thus, eventual consistency
• No guarantee of ordering of events • Maximal availability & performance • How to reconcile results from concurrent operations? A t B x x x Δb x Δa x Δa ∥x Δb Δ b Δ a x ???

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject How to reconcile
results of concurrent operations? • “Background” (deferred) consensus – Post-hoc resolution or rollback of conflicting updates • This is what we do today, all the time! – Resolving CouchDB conflicts and Riak siblings within applications – Merging (semi-)textual content via diffs • Very difficult to implement correctly, and no guiding formalisms to indicate correctness or warn against problems

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Proposed solution Conflict-free
Replicated Data Types (CRDTs) deterministically reconcile concurrent updates such that no conflicts arise – Performance, availability, scale of eventual consistency + reliable reconciliation as if you were using a consensus mechanism – Provably sound – Limitations: • No consensus → limitations on what can be stored, replicated, and reconciled (i.e. no global invariants) • Unbounded growth → “garbage collection”

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Chocolate and vanilla
CRDTs • What is replicated? – Entire state of the datatype? State-based, a.k.a. convergent replicated data type, a.k.a. CvRDT – Individual operations (+ arguments)? Operation-based, a.k.a. commutative replicated data type a.k.a. CmRDT – Options correspond to the two strategies for implementing optimistic replication✝ • These are formally equivalent §2.4 – Strategy: understand state-based constructions, move on to operation-based as optimization ✝http://research.microsoft.com/apps/pubs/default.aspx?id=66979

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Statebased (CvRDT) §2.2.1,
§2.3.1 All possible states within a CvRDT form a semilattice – Partially-ordered set established by a least upper bound (join, ) or greatest ≤ lower bound (meet, ) ≥ – Both relations are definitionally commutative, associative, and idempotent – Each application of join or meet yields a monotonically increasing or decreasing value {b} ø {a} {a} {a,b} t

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Statebased (CvRDT) •
join and meet are formally equivalent, join presumed throughout the literature • Update locally, propagate results to other replicas, where it must converge (the 'v' in “CvRDT”) • Requires weakest eventual consistency guarantees to yield convergence among all replicas, since join is associative and commutative – “infinitely often” transmission of state – Insensitive to reordered/dropped/repeated messages – Very expensive worst case, but easier to reason about

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Language for specifying
asynchronous replication • More than boxes, arrows • Better than (most) pseudocode: explicit about preconditions, where things happen, (a)synchrony, etc

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Language for specifying
asynchronous replication

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject A “Portfolio of
basic CRDTs” §3 • Counters • Registers • Sets • Sequences • CRDTs compose and retain their characteristics – Sets → maps, multimaps, graphs

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Designing a register
CRDT✝ • Sequential specification: – R.set(v) → R.get() == v • Join relation over R.set(v a ) || R.set(v b ) – Linearizable? – Error state? – Last writer wins? ✝Framework from http://bit.ly/shapiro-msr-talk

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject LWW (last writer
wins) register §3.2.1 • Ensures only a single value in register • Semilattice is ordered by timestamps A t B ∅ y t2 x t1 y t2 Δ t2 Δ t1 ∅ y t2

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Mapping LWWregister to
its semilattice A t B ∅ y t2 x t1 y t2 Δ t2 Δ t1 ∅ y t2 [t 2 ,y] ø ø [t 1 ,x] [t 2 ,y] t

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject MV (multivalue) register
§3.2.2 • Assignments carry causal history (e.g. version vector) which defines semilattice's partial order • join retains all values assigned concurrently; some client can later assign a single value A t B ∅ y Δb x Δa x Δa ∥y Δb Δ b Δ a ∅ x Δa ∥y Δb client z [Δa,Δb,Δc] z [Δa,Δb,Δc]

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Mapping MVregister to
its semilattice [Δb,y] ø [Δa,x] #{[Δa,x] [Δb,y]} t [[Δa,Δb,Δc],z] A t B ∅ y Δb x Δa x Δa ∥y Δb Δ b Δ a ∅ x Δa ∥y Δb client z [Δa,Δb,Δc] z [Δa,Δb,Δc]

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Designing a set
CRDT✝ • Sequential specification: – S.add(e) → S.contains(e) == true – S.remove(e) → S.contains(e) == false • Join relation over S.add(e) || S.remove(e) – Linearizable? – Disallow removals? – Error state? – Last writer wins? – Add wins? – Remove wins? ✝Framework from http://bit.ly/shapiro-msr-talk

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Sets §3.3 •
Counterintuitive convergent characterizations – G-Set (“grow-only”): can add, cannot remove – 2P-Set (“two phase”): once removed, cannot add an element back • Composition of two G-sets – LWW-Set – PN-Set (positive & negative counters track membership): addition may not yield membership

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject ObservedRemove Set §3.3.5
• Set CRDT with intuitive semantics ≈ – Given S.add(e) || S.remove(e), add wins • Strategy – tag each element uniquely (per actor or per operation), e τ – operation removing e must include set of all previously-unremoved τ for e – Set.contains(e) == true iff an e τ exists where τ has not been implicated in a removal of e • Tags are not exposed in userland API

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject ObservedRemove Set Progression
& Semilattice e a A e b B e b C e a,b e a,-a e a,-a,b {e} ∅ e b,a,-a {e} {e} {e} ∅ ∅ ∅ {e} {e} t • A and B concurrently add e a and e b • A removes e a ; this has no effect on e's membership in C's view of the set because of its knowledge of e b • e b is eventually replicated to A, yielding consistency

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Graphs §3.4 •
Two sets, vertices + edges • Many different possible constructions given the local invariants one might want to preserve between edges and vertices • Global invariants cannot be guaranteed because of concurrent operations – e.g. cannot prevent cycles

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Sequences §3.5.2 •
Set of (identifier, value) where identifiers are selected from a dense, totally-ordered set • Explored deeply in papers on Logoot and Treedoc CRDTs

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Operationbased CmRDT §2.2.2,
§2.3.2 • Requires “reliable broadcast channel” – Operations delivered to each replica in causal order < d – All concurrent operations that are unordered with respect to < d must commute (the 'm' in CmRDT) • Far more efficient than worst-case state-based specification

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Operationbased CmRDT: tradeoffs
• More complex, more difficult to reason about • More challenging to implement – Causal relationships between operations must be identified + maintained – Generally requires tracking “group membership”

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Garbage Collection §4
• “Garbage”: additional overhead that accumulates in order to satisfy CRDT semantics – “tombstones” (e.g. remove tags in an OR-Set) – Unbalanced trees of identifiers in sequences • Optimistically collecting garbage and rolling back as necessary is an option in some cases • Others appear to require various levels of consensus to achieve • “Garbage” is not always waste – The right kind of tombstones are what makes consistent snapshot possible

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Prior & related
work Lots of prior work had portions of CRDTs' semantics, before “CRDT” was identified as a concept: – Wuu and Bernstein, 'Efficient solutions to the replicated log and dictionary problems' (1984!) – Operational transforms – Any Dynamo-style system uses registers for values • LWW-registers: S3 • MV-registers: CouchDB conflicts, Riak siblings

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Prior & related
work “Consistency as Logical Monotonicity” (CALM theorem) – s/semilattices/monotonic logic • Stricter semantics than semilattices; no way to characterize non-monotonic operations (remove, etc) without consensus – Implemented at the language level by Bloom • Nearly all data structures are monotonic or lattices • Allows for static analysis that identifies parts of your program that aren't monotonic (require synchronization/consensus mechanism to ensure safety)

@papers_we_love NYC #4, 2014-05-15 @cemerick / @QuiltProject Resources • Meetup
page for this talk: http://bit.ly/pwl-nyc-4 • Shapiro et al. paper: http://bit.ly/shapiro-crdt-pdf • Shapiro talk @ MSR: http://bit.ly/shapiro-msr-talk • Chris Meiklejohn's 'Readings in Distributed Systems': http://bit.ly/cmeik-dist-sys-readings • CRDTs offered in v2.0 of Riak: http://bit.ly/riak-crdts • Bloom, a Ruby DSL for “disorderly programming”, an implementation of CALM: http://www.bloom-lang.net • The Quilt Project: http://quilt.org

An audiovisual précis of 'A comprehensive study...

An audiovisual précis of 'A comprehensive study of Convergent and Commutative Replicated Data Types'

More Decks by Chas Emerick

Other Decks in Technology

Featured

Transcript