An early discussion of Progressive Systems, CAP and CALM, using RAMP and bW-trees as examples of Progressive design. Presented at LinkedIn NYC 9/30/2015.
the consistency mechanisms down to a minimum, move them off the critical path, hide them in a rarely visited corner of the system, and then make it as hard as possible for application developers to get permission to use them —James Hamilton (IBM, MS, Amazon) System Poetry
the consistency mechanisms down to a minimum, move them off the critical path, hide them in a rarely visited corner of the system, and then make it as hard as possible for application developers to get permission to use them —James Hamilton (IBM, MS, Amazon) System Poetry
the consistency mechanisms down to a minimum, move them off the critical path, hide them in a rarely visited corner of the system, and then make it as hard as possible for application developers to get permission to use them —James Hamilton (IBM, MS, Amazon) coordination System Poetry
= 1 READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions OR X = 1 READ Y = 1 READ READ X = READ Y = “FRACTURED READS”
a (one-‐way) ready bit at each node. Every node has a (monotonically increasing) highest -mestamp commiJed. Immutable data with (monotonically increasing) -mestamps Every transac-on is assigned a -mestamp from a (monotonically increasing) counter. T13 T13 ✓ X=0 T10 X=1 T13
Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · T0 intention · via inten(on metadata “A transaction called T1 wrote this and also wrote to Y”
X=1 T1 {Y} intention · via inten(on metadata X = ? R Y = ? R X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention · “A transaction called T1 wrote this and also wrote to Y” via mul(-‐versioning, ready bit
via RAMP Transactions REPAIR ATOMICITY DETECT RACES via inten(on metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W 1.) Place write on each server. 2.) Set ready bit on each write on server. via mul(-‐versioning, ready bit Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers
via RAMP Transactions REPAIR ATOMICITY DETECT RACES via inten(on metadata via mul(-‐versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W X = ? R Y = ? R Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers
via RAMP Transactions REPAIR ATOMICITY DETECT RACES via inten(on metadata via mul(-‐versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · X = ? R Y = ? R 1.) Fetch “highest” ready versions. 2.) Fetch any missing writes using metadata. X = 1 Y = 0 Y = 1 Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers
< 2x benefit • Improving mul-core scalability? <2x benefit Solution: reduce # of instructions per transaction By a LOT! 10x faster? 90% fewer instruc-ons 100x faster? 99% fewer instruc-ons Q: Where are the inner-‐loop instruc-ons? A: Index access • especially latching and locking Answer: no latches, no locks. i.e. Avoid Coordination Diaconu, et al. “Hekaton: SQL Server’s Memory-‐Op-mized OLTP Engine”. SIGMOD 2013.
Δ: Insert record 50 Δ: Delete record 48 Δ: Update record 35 Δ: Insert Record 60 • Each page update produces a new address (the delta). • Install new page address in map using compare-and-swap. • Only one winner on concurrent update to the same address. • Eventually install new consolidate page with deltas applied. • Single-page updates are easy, solved node splits and deletes. Consolidated Page P Bw-Tree Delta Updates Coordina-on happens here, via CAS instruc-on. A monotonic log of updates. A monotonic accumula-on of versions
all the entries from the original base page as modified by the updates from the delta chain. We trigger consolidation if an accessor thread, during a page search, notices a delta chain length has exceeded a system threshold. The thread performs consolidation after attempting its update (or read) operation. When consolidating, the thread first creates a new base page (a new block of memory). It then populates the base page with a sorted vector containing the most recent version of a record from either the delta chain or old base page (deleted records are discarded). The thread then installs the new address of the consolidated page in the mapping table with a CAS. If it succeeds, the thread requests garbage collection (memory reclamation) of the old page state. Figure 2(b) provides an example depicting the consolidation of page P that incorpo- rates deltas into a new “Consolidated Page P”. If this CAS fails, the thread abandons the operation by deallocating the new page. The thread does not retry, as a subsequent thread will eventually perform a successful consolidation. C. Range Scans A range scan is specified by a key range (low key, high key). Either of the boundary keys can be omitted, meaning that one end of the range is open-ended. A scan will also specify either an ascending or descending key order for delivering the records. Our description here assumes both boundary keys are provided and the ordering is ascending. The other scan options are simple variants. A scan maintains a cursor providing a key indicating how LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr P O Q CA (b) I LPID Ptr Page P P O Q Page Q Pa Split ∆ CAS Index entry ∆ (c) Installing index entry Fig. 3. Split example. Dashed arrows represent arrows represent physical pointers. deallocate the old page state while anoth it. Similar concerns arise when a page Bw-tree. That is, other threads may s the now removed page. We must prote accessing reclaimed and potentially “re preventing reclamation until such access This is done by a thread executing with An epoch mechanism is a way of ing deallocated from being re-used too joins an epoch when it wants to prote (e.g., searching) from being reclaimed. I this dependency is finished. Typically, Q P R O P R Q O P R Q O on that creates a new “re-organized” base page contain- he entries from the original base page as modified by ates from the delta chain. We trigger consolidation if ssor thread, during a page search, notices a delta chain as exceeded a system threshold. The thread performs dation after attempting its update (or read) operation. n consolidating, the thread first creates a new base page block of memory). It then populates the base page with vector containing the most recent version of a record her the delta chain or old base page (deleted records arded). The thread then installs the new address of solidated page in the mapping table with a CAS. If eds, the thread requests garbage collection (memory tion) of the old page state. Figure 2(b) provides an e depicting the consolidation of page P that incorpo- ltas into a new “Consolidated Page P”. If this CAS e thread abandons the operation by deallocating the ge. The thread does not retry, as a subsequent thread ntually perform a successful consolidation. LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS (b) Installing split delta LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS Index entry ∆ (c) Installing index entry delta Fig. 3. Split example. Dashed arrows represent logical pointers, while solid arrows represent physical pointers. deallocate the old page state while another thread still accesses it. Similar concerns arise when a page is removed from the solidation that creates a new “re-organized” base page contain- ing all the entries from the original base page as modified by the updates from the delta chain. We trigger consolidation if an accessor thread, during a page search, notices a delta chain length has exceeded a system threshold. The thread performs consolidation after attempting its update (or read) operation. When consolidating, the thread first creates a new base page (a new block of memory). It then populates the base page with a sorted vector containing the most recent version of a record from either the delta chain or old base page (deleted records are discarded). The thread then installs the new address of the consolidated page in the mapping table with a CAS. If it succeeds, the thread requests garbage collection (memory reclamation) of the old page state. Figure 2(b) provides an example depicting the consolidation of page P that incorpo- rates deltas into a new “Consolidated Page P”. If this CAS LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr Page P P O Q Page Q Pa Split ∆ CAS (b) Installing split delt LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS Index entry ∆ (c) Installing index entry delta Fig. 3. Split example. Dashed arrows represent logical pointers, whi arrows represent physical pointers. Page “updates” are actually appends to a progressively growing log. Only Ptrs are muta-ng (vis CAS instruc-on). Page Splits
8.0 10.0 12.0 Xbox Synthetic Deduplication Operations/Sec (M) BW-Tree BerkeleyDB Fig. 6. Bw-tree and BerkeleyDB over linked delta chains are good for branch prediction and prefetching in general, the Xbox workload has large 100-byte records, meaning fewer deltas will fit into the L1 cache during a scan. The synthetic workload contains small 8-byte keys, which are more amenable to prefetching and caching. Thus, delta chain lengths can grow longer (to about eight deltas) without performance consequences. Synthetic workl Read-only work BW-TR In general, we bel rior performance of t blocks on updates or uses page-level latch ducing concurrency. utilization of about 9 (2) CPU cache efficie to update immutable threads are rarely inv leyDB updates page tree page involves in ordered records, on invalidating multiple D. Comparing Bw-tr
ground truth • Derive data structures via “queries” over the streaming log – Use versions, not mutable state – ACID 2.0: Associative, Commutative, Idempotent – Your ideas go here... 2. Libraries and Languages – CRDTs are monotonic data types • Have to link them together carefully – Bloom and Eve are languages whose compilers can test for monotonicity
Declarative Semantics. T Ameloot. SIGMOD Record 2014. Scalable Atomic Visibility with RAMP Transactions. P Bailis, A Fekete, A Ghodsi, JM Hellerstein, I Stoica. SIGMOD 2014. The Bw-Tree: A B-tree for New Hardware Platforms. JJ Levandoski, DB Lomet, S Sengupta. ICDE 2013. http://boom.cs.berkeley.edu http://bit.ly/progressiveseminar
8.7 376.4 21.5B single-site commit 72.3 112.8 31.2M multi-site commit 103.0 52.2 32.1M Table 6: F1-perceived operation latencies measured over the course of 24 hours. of such tables are extremely uncommon. The F1 team has only seen such behavior when they do untuned bulk data loads as transactions. The cated s tation [ store th cation. terface scribe a Their p phase c mit ove a varian across 10 TPS! [Corbett, et al. “Spanner:…”, OSDI12]
Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS …consciously choose to violate atomic visibility “[Tao] explicitly favors efficiency and availability over consistency… [an edge] may exist without an inverse; these hanging associa-ons are scheduled for repair by an asynchronous job.” Google App Engine
r(x)=0! w(y←1)! 1 Should have r(x)!1 r(y)=0! w(x←1)! 1 r(x)=0! w(y←1)! 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE! Atomic Visibility is not serializability! …but respects Atomic Visibility!