Progressive Systems

Progressive Systems Joe Hellerstein Berkeley/Trifacta

What slows us down?

What slows us down? Coordination Signals Barriers Communication

But we need coordination, right?

Or do we?

This is familiar... Coordination Locks Latches Mutexes Semaphores Compute Barriers
Distributed Coordination

The CAP Theorem {CA} {AP} {CP} •  Consistency, Availability, Partitioning.
– Choose 2. •  Why? – Consistency requires coordination! •  And you can’t coordinate without communication.

Partitions don’t happen very often

But coordination still slows us down

The ﬁrst principle of successful scalability is to   batter
the consistency mechanisms down to a minimum, move them oﬀ the critical path,   hide them in a rarely visited corner of the system,   and then   make it as hard as possible   for application developers   to get permission   to use them —James Hamilton   (IBM, MS, Amazon) System Poetry

The ﬁrst principle of successful scalability is to   batter
the consistency mechanisms down to a minimum, move them oﬀ the critical path,   hide them in a rarely visited corner of the system,   and then   make it as hard as possible   for application developers   to get permission   to use them —James Hamilton   (IBM, MS, Amazon) coordination System Poetry

The CAP Theorem {CA} {AP} {CP} •  Consistency, Availability, Partitioning.
– Choose 2. •  Why? – Consistency requires coordination! •  And you can’t coordinate without communication. L L L Latency Latency

The CAP Theorem {CA} {AP} {CP} Coordination is too expensive.

The CAP Theorem {CA} {AP} {CP} We have to sacriﬁce
Consistency! Closed Closed

Or do we?

Mayhem Ensues

ACID BASE

SQL NoSQL

Limits Chaos

So, when is coordination required?

The CALM Theorem ± KEEP CALM

± CALM Consistency As Logical Monotonicity

± CALM Consistency As Logical Monotonicity All processes respect invariants
and agree on outcomes regardless of message ordering.

± CALM Consistency As Logical Monotonicity Program logic ensures that
all state always makes progress in one direction. Once a fact is known, it never changes.

± CALM FORMALLY Theorem (CALM): A program specification has a
consistent, coordination-free implementation if and only if its logic is monotone. Avoids coordina-on Monotone

± CALM NOTE CALM precisely answers the question of when
one can get Consistency without Coordination*. It does not tell you how to achieve this goal! *i.e. when CAP does not hold

Progressive Systems Systems built upon monotonically growing state. Logs
Counters Vector Clocks Immutable Variables Deltas

Two Recent Examples •  RAMP Transactions (Global-scale system) •  Bw-Tree
(In-Memory Index)

RAMP Scalable Atomic Visibility with RAMP Transactions. P Bailis, A
Fekete, A Ghodsi, JM Hellerstein, I Stoica. SIGMOD 2014. Slides courtesy Peter Bailis

Social Graph 1.2B+ vertices 420B+ edges Facebook

Social Graph 1 2, 3, 5 User Adjacency List 2
1, 3, 5 3 1, 5, 6 4 6 5 1, 2, 3, 6 6 3, 4, 5 1.2B+ vertices 420B+ edges Facebook

1 2, 3, 5 6 3, 4, 5 ,6! ,1!
To preserve graph, should observe either: »  Both links »  Neither link Atomic Visibility!

Atomic Visibility OR X = 1 READ Y = 1
READ READ X = READ Y = X = 1 WRITE Y = 1 WRITE either all or none of each transaction’s updates should be visible to other transactions

BUT NOT Atomic Visibility OR X = 1 READ Y
= 1 READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions OR X = 1 READ Y = 1 READ READ X = READ Y = “FRACTURED READS”

Atomic Visibility is pretty useful: Maintain Index an attending
doctor each patient Seen By

RAMP: Basic State On each Node Every transac-on has
a (one-‐way) ready bit at each node. Every node has a (monotonically increasing) highest -mestamp commiJed. Immutable data with (monotonically increasing) -mestamps Every transac-on is assigned a -mestamp from a (monotonically increasing) counter. T13 T13 ✓ X=0 T10 X=1 T13

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X
= 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002

= 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1 X = ? R Y = ? R X = 1 Y = 0 via inten(on metadata

= 1 W Y = 1 W Server 1001 Y=0 Server 1002 X=1 via inten(on metadata

value Y=0 T0 {} intention · Atomic Visibility via RAMP
Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · T0 intention · via inten(on metadata “A transaction called T1 wrote this and also wrote to Y”

= 1 W Y = 1 W value X=1 T1 {Y} intention · value Y=0 T0 {} intention · via inten(on metadata X = ? R Y = ? R

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value
X=1 T1 {Y} intention · via inten(on metadata X = ? R Y = ? R X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention · “A transaction called T1 wrote this and also wrote to Y” via mul(-‐versioning, ready bit

= 1 W Y = 1 W value X=1 T1 {Y} intention · via inten(on metadata via mul(-‐versioning, ready bit value Y=0 T0 {} intention ·

Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility
via RAMP Transactions REPAIR ATOMICITY DETECT RACES via inten(on metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W 1.) Place write on each server. 2.) Set ready bit on each write on server. via mul(-‐versioning, ready bit Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via inten(on metadata via mul(-‐versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W X = ? R Y = ? R Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via inten(on metadata via mul(-‐versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · X = ? R Y = ? R 1.) Fetch “highest” ready versions. 2.) Fetch any missing writes using metadata. X = 1 Y = 0 Y = 1 Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers

RAMP Variants Algorithm Write RTT READ RTT (best case) READ
RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B(ε)) Bloom ﬁlter REPAIR ATOMICITY DETECT RACES via inten(on metadata via mul(-‐versioning, ready bit

RAMP-Hybrid YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn No
Concurrency Control Serializable 2PL Write Locks Only RAMP-Fast RAMP-Small

No Coordination On This RAMP

Bw-Trees The Bw-Tree: A B-tree for New Hardware Platforms. JJ
Levandoski, DB Lomet, S Sengupta. ICDE 2013.

In-Memory SQL Performance Analysis •  Improve CPI?
< 2x benefit •  Improving mul-core scalability? <2x benefit Solution: reduce # of instructions per transaction By a LOT! 10x faster? 90% fewer instruc-ons 100x faster? 99% fewer instruc-ons Q: Where are the inner-‐loop instruc-ons? A: Index access •  especially latching and locking Answer: no latches, no locks. i.e. Avoid Coordination Diaconu, et al. “Hekaton: SQL Server’s Memory-‐Op-mized OLTP Engine”. SIGMOD 2013.

The Bw-Tree: What is it? A Latch-free, Log-structured B-tree for
Multi-core Machines with Large Main Memories and Flash Storage Bw = Buzz Word No coordination Progressive!

Delta Updates Page P PID Physical Address P Mapping Table
Δ: Insert record 50 Δ: Delete record 48 Δ: Update record 35 Δ: Insert Record 60 •  Each page update produces a new address (the delta). •  Install new page address in map using compare-and-swap. •  Only one winner on concurrent update to the same address. •  Eventually install new consolidate page with deltas applied. •  Single-page updates are easy, solved node splits and deletes. Consolidated Page P Bw-Tree Delta Updates Coordina-on happens here, via CAS instruc-on. A monotonic log of updates. A monotonic accumula-on of versions

solidation that creates a new “re-organized” base page containing
all the entries from the original base page as modified by the updates from the delta chain. We trigger consolidation if an accessor thread, during a page search, notices a delta chain length has exceeded a system threshold. The thread performs consolidation after attempting its update (or read) operation. When consolidating, the thread first creates a new base page (a new block of memory). It then populates the base page with a sorted vector containing the most recent version of a record from either the delta chain or old base page (deleted records are discarded). The thread then installs the new address of the consolidated page in the mapping table with a CAS. If it succeeds, the thread requests garbage collection (memory reclamation) of the old page state. Figure 2(b) provides an example depicting the consolidation of page P that incorpo- rates deltas into a new “Consolidated Page P”. If this CAS fails, the thread abandons the operation by deallocating the new page. The thread does not retry, as a subsequent thread will eventually perform a successful consolidation. C. Range Scans A range scan is specified by a key range (low key, high key). Either of the boundary keys can be omitted, meaning that one end of the range is open-ended. A scan will also specify either an ascending or descending key order for delivering the records. Our description here assumes both boundary keys are provided and the ordering is ascending. The other scan options are simple variants. A scan maintains a cursor providing a key indicating how LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr P O Q CA (b) I LPID Ptr Page P P O Q Page Q Pa Split ∆ CAS Index entry ∆ (c) Installing index entry Fig. 3. Split example. Dashed arrows represent arrows represent physical pointers. deallocate the old page state while anoth it. Similar concerns arise when a page Bw-tree. That is, other threads may s the now removed page. We must prote accessing reclaimed and potentially “re preventing reclamation until such access This is done by a thread executing with An epoch mechanism is a way of ing deallocated from being re-used too joins an epoch when it wants to prote (e.g., searching) from being reclaimed. I this dependency is finished. Typically, Q P R O P R Q O P R Q O on that creates a new “re-organized” base page contain- he entries from the original base page as modified by ates from the delta chain. We trigger consolidation if ssor thread, during a page search, notices a delta chain as exceeded a system threshold. The thread performs dation after attempting its update (or read) operation. n consolidating, the thread first creates a new base page block of memory). It then populates the base page with vector containing the most recent version of a record her the delta chain or old base page (deleted records arded). The thread then installs the new address of solidated page in the mapping table with a CAS. If eds, the thread requests garbage collection (memory tion) of the old page state. Figure 2(b) provides an e depicting the consolidation of page P that incorpo- ltas into a new “Consolidated Page P”. If this CAS e thread abandons the operation by deallocating the ge. The thread does not retry, as a subsequent thread ntually perform a successful consolidation. LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS (b) Installing split delta LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS Index entry ∆ (c) Installing index entry delta Fig. 3. Split example. Dashed arrows represent logical pointers, while solid arrows represent physical pointers. deallocate the old page state while another thread still accesses it. Similar concerns arise when a page is removed from the solidation that creates a new “re-organized” base page containing all the entries from the original base page as modified by the updates from the delta chain. We trigger consolidation if an accessor thread, during a page search, notices a delta chain length has exceeded a system threshold. The thread performs consolidation after attempting its update (or read) operation. When consolidating, the thread first creates a new base page (a new block of memory). It then populates the base page with a sorted vector containing the most recent version of a record from either the delta chain or old base page (deleted records are discarded). The thread then installs the new address of the consolidated page in the mapping table with a CAS. If it succeeds, the thread requests garbage collection (memory reclamation) of the old page state. Figure 2(b) provides an example depicting the consolidation of page P that incorpo- rates deltas into a new “Consolidated Page P”. If this CAS LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr Page P P O Q Page Q Pa Split ∆ CAS (b) Installing split delt LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS Index entry ∆ (c) Installing index entry delta Fig. 3. Split example. Dashed arrows represent logical pointers, whi arrows represent physical pointers. Page “updates” are actually appends to a progressively growing log. Only Ptrs are muta-ng (vis CAS instruc-on). Page Splits

10.40 3.83 2.84 0.56 0.66 0.33 0.0 2.0 4.0 6.0
8.0 10.0 12.0 Xbox Synthetic Deduplication Operations/Sec (M) BW-Tree BerkeleyDB Fig. 6. Bw-tree and BerkeleyDB over linked delta chains are good for branch prediction and prefetching in general, the Xbox workload has large 100-byte records, meaning fewer deltas will ﬁt into the L1 cache during a scan. The synthetic workload contains small 8-byte keys, which are more amenable to prefetching and caching. Thus, delta chain lengths can grow longer (to about eight deltas) without performance consequences. Synthetic workl Read-only work BW-TR In general, we bel rior performance of t blocks on updates or uses page-level latch ducing concurrency. utilization of about 9 (2) CPU cache efﬁcie to update immutable threads are rarely inv leyDB updates page tree page involves in ordered records, on invalidating multiple D. Comparing Bw-tr

Reﬂection •  CAP? CALM. – Nothing in PTime requires coordination • 
Wow – But CALM only tells us what’s possible •  Not how to do it. •  How do we get good at designing progressive systems?

Getting Progressive 1.  Design patterns –  Use a log as
ground truth •  Derive data structures via “queries” over the streaming log –  Use versions, not mutable state –  ACID 2.0: Associative, Commutative, Idempotent –  Your ideas go here... 2.  Libraries and Languages –  CRDTs are monotonic data types •  Have to link them together carefully –  Bloom and Eve are languages whose compilers can test for monotonicity

More? Declarative Networking: Recent Theoretical Work on Coordination, Correctness, and
Declarative Semantics. T Ameloot. SIGMOD Record 2014. Scalable Atomic Visibility with RAMP Transactions. P Bailis, A Fekete, A Ghodsi, JM Hellerstein, I Stoica. SIGMOD 2014. The Bw-Tree: A B-tree for New Hardware Platforms. JJ Levandoski, DB Lomet, S Sengupta. ICDE 2013. http://boom.cs.berkeley.edu http://bit.ly/progressiveseminar

Backup Slides

Spanner? latency (ms) operation mean std dev count all reads
8.7 376.4 21.5B single-site commit 72.3 112.8 31.2M multi-site commit 103.0 52.2 32.1M Table 6: F1-perceived operation latencies measured over the course of 24 hours. of such tables are extremely uncommon. The F1 team has only seen such behavior when they do untuned bulk data loads as transactions. The cated s tation [ store th cation. terface scribe a Their p phase c mit ove a varian across 10 TPS! [Corbett, et al. “Spanner:…”, OSDI12]

Speed of light 7 global round-trips per sec 7

Facebook Tao Google Megastore LinkedIn Espresso Due to coordination overheads…
Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS …consciously choose to violate atomic visibility “[Tao] explicitly favors eﬃciency and availability over consistency… [an edge] may exist without an inverse; these hanging associa-ons are scheduled for repair by an asynchronous job.” Google App Engine

r(x)=0! w(x←1)! w(y←1)! r(y)=0! Should have r(y)!1 r(y)=0! w(x←1)! 2
r(x)=0! w(y←1)! 1 Should have r(x)!1 r(y)=0! w(x←1)! 1 r(x)=0! w(y←1)! 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE! Atomic Visibility is not serializability! …but respects Atomic Visibility!

Progressive Systems

Progressive Systems

More Decks by Joe Hellerstein

Other Decks in Programming

Featured

Transcript