Slide 1

Slide 1 text

Progressive Systems Joe Hellerstein Berkeley/Trifacta

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

What slows us down?

Slide 9

Slide 9 text

What slows us down?

Slide 10

Slide 10 text

What slows us down?

Slide 11

Slide 11 text

What slows us down?

Slide 12

Slide 12 text

What slows us down?

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

What slows us down? Coordination Signals Barriers Communication

Slide 16

Slide 16 text

But we need coordination, right?

Slide 17

Slide 17 text

Or do we?

Slide 18

Slide 18 text

This is familiar... Coordination Locks Latches Mutexes Semaphores Compute Barriers Distributed Coordination

Slide 19

Slide 19 text

The CAP Theorem {CA} {AP} {CP} •  Consistency, Availability, Partitioning. – Choose 2. •  Why? – Consistency requires coordination! •  And you can’t coordinate without communication.

Slide 20

Slide 20 text

Partitions don’t happen very often

Slide 21

Slide 21 text

But coordination still slows us down

Slide 22

Slide 22 text

The first principle of successful scalability is to 
 batter the consistency mechanisms down to a minimum, move them off the critical path, 
 hide them in a rarely visited corner of the system, 
 and then 
 make it as hard as possible 
 for application developers 
 to get permission 
 to use them —James Hamilton 
 (IBM, MS, Amazon) System Poetry

Slide 23

Slide 23 text

The first principle of successful scalability is to 
 batter the consistency mechanisms down to a minimum, move them off the critical path, 
 hide them in a rarely visited corner of the system, 
 and then 
 make it as hard as possible 
 for application developers 
 to get permission 
 to use them —James Hamilton 
 (IBM, MS, Amazon) System Poetry

Slide 24

Slide 24 text

The first principle of successful scalability is to 
 batter the consistency mechanisms down to a minimum, move them off the critical path, 
 hide them in a rarely visited corner of the system, 
 and then 
 make it as hard as possible 
 for application developers 
 to get permission 
 to use them —James Hamilton 
 (IBM, MS, Amazon) coordination System Poetry

Slide 25

Slide 25 text

The CAP Theorem {CA} {AP} {CP} •  Consistency, Availability, Partitioning. – Choose 2. •  Why? – Consistency requires coordination! •  And you can’t coordinate without communication. L L L Latency Latency

Slide 26

Slide 26 text

The CAP Theorem {CA} {AP} {CP} Coordination is too expensive.

Slide 27

Slide 27 text

The CAP Theorem {CA} {AP} {CP} We have to sacrifice Consistency! Closed Closed

Slide 28

Slide 28 text

Or do we?

Slide 29

Slide 29 text

Mayhem Ensues

Slide 30

Slide 30 text

ACID BASE

Slide 31

Slide 31 text

SQL NoSQL

Slide 32

Slide 32 text

Limits Chaos

Slide 33

Slide 33 text

So, when is coordination required?

Slide 34

Slide 34 text

The CALM Theorem ± KEEP CALM

Slide 35

Slide 35 text

± CALM Consistency As Logical Monotonicity

Slide 36

Slide 36 text

± CALM Consistency As Logical Monotonicity All processes respect invariants and agree on outcomes regardless of message ordering.

Slide 37

Slide 37 text

± CALM Consistency As Logical Monotonicity Program logic ensures that all state always makes progress in one direction. Once a fact is known, it never changes.

Slide 38

Slide 38 text

± CALM FORMALLY Theorem (CALM): A program specification has a consistent, coordination-free implementation if and only if its logic is monotone. Avoids   coordina-on   Monotone  

Slide 39

Slide 39 text

± CALM NOTE CALM precisely answers the question of when one can get Consistency without Coordination*. It does not tell you how to achieve this goal! *i.e. when CAP does not hold

Slide 40

Slide 40 text

Progressive Systems Systems built upon monotonically growing state. Logs   Counters   Vector  Clocks   Immutable   Variables   Deltas  

Slide 41

Slide 41 text

Two Recent Examples •  RAMP Transactions (Global-scale system) •  Bw-Tree (In-Memory Index)

Slide 42

Slide 42 text

RAMP Scalable Atomic Visibility with RAMP Transactions. P Bailis, A Fekete, A Ghodsi, JM Hellerstein, I Stoica. SIGMOD 2014. Slides  courtesy  Peter  Bailis  

Slide 43

Slide 43 text

Social Graph 1.2B+ vertices 420B+ edges Facebook

Slide 44

Slide 44 text

Social Graph 1 2, 3, 5 User Adjacency List 2 1, 3, 5 3 1, 5, 6 4 6 5 1, 2, 3, 6 6 3, 4, 5 1.2B+ vertices 420B+ edges Facebook

Slide 45

Slide 45 text

1 2, 3, 5 6 3, 4, 5 ,6! ,1! To preserve graph, should observe either: »  Both links »  Neither link Atomic Visibility!

Slide 46

Slide 46 text

Atomic Visibility OR X = 1 READ Y = 1 READ READ X = READ Y = X = 1 WRITE Y = 1 WRITE either all or none of each transaction’s updates should be visible to other transactions

Slide 47

Slide 47 text

BUT NOT Atomic Visibility OR X = 1 READ Y = 1 READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions OR X = 1 READ Y = 1 READ READ X = READ Y = “FRACTURED READS”

Slide 48

Slide 48 text

Atomic Visibility is pretty useful: Maintain  Index   an attending doctor each patient Seen  By  

Slide 49

Slide 49 text

RAMP: Basic State On  each  Node   Every  transac-on  has  a  (one-­‐way)     ready  bit  at  each  node.     Every  node  has  a  (monotonically  increasing)     highest  -mestamp  commiJed.     Immutable  data  with  (monotonically  increasing)   -mestamps   Every  transac-on  is  assigned  a  -mestamp  from  a   (monotonically  increasing)  counter.   T13   T13   ✓   X=0   T10   X=1   T13  

Slide 50

Slide 50 text

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002

Slide 51

Slide 51 text

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1 X = ? R Y = ? R X = 1 Y = 0 via  inten(on   metadata  

Slide 52

Slide 52 text

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W Server 1001 Y=0 Server 1002 X=1 via  inten(on   metadata  

Slide 53

Slide 53 text

value Y=0 T0 {} intention · Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · T0 intention · via  inten(on   metadata   “A transaction called T1 wrote this and also wrote to Y”

Slide 54

Slide 54 text

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · value Y=0 T0 {} intention · via  inten(on   metadata   X = ? R Y = ? R

Slide 55

Slide 55 text

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value X=1 T1 {Y} intention · via  inten(on   metadata   X = ? R Y = ? R X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention · “A transaction called T1 wrote this and also wrote to Y” via     mul(-­‐versioning,   ready  bit  

Slide 56

Slide 56 text

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · via  inten(on   metadata   via     mul(-­‐versioning,   ready  bit   value Y=0 T0 {} intention ·

Slide 57

Slide 57 text

Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES via  inten(on   metadata   value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W 1.) Place write on each server. 2.) Set ready bit on each write on server. via     mul(-­‐versioning,   ready  bit   Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers

Slide 58

Slide 58 text

Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES via  inten(on   metadata   via     mul(-­‐versioning   value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W X = ? R Y = ? R Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers

Slide 59

Slide 59 text

Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES via  inten(on   metadata   via     mul(-­‐versioning   value intention X=0 T0 {} · value intention Y=0 T0 {} · X = ? R Y = ? R 1.) Fetch “highest” ready versions. 2.) Fetch any missing writes using metadata. X = 1 Y = 0 Y = 1 Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers

Slide 60

Slide 60 text

RAMP Variants Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B(ε)) Bloom filter REPAIR ATOMICITY DETECT RACES via  inten(on   metadata   via     mul(-­‐versioning,   ready  bit  

Slide 61

Slide 61 text

RAMP-Hybrid YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn No Concurrency Control Serializable 2PL Write Locks Only RAMP-Fast RAMP-Small

Slide 62

Slide 62 text

No Coordination On This RAMP

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

Bw-Trees The Bw-Tree: A B-tree for New Hardware Platforms. JJ Levandoski, DB Lomet, S Sengupta. ICDE 2013.

Slide 65

Slide 65 text

In-Memory SQL Performance Analysis   •  Improve  CPI?     < 2x benefit •  Improving  mul-core  scalability?     <2x benefit Solution: reduce # of instructions per transaction By a LOT!  10x  faster?    90%  fewer  instruc-ons    100x  faster?    99%  fewer  instruc-ons   Q:  Where  are  the  inner-­‐loop  instruc-ons?   A: Index access •  especially latching and locking Answer: no latches, no locks. i.e. Avoid Coordination Diaconu,  et  al.    “Hekaton:  SQL   Server’s  Memory-­‐Op-mized   OLTP  Engine”.    SIGMOD  2013.  

Slide 66

Slide 66 text

The Bw-Tree: What is it? A Latch-free, Log-structured B-tree for Multi-core Machines with Large Main Memories and Flash Storage Bw = Buzz Word No coordination Progressive!

Slide 67

Slide 67 text

Delta Updates Page P PID Physical Address P Mapping Table Δ: Insert record 50 Δ: Delete record 48 Δ: Update record 35 Δ: Insert Record 60 •  Each page update produces a new address (the delta). •  Install new page address in map using compare-and-swap. •  Only one winner on concurrent update to the same address. •  Eventually install new consolidate page with deltas applied. •  Single-page updates are easy, solved node splits and deletes. Consolidated Page P Bw-Tree Delta Updates Coordina-on  happens  here,   via  CAS  instruc-on.   A  monotonic  log  of  updates.   A  monotonic  accumula-on   of  versions  

Slide 68

Slide 68 text

solidation that creates a new “re-organized” base page contain- ing all the entries from the original base page as modified by the updates from the delta chain. We trigger consolidation if an accessor thread, during a page search, notices a delta chain length has exceeded a system threshold. The thread performs consolidation after attempting its update (or read) operation. When consolidating, the thread first creates a new base page (a new block of memory). It then populates the base page with a sorted vector containing the most recent version of a record from either the delta chain or old base page (deleted records are discarded). The thread then installs the new address of the consolidated page in the mapping table with a CAS. If it succeeds, the thread requests garbage collection (memory reclamation) of the old page state. Figure 2(b) provides an example depicting the consolidation of page P that incorpo- rates deltas into a new “Consolidated Page P”. If this CAS fails, the thread abandons the operation by deallocating the new page. The thread does not retry, as a subsequent thread will eventually perform a successful consolidation. C. Range Scans A range scan is specified by a key range (low key, high key). Either of the boundary keys can be omitted, meaning that one end of the range is open-ended. A scan will also specify either an ascending or descending key order for delivering the records. Our description here assumes both boundary keys are provided and the ordering is ascending. The other scan options are simple variants. A scan maintains a cursor providing a key indicating how LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr P O Q CA (b) I LPID Ptr Page P P O Q Page Q Pa Split ∆ CAS Index entry ∆ (c) Installing index entry Fig. 3. Split example. Dashed arrows represent arrows represent physical pointers. deallocate the old page state while anoth it. Similar concerns arise when a page Bw-tree. That is, other threads may s the now removed page. We must prote accessing reclaimed and potentially “re preventing reclamation until such access This is done by a thread executing with An epoch mechanism is a way of ing deallocated from being re-used too joins an epoch when it wants to prote (e.g., searching) from being reclaimed. I this dependency is finished. Typically, Q   P   R   O   P   R   Q   O   P   R   Q   O   on that creates a new “re-organized” base page contain- he entries from the original base page as modified by ates from the delta chain. We trigger consolidation if ssor thread, during a page search, notices a delta chain as exceeded a system threshold. The thread performs dation after attempting its update (or read) operation. n consolidating, the thread first creates a new base page block of memory). It then populates the base page with vector containing the most recent version of a record her the delta chain or old base page (deleted records arded). The thread then installs the new address of solidated page in the mapping table with a CAS. If eds, the thread requests garbage collection (memory tion) of the old page state. Figure 2(b) provides an e depicting the consolidation of page P that incorpo- ltas into a new “Consolidated Page P”. If this CAS e thread abandons the operation by deallocating the ge. The thread does not retry, as a subsequent thread ntually perform a successful consolidation. LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS (b) Installing split delta LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS Index entry ∆ (c) Installing index entry delta Fig. 3. Split example. Dashed arrows represent logical pointers, while solid arrows represent physical pointers. deallocate the old page state while another thread still accesses it. Similar concerns arise when a page is removed from the solidation that creates a new “re-organized” base page contain- ing all the entries from the original base page as modified by the updates from the delta chain. We trigger consolidation if an accessor thread, during a page search, notices a delta chain length has exceeded a system threshold. The thread performs consolidation after attempting its update (or read) operation. When consolidating, the thread first creates a new base page (a new block of memory). It then populates the base page with a sorted vector containing the most recent version of a record from either the delta chain or old base page (deleted records are discarded). The thread then installs the new address of the consolidated page in the mapping table with a CAS. If it succeeds, the thread requests garbage collection (memory reclamation) of the old page state. Figure 2(b) provides an example depicting the consolidation of page P that incorpo- rates deltas into a new “Consolidated Page P”. If this CAS LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr Page P P O Q Page Q Pa Split ∆ CAS (b) Installing split delt LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS Index entry ∆ (c) Installing index entry delta Fig. 3. Split example. Dashed arrows represent logical pointers, whi arrows represent physical pointers. Page  “updates”  are  actually   appends  to  a  progressively   growing  log.  Only  Ptrs  are   muta-ng  (vis  CAS  instruc-on).   Page Splits

Slide 69

Slide 69 text

10.40 3.83 2.84 0.56 0.66 0.33 0.0 2.0 4.0 6.0 8.0 10.0 12.0 Xbox Synthetic Deduplication Operations/Sec (M) BW-Tree BerkeleyDB Fig. 6. Bw-tree and BerkeleyDB over linked delta chains are good for branch prediction and prefetching in general, the Xbox workload has large 100-byte records, meaning fewer deltas will fit into the L1 cache during a scan. The synthetic workload contains small 8-byte keys, which are more amenable to prefetching and caching. Thus, delta chain lengths can grow longer (to about eight deltas) without performance consequences. Synthetic workl Read-only work BW-TR In general, we bel rior performance of t blocks on updates or uses page-level latch ducing concurrency. utilization of about 9 (2) CPU cache efficie to update immutable threads are rarely inv leyDB updates page tree page involves in ordered records, on invalidating multiple D. Comparing Bw-tr

Slide 70

Slide 70 text

Reflection •  CAP? CALM. – Nothing in PTime requires coordination •  Wow – But CALM only tells us what’s possible •  Not how to do it. •  How do we get good at designing progressive systems?

Slide 71

Slide 71 text

Getting Progressive 1.  Design patterns –  Use a log as ground truth •  Derive data structures via “queries” over the streaming log –  Use versions, not mutable state –  ACID 2.0: Associative, Commutative, Idempotent –  Your ideas go here... 2.  Libraries and Languages –  CRDTs are monotonic data types •  Have to link them together carefully –  Bloom and Eve are languages whose compilers can test for monotonicity

Slide 72

Slide 72 text

More? Declarative Networking: Recent Theoretical Work on Coordination, Correctness, and Declarative Semantics. T Ameloot. SIGMOD Record 2014. Scalable Atomic Visibility with RAMP Transactions. P Bailis, A Fekete, A Ghodsi, JM Hellerstein, I Stoica. SIGMOD 2014. The Bw-Tree: A B-tree for New Hardware Platforms. JJ Levandoski, DB Lomet, S Sengupta. ICDE 2013. http://boom.cs.berkeley.edu http://bit.ly/progressiveseminar

Slide 73

Slide 73 text

Backup Slides

Slide 74

Slide 74 text

Spanner? latency (ms) operation mean std dev count all reads 8.7 376.4 21.5B single-site commit 72.3 112.8 31.2M multi-site commit 103.0 52.2 32.1M Table 6: F1-perceived operation latencies measured over the course of 24 hours. of such tables are extremely uncommon. The F1 team has only seen such behavior when they do untuned bulk data loads as transactions. The cated s tation [ store th cation. terface scribe a Their p phase c mit ove a varian across 10  TPS!   [Corbett, et al. “Spanner:…”, OSDI12]

Slide 75

Slide 75 text

Speed of light 7 global round-trips per sec 7

Slide 76

Slide 76 text

Facebook Tao Google Megastore LinkedIn Espresso Due to coordination overheads… Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS …consciously choose to violate atomic visibility “[Tao]  explicitly  favors  efficiency   and  availability  over    consistency… [an  edge]  may  exist  without  an   inverse;  these  hanging  associa-ons   are  scheduled  for  repair  by  an   asynchronous  job.”   Google App Engine

Slide 77

Slide 77 text

r(x)=0! w(x←1)! w(y←1)! r(y)=0! Should have r(y)!1 r(y)=0! w(x←1)! 2 r(x)=0! w(y←1)! 1 Should have r(x)!1 r(y)=0! w(x←1)! 1 r(x)=0! w(y←1)! 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE! Atomic Visibility is not serializability! …but respects Atomic Visibility!