Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Progressive Systems

Progressive Systems

An early discussion of Progressive Systems, CAP and CALM, using RAMP and bW-trees as examples of Progressive design. Presented at LinkedIn NYC 9/30/2015.

Joe Hellerstein

September 30, 2015
Tweet

More Decks by Joe Hellerstein

Other Decks in Programming

Transcript

  1. The CAP Theorem {CA} {AP} {CP} •  Consistency, Availability, Partitioning.

    – Choose 2. •  Why? – Consistency requires coordination! •  And you can’t coordinate without communication.
  2. The first principle of successful scalability is to 
 batter

    the consistency mechanisms down to a minimum, move them off the critical path, 
 hide them in a rarely visited corner of the system, 
 and then 
 make it as hard as possible 
 for application developers 
 to get permission 
 to use them —James Hamilton 
 (IBM, MS, Amazon) System Poetry
  3. The first principle of successful scalability is to 
 batter

    the consistency mechanisms down to a minimum, move them off the critical path, 
 hide them in a rarely visited corner of the system, 
 and then 
 make it as hard as possible 
 for application developers 
 to get permission 
 to use them —James Hamilton 
 (IBM, MS, Amazon) System Poetry
  4. The first principle of successful scalability is to 
 batter

    the consistency mechanisms down to a minimum, move them off the critical path, 
 hide them in a rarely visited corner of the system, 
 and then 
 make it as hard as possible 
 for application developers 
 to get permission 
 to use them —James Hamilton 
 (IBM, MS, Amazon) coordination System Poetry
  5. The CAP Theorem {CA} {AP} {CP} •  Consistency, Availability, Partitioning.

    – Choose 2. •  Why? – Consistency requires coordination! •  And you can’t coordinate without communication. L L L Latency Latency
  6. ± CALM Consistency As Logical Monotonicity All processes respect invariants

    and agree on outcomes regardless of message ordering.
  7. ± CALM Consistency As Logical Monotonicity Program logic ensures that

    all state always makes progress in one direction. Once a fact is known, it never changes.
  8. ± CALM FORMALLY Theorem (CALM): A program specification has a

    consistent, coordination-free implementation if and only if its logic is monotone. Avoids   coordina-on   Monotone  
  9. ± CALM NOTE CALM precisely answers the question of when

    one can get Consistency without Coordination*. It does not tell you how to achieve this goal! *i.e. when CAP does not hold
  10. Progressive Systems Systems built upon monotonically growing state. Logs  

    Counters   Vector  Clocks   Immutable   Variables   Deltas  
  11. RAMP Scalable Atomic Visibility with RAMP Transactions. P Bailis, A

    Fekete, A Ghodsi, JM Hellerstein, I Stoica. SIGMOD 2014. Slides  courtesy  Peter  Bailis  
  12. Social Graph 1 2, 3, 5 User Adjacency List 2

    1, 3, 5 3 1, 5, 6 4 6 5 1, 2, 3, 6 6 3, 4, 5 1.2B+ vertices 420B+ edges Facebook
  13. 1 2, 3, 5 6 3, 4, 5 ,6! ,1!

    To preserve graph, should observe either: »  Both links »  Neither link Atomic Visibility!
  14. Atomic Visibility OR X = 1 READ Y = 1

    READ READ X = READ Y = X = 1 WRITE Y = 1 WRITE either all or none of each transaction’s updates should be visible to other transactions
  15. BUT NOT Atomic Visibility OR X = 1 READ Y

    = 1 READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions OR X = 1 READ Y = 1 READ READ X = READ Y = “FRACTURED READS”
  16. RAMP: Basic State On  each  Node   Every  transac-on  has

     a  (one-­‐way)     ready  bit  at  each  node.     Every  node  has  a  (monotonically  increasing)     highest  -mestamp  commiJed.     Immutable  data  with  (monotonically  increasing)   -mestamps   Every  transac-on  is  assigned  a  -mestamp  from  a   (monotonically  increasing)  counter.   T13   T13   ✓   X=0   T10   X=1   T13  
  17. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002
  18. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1 X = ? R Y = ? R X = 1 Y = 0 via  inten(on   metadata  
  19. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 Y=0 Server 1002 X=1 via  inten(on   metadata  
  20. value Y=0 T0 {} intention · Atomic Visibility via RAMP

    Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · T0 intention · via  inten(on   metadata   “A transaction called T1 wrote this and also wrote to Y”
  21. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W value X=1 T1 {Y} intention · value Y=0 T0 {} intention · via  inten(on   metadata   X = ? R Y = ? R
  22. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value

    X=1 T1 {Y} intention · via  inten(on   metadata   X = ? R Y = ? R X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention · “A transaction called T1 wrote this and also wrote to Y” via     mul(-­‐versioning,   ready  bit  
  23. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W value X=1 T1 {Y} intention · via  inten(on   metadata   via     mul(-­‐versioning,   ready  bit   value Y=0 T0 {} intention ·
  24. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via  inten(on   metadata   value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W 1.) Place write on each server. 2.) Set ready bit on each write on server. via     mul(-­‐versioning,   ready  bit   Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers
  25. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via  inten(on   metadata   via     mul(-­‐versioning   value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W X = ? R Y = ? R Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers
  26. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via  inten(on   metadata   via     mul(-­‐versioning   value intention X=0 T0 {} · value intention Y=0 T0 {} · X = ? R Y = ? R 1.) Fetch “highest” ready versions. 2.) Fetch any missing writes using metadata. X = 1 Y = 0 Y = 1 Ready bit monotonicity: once ready bit is set, all writes in transaction are present on their respective servers
  27. RAMP Variants Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B(ε)) Bloom filter REPAIR ATOMICITY DETECT RACES via  inten(on   metadata   via     mul(-­‐versioning,   ready  bit  
  28. RAMP-Hybrid YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn No

    Concurrency Control Serializable 2PL Write Locks Only RAMP-Fast RAMP-Small
  29. Bw-Trees The Bw-Tree: A B-tree for New Hardware Platforms. JJ

    Levandoski, DB Lomet, S Sengupta. ICDE 2013.
  30. In-Memory SQL Performance Analysis   •  Improve  CPI?    

    < 2x benefit •  Improving  mul-core  scalability?     <2x benefit Solution: reduce # of instructions per transaction By a LOT!  10x  faster?    90%  fewer  instruc-ons    100x  faster?    99%  fewer  instruc-ons   Q:  Where  are  the  inner-­‐loop  instruc-ons?   A: Index access •  especially latching and locking Answer: no latches, no locks. i.e. Avoid Coordination Diaconu,  et  al.    “Hekaton:  SQL   Server’s  Memory-­‐Op-mized   OLTP  Engine”.    SIGMOD  2013.  
  31. The Bw-Tree: What is it? A Latch-free, Log-structured B-tree for

    Multi-core Machines with Large Main Memories and Flash Storage Bw = Buzz Word No coordination Progressive!
  32. Delta Updates Page P PID Physical Address P Mapping Table

    Δ: Insert record 50 Δ: Delete record 48 Δ: Update record 35 Δ: Insert Record 60 •  Each page update produces a new address (the delta). •  Install new page address in map using compare-and-swap. •  Only one winner on concurrent update to the same address. •  Eventually install new consolidate page with deltas applied. •  Single-page updates are easy, solved node splits and deletes. Consolidated Page P Bw-Tree Delta Updates Coordina-on  happens  here,   via  CAS  instruc-on.   A  monotonic  log  of  updates.   A  monotonic  accumula-on   of  versions  
  33. solidation that creates a new “re-organized” base page contain- ing

    all the entries from the original base page as modified by the updates from the delta chain. We trigger consolidation if an accessor thread, during a page search, notices a delta chain length has exceeded a system threshold. The thread performs consolidation after attempting its update (or read) operation. When consolidating, the thread first creates a new base page (a new block of memory). It then populates the base page with a sorted vector containing the most recent version of a record from either the delta chain or old base page (deleted records are discarded). The thread then installs the new address of the consolidated page in the mapping table with a CAS. If it succeeds, the thread requests garbage collection (memory reclamation) of the old page state. Figure 2(b) provides an example depicting the consolidation of page P that incorpo- rates deltas into a new “Consolidated Page P”. If this CAS fails, the thread abandons the operation by deallocating the new page. The thread does not retry, as a subsequent thread will eventually perform a successful consolidation. C. Range Scans A range scan is specified by a key range (low key, high key). Either of the boundary keys can be omitted, meaning that one end of the range is open-ended. A scan will also specify either an ascending or descending key order for delivering the records. Our description here assumes both boundary keys are provided and the ordering is ascending. The other scan options are simple variants. A scan maintains a cursor providing a key indicating how LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr P O Q CA (b) I LPID Ptr Page P P O Q Page Q Pa Split ∆ CAS Index entry ∆ (c) Installing index entry Fig. 3. Split example. Dashed arrows represent arrows represent physical pointers. deallocate the old page state while anoth it. Similar concerns arise when a page Bw-tree. That is, other threads may s the now removed page. We must prote accessing reclaimed and potentially “re preventing reclamation until such access This is done by a thread executing with An epoch mechanism is a way of ing deallocated from being re-used too joins an epoch when it wants to prote (e.g., searching) from being reclaimed. I this dependency is finished. Typically, Q   P   R   O   P   R   Q   O   P   R   Q   O   on that creates a new “re-organized” base page contain- he entries from the original base page as modified by ates from the delta chain. We trigger consolidation if ssor thread, during a page search, notices a delta chain as exceeded a system threshold. The thread performs dation after attempting its update (or read) operation. n consolidating, the thread first creates a new base page block of memory). It then populates the base page with vector containing the most recent version of a record her the delta chain or old base page (deleted records arded). The thread then installs the new address of solidated page in the mapping table with a CAS. If eds, the thread requests garbage collection (memory tion) of the old page state. Figure 2(b) provides an e depicting the consolidation of page P that incorpo- ltas into a new “Consolidated Page P”. If this CAS e thread abandons the operation by deallocating the ge. The thread does not retry, as a subsequent thread ntually perform a successful consolidation. LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS (b) Installing split delta LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS Index entry ∆ (c) Installing index entry delta Fig. 3. Split example. Dashed arrows represent logical pointers, while solid arrows represent physical pointers. deallocate the old page state while another thread still accesses it. Similar concerns arise when a page is removed from the solidation that creates a new “re-organized” base page contain- ing all the entries from the original base page as modified by the updates from the delta chain. We trigger consolidation if an accessor thread, during a page search, notices a delta chain length has exceeded a system threshold. The thread performs consolidation after attempting its update (or read) operation. When consolidating, the thread first creates a new base page (a new block of memory). It then populates the base page with a sorted vector containing the most recent version of a record from either the delta chain or old base page (deleted records are discarded). The thread then installs the new address of the consolidated page in the mapping table with a CAS. If it succeeds, the thread requests garbage collection (memory reclamation) of the old page state. Figure 2(b) provides an example depicting the consolidation of page P that incorpo- rates deltas into a new “Consolidated Page P”. If this CAS LPID Ptr Page P P O Q Page Q Page R (a) Creating sibling page Q LPID Ptr Page P P O Q Page Q Pa Split ∆ CAS (b) Installing split delt LPID Ptr Page P P O Q Page Q Page R Split ∆ CAS Index entry ∆ (c) Installing index entry delta Fig. 3. Split example. Dashed arrows represent logical pointers, whi arrows represent physical pointers. Page  “updates”  are  actually   appends  to  a  progressively   growing  log.  Only  Ptrs  are   muta-ng  (vis  CAS  instruc-on).   Page Splits
  34. 10.40 3.83 2.84 0.56 0.66 0.33 0.0 2.0 4.0 6.0

    8.0 10.0 12.0 Xbox Synthetic Deduplication Operations/Sec (M) BW-Tree BerkeleyDB Fig. 6. Bw-tree and BerkeleyDB over linked delta chains are good for branch prediction and prefetching in general, the Xbox workload has large 100-byte records, meaning fewer deltas will fit into the L1 cache during a scan. The synthetic workload contains small 8-byte keys, which are more amenable to prefetching and caching. Thus, delta chain lengths can grow longer (to about eight deltas) without performance consequences. Synthetic workl Read-only work BW-TR In general, we bel rior performance of t blocks on updates or uses page-level latch ducing concurrency. utilization of about 9 (2) CPU cache efficie to update immutable threads are rarely inv leyDB updates page tree page involves in ordered records, on invalidating multiple D. Comparing Bw-tr
  35. Reflection •  CAP? CALM. – Nothing in PTime requires coordination • 

    Wow – But CALM only tells us what’s possible •  Not how to do it. •  How do we get good at designing progressive systems?
  36. Getting Progressive 1.  Design patterns –  Use a log as

    ground truth •  Derive data structures via “queries” over the streaming log –  Use versions, not mutable state –  ACID 2.0: Associative, Commutative, Idempotent –  Your ideas go here... 2.  Libraries and Languages –  CRDTs are monotonic data types •  Have to link them together carefully –  Bloom and Eve are languages whose compilers can test for monotonicity
  37. More? Declarative Networking: Recent Theoretical Work on Coordination, Correctness, and

    Declarative Semantics. T Ameloot. SIGMOD Record 2014. Scalable Atomic Visibility with RAMP Transactions. P Bailis, A Fekete, A Ghodsi, JM Hellerstein, I Stoica. SIGMOD 2014. The Bw-Tree: A B-tree for New Hardware Platforms. JJ Levandoski, DB Lomet, S Sengupta. ICDE 2013. http://boom.cs.berkeley.edu http://bit.ly/progressiveseminar
  38. Spanner? latency (ms) operation mean std dev count all reads

    8.7 376.4 21.5B single-site commit 72.3 112.8 31.2M multi-site commit 103.0 52.2 32.1M Table 6: F1-perceived operation latencies measured over the course of 24 hours. of such tables are extremely uncommon. The F1 team has only seen such behavior when they do untuned bulk data loads as transactions. The cated s tation [ store th cation. terface scribe a Their p phase c mit ove a varian across 10  TPS!   [Corbett, et al. “Spanner:…”, OSDI12]
  39. Facebook Tao Google Megastore LinkedIn Espresso Due to coordination overheads…

    Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS …consciously choose to violate atomic visibility “[Tao]  explicitly  favors  efficiency   and  availability  over    consistency… [an  edge]  may  exist  without  an   inverse;  these  hanging  associa-ons   are  scheduled  for  repair  by  an   asynchronous  job.”   Google App Engine
  40. r(x)=0! w(x←1)! w(y←1)! r(y)=0! Should have r(y)!1 r(y)=0! w(x←1)! 2

    r(x)=0! w(y←1)! 1 Should have r(x)!1 r(y)=0! w(x←1)! 1 r(x)=0! w(y←1)! 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE! Atomic Visibility is not serializability! …but respects Atomic Visibility!