Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Scalable Atomic Visibility with RAMP Transactions

Scalable Atomic Visibility with RAMP Transactions

Slides for "Scalable Atomic Visibility with RAMP Transactions" by Bailis et al., appearing in SIGMOD 2014

This deck also contains a proposal for implementation in Cassandra. If you're interested in implementing RAMP in your own system, don't hesitate to get in touch: pbailis at cs.berkeley.edu or @pbailis

Paper: http://www.bailis.org/papers/ramp-sigmod2014.pdf
Blog post intro: http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/
Source code from paper and "executable pseudocode" in Python: https://github.com/pbailis/ramp-sigmod2014-code

pbailis

June 24, 2014
Tweet

More Decks by pbailis

Other Decks in Technology

Transcript

  1. SCALABLE ATOMIC VISIBILITY WITH RAMP TRANSACTIONS Peter Bailis, Alan Fekete,

    Ali Ghodsi, Joseph M. Hellerstein, Ion Stoica UC Berkeley and University of Sydney Overview deck with Cassandra discussion @pbailis
  2. POOR PERFORMANCE DELAY PEAK THROUGHPUT: 1/DELAY FOR CONTENDED OPERATIONS at

    .5MS, 2K TXN/s at 50MS, 20 TXN/s NO SQL DIDN’T WANT SERIALIZABILITY NOSQL
  3. STILL DON’T WANT SERIALIZABILITY “NOT ONLY SQL” BUT WANT MORE

    FEATURES This paper! (DON’T WANT THE COSTS)
  4. as “TAO: Facebook’s Distributed Data Store for the Social Graph”

    USENIX ATC 2013 s Denormalized Friend List Fast reads…
  5. as “TAO: Facebook’s Distributed Data Store for the Social Graph”

    USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates
  6. as “TAO: Facebook’s Distributed Data Store for the Social Graph”

    USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates s
  7. as “TAO: Facebook’s Distributed Data Store for the Social Graph”

    USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates s
  8. as “TAO: Facebook’s Distributed Data Store for the Social Graph”

    USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates s
  9. as “TAO: Facebook’s Distributed Data Store for the Social Graph”

    USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates Not cleanly partitionable s
  10. FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for the

    Social Graph” USENIX ATC 2013 “On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform” SIGMOD 2013
  11. FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for the

    Social Graph” USENIX ATC 2013 “On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform” SIGMOD 2013 “PNUTS: Yahoo!’s Hosted Data Serving Platform” VLDB 2008
  12. ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345

    AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13
  13. ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345

    AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13
  14. ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345

    AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13
  15. ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345

    AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13
  16. ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345

    AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13
  17. ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345

    AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13 Partition by primary key (ID)
  18. ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345

    AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13 Partition by primary key (ID) How should we look up by age?
  19. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing
  20. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data
  21. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data WRITE ONE SERVER, READ ALL
  22. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data poor scalability WRITE ONE SERVER, READ ALL
  23. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute poor scalability WRITE ONE SERVER, READ ALL
  24. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute WRITE 2+ SERVERS, READ ONE poor scalability WRITE ONE SERVER, READ ALL
  25. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute WRITE 2+ SERVERS, READ ONE scalable lookups poor scalability WRITE ONE SERVER, READ ALL
  26. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute WRITE 2+ SERVERS, READ ONE scalable lookups poor scalability WRITE ONE SERVER, READ ALL
  27. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL
  28. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW
  29. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i
  30. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i
  31. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i (PROPOSED) INCONSISTENT GLOBAL 2i
  32. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i (PROPOSED) INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i
  33. SECONDARY INDEXING Partition by primary key (ID) How should we

    look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i (PROPOSED) INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i
  34. MATERIALIZED VIEWS TABLE: ALL USERS TABLE: USERS OVER 25 RELEVANT

    RECENT EXAMPLES IN GOOGLE PERCOLATOR TWITTER RAINBIRD LINKEDIN ESPRESSO PAPERS
  35. SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ

    COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY SERIALIZABILITY
  36. SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ

    COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY SERIALIZABILITY
  37. REPEATABLE READ (PL-2.99) SERIALIZABILITY SNAPSHOT ISOLATION CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED LINEARIZABILITY CAUSAL PRAM RYW EVENTUAL CONSISTENCY
  38. REPEATABLE READ (PL-2.99) SERIALIZABILITY SNAPSHOT ISOLATION CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED LINEARIZABILITY MANY SUFFICIENT CAUSAL PRAM RYW EVENTUAL CONSISTENCY
  39. REPEATABLE READ (PL-2.99) SERIALIZABILITY SNAPSHOT ISOLATION CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED LINEARIZABILITY REQUIRE SYNCHRONOUS COORDINATION MANY SUFFICIENT CAUSAL PRAM RYW EVENTUAL CONSISTENCY
  40. SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY REQUIRE SYNCHRONOUS COORDINATION MANY SUFFICIENT COORDINATION-FREE
  41. SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY REQUIRE SYNCHRONOUS COORDINATION INSUFFICIENT MANY SUFFICIENT COORDINATION-FREE
  42. SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY LINEARIZABILITY REQUIRE

    SYNCHRONOUS COORDINATION INSUFFICIENT MANY SUFFICIENT COORDINATION-FREE Facebook TAO LinkedIn Espresso Yahoo! PNUTS Google Megastore Google App Engine Twitter Rainbird Amazon DynamoDB CONSCIOUS CHOICES!
  43. SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY COORDINATION-FREE INSUFFICIENT REQUIRE SYNCHRONOUS COORDINATION SUFFICIENT
  44. SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY COORDINATION-FREE RAMP (THIS PAPER) INSUFFICIENT REQUIRE SYNCHRONOUS COORDINATION SUFFICIENT
  45. Informally: Either all of each transaction’s updates are visible, or

    none are ATOMIC VISIBILITY WRITE X = 1 WRITE Y = 1
  46. Informally: Either all of each transaction’s updates are visible, or

    none are ATOMIC VISIBILITY WRITE X = 1 WRITE Y = 1 READ X = 1 READ Y = 1
  47. Informally: Either all of each transaction’s updates are visible, or

    none are ATOMIC VISIBILITY WRITE X = 1 WRITE Y = 1 READ X = 1 READ Y = 1 OR
  48. Informally: Either all of each transaction’s updates are visible, or

    none are ATOMIC VISIBILITY WRITE X = 1 WRITE Y = 1 READ X = 1 READ Y = 1 READ X = ∅ READ Y = ∅ OR
  49. Informally: Either all of each transaction’s updates are visible, or

    none are ATOMIC VISIBILITY READ X = 1 READ Y = 1 READ X = ∅ READ Y = ∅ OR
  50. Informally: Either all of each transaction’s updates are visible, or

    none are ATOMIC VISIBILITY READ X = 1 READ Y = 1 READ X = ∅ READ Y = ∅ OR BUT NOT READ Y = ∅ READ X = 1
  51. Informally: Either all of each transaction’s updates are visible, or

    none are ATOMIC VISIBILITY READ X = 1 READ Y = 1 READ X = ∅ READ Y = ∅ OR BUT NOT READ X = ∅ READ Y = ∅ READ X = 1 OR READ Y = 1
  52. ATOMIC VISIBILITY We also provide per-item PRAM guarantees with per-transaction

    regular semantics (see paper Appendix) Formally: A transaction Tj exhibits fractured reads if transaction Ti writes versions xm and yn (in any order, with x possibly but not necessarily equal to y), Tj reads version xm and version yk , and k <n. ! A system provides Read Atomic isolation (RA) if it prevents fractured reads anomalies and also prevents transactions from reading uncommitted, aborted, or intermediate data. FORMALIZED AS READ ATOMIC ISOLATION
  53. TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY WHILE ENSURING PARTITION INDEPENDENCE clients

    only access servers responsible for data in transactions W(X=1) W(Y=1) X Y Z
  54. TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY WHILE ENSURING PARTITION INDEPENDENCE clients

    only access servers responsible for data in transactions W(X=1) W(Y=1) X Y Z
  55. TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY WHILE ENSURING PARTITION INDEPENDENCE AND

    SYNCHRONIZATION INDEPENDENCE clients only access servers responsible for data in transactions transactions always commit* and no client can cause another client to block
  56. TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY ARE NOT SERIALIZABLE DO NOT

    PREVENT LOST UPDATE DO NOT PREVENT WRITE SKEW ALLOW CONCURRENT UPDATES
  57. TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY ARE NOT SERIALIZABLE DO NOT

    PREVENT LOST UPDATE DO NOT PREVENT WRITE SKEW ALLOW CONCURRENT UPDATES ARE GUIDED BY REAL WORLD USE CASES FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS
  58. TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY ARE NOT SERIALIZABLE DO NOT

    PREVENT LOST UPDATE DO NOT PREVENT WRITE SKEW ALLOW CONCURRENT UPDATES ARE GUIDED BY REAL WORLD USE CASES FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS Facebook TAO LinkedIn Espresso Yahoo! PNUTS Google Megastore Google App Engine Twitter Rainbird Amazon DynamoDB
  59. STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY

    COUPLED WITH MUTUAL EXCLUSION RTT unavailability!
  60. X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY COUPLED WITH

    MUTUAL EXCLUSION at .5 MS < 2K TPS! unavailable during failures SIMILAR ISSUES IN MVCC, PRE-SCHEDULING SERIALIZABLE OCC, (global timestamp assignment/application) (multi-partition validation, liveness) (scheduling, multi-partition execution) STRAWMAN: LOCKING
  61. X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY COUPLED WITH

    MUTUAL EXCLUSION at .5 MS < 2K TPS! unavailable during failures SIMILAR ISSUES IN MVCC, PRE-SCHEDULING SERIALIZABLE OCC, (global timestamp assignment/application) (multi-partition validation, liveness) (scheduling, multi-partition execution) FUNDAMENTAL TO “STRONG” SEMANTICS STRAWMAN: LOCKING
  62. BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) LET CLIENTS RACE,

    but HAVE READERS “CLEAN UP” X=1 METADATA
  63. BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) LET CLIENTS RACE,

    but HAVE READERS “CLEAN UP” X=1 + LIMITED MULTI-VERSIONING METADATA
  64. BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) LET CLIENTS RACE,

    but HAVE READERS “CLEAN UP” X=1 + LIMITED MULTI-VERSIONING METADATA FOR NOW:
 READ-ONLY, WRITE-ONLY TXNS
  65. last committed stamp for x: 0 RAMP-Fast known versions of

    x last committed stamp for y: 0 known versions of y
  66. last committed stamp for x: 0 RAMP-Fast known versions of

    x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}
  67. last committed stamp for x: 0 RAMP-Fast known versions of

    x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}
  68. last committed stamp for x: 0 RAMP-Fast known versions of

    x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}
  69. last committed stamp for x: 0 RAMP-Fast known versions of

    x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}
  70. last committed stamp for x: 0 RAMP-Fast known versions of

    x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}
  71. last committed stamp for x: 0 RAMP-Fast known versions of

    x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}
  72. last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known

    versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}
  73. last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known

    versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. timestamp: 242 e.g., time concat client ID concat sequence number
  74. last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known

    versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. timestamp: 242
  75. last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known

    versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. timestamp: 242
  76. last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known

    versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 242 1 timestamp: 242
  77. last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known

    versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 242 1 242 1 timestamp: 242
  78. last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known

    versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242
  79. RAMP-Fast W(X=1) W(Y=1) known versions of x last committed stamp

    for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 last committed stamp for x: 242 timestamp: 242
  80. RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of

    y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 242 timestamp: 242
  81. RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of

    y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 R(X=?) R(Y=?) last committed stamp for x: 242 last committed stamp for y: 242
  82. RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of

    y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 R(X=1) R(Y=1) last committed stamp for x: 242 last committed stamp for y: 242
  83. RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of

    y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 R(X=1) R(Y=1) last committed stamp for x: 242 last committed stamp for y: 242
  84. R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known

    versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 last committed stamp for x: 242 last committed stamp for y: 0
  85. R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known

    versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 last committed stamp for x: 242 last committed stamp for y: 0
  86. R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known

    versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! last committed stamp for x: 242 last committed stamp for y: 0
  87. R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known

    versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0
  88. R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known

    versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0
  89. RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of

    y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0
  90. RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of

    y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0 RECORD THE ITEMS WRITTEN IN THE TRANSACTION
  91. RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of

    y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0 {y} RECORD THE ITEMS WRITTEN IN THE TRANSACTION
  92. RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of

    y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} RECORD THE ITEMS WRITTEN IN THE TRANSACTION
  93. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?)
  94. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?)
  95. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?)
  96. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed:
  97. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: X=1 @ 242, {Y}
  98. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: X=1 @ 242, {Y} Y=NULL @ 0, {}
  99. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: X=1 @ 242, {Y} Y=NULL @ 0, {}
  100. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242
  101. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242
  102. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: 3.) Fetch missing versions. X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242
  103. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: 3.) Fetch missing versions. X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242 Y=1 @ 242, {X} (Send required timestamp in request)
  104. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: 3.) Fetch missing versions. X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242 Y=1 @ 242, {X} (Send required timestamp in request) 2PC ENSURES NO WAIT AT SERVER
  105. RAMP-Fast known versions of x known versions of y TIMESTAMP

    VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: 3.) Fetch missing versions. X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242 Y=1 @ 242, {X} (Send required timestamp in request) 4.) Return resulting set. R(X=1) R(Y=1) 2PC ENSURES NO WAIT AT SERVER
  106. RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size

    linear in transaction size ENSURES READERS NEVER WAIT!
  107. RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size

    linear in transaction size 1 RTT reads: in race-free case ENSURES READERS NEVER WAIT!
  108. RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size

    linear in transaction size 1 RTT reads: in race-free case 2 RTT reads: otherwise ENSURES READERS NEVER WAIT!
  109. RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size

    linear in transaction size 1 RTT reads: in race-free case 2 RTT reads: otherwise no fast-path synchronization ENSURES READERS NEVER WAIT!
  110. RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size

    linear in transaction size 1 RTT reads: in race-free case 2 RTT reads: otherwise no fast-path synchronization ENSURES READERS NEVER WAIT! CAN WE USE LESS METADATA?
  111. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads
  112. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads always
  113. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. always
  114. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. always
  115. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 always
  116. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 always
  117. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 Z time 842 always
  118. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 Z time 842 {247, 523, 842} always
  119. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads partial commits will be in this set INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 Z time 842 {247, 523, 842} always
  120. RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but

    drop all RAMP-Fast metadata 2 RTT reads partial commits will be in this set INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 Z time 842 {247, 523, 842} send it to all participating servers always
  121. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter
  122. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter
  123. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter
  124. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter
  125. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter
  126. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 log(2)2 O([txn len]*log(1/ε))
  127. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 log(2)2 O([txn len]*log(1/ε))
  128. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 BLOOM FILTER SUMMARIZES WRITE SET FALSE POSITIVES: EXTRA RTTs log(2)2 O([txn len]*log(1/ε))
  129. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 BLOOM FILTER SUMMARIZES WRITE SET FALSE POSITIVES: EXTRA RTTs log(2)2 O([txn len]*log(1/ε))
  130. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter
  131. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter • AVOID IN-PLACE UPDATES
  132. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter • AVOID IN-PLACE UPDATES • EMBRACE RACES TO IMPROVE CONCURRENCY
  133. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter • AVOID IN-PLACE UPDATES • EMBRACE RACES TO IMPROVE CONCURRENCY • ALLOW READERS TO REPAIR PARTIAL WRITES
  134. RAMP Summary Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter • AVOID IN-PLACE UPDATES • EMBRACE RACES TO IMPROVE CONCURRENCY • ALLOW READERS TO REPAIR PARTIAL WRITES • USE 2PC TO AVOID READER STALLS
  135. Additional Details Garbage collection: limit read transaction duration to K

    seconds GC overwritten versions after K seconds
  136. Additional Details Garbage collection: limit read transaction duration to K

    seconds GC overwritten versions after K seconds Replication paper assumes linearizable masters extendable to “AP” systems see HAT by Bailis et al., VLDB 2014
  137. Additional Details Garbage collection: limit read transaction duration to K

    seconds GC overwritten versions after K seconds Failure handling: blocked 2PC rounds do not block clients stalled commits? versions are not GC’d if desirable, use CTP termination protocol Replication paper assumes linearizable masters extendable to “AP” systems see HAT by Bailis et al., VLDB 2014
  138. RAMP PERFORMANCE Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter
  139. RAMP PERFORMANCE Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter EVALUATED ON EC2 cr1.8xlarge instances (cluster size: 1-100 servers; default: 5) ! open sourced on GitHub; see link at end of talk
  140. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s)
  141. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control
  142. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Doesn’t provide atomic visibility RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control
  143. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control
  144. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL
  145. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only
  146. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Also doesn’t provide atomic visibility RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only
  147. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Representative of coordinated approaches RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only
  148. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only
  149. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast
  150. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Within ~5% of baseline ! Latency in paper (comparable) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast
  151. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast
  152. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small
  153. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Always needs 2 RTT reads RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small
  154. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small
  155. RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: WorkloadA, 95% reads, 1M

    items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small
  156. RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: WorkloadA, 95% reads, 1M

    items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small
  157. YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25

    50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s)
  158. YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25

    50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control
  159. YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25

    50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only
  160. RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR

    LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25 50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid
  161. RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR

    LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25 50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid Linear scaling; due to 2RTT writes, races
  162. RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR

    LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25 50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid
  163. YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0

    25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)
  164. RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control YCSB:

    uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)
  165. RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control RAMP-F

    RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)
  166. RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control RAMP-F

    RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 40K 80K 120K 160K 200K operations/s/server
  167. RAMP PERFORMANCE Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter
  168. RAMP PERFORMANCE Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter More results in paper: Transaction length, contention, value size, latency, failures
  169. FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES:

    Maintain list of matching record IDs and versions e.g., HAS_BEARD={52@512, 412@52, 123@512} merge lists on commit/read (LWW by timestamp for conflicts)
  170. FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES:

    Maintain list of matching record IDs and versions e.g., HAS_BEARD={52@512, 412@52, 123@512} merge lists on commit/read (LWW by timestamp for conflicts) LOOKUPs: READ INDEX, THEN FETCH DATA
  171. FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES:

    Maintain list of matching record IDs and versions e.g., HAS_BEARD={52@512, 412@52, 123@512} merge lists on commit/read (LWW by timestamp for conflicts) LOOKUPs: READ INDEX, THEN FETCH DATA SIMILAR FOR SELECT/PROJECT
  172. SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY REQUIRE SYNCHRONOUS COORDINATION INSUFFICIENT SUFFICIENT COORDINATION-FREE
  173. SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY REQUIRE SYNCHRONOUS COORDINATION INSUFFICIENT SUFFICIENT COORDINATION-FREE
  174. SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY COORDINATION-FREE INSUFFICIENT SUFFICIENT REQUIRE SYNCHRONOUS COORDINATION
  175. SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED

    READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY COORDINATION-FREE ATOMIC VISIBILITY VIA RAMP INSUFFICIENT SUFFICIENT REQUIRE SYNCHRONOUS COORDINATION
  176. RAMP IN CASSANDRA STRAIGHTFORWARD USES: •Add atomic visibility to atomic

    batch operations •Expose as CQL isolation level • USING CONSISTENCY READ_ATOMIC •Encourage use in multi-put, multi-get •Treat as basis for global secondary indexing •CREATE GLOBAL INDEX on users (age )
  177. RAMP IN CASSANDRA REQUIREMENTS: •Unique timestamp generation for transactions •Use

    node ID from ring •Other form of UUID •Hash transaction contents* •Limited multi-versioning for prepared and old values •RAMP doesn’t actually require true MVCC •One proposal: keep a look aside cache
  178. RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: Lookaside cache for prepared and

    old values ! Standard C* Table stores last committed write 1 52 335 1240 1402 2201
  179. RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: Lookaside cache for prepared and

    old values ! Standard C* Table stores last committed write Shadow table stores prepared-but-not-committed and overwritten versions 1 52 335 1240 1402 2201 64 335 2201
  180. RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: Lookaside cache for prepared and

    old values ! Standard C* Table stores last committed write Shadow table stores prepared-but-not-committed and overwritten versions 1 52 335 1240 1402 2201 64 335 2201 Transparent to end-users
  181. RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: Lookaside cache for prepared and

    old values ! Standard C* Table stores last committed write Shadow table stores prepared-but-not-committed and overwritten versions 1 52 335 1240 1402 2201 64 335 2201 Overwritten versions have TTL set to max read transaction time, do not need durability Transparent to end-users
  182. RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: OPERATION CONSISTENCY LEVEL Write Prepare

    CL.QUORUM Write Commit CL.ANY or higher First-round Read CL.ANY/CL.ONE Second-round Read CL.QUORUM
  183. RAMP IN CASSANDRA To avoid stalling, second-round reads must be

    able to access prepared writes POSSIBLE IMPLEMENTATION: OPERATION CONSISTENCY LEVEL Write Prepare CL.QUORUM Write Commit CL.ANY or higher First-round Read CL.ANY/CL.ONE Second-round Read CL.QUORUM
  184. RAMP IN CASSANDRA To avoid stalling, second-round reads must be

    able to access prepared writes POSSIBLE IMPLEMENTATION: OPERATION CONSISTENCY LEVEL Write Prepare CL.QUORUM Write Commit CL.ANY or higher First-round Read CL.ANY/CL.ONE Second-round Read CL.QUORUM
  185. RAMP IN CASSANDRA To avoid stalling, second-round reads must be

    able to access prepared writes POSSIBLE IMPLEMENTATION: OPERATION CONSISTENCY LEVEL Write Prepare CL.QUORUM Write Commit CL.ANY or higher First-round Read CL.ANY/CL.ONE Second-round Read CL.QUORUM
  186. RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: DC1 DC2 Run algorithms on

    a per-DC basis, with use of CL.LOCAL_QUORUM instead of full CL.QUORUM
  187. RAMP TRANSACTIONS: • Provide atomic visibility, as required for maintaining

    FKs, scalable indexing, mat views • Avoid in-place updates, mutual exclusion, any synchronous/blocking coordination
  188. RAMP TRANSACTIONS: • Provide atomic visibility, as required for maintaining

    FKs, scalable indexing, mat views • Avoid in-place updates, mutual exclusion, any synchronous/blocking coordination • Use metadata with limited multi versioning, reads repair partial writes
  189. RAMP TRANSACTIONS: • Provide atomic visibility, as required for maintaining

    FKs, scalable indexing, mat views • Avoid in-place updates, mutual exclusion, any synchronous/blocking coordination • Use metadata with limited multi versioning, reads repair partial writes • 1-2RTT overhead, pay only during contention
  190. RAMP TRANSACTIONS: • Provide atomic visibility, as required for maintaining

    FKs, scalable indexing, mat views • Avoid in-place updates, mutual exclusion, any synchronous/blocking coordination • Use metadata with limited multi versioning, reads repair partial writes • 1-2RTT overhead, pay only during contention Thanks! http://tiny.cc/ramp-code @pbailis http://tiny.cc/ramp-intro
  191. Punk designed by my name is mud from the Noun

    Project Creative Commons – Attribution (CC BY 3.0) Queen designed by Bohdan Burmich from the Noun Project Creative Commons – Attribution (CC BY 3.0) Guy Fawkes designed by Anisha Varghese from the Noun Project Creative Commons – Attribution (CC BY 3.0) Emperor designed by Simon Child from the Noun Project Creative Commons – Attribution (CC BY 3.0) Baby designed by Les vieux garçons from the Noun Project Creative Commons – Attribution (CC BY 3.0) Baby designed by Les vieux garçons from the Noun Project Creative Commons – Attribution (CC BY 3.0) Gandhi designed by Luis Martins from the Noun Project Creative Commons – Attribution (CC BY 3.0) Database designed by Anton Outkine from the Noun Project Creative Commons – Attribution (CC BY 3.0) Girl designed by Rodrigo Vidinich from the Noun Project Creative Commons – Attribution (CC BY 3.0) Child designed by Gemma Garner from the Noun Project Creative Commons – Attribution (CC BY 3.0) Customer Service designed by Veysel Kara from the Noun Project Creative Commons – Attribution (CC BY 3.0) Punk Rocker designed by Simon Child from the Noun Project Creative Commons – Attribution (CC BY 3.0) Pyramid designed by misirlou from the Noun Project Creative Commons – Attribution (CC BY 3.0) Person designed by Stefania Bonacasa from the Noun Project Creative Commons – Attribution (CC BY 3.0) Record designed by Diogo Trindade from the Noun Project Creative Commons – Attribution (CC BY 3.0) Window designed by Juan Pablo Bravo from the Noun Project Creative Commons – Attribution (CC BY 3.0) Balloon designed by Julien Deveaux from the Noun Project Creative Commons – Attribution (CC BY 3.0) Balloon designed by Julien Deveaux from the Noun Project Creative Commons – Attribution (CC BY 3.0) Balloon designed by Julien Deveaux from the Noun Project Creative Commons – Attribution (CC BY 3.0) Crying designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0) Sad designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0) Happy designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0) Happy designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0) User designed by JM Waideaswaran from the Noun Project Creative Commons – Attribution (CC BY 3.0) ! COCOGOOSE font by ZetaFonts COMMON CREATIVE NON COMMERCIAL USE IMAGE/FONT CREDITs