Slide 1

Slide 1 text

SCALABLE ATOMIC VISIBILITY WITH RAMP TRANSACTIONS Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, Ion Stoica UC Berkeley and University of Sydney Overview deck with Cassandra discussion @pbailis

Slide 2

Slide 2 text

NOSQL

Slide 3

Slide 3 text

NO SQL NOSQL

Slide 4

Slide 4 text

NO SQL DIDN’T WANT SQL NOSQL

Slide 5

Slide 5 text

NO SQL DIDN’T WANT SERIALIZABILITY NOSQL

Slide 6

Slide 6 text

POOR PERFORMANCE NO SQL DIDN’T WANT SERIALIZABILITY NOSQL

Slide 7

Slide 7 text

POOR PERFORMANCE NO SQL DIDN’T WANT SERIALIZABILITY NOSQL

Slide 8

Slide 8 text

POOR PERFORMANCE DELAY NO SQL DIDN’T WANT SERIALIZABILITY NOSQL

Slide 9

Slide 9 text

POOR PERFORMANCE DELAY PEAK THROUGHPUT: 1/DELAY FOR CONTENDED OPERATIONS NO SQL DIDN’T WANT SERIALIZABILITY NOSQL

Slide 10

Slide 10 text

POOR PERFORMANCE DELAY PEAK THROUGHPUT: 1/DELAY FOR CONTENDED OPERATIONS at .5MS, 2K TXN/s at 50MS, 20 TXN/s NO SQL DIDN’T WANT SERIALIZABILITY NOSQL

Slide 11

Slide 11 text

NO SQL DIDN’T WANT SERIALIZABILITY NOSQL POOR PERFORMANCE HIGH LATENCY

Slide 12

Slide 12 text

NO SQL DIDN’T WANT SERIALIZABILITY NOSQL POOR PERFORMANCE LIMITED AVAILABILITY HIGH LATENCY

Slide 13

Slide 13 text

STILL DON’T WANT SERIALIZABILITY “NOT ONLY SQL”

Slide 14

Slide 14 text

STILL DON’T WANT SERIALIZABILITY “NOT ONLY SQL” (DON’T WANT THE COSTS)

Slide 15

Slide 15 text

STILL DON’T WANT SERIALIZABILITY “NOT ONLY SQL” BUT WANT MORE FEATURES (DON’T WANT THE COSTS)

Slide 16

Slide 16 text

STILL DON’T WANT SERIALIZABILITY “NOT ONLY SQL” BUT WANT MORE FEATURES This paper! (DON’T WANT THE COSTS)

Slide 17

Slide 17 text

“TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013

Slide 18

Slide 18 text

“TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013

Slide 19

Slide 19 text

“TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013

Slide 20

Slide 20 text

“TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013

Slide 21

Slide 21 text

“TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 FRIENDS FRIENDS

Slide 22

Slide 22 text

as “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 FRIENDS FRIENDS

Slide 23

Slide 23 text

as s “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013

Slide 24

Slide 24 text

as “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 s Denormalized Friend List

Slide 25

Slide 25 text

as “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads…

Slide 26

Slide 26 text

as “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates

Slide 27

Slide 27 text

as “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates s

Slide 28

Slide 28 text

as “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates s

Slide 29

Slide 29 text

as “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates s

Slide 30

Slide 30 text

as “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 s Denormalized Friend List Fast reads… …multi-entity updates Not cleanly partitionable s

Slide 31

Slide 31 text

FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013

Slide 32

Slide 32 text

FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 “On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform” SIGMOD 2013

Slide 33

Slide 33 text

FOREIGN KEY DEPENDENCIES “TAO: Facebook’s Distributed Data Store for the Social Graph” USENIX ATC 2013 “On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform” SIGMOD 2013 “PNUTS: Yahoo!’s Hosted Data Serving Platform” VLDB 2008

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345 AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13

Slide 36

Slide 36 text

ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345 AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13

Slide 37

Slide 37 text

ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345 AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13

Slide 38

Slide 38 text

ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345 AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13

Slide 39

Slide 39 text

ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345 AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13

Slide 40

Slide 40 text

ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345 AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13 Partition by primary key (ID)

Slide 41

Slide 41 text

ID: 532 AGE: 42 ID: 123 AGE: 22 ID: 2345 AGE: 1 ID: 412 AGE: 72 ID: 892 AGE: 13 Partition by primary key (ID) How should we look up by age?

Slide 42

Slide 42 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age?

Slide 43

Slide 43 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing

Slide 44

Slide 44 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data

Slide 45

Slide 45 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data WRITE ONE SERVER, READ ALL

Slide 46

Slide 46 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data poor scalability WRITE ONE SERVER, READ ALL

Slide 47

Slide 47 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute poor scalability WRITE ONE SERVER, READ ALL

Slide 48

Slide 48 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute WRITE 2+ SERVERS, READ ONE poor scalability WRITE ONE SERVER, READ ALL

Slide 49

Slide 49 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute WRITE 2+ SERVERS, READ ONE scalable lookups poor scalability WRITE ONE SERVER, READ ALL

Slide 50

Slide 50 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute WRITE 2+ SERVERS, READ ONE scalable lookups poor scalability WRITE ONE SERVER, READ ALL

Slide 51

Slide 51 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL

Slide 52

Slide 52 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW

Slide 53

Slide 53 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i

Slide 54

Slide 54 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i

Slide 55

Slide 55 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i (PROPOSED) INCONSISTENT GLOBAL 2i

Slide 56

Slide 56 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i (PROPOSED) INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i

Slide 57

Slide 57 text

SECONDARY INDEXING Partition by primary key (ID) How should we look up by age? Option I: Local Secondary Indexing Build indexes co-located with primary data Option II: Global Secondary Indexing Partition indexes by secondary key Partition by secondary attribute scalable lookups poor scalability WRITE 2+ SERVERS, READ ONE WRITE ONE SERVER, READ ALL OVERVIEW INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i (PROPOSED) INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i INCONSISTENT GLOBAL 2i

Slide 58

Slide 58 text

TABLE: ALL USERS

Slide 59

Slide 59 text

TABLE: ALL USERS

Slide 60

Slide 60 text

TABLE: ALL USERS TABLE: USERS OVER 25

Slide 61

Slide 61 text

TABLE: ALL USERS TABLE: USERS OVER 25

Slide 62

Slide 62 text

TABLE: ALL USERS TABLE: USERS OVER 25

Slide 63

Slide 63 text

TABLE: ALL USERS TABLE: USERS OVER 25

Slide 64

Slide 64 text

MATERIALIZED VIEWS TABLE: ALL USERS TABLE: USERS OVER 25

Slide 65

Slide 65 text

MATERIALIZED VIEWS TABLE: ALL USERS TABLE: USERS OVER 25 RELEVANT RECENT EXAMPLES IN GOOGLE PERCOLATOR TWITTER RAINBIRD LINKEDIN ESPRESSO PAPERS

Slide 66

Slide 66 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXES MATERIALIZED VIEWS HOW SHOULD WE CORRECTLY MAINTAIN

Slide 67

Slide 67 text

SERIALIZABILITY

Slide 68

Slide 68 text

SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY SERIALIZABILITY

Slide 69

Slide 69 text

SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY SERIALIZABILITY

Slide 70

Slide 70 text

REPEATABLE READ (PL-2.99) SERIALIZABILITY SNAPSHOT ISOLATION CURSOR STABILITY READ UNCOMMITTED READ COMMITTED LINEARIZABILITY CAUSAL PRAM RYW EVENTUAL CONSISTENCY

Slide 71

Slide 71 text

REPEATABLE READ (PL-2.99) SERIALIZABILITY SNAPSHOT ISOLATION CURSOR STABILITY READ UNCOMMITTED READ COMMITTED LINEARIZABILITY MANY SUFFICIENT CAUSAL PRAM RYW EVENTUAL CONSISTENCY

Slide 72

Slide 72 text

REPEATABLE READ (PL-2.99) SERIALIZABILITY SNAPSHOT ISOLATION CURSOR STABILITY READ UNCOMMITTED READ COMMITTED LINEARIZABILITY REQUIRE SYNCHRONOUS COORDINATION MANY SUFFICIENT CAUSAL PRAM RYW EVENTUAL CONSISTENCY

Slide 73

Slide 73 text

SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY REQUIRE SYNCHRONOUS COORDINATION MANY SUFFICIENT COORDINATION-FREE

Slide 74

Slide 74 text

SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY REQUIRE SYNCHRONOUS COORDINATION INSUFFICIENT MANY SUFFICIENT COORDINATION-FREE

Slide 75

Slide 75 text

SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY LINEARIZABILITY REQUIRE SYNCHRONOUS COORDINATION INSUFFICIENT MANY SUFFICIENT COORDINATION-FREE Facebook TAO LinkedIn Espresso Yahoo! PNUTS Google Megastore Google App Engine Twitter Rainbird Amazon DynamoDB CONSCIOUS CHOICES!

Slide 76

Slide 76 text

SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY COORDINATION-FREE INSUFFICIENT REQUIRE SYNCHRONOUS COORDINATION SUFFICIENT

Slide 77

Slide 77 text

SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY COORDINATION-FREE RAMP (THIS PAPER) INSUFFICIENT REQUIRE SYNCHRONOUS COORDINATION SUFFICIENT

Slide 78

Slide 78 text

TRANSACTIONS R A M P TOMIC EAD ULTI- ARTITION

Slide 79

Slide 79 text

TRANSACTIONS R A M P TOMIC EAD ULTI- ARTITION

Slide 80

Slide 80 text

TRANSACTIONS RAMP EFFICIENTLY MAINTAIN

Slide 81

Slide 81 text

TRANSACTIONS RAMP FOREIGN KEY DEPENDENCIES EFFICIENTLY MAINTAIN

Slide 82

Slide 82 text

TRANSACTIONS RAMP FOREIGN KEY DEPENDENCIES SECONDARY INDEXES EFFICIENTLY MAINTAIN

Slide 83

Slide 83 text

TRANSACTIONS RAMP FOREIGN KEY DEPENDENCIES SECONDARY INDEXES MATERIALIZED VIEWS EFFICIENTLY MAINTAIN

Slide 84

Slide 84 text

TRANSACTIONS RAMP FOREIGN KEY DEPENDENCIES SECONDARY INDEXES MATERIALIZED VIEWS BY PROVIDING ATOMIC VISIBILITY EFFICIENTLY MAINTAIN

Slide 85

Slide 85 text

ATOMIC VISIBILITY

Slide 86

Slide 86 text

Informally: Either all of each transaction’s updates are visible, or none are ATOMIC VISIBILITY

Slide 87

Slide 87 text

Informally: Either all of each transaction’s updates are visible, or none are ATOMIC VISIBILITY

Slide 88

Slide 88 text

Informally: Either all of each transaction’s updates are visible, or none are ATOMIC VISIBILITY WRITE X = 1 WRITE Y = 1

Slide 89

Slide 89 text

Informally: Either all of each transaction’s updates are visible, or none are ATOMIC VISIBILITY WRITE X = 1 WRITE Y = 1 READ X = 1 READ Y = 1

Slide 90

Slide 90 text

Informally: Either all of each transaction’s updates are visible, or none are ATOMIC VISIBILITY WRITE X = 1 WRITE Y = 1 READ X = 1 READ Y = 1 OR

Slide 91

Slide 91 text

Informally: Either all of each transaction’s updates are visible, or none are ATOMIC VISIBILITY WRITE X = 1 WRITE Y = 1 READ X = 1 READ Y = 1 READ X = ∅ READ Y = ∅ OR

Slide 92

Slide 92 text

Informally: Either all of each transaction’s updates are visible, or none are ATOMIC VISIBILITY READ X = 1 READ Y = 1 READ X = ∅ READ Y = ∅ OR

Slide 93

Slide 93 text

Informally: Either all of each transaction’s updates are visible, or none are ATOMIC VISIBILITY READ X = 1 READ Y = 1 READ X = ∅ READ Y = ∅ OR BUT NOT READ Y = ∅ READ X = 1

Slide 94

Slide 94 text

Informally: Either all of each transaction’s updates are visible, or none are ATOMIC VISIBILITY READ X = 1 READ Y = 1 READ X = ∅ READ Y = ∅ OR BUT NOT READ X = ∅ READ Y = ∅ READ X = 1 OR READ Y = 1

Slide 95

Slide 95 text

ATOMIC VISIBILITY We also provide per-item PRAM guarantees with per-transaction regular semantics (see paper Appendix) Formally: A transaction Tj exhibits fractured reads if transaction Ti writes versions xm and yn (in any order, with x possibly but not necessarily equal to y), Tj reads version xm and version yk , and k

Slide 96

Slide 96 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY

Slide 97

Slide 97 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY WHILE ENSURING

Slide 98

Slide 98 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY WHILE ENSURING PARTITION INDEPENDENCE

Slide 99

Slide 99 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY WHILE ENSURING PARTITION INDEPENDENCE clients only access servers responsible for data in transactions

Slide 100

Slide 100 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY WHILE ENSURING PARTITION INDEPENDENCE clients only access servers responsible for data in transactions W(X=1) W(Y=1) X Y Z

Slide 101

Slide 101 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY WHILE ENSURING PARTITION INDEPENDENCE clients only access servers responsible for data in transactions W(X=1) W(Y=1) X Y Z

Slide 102

Slide 102 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY WHILE ENSURING PARTITION INDEPENDENCE AND SYNCHRONIZATION INDEPENDENCE clients only access servers responsible for data in transactions transactions always commit* and no client can cause another client to block

Slide 103

Slide 103 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY ARE NOT SERIALIZABLE DO NOT PREVENT LOST UPDATE DO NOT PREVENT WRITE SKEW ALLOW CONCURRENT UPDATES

Slide 104

Slide 104 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY ARE NOT SERIALIZABLE DO NOT PREVENT LOST UPDATE DO NOT PREVENT WRITE SKEW ALLOW CONCURRENT UPDATES ARE GUIDED BY REAL WORLD USE CASES FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS

Slide 105

Slide 105 text

TRANSACTIONS RAMP GUARANTEE ATOMIC VISIBILITY ARE NOT SERIALIZABLE DO NOT PREVENT LOST UPDATE DO NOT PREVENT WRITE SKEW ALLOW CONCURRENT UPDATES ARE GUIDED BY REAL WORLD USE CASES FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS Facebook TAO LinkedIn Espresso Yahoo! PNUTS Google Megastore Google App Engine Twitter Rainbird Amazon DynamoDB

Slide 106

Slide 106 text

STRAWMAN: LOCKING X=0 Y=0

Slide 107

Slide 107 text

STRAWMAN: LOCKING X=0 Y=0 W(X=1) W(Y=1)

Slide 108

Slide 108 text

STRAWMAN: LOCKING X=0 Y=0 W(X=1) W(Y=1)

Slide 109

Slide 109 text

STRAWMAN: LOCKING X=0 Y=0 W(X=1) W(Y=1)

Slide 110

Slide 110 text

STRAWMAN: LOCKING X=1 Y=1 W(X=1) W(Y=1)

Slide 111

Slide 111 text

STRAWMAN: LOCKING X=1 Y=1 W(X=1) W(Y=1)

Slide 112

Slide 112 text

STRAWMAN: LOCKING X=1 Y=1 W(X=1) W(Y=1)

Slide 113

Slide 113 text

STRAWMAN: LOCKING X=1 Y=1 W(X=1) W(Y=1)

Slide 114

Slide 114 text

STRAWMAN: LOCKING X=1 Y=1 W(X=1) W(Y=1) R(X=1)

Slide 115

Slide 115 text

STRAWMAN: LOCKING X=1 Y=1 W(X=1) W(Y=1) R(X=1) R(Y=1)

Slide 116

Slide 116 text

Y=0 STRAWMAN: LOCKING X=1 W(X=1) W(Y=1)

Slide 117

Slide 117 text

Y=0 STRAWMAN: LOCKING X=1 W(X=1) W(Y=1)

Slide 118

Slide 118 text

Y=0 STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) R(X=?)

Slide 119

Slide 119 text

Y=0 STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) R(X=?) R(Y=?)

Slide 120

Slide 120 text

Y=0 STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) R(X=?) R(Y=?) ATOMIC VISIBILITY COUPLED WITH MUTUAL EXCLUSION

Slide 121

Slide 121 text

STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY COUPLED WITH MUTUAL EXCLUSION

Slide 122

Slide 122 text

STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY COUPLED WITH MUTUAL EXCLUSION

Slide 123

Slide 123 text

STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY COUPLED WITH MUTUAL EXCLUSION RTT

Slide 124

Slide 124 text

STRAWMAN: LOCKING X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY COUPLED WITH MUTUAL EXCLUSION RTT unavailability!

Slide 125

Slide 125 text

X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY COUPLED WITH MUTUAL EXCLUSION at .5 MS < 2K TPS! unavailable during failures SIMILAR ISSUES IN MVCC, PRE-SCHEDULING SERIALIZABLE OCC, (global timestamp assignment/application) (multi-partition validation, liveness) (scheduling, multi-partition execution) STRAWMAN: LOCKING

Slide 126

Slide 126 text

X=1 W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) ATOMIC VISIBILITY COUPLED WITH MUTUAL EXCLUSION at .5 MS < 2K TPS! unavailable during failures SIMILAR ISSUES IN MVCC, PRE-SCHEDULING SERIALIZABLE OCC, (global timestamp assignment/application) (multi-partition validation, liveness) (scheduling, multi-partition execution) FUNDAMENTAL TO “STRONG” SEMANTICS STRAWMAN: LOCKING

Slide 127

Slide 127 text

BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) X=1

Slide 128

Slide 128 text

BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) X=1

Slide 129

Slide 129 text

BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) LET CLIENTS RACE, but HAVE READERS “CLEAN UP” X=1

Slide 130

Slide 130 text

BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) LET CLIENTS RACE, but HAVE READERS “CLEAN UP” X=1 METADATA

Slide 131

Slide 131 text

BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) LET CLIENTS RACE, but HAVE READERS “CLEAN UP” X=1 + LIMITED MULTI-VERSIONING METADATA

Slide 132

Slide 132 text

BASIC IDEA W(X=1) W(Y=1) Y=0 R(X=?) R(Y=?) LET CLIENTS RACE, but HAVE READERS “CLEAN UP” X=1 + LIMITED MULTI-VERSIONING METADATA FOR NOW:
 READ-ONLY, WRITE-ONLY TXNS

Slide 133

Slide 133 text

last committed stamp for x: 0 RAMP-Fast last committed stamp for y: 0

Slide 134

Slide 134 text

last committed stamp for x: 0 RAMP-Fast last committed stamp for y: 0

Slide 135

Slide 135 text

last committed stamp for x: 0 RAMP-Fast last committed stamp for y: 0

Slide 136

Slide 136 text

last committed stamp for x: 0 RAMP-Fast known versions of x last committed stamp for y: 0 known versions of y

Slide 137

Slide 137 text

last committed stamp for x: 0 RAMP-Fast known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}

Slide 138

Slide 138 text

last committed stamp for x: 0 RAMP-Fast known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}

Slide 139

Slide 139 text

last committed stamp for x: 0 RAMP-Fast known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}

Slide 140

Slide 140 text

last committed stamp for x: 0 RAMP-Fast known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}

Slide 141

Slide 141 text

last committed stamp for x: 0 RAMP-Fast known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}

Slide 142

Slide 142 text

last committed stamp for x: 0 RAMP-Fast known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}

Slide 143

Slide 143 text

last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {}

Slide 144

Slide 144 text

last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. timestamp: 242 e.g., time concat client ID concat sequence number

Slide 145

Slide 145 text

last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. timestamp: 242

Slide 146

Slide 146 text

last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. timestamp: 242

Slide 147

Slide 147 text

last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 242 1 timestamp: 242

Slide 148

Slide 148 text

last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 242 1 242 1 timestamp: 242

Slide 149

Slide 149 text

last committed stamp for x: 0 RAMP-Fast W(X=1) W(Y=1) known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242

Slide 150

Slide 150 text

RAMP-Fast W(X=1) W(Y=1) known versions of x last committed stamp for y: 0 known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 last committed stamp for x: 242 timestamp: 242

Slide 151

Slide 151 text

RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 242 timestamp: 242

Slide 152

Slide 152 text

RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 R(X=?) R(Y=?) last committed stamp for x: 242 last committed stamp for y: 242

Slide 153

Slide 153 text

RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 R(X=1) R(Y=1) last committed stamp for x: 242 last committed stamp for y: 242

Slide 154

Slide 154 text

RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 R(X=1) R(Y=1) last committed stamp for x: 242 last committed stamp for y: 242

Slide 155

Slide 155 text

R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 last committed stamp for x: 242 last committed stamp for y: 0

Slide 156

Slide 156 text

R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 last committed stamp for x: 242 last committed stamp for y: 0

Slide 157

Slide 157 text

R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! last committed stamp for x: 242 last committed stamp for y: 0

Slide 158

Slide 158 text

R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0

Slide 159

Slide 159 text

R(X=?) R(Y=?) RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0

Slide 160

Slide 160 text

RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0

Slide 161

Slide 161 text

RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0 RECORD THE ITEMS WRITTEN IN THE TRANSACTION

Slide 162

Slide 162 text

RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0 {y} RECORD THE ITEMS WRITTEN IN THE TRANSACTION

Slide 163

Slide 163 text

RAMP-Fast W(X=1) W(Y=1) known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 1.) Assign unique (logical) transaction timestamp. 2.) Add write to known versions on partition. 3.) Commit and update last committed stamp. 242 1 242 1 timestamp: 242 RACE!!! R(X=1) R(Y=0) last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} RECORD THE ITEMS WRITTEN IN THE TRANSACTION

Slide 164

Slide 164 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?)

Slide 165

Slide 165 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?)

Slide 166

Slide 166 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?)

Slide 167

Slide 167 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed:

Slide 168

Slide 168 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: X=1 @ 242, {Y}

Slide 169

Slide 169 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: X=1 @ 242, {Y} Y=NULL @ 0, {}

Slide 170

Slide 170 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: X=1 @ 242, {Y} Y=NULL @ 0, {}

Slide 171

Slide 171 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242

Slide 172

Slide 172 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242

Slide 173

Slide 173 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: 3.) Fetch missing versions. X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242

Slide 174

Slide 174 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: 3.) Fetch missing versions. X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242 Y=1 @ 242, {X} (Send required timestamp in request)

Slide 175

Slide 175 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: 3.) Fetch missing versions. X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242 Y=1 @ 242, {X} (Send required timestamp in request) 2PC ENSURES NO WAIT AT SERVER

Slide 176

Slide 176 text

RAMP-Fast known versions of x known versions of y TIMESTAMP VALUE METADATA 0 NULL {} TIMESTAMP VALUE METADATA 0 NULL {} 242 1 242 1 last committed stamp for x: 242 last committed stamp for y: 0 {y} {x} R(X=?) R(Y=?) 1.) Read last committed: 2.) Calculate missing versions: 3.) Fetch missing versions. X=1 @ 242, {Y} Y=NULL @ 0, {} ITEM HIGHEST TS X 242 Y 242 Y=1 @ 242, {X} (Send required timestamp in request) 4.) Return resulting set. R(X=1) R(Y=1) 2PC ENSURES NO WAIT AT SERVER

Slide 177

Slide 177 text

RAMP-Fast

Slide 178

Slide 178 text

RAMP-Fast 2 RTT writes:

Slide 179

Slide 179 text

RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization

Slide 180

Slide 180 text

RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization ENSURES READERS NEVER WAIT!

Slide 181

Slide 181 text

RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size linear in transaction size ENSURES READERS NEVER WAIT!

Slide 182

Slide 182 text

RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size linear in transaction size 1 RTT reads: in race-free case ENSURES READERS NEVER WAIT!

Slide 183

Slide 183 text

RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size linear in transaction size 1 RTT reads: in race-free case 2 RTT reads: otherwise ENSURES READERS NEVER WAIT!

Slide 184

Slide 184 text

RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size linear in transaction size 1 RTT reads: in race-free case 2 RTT reads: otherwise no fast-path synchronization ENSURES READERS NEVER WAIT!

Slide 185

Slide 185 text

RAMP-Fast 2 RTT writes: 2PC, without blocking synchronization metadata size linear in transaction size 1 RTT reads: in race-free case 2 RTT reads: otherwise no fast-path synchronization ENSURES READERS NEVER WAIT! CAN WE USE LESS METADATA?

Slide 186

Slide 186 text

RAMP-Small

Slide 187

Slide 187 text

RAMP-Small 2 RTT writes:

Slide 188

Slide 188 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata

Slide 189

Slide 189 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads

Slide 190

Slide 190 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads always

Slide 191

Slide 191 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. always

Slide 192

Slide 192 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. always

Slide 193

Slide 193 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 always

Slide 194

Slide 194 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 always

Slide 195

Slide 195 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 Z time 842 always

Slide 196

Slide 196 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 Z time 842 {247, 523, 842} always

Slide 197

Slide 197 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads partial commits will be in this set INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 Z time 842 {247, 523, 842} always

Slide 198

Slide 198 text

RAMP-Small 2 RTT writes: same basic protocol as RAMP-Fast but drop all RAMP-Fast metadata 2 RTT reads partial commits will be in this set INTUITION: 1.) For each item, fetch the highest committed timestamp. 2.) Request highest matching write with timestamp in step 1. X time 523 Y time 247 Z time 842 {247, 523, 842} send it to all participating servers always

Slide 199

Slide 199 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter

Slide 200

Slide 200 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter

Slide 201

Slide 201 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter

Slide 202

Slide 202 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter

Slide 203

Slide 203 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter

Slide 204

Slide 204 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 log(2)2 O([txn len]*log(1/ε))

Slide 205

Slide 205 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 log(2)2 O([txn len]*log(1/ε))

Slide 206

Slide 206 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 BLOOM FILTER SUMMARIZES WRITE SET FALSE POSITIVES: EXTRA RTTs log(2)2 O([txn len]*log(1/ε))

Slide 207

Slide 207 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 BLOOM FILTER SUMMARIZES WRITE SET FALSE POSITIVES: EXTRA RTTs log(2)2 O([txn len]*log(1/ε))

Slide 208

Slide 208 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter

Slide 209

Slide 209 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter • AVOID IN-PLACE UPDATES

Slide 210

Slide 210 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter • AVOID IN-PLACE UPDATES • EMBRACE RACES TO IMPROVE CONCURRENCY

Slide 211

Slide 211 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter • AVOID IN-PLACE UPDATES • EMBRACE RACES TO IMPROVE CONCURRENCY • ALLOW READERS TO REPAIR PARTIAL WRITES

Slide 212

Slide 212 text

RAMP Summary Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter • AVOID IN-PLACE UPDATES • EMBRACE RACES TO IMPROVE CONCURRENCY • ALLOW READERS TO REPAIR PARTIAL WRITES • USE 2PC TO AVOID READER STALLS

Slide 213

Slide 213 text

Additional Details

Slide 214

Slide 214 text

Additional Details Garbage collection: limit read transaction duration to K seconds GC overwritten versions after K seconds

Slide 215

Slide 215 text

Additional Details Garbage collection: limit read transaction duration to K seconds GC overwritten versions after K seconds Replication paper assumes linearizable masters extendable to “AP” systems see HAT by Bailis et al., VLDB 2014

Slide 216

Slide 216 text

Additional Details Garbage collection: limit read transaction duration to K seconds GC overwritten versions after K seconds Failure handling: blocked 2PC rounds do not block clients stalled commits? versions are not GC’d if desirable, use CTP termination protocol Replication paper assumes linearizable masters extendable to “AP” systems see HAT by Bailis et al., VLDB 2014

Slide 217

Slide 217 text

RAMP PERFORMANCE Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter

Slide 218

Slide 218 text

RAMP PERFORMANCE Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter EVALUATED ON EC2 cr1.8xlarge instances (cluster size: 1-100 servers; default: 5) ! open sourced on GitHub; see link at end of talk

Slide 219

Slide 219 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s)

Slide 220

Slide 220 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control

Slide 221

Slide 221 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Doesn’t provide atomic visibility RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control

Slide 222

Slide 222 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control

Slide 223

Slide 223 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL

Slide 224

Slide 224 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only

Slide 225

Slide 225 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Also doesn’t provide atomic visibility RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only

Slide 226

Slide 226 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Representative of coordinated approaches RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only

Slide 227

Slide 227 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only

Slide 228

Slide 228 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast

Slide 229

Slide 229 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Within ~5% of baseline ! Latency in paper (comparable) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast

Slide 230

Slide 230 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast

Slide 231

Slide 231 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small

Slide 232

Slide 232 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) Always needs 2 RTT reads RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small

Slide 233

Slide 233 text

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small

Slide 234

Slide 234 text

RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small

Slide 235

Slide 235 text

RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small

Slide 236

Slide 236 text

YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25 50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s)

Slide 237

Slide 237 text

YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25 50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control

Slide 238

Slide 238 text

YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25 50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only

Slide 239

Slide 239 text

RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25 50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid

Slide 240

Slide 240 text

RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25 50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid Linear scaling; due to 2RTT writes, races

Slide 241

Slide 241 text

RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients 0 25 50 75 100 Percentage Reads 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid

Slide 242

Slide 242 text

YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)

Slide 243

Slide 243 text

RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)

Slide 244

Slide 244 text

RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)

Slide 245

Slide 245 text

RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 40K 80K 120K 160K 200K operations/s/server

Slide 246

Slide 246 text

RAMP PERFORMANCE Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter

Slide 247

Slide 247 text

RAMP PERFORMANCE Algorithm Write RTT READ RTT (best case) READ RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B) Bloom filter More results in paper: Transaction length, contention, value size, latency, failures

Slide 248

Slide 248 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES:

Slide 249

Slide 249 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES: as s

Slide 250

Slide 250 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES: as s

Slide 251

Slide 251 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES: as s

Slide 252

Slide 252 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES: as s MULTI-PUT (DELETES VIA TOMBSTONES)

Slide 253

Slide 253 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES:

Slide 254

Slide 254 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES:

Slide 255

Slide 255 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES:

Slide 256

Slide 256 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES: Maintain list of matching record IDs and versions e.g., HAS_BEARD={52@512, 412@52, 123@512} merge lists on commit/read (LWW by timestamp for conflicts)

Slide 257

Slide 257 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES: Maintain list of matching record IDs and versions e.g., HAS_BEARD={52@512, 412@52, 123@512} merge lists on commit/read (LWW by timestamp for conflicts) LOOKUPs: READ INDEX, THEN FETCH DATA

Slide 258

Slide 258 text

FOREIGN KEY DEPENDENCIES SECONDARY INDEXING MATERIALIZED VIEWS HOW RAMP HANDLES: Maintain list of matching record IDs and versions e.g., HAS_BEARD={52@512, 412@52, 123@512} merge lists on commit/read (LWW by timestamp for conflicts) LOOKUPs: READ INDEX, THEN FETCH DATA SIMILAR FOR SELECT/PROJECT

Slide 259

Slide 259 text

SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY REQUIRE SYNCHRONOUS COORDINATION INSUFFICIENT SUFFICIENT COORDINATION-FREE

Slide 260

Slide 260 text

SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY REQUIRE SYNCHRONOUS COORDINATION INSUFFICIENT SUFFICIENT COORDINATION-FREE

Slide 261

Slide 261 text

SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY COORDINATION-FREE INSUFFICIENT SUFFICIENT REQUIRE SYNCHRONOUS COORDINATION

Slide 262

Slide 262 text

SERIALIZABILITY SNAPSHOT ISOLATION REPEATABLE READ (PL-2.99) CURSOR STABILITY READ UNCOMMITTED READ COMMITTED CAUSAL PRAM RYW LINEARIZABILITY EVENTUAL CONSISTENCY COORDINATION-FREE ATOMIC VISIBILITY VIA RAMP INSUFFICIENT SUFFICIENT REQUIRE SYNCHRONOUS COORDINATION

Slide 263

Slide 263 text

RAMP IN CASSANDRA

Slide 264

Slide 264 text

RAMP IN CASSANDRA USES

Slide 265

Slide 265 text

RAMP IN CASSANDRA USES REQUIREMENTS

Slide 266

Slide 266 text

RAMP IN CASSANDRA USES REQUIREMENTS IMPLEMENTATION

Slide 267

Slide 267 text

RAMP IN CASSANDRA STRAIGHTFORWARD USES: •Add atomic visibility to atomic batch operations •Expose as CQL isolation level • USING CONSISTENCY READ_ATOMIC •Encourage use in multi-put, multi-get •Treat as basis for global secondary indexing •CREATE GLOBAL INDEX on users (age )

Slide 268

Slide 268 text

RAMP IN CASSANDRA REQUIREMENTS: •Unique timestamp generation for transactions •Use node ID from ring •Other form of UUID •Hash transaction contents* •Limited multi-versioning for prepared and old values •RAMP doesn’t actually require true MVCC •One proposal: keep a look aside cache

Slide 269

Slide 269 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: Lookaside cache for prepared and old values ! Standard C* Table stores last committed write 1 52 335 1240 1402 2201

Slide 270

Slide 270 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: Lookaside cache for prepared and old values ! Standard C* Table stores last committed write Shadow table stores prepared-but-not-committed and overwritten versions 1 52 335 1240 1402 2201 64 335 2201

Slide 271

Slide 271 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: Lookaside cache for prepared and old values ! Standard C* Table stores last committed write Shadow table stores prepared-but-not-committed and overwritten versions 1 52 335 1240 1402 2201 64 335 2201 Transparent to end-users

Slide 272

Slide 272 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: Lookaside cache for prepared and old values ! Standard C* Table stores last committed write Shadow table stores prepared-but-not-committed and overwritten versions 1 52 335 1240 1402 2201 64 335 2201 Overwritten versions have TTL set to max read transaction time, do not need durability Transparent to end-users

Slide 273

Slide 273 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION:

Slide 274

Slide 274 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: OPERATION CONSISTENCY LEVEL Write Prepare CL.QUORUM Write Commit CL.ANY or higher First-round Read CL.ANY/CL.ONE Second-round Read CL.QUORUM

Slide 275

Slide 275 text

RAMP IN CASSANDRA To avoid stalling, second-round reads must be able to access prepared writes POSSIBLE IMPLEMENTATION: OPERATION CONSISTENCY LEVEL Write Prepare CL.QUORUM Write Commit CL.ANY or higher First-round Read CL.ANY/CL.ONE Second-round Read CL.QUORUM

Slide 276

Slide 276 text

RAMP IN CASSANDRA To avoid stalling, second-round reads must be able to access prepared writes POSSIBLE IMPLEMENTATION: OPERATION CONSISTENCY LEVEL Write Prepare CL.QUORUM Write Commit CL.ANY or higher First-round Read CL.ANY/CL.ONE Second-round Read CL.QUORUM

Slide 277

Slide 277 text

RAMP IN CASSANDRA To avoid stalling, second-round reads must be able to access prepared writes POSSIBLE IMPLEMENTATION: OPERATION CONSISTENCY LEVEL Write Prepare CL.QUORUM Write Commit CL.ANY or higher First-round Read CL.ANY/CL.ONE Second-round Read CL.QUORUM

Slide 278

Slide 278 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION:

Slide 279

Slide 279 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: DC1

Slide 280

Slide 280 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: DC1 DC2

Slide 281

Slide 281 text

RAMP IN CASSANDRA POSSIBLE IMPLEMENTATION: DC1 DC2 Run algorithms on a per-DC basis, with use of CL.LOCAL_QUORUM instead of full CL.QUORUM

Slide 282

Slide 282 text

No content

Slide 283

Slide 283 text

RAMP TRANSACTIONS:

Slide 284

Slide 284 text

RAMP TRANSACTIONS: • Provide atomic visibility, as required for maintaining FKs, scalable indexing, mat views

Slide 285

Slide 285 text

RAMP TRANSACTIONS: • Provide atomic visibility, as required for maintaining FKs, scalable indexing, mat views • Avoid in-place updates, mutual exclusion, any synchronous/blocking coordination

Slide 286

Slide 286 text

RAMP TRANSACTIONS: • Provide atomic visibility, as required for maintaining FKs, scalable indexing, mat views • Avoid in-place updates, mutual exclusion, any synchronous/blocking coordination • Use metadata with limited multi versioning, reads repair partial writes

Slide 287

Slide 287 text

RAMP TRANSACTIONS: • Provide atomic visibility, as required for maintaining FKs, scalable indexing, mat views • Avoid in-place updates, mutual exclusion, any synchronous/blocking coordination • Use metadata with limited multi versioning, reads repair partial writes • 1-2RTT overhead, pay only during contention

Slide 288

Slide 288 text

RAMP TRANSACTIONS: • Provide atomic visibility, as required for maintaining FKs, scalable indexing, mat views • Avoid in-place updates, mutual exclusion, any synchronous/blocking coordination • Use metadata with limited multi versioning, reads repair partial writes • 1-2RTT overhead, pay only during contention Thanks! http://tiny.cc/ramp-code @pbailis http://tiny.cc/ramp-intro

Slide 289

Slide 289 text

Punk designed by my name is mud from the Noun Project Creative Commons – Attribution (CC BY 3.0) Queen designed by Bohdan Burmich from the Noun Project Creative Commons – Attribution (CC BY 3.0) Guy Fawkes designed by Anisha Varghese from the Noun Project Creative Commons – Attribution (CC BY 3.0) Emperor designed by Simon Child from the Noun Project Creative Commons – Attribution (CC BY 3.0) Baby designed by Les vieux garçons from the Noun Project Creative Commons – Attribution (CC BY 3.0) Baby designed by Les vieux garçons from the Noun Project Creative Commons – Attribution (CC BY 3.0) Gandhi designed by Luis Martins from the Noun Project Creative Commons – Attribution (CC BY 3.0) Database designed by Anton Outkine from the Noun Project Creative Commons – Attribution (CC BY 3.0) Girl designed by Rodrigo Vidinich from the Noun Project Creative Commons – Attribution (CC BY 3.0) Child designed by Gemma Garner from the Noun Project Creative Commons – Attribution (CC BY 3.0) Customer Service designed by Veysel Kara from the Noun Project Creative Commons – Attribution (CC BY 3.0) Punk Rocker designed by Simon Child from the Noun Project Creative Commons – Attribution (CC BY 3.0) Pyramid designed by misirlou from the Noun Project Creative Commons – Attribution (CC BY 3.0) Person designed by Stefania Bonacasa from the Noun Project Creative Commons – Attribution (CC BY 3.0) Record designed by Diogo Trindade from the Noun Project Creative Commons – Attribution (CC BY 3.0) Window designed by Juan Pablo Bravo from the Noun Project Creative Commons – Attribution (CC BY 3.0) Balloon designed by Julien Deveaux from the Noun Project Creative Commons – Attribution (CC BY 3.0) Balloon designed by Julien Deveaux from the Noun Project Creative Commons – Attribution (CC BY 3.0) Balloon designed by Julien Deveaux from the Noun Project Creative Commons – Attribution (CC BY 3.0) Crying designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0) Sad designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0) Happy designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0) Happy designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0) User designed by JM Waideaswaran from the Noun Project Creative Commons – Attribution (CC BY 3.0) ! COCOGOOSE font by ZetaFonts COMMON CREATIVE NON COMMERCIAL USE IMAGE/FONT CREDITs