$30 off During Our Annual Pro Sale. View Details »

Scalable Atomic Visibility with RAMP Transactions

pbailis
June 24, 2014

Scalable Atomic Visibility with RAMP Transactions

Slides for "Scalable Atomic Visibility with RAMP Transactions" by Bailis et al., appearing in SIGMOD 2014

This deck also contains a proposal for implementation in Cassandra. If you're interested in implementing RAMP in your own system, don't hesitate to get in touch: pbailis at cs.berkeley.edu or @pbailis

Paper: http://www.bailis.org/papers/ramp-sigmod2014.pdf
Blog post intro: http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/
Source code from paper and "executable pseudocode" in Python: https://github.com/pbailis/ramp-sigmod2014-code

pbailis

June 24, 2014
Tweet

More Decks by pbailis

Other Decks in Technology

Transcript

  1. SCALABLE
    ATOMIC VISIBILITY
    WITH
    RAMP TRANSACTIONS
    Peter Bailis, Alan Fekete, Ali Ghodsi,
    Joseph M. Hellerstein, Ion Stoica
    UC Berkeley and University of Sydney

    Overview deck with Cassandra discussion

    @pbailis

    View Slide

  2. NOSQL

    View Slide

  3. NO SQL
    NOSQL

    View Slide

  4. NO SQL
    DIDN’T WANT SQL
    NOSQL

    View Slide

  5. NO SQL
    DIDN’T WANT SERIALIZABILITY
    NOSQL

    View Slide

  6. POOR PERFORMANCE
    NO SQL
    DIDN’T WANT SERIALIZABILITY
    NOSQL

    View Slide

  7. POOR PERFORMANCE
    NO SQL
    DIDN’T WANT SERIALIZABILITY
    NOSQL

    View Slide

  8. POOR PERFORMANCE
    DELAY
    NO SQL
    DIDN’T WANT SERIALIZABILITY
    NOSQL

    View Slide

  9. POOR PERFORMANCE
    DELAY
    PEAK THROUGHPUT: 1/DELAY
    FOR CONTENDED OPERATIONS
    NO SQL
    DIDN’T WANT SERIALIZABILITY
    NOSQL

    View Slide

  10. POOR PERFORMANCE
    DELAY
    PEAK THROUGHPUT: 1/DELAY
    FOR CONTENDED OPERATIONS
    at .5MS,
    2K TXN/s
    at 50MS,
    20 TXN/s
    NO SQL
    DIDN’T WANT SERIALIZABILITY
    NOSQL

    View Slide

  11. NO SQL
    DIDN’T WANT SERIALIZABILITY
    NOSQL
    POOR PERFORMANCE
    HIGH LATENCY

    View Slide

  12. NO SQL
    DIDN’T WANT SERIALIZABILITY
    NOSQL
    POOR PERFORMANCE
    LIMITED AVAILABILITY
    HIGH LATENCY

    View Slide

  13. STILL DON’T WANT SERIALIZABILITY
    “NOT ONLY SQL”

    View Slide

  14. STILL DON’T WANT SERIALIZABILITY
    “NOT ONLY SQL”
    (DON’T WANT THE COSTS)

    View Slide

  15. STILL DON’T WANT SERIALIZABILITY
    “NOT ONLY SQL”
    BUT WANT MORE FEATURES
    (DON’T WANT THE COSTS)

    View Slide

  16. STILL DON’T WANT SERIALIZABILITY
    “NOT ONLY SQL”
    BUT WANT MORE FEATURES
    This paper!
    (DON’T WANT THE COSTS)

    View Slide

  17. “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013

    View Slide

  18. “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013

    View Slide

  19. “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013

    View Slide

  20. “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013

    View Slide

  21. “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    FRIENDS
    FRIENDS

    View Slide

  22. as
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    FRIENDS
    FRIENDS

    View Slide

  23. as
    s
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013

    View Slide

  24. as
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    s
    Denormalized Friend List

    View Slide

  25. as
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    s
    Denormalized Friend List
    Fast reads…

    View Slide

  26. as
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    s
    Denormalized Friend List
    Fast reads…
    …multi-entity updates

    View Slide

  27. as
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    s
    Denormalized Friend List
    Fast reads…
    …multi-entity updates
    s

    View Slide

  28. as
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    s
    Denormalized Friend List
    Fast reads…
    …multi-entity updates
    s

    View Slide

  29. as
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    s
    Denormalized Friend List
    Fast reads…
    …multi-entity updates
    s

    View Slide

  30. as
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    s
    Denormalized Friend List
    Fast reads…
    …multi-entity updates
    Not cleanly partitionable
    s

    View Slide

  31. FOREIGN KEY DEPENDENCIES
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013

    View Slide

  32. FOREIGN KEY DEPENDENCIES
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    “On Brewing Fresh Espresso: LinkedIn’s Distributed Data
    Serving Platform” SIGMOD 2013

    View Slide

  33. FOREIGN KEY DEPENDENCIES
    “TAO: Facebook’s Distributed Data Store for the Social Graph”
    USENIX ATC 2013
    “On Brewing Fresh Espresso: LinkedIn’s Distributed Data
    Serving Platform” SIGMOD 2013
    “PNUTS: Yahoo!’s Hosted Data Serving Platform”
    VLDB 2008

    View Slide

  34. View Slide

  35. ID: 532
    AGE: 42
    ID: 123
    AGE: 22
    ID: 2345
    AGE: 1
    ID: 412
    AGE: 72
    ID: 892
    AGE: 13

    View Slide

  36. ID: 532
    AGE: 42
    ID: 123
    AGE: 22
    ID: 2345
    AGE: 1
    ID: 412
    AGE: 72
    ID: 892
    AGE: 13

    View Slide

  37. ID: 532
    AGE: 42
    ID: 123
    AGE: 22
    ID: 2345
    AGE: 1
    ID: 412
    AGE: 72
    ID: 892
    AGE: 13

    View Slide

  38. ID: 532
    AGE: 42
    ID: 123
    AGE: 22
    ID: 2345
    AGE: 1
    ID: 412
    AGE: 72
    ID: 892
    AGE: 13

    View Slide

  39. ID: 532
    AGE: 42
    ID: 123
    AGE: 22
    ID: 2345
    AGE: 1
    ID: 412
    AGE: 72
    ID: 892
    AGE: 13

    View Slide

  40. ID: 532
    AGE: 42
    ID: 123
    AGE: 22
    ID: 2345
    AGE: 1
    ID: 412
    AGE: 72
    ID: 892
    AGE: 13
    Partition by
    primary key (ID)

    View Slide

  41. ID: 532
    AGE: 42
    ID: 123
    AGE: 22
    ID: 2345
    AGE: 1
    ID: 412
    AGE: 72
    ID: 892
    AGE: 13
    Partition by
    primary key (ID) How should we look up by age?

    View Slide

  42. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?

    View Slide

  43. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing

    View Slide

  44. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data

    View Slide

  45. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    WRITE ONE SERVER, READ ALL

    View Slide

  46. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    poor scalability
    WRITE ONE SERVER, READ ALL

    View Slide

  47. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    poor scalability
    WRITE ONE SERVER, READ ALL

    View Slide

  48. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    WRITE 2+ SERVERS, READ ONE
    poor scalability
    WRITE ONE SERVER, READ ALL

    View Slide

  49. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    WRITE 2+ SERVERS, READ ONE
    scalable lookups
    poor scalability
    WRITE ONE SERVER, READ ALL

    View Slide

  50. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    WRITE 2+ SERVERS, READ ONE
    scalable lookups
    poor scalability
    WRITE ONE SERVER, READ ALL

    View Slide

  51. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    scalable lookups
    poor scalability
    WRITE 2+ SERVERS, READ ONE
    WRITE ONE SERVER, READ ALL

    View Slide

  52. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    scalable lookups
    poor scalability
    WRITE 2+ SERVERS, READ ONE
    WRITE ONE SERVER, READ ALL
    OVERVIEW

    View Slide

  53. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    scalable lookups
    poor scalability
    WRITE 2+ SERVERS, READ ONE
    WRITE ONE SERVER, READ ALL
    OVERVIEW
    INCONSISTENT
    GLOBAL 2i

    View Slide

  54. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    scalable lookups
    poor scalability
    WRITE 2+ SERVERS, READ ONE
    WRITE ONE SERVER, READ ALL
    OVERVIEW
    INCONSISTENT
    GLOBAL 2i
    INCONSISTENT
    GLOBAL 2i

    View Slide

  55. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    scalable lookups
    poor scalability
    WRITE 2+ SERVERS, READ ONE
    WRITE ONE SERVER, READ ALL
    OVERVIEW
    INCONSISTENT
    GLOBAL 2i
    INCONSISTENT
    GLOBAL 2i
    (PROPOSED)
    INCONSISTENT
    GLOBAL 2i

    View Slide

  56. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    scalable lookups
    poor scalability
    WRITE 2+ SERVERS, READ ONE
    WRITE ONE SERVER, READ ALL
    OVERVIEW
    INCONSISTENT
    GLOBAL 2i
    INCONSISTENT
    GLOBAL 2i
    (PROPOSED)
    INCONSISTENT
    GLOBAL 2i
    INCONSISTENT
    GLOBAL 2i

    View Slide

  57. SECONDARY INDEXING
    Partition by
    primary key (ID) How should we look up by age?
    Option I: Local Secondary Indexing
    Build indexes co-located with primary data
    Option II: Global Secondary Indexing
    Partition indexes by secondary key
    Partition by
    secondary attribute
    scalable lookups
    poor scalability
    WRITE 2+ SERVERS, READ ONE
    WRITE ONE SERVER, READ ALL
    OVERVIEW
    INCONSISTENT
    GLOBAL 2i
    INCONSISTENT
    GLOBAL 2i
    (PROPOSED)
    INCONSISTENT
    GLOBAL 2i
    INCONSISTENT
    GLOBAL 2i
    INCONSISTENT
    GLOBAL 2i

    View Slide

  58. TABLE:
    ALL USERS

    View Slide

  59. TABLE:
    ALL USERS

    View Slide

  60. TABLE:
    ALL USERS
    TABLE:
    USERS OVER 25

    View Slide

  61. TABLE:
    ALL USERS
    TABLE:
    USERS OVER 25

    View Slide

  62. TABLE:
    ALL USERS
    TABLE:
    USERS OVER 25

    View Slide

  63. TABLE:
    ALL USERS
    TABLE:
    USERS OVER 25

    View Slide

  64. MATERIALIZED VIEWS
    TABLE:
    ALL USERS
    TABLE:
    USERS OVER 25

    View Slide

  65. MATERIALIZED VIEWS
    TABLE:
    ALL USERS
    TABLE:
    USERS OVER 25
    RELEVANT
    RECENT
    EXAMPLES IN
    GOOGLE PERCOLATOR
    TWITTER RAINBIRD
    LINKEDIN ESPRESSO
    PAPERS

    View Slide

  66. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXES
    MATERIALIZED VIEWS
    HOW SHOULD WE CORRECTLY MAINTAIN

    View Slide

  67. SERIALIZABILITY

    View Slide

  68. SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED
    CAUSAL
    PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    SERIALIZABILITY

    View Slide

  69. SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED
    CAUSAL
    PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    SERIALIZABILITY

    View Slide

  70. REPEATABLE READ
    (PL-2.99)
    SERIALIZABILITY
    SNAPSHOT ISOLATION
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED
    LINEARIZABILITY
    CAUSAL
    PRAM
    RYW
    EVENTUAL CONSISTENCY

    View Slide

  71. REPEATABLE READ
    (PL-2.99)
    SERIALIZABILITY
    SNAPSHOT ISOLATION
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED
    LINEARIZABILITY
    MANY
    SUFFICIENT
    CAUSAL
    PRAM
    RYW
    EVENTUAL CONSISTENCY

    View Slide

  72. REPEATABLE READ
    (PL-2.99)
    SERIALIZABILITY
    SNAPSHOT ISOLATION
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED
    LINEARIZABILITY
    REQUIRE SYNCHRONOUS COORDINATION
    MANY
    SUFFICIENT
    CAUSAL
    PRAM
    RYW
    EVENTUAL CONSISTENCY

    View Slide

  73. SERIALIZABILITY
    SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED
    CAUSAL
    PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    REQUIRE SYNCHRONOUS COORDINATION
    MANY
    SUFFICIENT
    COORDINATION-FREE

    View Slide

  74. SERIALIZABILITY
    SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED
    CAUSAL
    PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    REQUIRE SYNCHRONOUS COORDINATION
    INSUFFICIENT
    MANY
    SUFFICIENT
    COORDINATION-FREE

    View Slide

  75. SERIALIZABILITY
    SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    LINEARIZABILITY
    REQUIRE SYNCHRONOUS COORDINATION
    INSUFFICIENT
    MANY
    SUFFICIENT
    COORDINATION-FREE
    Facebook TAO
    LinkedIn Espresso
    Yahoo! PNUTS
    Google Megastore
    Google App Engine
    Twitter Rainbird
    Amazon DynamoDB
    CONSCIOUS
    CHOICES!

    View Slide

  76. SERIALIZABILITY
    SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED CAUSAL PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    COORDINATION-FREE
    INSUFFICIENT
    REQUIRE SYNCHRONOUS COORDINATION
    SUFFICIENT

    View Slide

  77. SERIALIZABILITY
    SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED CAUSAL PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    COORDINATION-FREE
    RAMP (THIS PAPER)
    INSUFFICIENT
    REQUIRE SYNCHRONOUS COORDINATION
    SUFFICIENT

    View Slide

  78. TRANSACTIONS
    R
    A
    M
    P
    TOMIC
    EAD
    ULTI-
    ARTITION

    View Slide

  79. TRANSACTIONS
    R
    A
    M
    P
    TOMIC
    EAD
    ULTI-
    ARTITION

    View Slide

  80. TRANSACTIONS
    RAMP
    EFFICIENTLY MAINTAIN

    View Slide

  81. TRANSACTIONS
    RAMP
    FOREIGN KEY DEPENDENCIES
    EFFICIENTLY MAINTAIN

    View Slide

  82. TRANSACTIONS
    RAMP
    FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXES
    EFFICIENTLY MAINTAIN

    View Slide

  83. TRANSACTIONS
    RAMP
    FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXES
    MATERIALIZED VIEWS
    EFFICIENTLY MAINTAIN

    View Slide

  84. TRANSACTIONS
    RAMP
    FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXES
    MATERIALIZED VIEWS
    BY PROVIDING
    ATOMIC VISIBILITY
    EFFICIENTLY MAINTAIN

    View Slide

  85. ATOMIC VISIBILITY

    View Slide

  86. Informally:
    Either all of each transaction’s updates are visible,
    or none are
    ATOMIC VISIBILITY

    View Slide

  87. Informally:
    Either all of each transaction’s updates are visible,
    or none are
    ATOMIC VISIBILITY

    View Slide

  88. Informally:
    Either all of each transaction’s updates are visible,
    or none are
    ATOMIC VISIBILITY
    WRITE X = 1
    WRITE Y = 1

    View Slide

  89. Informally:
    Either all of each transaction’s updates are visible,
    or none are
    ATOMIC VISIBILITY
    WRITE X = 1
    WRITE Y = 1
    READ X = 1
    READ Y = 1

    View Slide

  90. Informally:
    Either all of each transaction’s updates are visible,
    or none are
    ATOMIC VISIBILITY
    WRITE X = 1
    WRITE Y = 1
    READ X = 1
    READ Y = 1
    OR

    View Slide

  91. Informally:
    Either all of each transaction’s updates are visible,
    or none are
    ATOMIC VISIBILITY
    WRITE X = 1
    WRITE Y = 1
    READ X = 1
    READ Y = 1
    READ X = ∅
    READ Y = ∅
    OR

    View Slide

  92. Informally:
    Either all of each transaction’s updates are visible,
    or none are
    ATOMIC VISIBILITY
    READ X = 1
    READ Y = 1
    READ X = ∅
    READ Y = ∅
    OR

    View Slide

  93. Informally:
    Either all of each transaction’s updates are visible,
    or none are
    ATOMIC VISIBILITY
    READ X = 1
    READ Y = 1
    READ X = ∅
    READ Y = ∅
    OR
    BUT NOT
    READ Y = ∅
    READ X = 1

    View Slide

  94. Informally:
    Either all of each transaction’s updates are visible,
    or none are
    ATOMIC VISIBILITY
    READ X = 1
    READ Y = 1
    READ X = ∅
    READ Y = ∅
    OR
    BUT NOT
    READ X = ∅
    READ Y = ∅
    READ X = 1
    OR
    READ Y = 1

    View Slide

  95. ATOMIC VISIBILITY
    We also provide per-item PRAM guarantees with
    per-transaction regular semantics (see paper Appendix)
    Formally:
    A transaction Tj
    exhibits fractured reads if transaction Ti
    writes versions xm
    and yn
    (in any order,
    with x possibly but not necessarily equal to y), Tj
    reads version xm
    and version yk
    , and k !
    A system provides Read Atomic isolation (RA) if it prevents fractured reads anomalies and also
    prevents transactions from reading uncommitted, aborted, or intermediate data.
    FORMALIZED AS
    READ ATOMIC ISOLATION

    View Slide

  96. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY

    View Slide

  97. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY
    WHILE ENSURING

    View Slide

  98. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY
    WHILE ENSURING
    PARTITION INDEPENDENCE

    View Slide

  99. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY
    WHILE ENSURING
    PARTITION INDEPENDENCE
    clients only access servers responsible
    for data in transactions

    View Slide

  100. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY
    WHILE ENSURING
    PARTITION INDEPENDENCE
    clients only access servers responsible
    for data in transactions
    W(X=1)
    W(Y=1)
    X Y Z

    View Slide

  101. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY
    WHILE ENSURING
    PARTITION INDEPENDENCE
    clients only access servers responsible
    for data in transactions
    W(X=1)
    W(Y=1)
    X Y Z

    View Slide

  102. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY
    WHILE ENSURING
    PARTITION INDEPENDENCE
    AND
    SYNCHRONIZATION INDEPENDENCE
    clients only access servers responsible
    for data in transactions
    transactions always commit* and no
    client can cause another client to block

    View Slide

  103. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY
    ARE NOT SERIALIZABLE
    DO NOT PREVENT LOST UPDATE
    DO NOT PREVENT WRITE SKEW
    ALLOW CONCURRENT UPDATES

    View Slide

  104. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY
    ARE NOT SERIALIZABLE
    DO NOT PREVENT LOST UPDATE
    DO NOT PREVENT WRITE SKEW
    ALLOW CONCURRENT UPDATES
    ARE GUIDED BY REAL WORLD USE CASES
    FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS

    View Slide

  105. TRANSACTIONS
    RAMP
    GUARANTEE
    ATOMIC VISIBILITY
    ARE NOT SERIALIZABLE
    DO NOT PREVENT LOST UPDATE
    DO NOT PREVENT WRITE SKEW
    ALLOW CONCURRENT UPDATES
    ARE GUIDED BY REAL WORLD USE CASES
    FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    Facebook TAO
    LinkedIn Espresso
    Yahoo! PNUTS
    Google Megastore
    Google App Engine
    Twitter Rainbird
    Amazon DynamoDB

    View Slide

  106. STRAWMAN: LOCKING
    X=0 Y=0

    View Slide

  107. STRAWMAN: LOCKING
    X=0 Y=0
    W(X=1)
    W(Y=1)

    View Slide

  108. STRAWMAN: LOCKING
    X=0 Y=0
    W(X=1)
    W(Y=1)

    View Slide

  109. STRAWMAN: LOCKING
    X=0 Y=0
    W(X=1)
    W(Y=1)

    View Slide

  110. STRAWMAN: LOCKING
    X=1 Y=1
    W(X=1)
    W(Y=1)

    View Slide

  111. STRAWMAN: LOCKING
    X=1 Y=1
    W(X=1)
    W(Y=1)

    View Slide

  112. STRAWMAN: LOCKING
    X=1 Y=1
    W(X=1)
    W(Y=1)

    View Slide

  113. STRAWMAN: LOCKING
    X=1 Y=1
    W(X=1)
    W(Y=1)

    View Slide

  114. STRAWMAN: LOCKING
    X=1 Y=1
    W(X=1)
    W(Y=1)
    R(X=1)

    View Slide

  115. STRAWMAN: LOCKING
    X=1 Y=1
    W(X=1)
    W(Y=1)
    R(X=1)
    R(Y=1)

    View Slide

  116. Y=0
    STRAWMAN: LOCKING
    X=1
    W(X=1)
    W(Y=1)

    View Slide

  117. Y=0
    STRAWMAN: LOCKING
    X=1
    W(X=1)
    W(Y=1)

    View Slide

  118. Y=0
    STRAWMAN: LOCKING
    X=1
    W(X=1)
    W(Y=1)
    R(X=?)

    View Slide

  119. Y=0
    STRAWMAN: LOCKING
    X=1
    W(X=1)
    W(Y=1)
    R(X=?)
    R(Y=?)

    View Slide

  120. Y=0
    STRAWMAN: LOCKING
    X=1
    W(X=1)
    W(Y=1)
    R(X=?)
    R(Y=?)
    ATOMIC VISIBILITY
    COUPLED WITH
    MUTUAL EXCLUSION

    View Slide

  121. STRAWMAN: LOCKING
    X=1
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    ATOMIC VISIBILITY
    COUPLED WITH
    MUTUAL EXCLUSION

    View Slide

  122. STRAWMAN: LOCKING
    X=1
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    ATOMIC VISIBILITY
    COUPLED WITH
    MUTUAL EXCLUSION

    View Slide

  123. STRAWMAN: LOCKING
    X=1
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    ATOMIC VISIBILITY
    COUPLED WITH
    MUTUAL EXCLUSION
    RTT

    View Slide

  124. STRAWMAN: LOCKING
    X=1
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    ATOMIC VISIBILITY
    COUPLED WITH
    MUTUAL EXCLUSION
    RTT
    unavailability!

    View Slide

  125. X=1
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    ATOMIC VISIBILITY
    COUPLED WITH
    MUTUAL EXCLUSION
    at .5 MS
    < 2K TPS!
    unavailable
    during
    failures
    SIMILAR ISSUES IN MVCC,
    PRE-SCHEDULING
    SERIALIZABLE OCC,
    (global timestamp assignment/application)
    (multi-partition validation, liveness)
    (scheduling, multi-partition execution)
    STRAWMAN: LOCKING

    View Slide

  126. X=1
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    ATOMIC VISIBILITY
    COUPLED WITH
    MUTUAL EXCLUSION
    at .5 MS
    < 2K TPS!
    unavailable
    during
    failures
    SIMILAR ISSUES IN MVCC,
    PRE-SCHEDULING
    SERIALIZABLE OCC,
    (global timestamp assignment/application)
    (multi-partition validation, liveness)
    (scheduling, multi-partition execution)
    FUNDAMENTAL
    TO
    “STRONG”
    SEMANTICS
    STRAWMAN: LOCKING

    View Slide

  127. BASIC IDEA
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    X=1

    View Slide

  128. BASIC IDEA
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    X=1

    View Slide

  129. BASIC IDEA
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    LET CLIENTS RACE, but
    HAVE READERS “CLEAN UP”
    X=1

    View Slide

  130. BASIC IDEA
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    LET CLIENTS RACE, but
    HAVE READERS “CLEAN UP”
    X=1
    METADATA

    View Slide

  131. BASIC IDEA
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    LET CLIENTS RACE, but
    HAVE READERS “CLEAN UP”
    X=1
    + LIMITED
    MULTI-VERSIONING
    METADATA

    View Slide

  132. BASIC IDEA
    W(X=1)
    W(Y=1)
    Y=0
    R(X=?)
    R(Y=?)
    LET CLIENTS RACE, but
    HAVE READERS “CLEAN UP”
    X=1
    + LIMITED
    MULTI-VERSIONING
    METADATA
    FOR NOW:

    READ-ONLY,
    WRITE-ONLY
    TXNS

    View Slide

  133. last committed stamp for x: 0
    RAMP-Fast
    last committed stamp for y: 0

    View Slide

  134. last committed stamp for x: 0
    RAMP-Fast
    last committed stamp for y: 0

    View Slide

  135. last committed stamp for x: 0
    RAMP-Fast
    last committed stamp for y: 0

    View Slide

  136. last committed stamp for x: 0
    RAMP-Fast
    known versions of x
    last committed stamp for y: 0
    known versions of y

    View Slide

  137. last committed stamp for x: 0
    RAMP-Fast
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}

    View Slide

  138. last committed stamp for x: 0
    RAMP-Fast
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}

    View Slide

  139. last committed stamp for x: 0
    RAMP-Fast
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}

    View Slide

  140. last committed stamp for x: 0
    RAMP-Fast
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}

    View Slide

  141. last committed stamp for x: 0
    RAMP-Fast
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}

    View Slide

  142. last committed stamp for x: 0
    RAMP-Fast
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}

    View Slide

  143. last committed stamp for x: 0
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}

    View Slide

  144. last committed stamp for x: 0
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    timestamp: 242
    e.g., time concat client ID
    concat sequence number

    View Slide

  145. last committed stamp for x: 0
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    timestamp: 242

    View Slide

  146. last committed stamp for x: 0
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    timestamp: 242

    View Slide

  147. last committed stamp for x: 0
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    242 1
    timestamp: 242

    View Slide

  148. last committed stamp for x: 0
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    242 1
    242 1
    timestamp: 242

    View Slide

  149. last committed stamp for x: 0
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242

    View Slide

  150. RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    last committed stamp for y: 0
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    last committed stamp for x: 242
    timestamp: 242

    View Slide

  151. RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 242
    timestamp: 242

    View Slide

  152. RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    R(X=?)
    R(Y=?)
    last committed stamp for x: 242
    last committed stamp for y: 242

    View Slide

  153. RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    R(X=1)
    R(Y=1)
    last committed stamp for x: 242
    last committed stamp for y: 242

    View Slide

  154. RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    R(X=1)
    R(Y=1)
    last committed stamp for x: 242
    last committed stamp for y: 242

    View Slide

  155. R(X=?)
    R(Y=?)
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    last committed stamp for x: 242
    last committed stamp for y: 0

    View Slide

  156. R(X=?)
    R(Y=?)
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    last committed stamp for x: 242
    last committed stamp for y: 0

    View Slide

  157. R(X=?)
    R(Y=?)
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    RACE!!!
    last committed stamp for x: 242
    last committed stamp for y: 0

    View Slide

  158. R(X=?)
    R(Y=?)
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    RACE!!!
    R(X=1)
    R(Y=0)
    last committed stamp for x: 242
    last committed stamp for y: 0

    View Slide

  159. R(X=?)
    R(Y=?)
    RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    RACE!!!
    R(X=1)
    R(Y=0)
    last committed stamp for x: 242
    last committed stamp for y: 0

    View Slide

  160. RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    RACE!!!
    R(X=1)
    R(Y=0)
    last committed stamp for x: 242
    last committed stamp for y: 0

    View Slide

  161. RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    RACE!!!
    R(X=1)
    R(Y=0)
    last committed stamp for x: 242
    last committed stamp for y: 0
    RECORD THE ITEMS
    WRITTEN IN THE
    TRANSACTION

    View Slide

  162. RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    RACE!!!
    R(X=1)
    R(Y=0)
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    RECORD THE ITEMS
    WRITTEN IN THE
    TRANSACTION

    View Slide

  163. RAMP-Fast
    W(X=1)
    W(Y=1)
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    1.) Assign unique (logical)
    transaction timestamp.
    2.) Add write to known
    versions on partition.
    3.) Commit and update last
    committed stamp.
    242 1
    242 1
    timestamp: 242
    RACE!!!
    R(X=1)
    R(Y=0)
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    RECORD THE ITEMS
    WRITTEN IN THE
    TRANSACTION

    View Slide

  164. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)

    View Slide

  165. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)

    View Slide

  166. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)

    View Slide

  167. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:

    View Slide

  168. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:
    X=1 @ 242, {Y}

    View Slide

  169. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:
    X=1 @ 242, {Y}
    Y=NULL @ 0, {}

    View Slide

  170. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:
    2.) Calculate missing versions:
    X=1 @ 242, {Y}
    Y=NULL @ 0, {}

    View Slide

  171. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:
    2.) Calculate missing versions:
    X=1 @ 242, {Y}
    Y=NULL @ 0, {}
    ITEM HIGHEST TS
    X 242
    Y 242

    View Slide

  172. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:
    2.) Calculate missing versions:
    X=1 @ 242, {Y}
    Y=NULL @ 0, {}
    ITEM HIGHEST TS
    X 242
    Y 242

    View Slide

  173. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:
    2.) Calculate missing versions:
    3.) Fetch missing versions.
    X=1 @ 242, {Y}
    Y=NULL @ 0, {}
    ITEM HIGHEST TS
    X 242
    Y 242

    View Slide

  174. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:
    2.) Calculate missing versions:
    3.) Fetch missing versions.
    X=1 @ 242, {Y}
    Y=NULL @ 0, {}
    ITEM HIGHEST TS
    X 242
    Y 242
    Y=1 @ 242, {X}
    (Send required timestamp in request)

    View Slide

  175. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:
    2.) Calculate missing versions:
    3.) Fetch missing versions.
    X=1 @ 242, {Y}
    Y=NULL @ 0, {}
    ITEM HIGHEST TS
    X 242
    Y 242
    Y=1 @ 242, {X}
    (Send required timestamp in request)
    2PC ENSURES NO
    WAIT AT SERVER

    View Slide

  176. RAMP-Fast
    known versions of x
    known versions of y
    TIMESTAMP VALUE METADATA
    0 NULL {}
    TIMESTAMP VALUE METADATA
    0 NULL {}
    242 1
    242 1
    last committed stamp for x: 242
    last committed stamp for y: 0
    {y}
    {x}
    R(X=?)
    R(Y=?)
    1.) Read last committed:
    2.) Calculate missing versions:
    3.) Fetch missing versions.
    X=1 @ 242, {Y}
    Y=NULL @ 0, {}
    ITEM HIGHEST TS
    X 242
    Y 242
    Y=1 @ 242, {X}
    (Send required timestamp in request)
    4.) Return resulting set.
    R(X=1)
    R(Y=1)
    2PC ENSURES NO
    WAIT AT SERVER

    View Slide

  177. RAMP-Fast

    View Slide

  178. RAMP-Fast
    2 RTT writes:

    View Slide

  179. RAMP-Fast
    2 RTT writes:
    2PC, without blocking synchronization

    View Slide

  180. RAMP-Fast
    2 RTT writes:
    2PC, without blocking synchronization
    ENSURES READERS
    NEVER WAIT!

    View Slide

  181. RAMP-Fast
    2 RTT writes:
    2PC, without blocking synchronization
    metadata size linear in transaction size
    ENSURES READERS
    NEVER WAIT!

    View Slide

  182. RAMP-Fast
    2 RTT writes:
    2PC, without blocking synchronization
    metadata size linear in transaction size
    1 RTT reads:
    in race-free case
    ENSURES READERS
    NEVER WAIT!

    View Slide

  183. RAMP-Fast
    2 RTT writes:
    2PC, without blocking synchronization
    metadata size linear in transaction size
    1 RTT reads:
    in race-free case
    2 RTT reads:
    otherwise
    ENSURES READERS
    NEVER WAIT!

    View Slide

  184. RAMP-Fast
    2 RTT writes:
    2PC, without blocking synchronization
    metadata size linear in transaction size
    1 RTT reads:
    in race-free case
    2 RTT reads:
    otherwise
    no fast-path
    synchronization
    ENSURES READERS
    NEVER WAIT!

    View Slide

  185. RAMP-Fast
    2 RTT writes:
    2PC, without blocking synchronization
    metadata size linear in transaction size
    1 RTT reads:
    in race-free case
    2 RTT reads:
    otherwise
    no fast-path
    synchronization
    ENSURES READERS
    NEVER WAIT!
    CAN WE USE LESS
    METADATA?

    View Slide

  186. RAMP-Small

    View Slide

  187. RAMP-Small
    2 RTT writes:

    View Slide

  188. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata

    View Slide

  189. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads

    View Slide

  190. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads always

    View Slide

  191. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads
    1.) For each item, fetch the highest committed timestamp.
    2.) Request highest matching write with timestamp in step 1.
    always

    View Slide

  192. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads
    INTUITION:
    1.) For each item, fetch the highest committed timestamp.
    2.) Request highest matching write with timestamp in step 1.
    always

    View Slide

  193. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads
    INTUITION:
    1.) For each item, fetch the highest committed timestamp.
    2.) Request highest matching write with timestamp in step 1.
    X time 523
    always

    View Slide

  194. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads
    INTUITION:
    1.) For each item, fetch the highest committed timestamp.
    2.) Request highest matching write with timestamp in step 1.
    X time 523
    Y time 247
    always

    View Slide

  195. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads
    INTUITION:
    1.) For each item, fetch the highest committed timestamp.
    2.) Request highest matching write with timestamp in step 1.
    X time 523
    Y time 247
    Z time 842
    always

    View Slide

  196. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads
    INTUITION:
    1.) For each item, fetch the highest committed timestamp.
    2.) Request highest matching write with timestamp in step 1.
    X time 523
    Y time 247
    Z time 842
    {247, 523, 842}
    always

    View Slide

  197. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads
    partial commits will be in this set
    INTUITION:
    1.) For each item, fetch the highest committed timestamp.
    2.) Request highest matching write with timestamp in step 1.
    X time 523
    Y time 247
    Z time 842
    {247, 523, 842}
    always

    View Slide

  198. RAMP-Small
    2 RTT writes:
    same basic protocol as RAMP-Fast
    but drop all RAMP-Fast metadata
    2 RTT reads
    partial commits will be in this set
    INTUITION:
    1.) For each item, fetch the highest committed timestamp.
    2.) Request highest matching write with timestamp in step 1.
    X time 523
    Y time 247
    Z time 842
    {247, 523, 842}
    send it to all participating servers
    always

    View Slide

  199. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(1)
    Bloom filter

    View Slide

  200. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(1)
    Bloom filter

    View Slide

  201. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(1)
    Bloom filter

    View Slide

  202. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(1)
    Bloom filter

    View Slide

  203. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(1)
    Bloom filter

    View Slide

  204. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2
    log(2)2
    O([txn len]*log(1/ε))

    View Slide

  205. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2
    log(2)2
    O([txn len]*log(1/ε))

    View Slide

  206. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2
    BLOOM FILTER SUMMARIZES WRITE SET
    FALSE POSITIVES: EXTRA RTTs
    log(2)2
    O([txn len]*log(1/ε))

    View Slide

  207. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2
    BLOOM FILTER SUMMARIZES WRITE SET
    FALSE POSITIVES: EXTRA RTTs
    log(2)2
    O([txn len]*log(1/ε))

    View Slide

  208. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B)
    Bloom filter

    View Slide

  209. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B)
    Bloom filter
    • AVOID IN-PLACE UPDATES

    View Slide

  210. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B)
    Bloom filter
    • AVOID IN-PLACE UPDATES
    • EMBRACE RACES TO IMPROVE CONCURRENCY

    View Slide

  211. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B)
    Bloom filter
    • AVOID IN-PLACE UPDATES
    • EMBRACE RACES TO IMPROVE CONCURRENCY
    • ALLOW READERS TO REPAIR PARTIAL WRITES

    View Slide

  212. RAMP Summary
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B)
    Bloom filter
    • AVOID IN-PLACE UPDATES
    • EMBRACE RACES TO IMPROVE CONCURRENCY
    • ALLOW READERS TO REPAIR PARTIAL WRITES
    • USE 2PC TO AVOID READER STALLS

    View Slide

  213. Additional Details

    View Slide

  214. Additional Details
    Garbage collection:
    limit read transaction duration to K seconds
    GC overwritten versions after K seconds

    View Slide

  215. Additional Details
    Garbage collection:
    limit read transaction duration to K seconds
    GC overwritten versions after K seconds
    Replication
    paper assumes linearizable masters
    extendable to “AP” systems
    see HAT by Bailis et al., VLDB 2014

    View Slide

  216. Additional Details
    Garbage collection:
    limit read transaction duration to K seconds
    GC overwritten versions after K seconds
    Failure handling:
    blocked 2PC rounds do not block clients
    stalled commits? versions are not GC’d
    if desirable, use CTP termination protocol
    Replication
    paper assumes linearizable masters
    extendable to “AP” systems
    see HAT by Bailis et al., VLDB 2014

    View Slide

  217. RAMP PERFORMANCE
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B)
    Bloom filter

    View Slide

  218. RAMP PERFORMANCE
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B)
    Bloom filter
    EVALUATED ON EC2
    cr1.8xlarge instances
    (cluster size: 1-100 servers; default: 5)
    !
    open sourced on GitHub; see link at end of talk

    View Slide

  219. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)

    View Slide

  220. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control

    View Slide

  221. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    Doesn’t provide
    atomic visibility
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control

    View Slide

  222. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control

    View Slide

  223. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL

    View Slide

  224. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only

    View Slide

  225. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    Also doesn’t provide
    atomic visibility
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only

    View Slide

  226. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    Representative of
    coordinated approaches
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only

    View Slide

  227. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only

    View Slide

  228. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast

    View Slide

  229. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    Within ~5% of
    baseline
    !
    Latency in paper
    (comparable)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast

    View Slide

  230. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast

    View Slide

  231. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small

    View Slide

  232. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    Always needs
    2 RTT reads
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small

    View Slide

  233. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small

    View Slide

  234. RAMP-F RAMP-S RAMP-H NWNR
    RAMP-Hybrid
    YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small

    View Slide

  235. RAMP-F RAMP-S RAMP-H NWNR
    RAMP-Hybrid
    YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small

    View Slide

  236. YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients
    0 25 50 75 100
    Percentage Reads
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)

    View Slide

  237. YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients
    0 25 50 75 100
    Percentage Reads
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control

    View Slide

  238. YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients
    0 25 50 75 100
    Percentage Reads
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only

    View Slide

  239. RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small
    YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients
    0 25 50 75 100
    Percentage Reads
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-F RAMP-S RAMP-H NWNR
    RAMP-Hybrid

    View Slide

  240. RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small
    YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients
    0 25 50 75 100
    Percentage Reads
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-F RAMP-S RAMP-H NWNR
    RAMP-Hybrid
    Linear scaling; due to
    2RTT writes, races

    View Slide

  241. RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small
    YCSB: WorkloadA, 1M items, 4 items/txn, 5K clients
    0 25 50 75 100
    Percentage Reads
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-F RAMP-S RAMP-H NWNR
    RAMP-Hybrid

    View Slide

  242. YCSB: uniform access, 1M items, 4 items/txn, 95% reads
    0 25 50 75 100
    Number of Servers
    0
    2M
    4M
    6M
    8M
    Throughput (ops/s)

    View Slide

  243. RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    YCSB: uniform access, 1M items, 4 items/txn, 95% reads
    0 25 50 75 100
    Number of Servers
    0
    2M
    4M
    6M
    8M
    Throughput (ops/s)

    View Slide

  244. RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small
    RAMP-F RAMP-S RAMP-H NWNR
    RAMP-Hybrid
    YCSB: uniform access, 1M items, 4 items/txn, 95% reads
    0 25 50 75 100
    Number of Servers
    0
    2M
    4M
    6M
    8M
    Throughput (ops/s)

    View Slide

  245. RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small
    RAMP-F RAMP-S RAMP-H NWNR
    RAMP-Hybrid
    YCSB: uniform access, 1M items, 4 items/txn, 95% reads
    0 25 50 75 100
    Number of Servers
    0
    40K
    80K
    120K
    160K
    200K
    operations/s/server

    View Slide

  246. RAMP PERFORMANCE
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B)
    Bloom filter

    View Slide

  247. RAMP PERFORMANCE
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B)
    Bloom filter
    More results in paper:
    Transaction length, contention,
    value size, latency, failures

    View Slide

  248. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:

    View Slide

  249. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:
    as
    s

    View Slide

  250. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:
    as
    s

    View Slide

  251. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:
    as
    s

    View Slide

  252. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:
    as
    s
    MULTI-PUT
    (DELETES VIA
    TOMBSTONES)

    View Slide

  253. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:

    View Slide

  254. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:

    View Slide

  255. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:

    View Slide

  256. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:
    Maintain list of matching record IDs and versions
    e.g., HAS_BEARD={52@512, 412@52, 123@512}
    merge lists on commit/read (LWW by timestamp for conflicts)

    View Slide

  257. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:
    Maintain list of matching record IDs and versions
    e.g., HAS_BEARD={52@512, 412@52, 123@512}
    merge lists on commit/read (LWW by timestamp for conflicts)
    LOOKUPs: READ INDEX, THEN FETCH DATA

    View Slide

  258. FOREIGN KEY DEPENDENCIES
    SECONDARY INDEXING
    MATERIALIZED VIEWS
    HOW RAMP HANDLES:
    Maintain list of matching record IDs and versions
    e.g., HAS_BEARD={52@512, 412@52, 123@512}
    merge lists on commit/read (LWW by timestamp for conflicts)
    LOOKUPs: READ INDEX, THEN FETCH DATA
    SIMILAR FOR
    SELECT/PROJECT

    View Slide

  259. SERIALIZABILITY
    SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED
    CAUSAL
    PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    REQUIRE SYNCHRONOUS COORDINATION
    INSUFFICIENT
    SUFFICIENT
    COORDINATION-FREE

    View Slide

  260. SERIALIZABILITY
    SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED
    CAUSAL
    PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    REQUIRE SYNCHRONOUS COORDINATION
    INSUFFICIENT
    SUFFICIENT
    COORDINATION-FREE

    View Slide

  261. SERIALIZABILITY
    SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED CAUSAL
    PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    COORDINATION-FREE
    INSUFFICIENT
    SUFFICIENT
    REQUIRE SYNCHRONOUS COORDINATION

    View Slide

  262. SERIALIZABILITY
    SNAPSHOT ISOLATION
    REPEATABLE READ
    (PL-2.99)
    CURSOR STABILITY
    READ UNCOMMITTED
    READ COMMITTED CAUSAL
    PRAM
    RYW
    LINEARIZABILITY
    EVENTUAL CONSISTENCY
    COORDINATION-FREE
    ATOMIC VISIBILITY VIA RAMP
    INSUFFICIENT
    SUFFICIENT
    REQUIRE SYNCHRONOUS COORDINATION

    View Slide

  263. RAMP IN CASSANDRA

    View Slide

  264. RAMP IN CASSANDRA
    USES

    View Slide

  265. RAMP IN CASSANDRA
    USES
    REQUIREMENTS

    View Slide

  266. RAMP IN CASSANDRA
    USES
    REQUIREMENTS
    IMPLEMENTATION

    View Slide

  267. RAMP IN CASSANDRA
    STRAIGHTFORWARD USES:
    •Add atomic visibility to atomic batch operations
    •Expose as CQL isolation level
    • USING CONSISTENCY READ_ATOMIC
    •Encourage use in multi-put, multi-get
    •Treat as basis for global secondary indexing
    •CREATE GLOBAL INDEX on users (age )

    View Slide

  268. RAMP IN CASSANDRA
    REQUIREMENTS:
    •Unique timestamp generation for transactions
    •Use node ID from ring
    •Other form of UUID
    •Hash transaction contents*
    •Limited multi-versioning for prepared and old values
    •RAMP doesn’t actually require true MVCC
    •One proposal: keep a look aside cache

    View Slide

  269. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:
    Lookaside cache for prepared and old values
    !
    Standard C* Table stores
    last committed write
    1
    52
    335
    1240
    1402
    2201

    View Slide

  270. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:
    Lookaside cache for prepared and old values
    !
    Standard C* Table stores
    last committed write
    Shadow table stores
    prepared-but-not-committed
    and overwritten versions
    1
    52
    335
    1240
    1402
    2201
    64
    335
    2201

    View Slide

  271. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:
    Lookaside cache for prepared and old values
    !
    Standard C* Table stores
    last committed write
    Shadow table stores
    prepared-but-not-committed
    and overwritten versions
    1
    52
    335
    1240
    1402
    2201
    64
    335
    2201
    Transparent to end-users

    View Slide

  272. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:
    Lookaside cache for prepared and old values
    !
    Standard C* Table stores
    last committed write
    Shadow table stores
    prepared-but-not-committed
    and overwritten versions
    1
    52
    335
    1240
    1402
    2201
    64
    335
    2201
    Overwritten versions have
    TTL set to max read transaction
    time, do not need durability
    Transparent to end-users

    View Slide

  273. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:

    View Slide

  274. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:
    OPERATION CONSISTENCY LEVEL
    Write Prepare CL.QUORUM
    Write Commit CL.ANY or higher
    First-round Read CL.ANY/CL.ONE
    Second-round Read CL.QUORUM

    View Slide

  275. RAMP IN CASSANDRA
    To avoid stalling, second-round reads must
    be able to access prepared writes
    POSSIBLE IMPLEMENTATION:
    OPERATION CONSISTENCY LEVEL
    Write Prepare CL.QUORUM
    Write Commit CL.ANY or higher
    First-round Read CL.ANY/CL.ONE
    Second-round Read CL.QUORUM

    View Slide

  276. RAMP IN CASSANDRA
    To avoid stalling, second-round reads must
    be able to access prepared writes
    POSSIBLE IMPLEMENTATION:
    OPERATION CONSISTENCY LEVEL
    Write Prepare CL.QUORUM
    Write Commit CL.ANY or higher
    First-round Read CL.ANY/CL.ONE
    Second-round Read CL.QUORUM

    View Slide

  277. RAMP IN CASSANDRA
    To avoid stalling, second-round reads must
    be able to access prepared writes
    POSSIBLE IMPLEMENTATION:
    OPERATION CONSISTENCY LEVEL
    Write Prepare CL.QUORUM
    Write Commit CL.ANY or higher
    First-round Read CL.ANY/CL.ONE
    Second-round Read CL.QUORUM

    View Slide

  278. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:

    View Slide

  279. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:
    DC1

    View Slide

  280. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:
    DC1 DC2

    View Slide

  281. RAMP IN CASSANDRA
    POSSIBLE IMPLEMENTATION:
    DC1 DC2
    Run algorithms on a per-DC basis, with use of
    CL.LOCAL_QUORUM instead of full CL.QUORUM

    View Slide

  282. View Slide

  283. RAMP TRANSACTIONS:

    View Slide

  284. RAMP TRANSACTIONS:
    • Provide atomic visibility, as required for
    maintaining FKs, scalable indexing, mat views

    View Slide

  285. RAMP TRANSACTIONS:
    • Provide atomic visibility, as required for
    maintaining FKs, scalable indexing, mat views
    • Avoid in-place updates, mutual exclusion, any
    synchronous/blocking coordination

    View Slide

  286. RAMP TRANSACTIONS:
    • Provide atomic visibility, as required for
    maintaining FKs, scalable indexing, mat views
    • Avoid in-place updates, mutual exclusion, any
    synchronous/blocking coordination
    • Use metadata with limited multi versioning,
    reads repair partial writes

    View Slide

  287. RAMP TRANSACTIONS:
    • Provide atomic visibility, as required for
    maintaining FKs, scalable indexing, mat views
    • Avoid in-place updates, mutual exclusion, any
    synchronous/blocking coordination
    • Use metadata with limited multi versioning,
    reads repair partial writes
    • 1-2RTT overhead, pay only during contention

    View Slide

  288. RAMP TRANSACTIONS:
    • Provide atomic visibility, as required for
    maintaining FKs, scalable indexing, mat views
    • Avoid in-place updates, mutual exclusion, any
    synchronous/blocking coordination
    • Use metadata with limited multi versioning,
    reads repair partial writes
    • 1-2RTT overhead, pay only during contention
    Thanks!
    http://tiny.cc/ramp-code
    @pbailis
    http://tiny.cc/ramp-intro

    View Slide

  289. Punk designed by my name is mud from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Queen designed by Bohdan Burmich from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Guy Fawkes designed by Anisha Varghese from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Emperor designed by Simon Child from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Baby designed by Les vieux garçons from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Baby designed by Les vieux garçons from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Gandhi designed by Luis Martins from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Database designed by Anton Outkine from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Girl designed by Rodrigo Vidinich from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Child designed by Gemma Garner from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Customer Service designed by Veysel Kara from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Punk Rocker designed by Simon Child from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Pyramid designed by misirlou from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Person designed by Stefania Bonacasa from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Record designed by Diogo Trindade from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Window designed by Juan Pablo Bravo from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Balloon designed by Julien Deveaux from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Balloon designed by Julien Deveaux from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Balloon designed by Julien Deveaux from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Crying designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Sad designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Happy designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    Happy designed by Megan Sheehan from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    User designed by JM Waideaswaran from the Noun Project Creative Commons – Attribution (CC BY 3.0)
    !
    COCOGOOSE font by ZetaFonts COMMON CREATIVE NON COMMERCIAL USE
    IMAGE/FONT CREDITs

    View Slide