Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Coordination Avoidance In Distributed Databases

pbailis
January 01, 2015

Coordination Avoidance In Distributed Databases

Job talk from early 2015

The rise of Internet-scale geo-replicated services has led to considerable upheaval in the design of modern data management systems. Namely, given the availability, latency, and throughput penalties associated with classic mechanisms such as serializable transactions, a broad class of systems (e.g., “NoSQL”) has sought weaker alternatives that reduce the use of expensive coordination during system operation, often at the cost of application integrity. When can we safely forego the cost of this expensive coordination, and when must we pay the price?

In this talk, I will discuss the potential for coordination avoidance — the use of as little coordination as possible while ensuring application integrity — in several modern data-intensive domains. Specifically, I will demonstrate how to leverage the semantic requirements of applications in data serving, transaction processing, and statistical analytics to enable more efficient distributed algorithms and system designs. The prototype systems I have built demonstrate order-of-magnitude speedups compared to their traditional, coordinated counterparts on a variety of tasks, including referential integrity and index maintenance, transaction execution under common isolation models, and asynchronous convex optimization. I will also discuss our experiences studying and optimizing a range of open source applications and systems, which exhibit similar results.

pbailis

January 01, 2015
Tweet

More Decks by pbailis

Other Decks in Programming

Transcript

  1. COORDINATION
    AVOIDANCE

    IN

    DISTRIBUTED

    DATABASES
    PETER BAILIS
    UC Berkeley

    View full-size slide

  2. SCALE
    DATA TODAY:
    UNPRECEDENTED

    View full-size slide

  3. SCALE Billion-user Internet services
    3B Internet users in 2014
    2.3B Mobile broadband users
    DATA TODAY:
    UNPRECEDENTED
    Ericsson Mobility Report,
    UN International Telecommunication Union, Facebook, Google, NSA,

    View full-size slide

  4. SCALE
    VOLUME
    Billion-user Internet services
    3B Internet users in 2014
    2.3B Mobile broadband users
    Facebook RocksDB: 9B ops/sec
    Google BigTable: 600M ops/sec
    LinkedIn Kafka: 2.5M ops/sec
    DATA TODAY:
    UNPRECEDENTED
    Ericsson Mobility Report,
    UN International Telecommunication Union, Facebook, Google, NSA, @RocksDB, @AKPurtell, Martin Kleppmann

    View full-size slide

  5. SCALE
    VOLUME
    INTERACTIVITY
    Billion-user Internet services
    3B Internet users in 2014
    2.3B Mobile broadband users
    Facebook RocksDB: 9B ops/sec
    Google BigTable: 600M ops/sec
    LinkedIn Kafka: 2.5M ops/sec
    Impatient users want low latency
    Always-on responsiveness
    Personalized user experiences
    DATA TODAY:
    UNPRECEDENTED
    Ericsson Mobility Report,
    UN International Telecommunication Union, Facebook, Google, NSA, @RocksDB, @AKPurtell, Martin Kleppmann

    View full-size slide

  6. SCALE
    VOLUME
    INTERACTIVITY
    DATA TODAY:
    UNPRECEDENTED

    View full-size slide

  7. SCALE
    VOLUME
    INTERACTIVITY
    AND GROWING!
    DATA TODAY:
    UNPRECEDENTED

    View full-size slide

  8. “post
    on
    timeline”
    “accept
    friend
    request”

    View full-size slide

  9. How should we design database systems
    that enable applications to scale?
    “post
    on
    timeline”
    “accept
    friend
    request”

    View full-size slide

  10. CLASSIC:

    ACID

    View full-size slide

  11. CLASSIC:

    ACID
    serializable transactions
    “accept
    friend
    request”
    “post
    on
    timeline”

    View full-size slide

  12. CLASSIC:

    ACID
    serializable transactions
    “accept
    friend
    request”
    “post
    on
    timeline”

    View full-size slide

  13. CLASSIC:

    ACID
    serializable transactions

    View full-size slide

  14. serializability: equivalence to some serial execution

    View full-size slide

  15. “post
    on
    timeline”
    serializability: equivalence to some serial execution

    View full-size slide

  16. “post
    on
    timeline”
    “accept
    friend
    request”
    serializability: equivalence to some serial execution

    View full-size slide

  17. “post
    on
    timeline”
    “accept
    friend
    request”
    serializability: equivalence to some serial execution
    very general!

    View full-size slide

  18. r(y)
    w(x←1)
    r(x)
    w(y←1)
    very general!
    serializability: equivalence to some serial execution

    View full-size slide

  19. r(y)
    w(x←1)
    r(x)
    w(y←1)
    very general!
    …but restricts concurrency
    serializability: equivalence to some serial execution

    View full-size slide

  20. serializability: equivalence to some serial execution
    very general!
    …but restricts concurrency

    View full-size slide

  21. serializability: equivalence to some serial execution
    very general!
    …but restricts concurrency
    CONCURRENT EXECUTION

    View full-size slide

  22. serializability: equivalence to some serial execution
    r(x)=0
    very general!
    …but restricts concurrency
    CONCURRENT EXECUTION

    View full-size slide

  23. serializability: equivalence to some serial execution
    r(x)=0
    r(y)=0
    very general!
    …but restricts concurrency
    CONCURRENT EXECUTION

    View full-size slide

  24. serializability: equivalence to some serial execution
    r(x)=0
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    CONCURRENT EXECUTION

    View full-size slide

  25. serializability: equivalence to some serial execution
    r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    CONCURRENT EXECUTION

    View full-size slide

  26. serializability: equivalence to some serial execution
    r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    CONCURRENT EXECUTION

    View full-size slide

  27. serializability: equivalence to some serial execution
    r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    Should have
    r(y)!1
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    CONCURRENT EXECUTION

    View full-size slide

  28. serializability: equivalence to some serial execution
    r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    Should have
    r(y)!1
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    CONCURRENT EXECUTION

    View full-size slide

  29. serializability: equivalence to some serial execution
    r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    Should have
    r(y)!1
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    r(y)=0
    w(x←1)
    1
    r(x)=0
    w(y←1)
    2
    CONCURRENT EXECUTION

    View full-size slide

  30. serializability: equivalence to some serial execution
    r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    Should have
    r(y)!1
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    Should have
    r(x)!1
    r(y)=0
    w(x←1)
    1
    r(x)=0
    w(y←1)
    2
    CONCURRENT EXECUTION

    View full-size slide

  31. serializability: equivalence to some serial execution
    r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    Should have
    r(y)!1
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    Should have
    r(x)!1
    r(y)=0
    w(x←1)
    1
    r(x)=0
    w(y←1)
    2
    CONCURRENT EXECUTION

    View full-size slide

  32. serializability: equivalence to some serial execution
    r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    Should have
    r(y)!1
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    Should have
    r(x)!1
    r(y)=0
    w(x←1)
    1
    r(x)=0
    w(y←1)
    2
    CONCURRENT EXECUTION
    IS NOT SERIALIZABLE!

    View full-size slide

  33. serializability: equivalence to some serial execution
    r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    very general!
    …but restricts concurrency
    transactions cannot make progress independently
    Serializability requires Coordination
    Should have
    r(y)!1
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    Should have
    r(x)!1
    r(y)=0
    w(x←1)
    1
    r(x)=0
    w(y←1)
    2
    CONCURRENT EXECUTION
    IS NOT SERIALIZABLE!

    View full-size slide

  34. transactions cannot make progress independently
    Serializability requires Coordination

    View full-size slide

  35. transactions cannot make progress independently
    Serializability requires Coordination
    Two-Phase Locking
    Optimistic Concurrency Control Pre-Scheduling
    Multi-Version Concurrency Control

    View full-size slide

  36. transactions cannot make progress independently
    Serializability requires Coordination
    Two-Phase Locking
    Optimistic Concurrency Control Pre-Scheduling
    Multi-Version Concurrency Control Blocking
    Waiting
    Aborts

    View full-size slide

  37. transactions cannot make progress independently
    Serializability requires Coordination
    Two-Phase Locking
    Optimistic Concurrency Control Pre-Scheduling
    Multi-Version Concurrency Control Blocking
    Waiting
    Aborts
    Costs of Coordination
    Between Concurrent Transactions

    View full-size slide

  38. 1. Decreased performance
    transactions cannot make progress independently
    Serializability requires Coordination
    Two-Phase Locking
    Optimistic Concurrency Control Pre-Scheduling
    Multi-Version Concurrency Control Blocking
    Waiting
    Aborts
    Costs of Coordination
    Between Concurrent Transactions

    View full-size slide

  39. 2 3 4 5 6 7 8
    Number of Servers in Transaction
    0
    200
    400
    600
    800
    1000
    1200
    Maximum Throughput (txns/s)
    Number of Servers in Transaction
    Local datacenter
    (Amazon EC2)
    Based on
    [Bobtail, Xu et al., NSDI 13]
    For conflicting transactions

    View full-size slide

  40. 2 3 4 5 6 7 8
    Number of Servers in Transaction
    0
    200
    400
    600
    800
    1000
    1200
    Maximum Throughput (txns/s)
    Number of Servers in Transaction
    Local datacenter
    (Amazon EC2)
    Based on
    [Bobtail, Xu et al., NSDI 13]
    For conflicting transactions

    View full-size slide

  41. 2 3 4 5 6 7 8
    Number of Servers in Transaction
    0
    200
    400
    600
    800
    1000
    1200
    Maximum Throughput (txns/s)
    Number of Servers in Transaction
    +OR +CA +IR +SP +TO +SI +SY
    Participating Datacenters (+VA)
    2
    4
    6
    8
    10
    12
    Maximum Throughput (txn/s)
    Local datacenter
    (Amazon EC2)
    Based on
    [Bobtail, Xu et al., NSDI 13]
    Multi-datacenter
    (Amazon EC2)
    Based on
    [HAT, Bailis et al., VLDB 14]
    For conflicting transactions

    View full-size slide

  42. 2 3 4 5 6 7 8
    Number of Servers in Transaction
    0
    200
    400
    600
    800
    1000
    1200
    Maximum Throughput (txns/s)
    Number of Servers in Transaction
    +OR +CA +IR +SP +TO +SI +SY
    Participating Datacenters (+VA)
    2
    4
    6
    8
    10
    12
    Maximum Throughput (txn/s)
    Local datacenter
    (Amazon EC2)
    Based on
    [Bobtail, Xu et al., NSDI 13]
    Multi-datacenter
    (Amazon EC2)
    Based on
    [HAT, Bailis et al., VLDB 14]
    For conflicting transactions

    View full-size slide

  43. 2 3 4 5 6 7 8
    Number of Servers in Transaction
    0
    200
    400
    600
    800
    1000
    1200
    Maximum Throughput (txns/s)
    Number of Servers in Transaction
    +OR +CA +IR +SP +TO +SI +SY
    Participating Datacenters (+VA)
    2
    4
    6
    8
    10
    12
    Maximum Throughput (txn/s)
    Local datacenter
    (Amazon EC2)
    Based on
    [Bobtail, Xu et al., NSDI 13]
    Multi-datacenter
    (Amazon EC2)
    Based on
    [HAT, Bailis et al., VLDB 14]
    For conflicting transactions

    View full-size slide

  44. 1. Decreased performance
    » due to waiting, communication delays, aborts
    » exacerbated in distributed environment!
    2. Decreased availability during failures
    transactions cannot make progress independently
    Serializability requires Coordination
    Costs of Coordination
    Between Concurrent Transactions

    View full-size slide

  45. 1. Decreased performance
    » due to waiting, communication delays, aborts
    » exacerbated in distributed environment!
    2. Decreased availability during failures
    transactions cannot make progress independently
    Serializability requires Coordination
    Costs of Coordination
    Between Concurrent Transactions

    View full-size slide

  46. 1. Decreased performance
    » due to waiting, communication delays, aborts
    » exacerbated in distributed environment!
    2. Decreased availability during failures
    transactions cannot make progress independently
    Serializability requires Coordination
    Costs of Coordination
    Between Concurrent Transactions

    View full-size slide

  47. 1. Decreased performance
    » due to waiting, communication delays, aborts
    » exacerbated in distributed environment!
    2. Decreased availability during failures
    transactions cannot make progress independently
    Serializability requires Coordination
    Costs of Coordination
    Between Concurrent Transactions

    View full-size slide

  48. 1. Decreased performance
    » due to waiting, communication delays, aborts
    » exacerbated in distributed environment!
    2. Decreased availability during failures
    transactions cannot make progress independently
    Serializability requires Coordination
    Well-known for decades; cf. “CAP”
    Costs of Coordination
    Between Concurrent Transactions

    View full-size slide

  49. How should we design database systems
    that enable applications to scale?

    View full-size slide

  50. Serializability
    COORDINATION
    REQUIRED
    How should we design database systems
    that enable applications to scale?

    View full-size slide

  51. Serializability
    COORDINATION
    REQUIRED
    “NoSQL”
    COORDINATION
    FREE
    How should we design database systems
    that enable applications to scale?

    View full-size slide

  52. Eventual Consistency
    “if no new updates are made to the [database],
    eventually all accesses will return the last updated value[s]”
    — Werner Vogels, Amazon CTO

    View full-size slide

  53. Eventual Consistency
    “if no new updates are made to the [database],
    eventually all accesses will return the last updated value[s]”
    — Werner Vogels, Amazon CTO

    View full-size slide

  54. Eventual Consistency
    “if no new updates are made to the [database],
    eventually all accesses will return the last updated value[s]”
    — Werner Vogels, Amazon CTO

    View full-size slide

  55. Eventual Consistency
    “if no new updates are made to the [database],
    eventually all accesses will return the last updated value[s]”
    — Werner Vogels, Amazon CTO

    View full-size slide

  56. Eventual Consistency
    “if no new updates are made to the [database],
    eventually all accesses will return the last updated value[s]”
    — Werner Vogels, Amazon CTO
    provides no safety: what happens in the meantime?

    View full-size slide

  57. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”,
    SIGMOD 2013 (Demo), CACM Research Highlight]
    Probabilistically Bounded Staleness (PBS)

    View full-size slide

  58. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”,
    SIGMOD 2013 (Demo), CACM Research Highlight]
    Probabilistically Bounded Staleness (PBS)
    » Monte Carlo analysis of protocol behavior

    View full-size slide

  59. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”,
    SIGMOD 2013 (Demo), CACM Research Highlight]
    Probabilistically Bounded Staleness (PBS)
    » Monte Carlo analysis of protocol behavior
    » Key finding: frequently “correct” results…

    View full-size slide

  60. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”,
    SIGMOD 2013 (Demo), CACM Research Highlight]
    Probabilistically Bounded Staleness (PBS)
    » Monte Carlo analysis of protocol behavior
    » Key finding: frequently “correct” results…
    PBS: Voldemort Database at LinkedIn
    99% of reads return the last update 23ms after write

    View full-size slide

  61. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”,
    SIGMOD 2013 (Demo), CACM Research Highlight]
    Probabilistically Bounded Staleness (PBS)
    » Monte Carlo analysis of protocol behavior
    » Key finding: frequently “correct” results…
    PBS: Voldemort Database at LinkedIn
    99% of reads return the last update 23ms after write
    32-90% decrease in 99.9th percentile latency

    View full-size slide

  62. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”,
    SIGMOD 2013 (Demo), CACM Research Highlight]
    Probabilistically Bounded Staleness (PBS)
    » Monte Carlo analysis of protocol behavior
    » Key finding: frequently “correct” results…
    PBS: Voldemort Database at LinkedIn
    99% of reads return the last update 23ms after write
    32-90% decrease in 99.9th percentile latency

    View full-size slide

  63. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”,
    SIGMOD 2013 (Demo), CACM Research Highlight]
    Probabilistically Bounded Staleness (PBS)
    » Monte Carlo analysis of protocol behavior
    » Key finding: frequently “correct” results…
    PBS: Voldemort Database at LinkedIn
    99% of reads return the last update 23ms after write
    32-90% decrease in 99.9th percentile latency
    …BUT NO GUARANTEES!
    㱺 DIFFICULT TO PROGRAM

    View full-size slide

  64. “…sometimes the [write] is
    retrieved from the datastore and
    sometimes it is not.”

    View full-size slide

  65. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY

    View full-size slide

  66. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  67. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    MY WORK:

    View full-size slide

  68. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    MY WORK:

    View full-size slide

  69. The Far Side,
    Gary Larson

    View full-size slide

  70. WHAT THE APPLICATION SAYS
    “post
    on
    timeline”
    “accept
    friend
    request”

    View full-size slide

  71. WHAT THE APPLICATION SAYS
    “post
    on
    timeline”
    “accept
    friend
    request”
    write read
    write
    read
    write
    write
    read
    write
    write
    write
    read
    write
    WHAT THE DATABASE HEARS
    read
    read
    read
    read
    read
    read

    View full-size slide

  72. DESIGN DATABASE SYSTEMS
    THAT EXPLOIT SEMANTICS OF
    HIGH-VALUE USE CASES
    MY APPROACH:

    View full-size slide

  73. DESIGN DATABASE SYSTEMS
    THAT EXPLOIT SEMANTICS OF
    HIGH-VALUE USE CASES
    MY APPROACH:
    Study practical database use cases

    View full-size slide

  74. DESIGN DATABASE SYSTEMS
    THAT EXPLOIT SEMANTICS OF
    HIGH-VALUE USE CASES
    MY APPROACH:
    Study practical database use cases
    Derive principles and algorithms

    View full-size slide

  75. DESIGN DATABASE SYSTEMS
    THAT EXPLOIT SEMANTICS OF
    HIGH-VALUE USE CASES
    MY APPROACH:
    Study practical database use cases
    Derive principles and algorithms
    Build systems to realize the benefits

    View full-size slide

  76. DESIGN DATABASE SYSTEMS
    THAT EXPLOIT SEMANTICS OF
    HIGH-VALUE USE CASES
    MY APPROACH:
    Study practical database use cases
    Derive principles and algorithms
    Build systems to realize the benefits

    View full-size slide

  77. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  78. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  79. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  80. Causality
    SOCC12, SIGMOD13
    Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  81. Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  82. Atomic Visibility
    SIGMOD14
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  83. Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  84. Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION

    View full-size slide

  85. Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    Data Serving and Transactions

    View full-size slide

  86. Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    Data Serving and Transactions
    Model Prediction
    and Training
    CIDR15, TBA
    Analytics

    View full-size slide

  87. Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  88. Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14

    View full-size slide

  89. Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  90. (Abridged) Related Work

    View full-size slide

  91. (Abridged) Related Work
    » Semantics-based concurrency control: esp.
    commutativity and CALM analysis, laws of order
    » Available storage systems: optimistic replication,
    causal memory, CRDTs, eventually consistent transactions
    » Distributed computing: CAP, FLP, NBAC, quorums

    View full-size slide

  92. (Abridged) Related Work
    » Semantics-based concurrency control: esp.
    commutativity and CALM analysis, laws of order
    » Available storage systems: optimistic replication,
    causal memory, CRDTs, eventually consistent transactions
    » Distributed computing: CAP, FLP, NBAC, quorums
    » Here: focus on necessary coordination for
    common, modern data-intensive apps

    View full-size slide

  93. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  94. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE
    1

    View full-size slide

  95. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE
    1
    2

    View full-size slide

  96. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE
    1
    2 3

    View full-size slide

  97. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE
    1

    View full-size slide

  98. Social Graph

    View full-size slide

  99. Social Graph

    View full-size slide

  100. Social Graph
    Facebook

    View full-size slide

  101. Social Graph
    1.2B+ vertices
    Facebook

    View full-size slide

  102. Social Graph
    1.2B+ vertices
    420B+ edges
    Facebook

    View full-size slide

  103. Social Graph
    1.2B+ vertices
    420B+ edges
    Facebook

    View full-size slide

  104. Social Graph
    1
    2
    3
    4
    5
    6
    User
    Facebook
    1.2B+ vertices
    420B+ edges

    View full-size slide

  105. Social Graph
    1
    2
    3
    4
    5
    6
    2, 3, 5
    User Adjacency List
    1, 3, 5
    1, 5, 6
    6
    1, 2, 3, 6
    3, 4, 5
    Facebook
    1.2B+ vertices
    420B+ edges

    View full-size slide

  106. Social Graph
    1 2, 3, 5
    User Adjacency List
    2 1, 3, 5
    3 1, 5, 6
    4 6
    5 1, 2, 3, 6
    6 3, 4, 5
    1.2B+ vertices
    420B+ edges
    Facebook

    View full-size slide

  107. 1 2, 3, 5 6 3, 4, 5

    View full-size slide

  108. 1 2, 3, 5 6 3, 4, 5

    View full-size slide

  109. 1 2, 3, 5 6 3, 4, 5
    ,6 ,1

    View full-size slide

  110. 1 2, 3, 5 6 3, 4, 5
    ,6 ,1
    To preserve graph,
    should observe either:
    » Both links
    » Neither link

    View full-size slide

  111. 1 2, 3, 5 6 3, 4, 5
    ,6 ,1
    To preserve graph,
    should observe either:
    » Both links
    » Neither link
    Atomic Visibility

    View full-size slide

  112. Atomic Visibility

    View full-size slide

  113. Atomic Visibility
    either all or none of each transaction’s updates
    should be visible to other transactions

    View full-size slide

  114. Atomic Visibility
    either all or none of each transaction’s updates
    should be visible to other transactions

    View full-size slide

  115. Atomic Visibility
    X = 1
    WRITE
    Y = 1
    WRITE
    either all or none of each transaction’s updates
    should be visible to other transactions

    View full-size slide

  116. Atomic Visibility
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =
    X = 1
    WRITE
    Y = 1
    WRITE
    either all or none of each transaction’s updates
    should be visible to other transactions

    View full-size slide

  117. Atomic Visibility
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =
    X = 1
    WRITE
    Y = 1
    WRITE
    either all or none of each transaction’s updates
    should be visible to other transactions

    View full-size slide

  118. Atomic Visibility
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =
    either all or none of each transaction’s updates
    should be visible to other transactions

    View full-size slide

  119. BUT NOT
    Atomic Visibility
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =
    either all or none of each transaction’s updates
    should be visible to other transactions
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =

    View full-size slide

  120. BUT NOT
    Atomic Visibility
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =
    either all or none of each transaction’s updates
    should be visible to other transactions
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =
    “FRACTURED READS”

    View full-size slide

  121. Atomic Visibility
    is sufficient to correctly maintain:
    social graph structure

    View full-size slide

  122. r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    Should have
    r(y)!1
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    Should have
    r(x)!1
    r(y)=0
    w(x←1)
    1
    r(x)=0
    w(y←1)
    2
    CONCURRENT EXECUTION
    IS NOT SERIALIZABLE!
    Atomic Visibility
    is not serializability!

    View full-size slide

  123. r(x)=0
    w(x←1)
    w(y←1)
    r(y)=0
    Should have
    r(y)!1
    r(y)=0
    w(x←1)
    2
    r(x)=0
    w(y←1)
    1
    Should have
    r(x)!1
    r(y)=0
    w(x←1)
    1
    r(x)=0
    w(y←1)
    2
    CONCURRENT EXECUTION
    IS NOT SERIALIZABLE!
    Atomic Visibility
    is not serializability!
    …but respects
    Atomic Visibility!

    View full-size slide

  124. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared

    View full-size slide

  125. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared
    WANT
    TO
    PREVENT

    View full-size slide

  126. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared
    WANT
    TO
    PREVENT

    View full-size slide

  127. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared
    WANT
    TO
    PREVENT

    View full-size slide

  128. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared
    WANT
    TO
    PREVENT

    View full-size slide

  129. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared
    WANT
    TO
    PREVENT

    View full-size slide

  130. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared
    WANT
    TO
    PREVENT

    View full-size slide

  131. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared
    Require coordination to
    prevent! [VLDB 2014]
    WANT
    TO
    PREVENT

    View full-size slide

  132. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared
    Require coordination to
    prevent! [VLDB 2014]
    WANT
    TO
    PREVENT

    View full-size slide

  133. Fractured
    Reads
    Item Anti-
    Dependency Cycles
    Anti-Dependency
    Cycles
    Serializability Prevents Prevents Prevents
    Snapshot Isolation Prevents Prevents Doesn’t
    prevent
    Atomic Visibility
    via
    Read Atomic
    Prevents Doesn’t
    prevent
    Doesn’t
    prevent
    Eventual
    Consistency
    Doesn’t
    prevent
    Doesn’t
    prevent
    Doesn’t
    prevent
    Atomic Visibility compared
    Require coordination to
    prevent! [VLDB 2014]
    WANT
    TO
    PREVENT

    View full-size slide

  134. Atomic Visibility
    is sufficient to correctly maintain:
    social graph structure

    View full-size slide

  135. Also applies to other
    relationships

    View full-size slide

  136. Also applies to other
    relationships
    an attending
    doctor
    should
    have
    each
    patient

    View full-size slide

  137. Atomic Visibility
    is sufficient to correctly maintain:
    social graph structure

    View full-size slide

  138. Atomic Visibility
    is sufficient to correctly maintain:
    referential integrity
    secondary indexes
    materialized views
    social graph structure

    View full-size slide

  139. Atomic Visibility
    is sufficient to correctly maintain:
    referential integrity
    secondary indexes
    materialized views
    despite being weaker than serializability
    social graph structure

    View full-size slide

  140. Atomic Visibility via Locking

    View full-size slide

  141. Atomic Visibility via Locking
    X=0 Y=0
    X = 1
    W
    Y = 1
    W

    View full-size slide

  142. Atomic Visibility via Locking
    X = 1
    W
    Y = 1
    W
    X=1 Y=1

    View full-size slide

  143. Atomic Visibility via Locking
    X = 1
    R
    Y = 1
    R
    X = 1
    W
    Y = 1
    W
    X=1 Y=1

    View full-size slide

  144. Atomic Visibility via Locking
    X = 1
    W
    Y = 1
    W
    Y=0
    X=1

    View full-size slide

  145. Atomic Visibility via Locking
    X = ?
    R
    X = 1
    W
    Y = 1
    W
    Y=0
    Y = ?
    R
    X=1

    View full-size slide

  146. Atomic Visibility via Locking
    X = ?
    R
    X = 1
    W
    Y = 1
    W
    Y=0
    Y = ?
    R
    X=1
    Server 1001 Server 1002

    View full-size slide

  147. Atomic Visibility via Locking
    X = ?
    R
    X = 1
    W
    Y = 1
    W
    Y=0
    Y = ?
    R
    X=1
    Server 1001 Server 1002

    View full-size slide

  148. Atomic Visibility via Locking
    X = ?
    R
    X = 1
    W
    Y = 1
    W
    Y=0
    Y = ?
    R
    X=1
    Server 1001 Server 1002

    View full-size slide

  149. LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    T
    I
    M
    E

    View full-size slide

  150. LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    ATOMICITY
    VIOLATED!
    T
    I
    M
    E

    View full-size slide

  151. LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    ATOMICITY
    VIOLATED!
    T
    I
    M
    E

    View full-size slide

  152. LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    ATOMICITY
    VIOLATED!
    T
    I
    M
    E
    OPTIMISTIC

    View full-size slide

  153. Y
    X
    LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    ATOMICITY
    VIOLATED!
    T
    I
    M
    E
    OPTIMISTIC
    VALIDATE
    ATOMICITY

    View full-size slide

  154. Y
    X
    LOCKING
    VIOLATED?
    ABORT
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    ATOMICITY
    VIOLATED!
    T
    I
    M
    E
    OPTIMISTIC
    VALIDATE
    ATOMICITY

    View full-size slide

  155. Y
    X
    LOCKING
    VIOLATED?
    ABORT
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    ATOMICITY
    VIOLATED!
    T
    I
    M
    E
    OPTIMISTIC
    VALIDATE
    ATOMICITY

    View full-size slide

  156. Y
    X
    LOCKING
    VIOLATED?
    ABORT
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    ATOMICITY
    VIOLATED!
    T
    I
    M
    E
    OPTIMISTIC
    VALIDATE
    ATOMICITY
    BOTH RELY
    ON
    COORDINATION

    View full-size slide

  157. Due to coordination overheads…

    View full-size slide

  158. Facebook Tao
    Google Megastore
    LinkedIn Espresso
    Due to coordination overheads…
    Amazon DynamoDB
    Apache Cassandra
    Basho Riak
    Yahoo! PNUTS
    Google App Engine

    View full-size slide

  159. Facebook Tao
    Google Megastore
    LinkedIn Espresso
    Due to coordination overheads…
    Amazon DynamoDB
    Apache Cassandra
    Basho Riak
    Yahoo! PNUTS
    …consciously choose to
    violate atomic visibility
    Google App Engine

    View full-size slide

  160. Facebook Tao
    Google Megastore
    LinkedIn Espresso
    Due to coordination overheads…
    Amazon DynamoDB
    Apache Cassandra
    Basho Riak
    Yahoo! PNUTS
    …consciously choose to
    violate atomic visibility
    “[Tao] explicitly favors
    efficiency and availability over
    consistency…[an edge] may
    exist without an inverse; these
    hanging associations are
    scheduled for repair by an
    asynchronous job.”
    Google App Engine

    View full-size slide

  161. Our contributions:
    to maintain
    social graph structure
    referential integrity
    [SIGMOD 2014, selected for “Best of SIGMOD” ACM TODS]
    secondary indexes
    materialized views

    View full-size slide

  162. Our contributions:
    to maintain
    1. A new model: atomic visibility (via Read
    Atomic isolation) is (provably) sufficient
    social graph structure
    referential integrity
    [SIGMOD 2014, selected for “Best of SIGMOD” ACM TODS]
    secondary indexes
    materialized views

    View full-size slide

  163. Our contributions:
    to maintain
    1. A new model: atomic visibility (via Read
    Atomic isolation) is (provably) sufficient
    2. Efficient protocols: RAMP transactions
    enforce atomic visibility without coordination
    social graph structure
    referential integrity
    [SIGMOD 2014, selected for “Best of SIGMOD” ACM TODS]
    secondary indexes
    materialized views

    View full-size slide

  164. WHAT THE APPLICATION SAYS
    “accept
    friend
    request”
    “update
    index
    entry”
    write
    write
    read
    write
    read
    write
    read
    read
    read
    read
    read
    write
    write
    read
    WHAT THE DATABASE HEARS
    read
    read read write
    read
    write

    View full-size slide

  165. “accept
    friend
    request”
    “update
    index
    entry”
    write
    write
    read
    write
    read
    write
    read
    read
    read
    read
    read
    write
    write
    write
    read

    View full-size slide

  166. “accept
    friend
    request”
    “update
    index
    entry”
    ATOMIC VISIBILITY
    write
    write
    read
    write
    read
    write
    read
    read
    read
    read
    read
    write
    write
    write
    read

    View full-size slide

  167. “accept
    friend
    request”
    “update
    index
    entry”
    RAMP
    TRANSACTION
    ATOMIC VISIBILITY
    write
    write
    read
    write
    read
    write
    read
    read
    read
    read
    read
    write
    write
    write
    read

    View full-size slide

  168. “accept
    friend
    request”
    “update
    index
    entry”
    RAMP
    TRANSACTION
    RAMP
    TRANSACTION
    ATOMIC VISIBILITY
    write
    write
    read
    write
    read
    write
    read
    read
    read
    read
    read
    write
    write
    write
    read

    View full-size slide

  169. ATOMICITY
    VIOLATED!
    Y
    X
    LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    OPTIMISTIC
    T
    I
    M
    E
    VIOLATED?
    ABORT
    VALIDATE
    ATOMICITY

    View full-size slide

  170. ATOMICITY
    VIOLATED!
    Y
    X
    LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    OPTIMISTIC RAMP TRANSACTIONS
    T
    I
    M
    E
    VIOLATED?
    ABORT
    VALIDATE
    ATOMICITY

    View full-size slide

  171. ATOMICITY
    VIOLATED!
    Y
    X
    LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    OPTIMISTIC RAMP TRANSACTIONS
    T
    I
    M
    E
    Without
    coordination,
    atomicity
    violations will
    (initially)
    occur!
    VIOLATED?
    ABORT
    VALIDATE
    ATOMICITY

    View full-size slide

  172. ATOMICITY
    VIOLATED!
    Y
    X
    LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    OPTIMISTIC RAMP TRANSACTIONS
    W(Y)
    R(X)
    R(Y)
    W(X)
    T
    I
    M
    E
    Without
    coordination,
    atomicity
    violations will
    (initially)
    occur!
    VIOLATED?
    ABORT
    VALIDATE
    ATOMICITY

    View full-size slide

  173. ATOMICITY
    VIOLATED!
    Y
    X
    LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    OPTIMISTIC RAMP TRANSACTIONS
    W(Y)
    R(X)
    R(Y)
    W(X)
    T
    I
    M
    E
    Without
    coordination,
    atomicity
    violations will
    (initially)
    occur!
    Don’t
    panic!
    Don’t
    abort!
    VIOLATED?
    ABORT
    VALIDATE
    ATOMICITY

    View full-size slide

  174. ATOMICITY
    VIOLATED!
    Y
    X
    LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    OPTIMISTIC RAMP TRANSACTIONS
    W(Y)
    R(X)
    R(Y)
    W(X)
    DETECT
    RACES
    T
    I
    M
    E
    Without
    coordination,
    atomicity
    violations will
    (initially)
    occur!
    Don’t
    panic!
    Don’t
    abort!
    VIOLATED?
    ABORT
    VALIDATE
    ATOMICITY

    View full-size slide

  175. ATOMICITY
    VIOLATED!
    Y
    X
    LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    OPTIMISTIC RAMP TRANSACTIONS
    W(Y)
    R(X)
    R(Y)
    W(X)
    REPAIR
    ATOMICITY
    DETECT
    RACES
    T
    I
    M
    E
    Without
    coordination,
    atomicity
    violations will
    (initially)
    occur!
    Don’t
    panic!
    Don’t
    abort!
    VIOLATED?
    ABORT
    VALIDATE
    ATOMICITY

    View full-size slide

  176. ATOMICITY
    VIOLATED!
    Y
    X
    LOCKING
    W(Y)
    R(X)
    R(Y)
    W(X)
    W(Y)
    R(X)
    R(Y)
    W(X)
    OPTIMISTIC RAMP TRANSACTIONS
    W(Y)
    R(X)
    R(Y)
    W(X)
    REPAIR
    ATOMICITY
    DETECT
    RACES
    R(Y)
    T
    I
    M
    E
    Without
    coordination,
    atomicity
    violations will
    (initially)
    occur!
    Don’t
    panic!
    Don’t
    abort!
    VIOLATED?
    ABORT
    VALIDATE
    ATOMICITY

    View full-size slide

  177. RAMP
    TRANSACTIONS
    REPAIR
    ATOMICITY
    DETECT
    RACES

    View full-size slide

  178. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES

    View full-size slide

  179. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES

    View full-size slide

  180. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    X=0 Y=0
    Server 1002

    View full-size slide

  181. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    X=0 Y=0
    Server 1002
    X=1

    View full-size slide

  182. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    X=0 Y=0
    Server 1002
    X=1
    X = ?
    R
    Y = ?
    R
    X = 1
    Y = 0

    View full-size slide

  183. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    X=0 Y=0
    Server 1002
    X=1
    X = ?
    R
    Y = ?
    R
    X = 1
    Y = 0

    View full-size slide

  184. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    X=0 Y=0
    Server 1002
    X=1
    X = ?
    R
    Y = ?
    R
    X = 1
    Y = 0

    View full-size slide

  185. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    X=0 Y=0
    Server 1002
    X=1
    X = ?
    R
    Y = ?
    R
    X = 1
    Y = 0
    via intention
    metadata

    View full-size slide

  186. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    Y=0
    Server 1002
    X=1
    via intention
    metadata

    View full-size slide

  187. Y=0 T0 {}
    intention
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    X=1 T1 {Y}
    intention
    · T0
    intention
    ·
    via intention
    metadata

    View full-size slide

  188. value
    Y=0 T0 {}
    intention
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    · T0
    intention
    ·
    via intention
    metadata

    View full-size slide

  189. value
    Y=0 T0 {}
    intention
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    · T0
    intention
    ·
    via intention
    metadata

    View full-size slide

  190. value
    Y=0 T0 {}
    intention
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    · T0
    intention
    ·
    via intention
    metadata
    “A transaction called T1 wrote this and also wrote to Y”

    View full-size slide

  191. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    · value
    Y=0 T0 {}
    intention
    ·
    via intention
    metadata

    View full-size slide

  192. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    · value
    Y=0 T0 {}
    intention
    ·
    via intention
    metadata

    View full-size slide

  193. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    · value
    Y=0 T0 {}
    intention
    ·
    via intention
    metadata
    X = ?
    R
    Y = ?
    R

    View full-size slide

  194. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    value
    X=1 T1 {Y}
    intention
    ·
    via intention
    metadata
    X = ?
    R
    Y = ?
    R
    X = 1
    W
    Y = 1
    W
    value
    Y=0 T0 {}
    intention
    ·

    View full-size slide

  195. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    value
    X=1 T1 {Y}
    intention
    ·
    via intention
    metadata
    X = ?
    R
    R
    X = 1
    W
    Y = 1
    W
    X = 1
    Y = 0
    value
    Y=0 T0 {}
    intention
    ·
    “A transaction called T1 wrote this and also wrote to Y”

    View full-size slide

  196. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    value
    X=1 T1 {Y}
    intention
    ·
    via intention
    metadata
    X = ?
    R
    R
    X = 1
    W
    Y = 1
    W
    X = 1
    Y = 0
    value
    Y=0 T0 {}
    intention
    ·
    “A transaction called T1 wrote this and also wrote to Y”

    View full-size slide

  197. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    value
    X=1 T1 {Y}
    intention
    ·
    via intention
    metadata
    X = ?
    R
    R
    X = 1
    W
    Y = 1
    W
    X = 1
    Y = 0
    Where is T1’s write to Y?
    value
    Y=0 T0 {}
    intention
    ·

    View full-size slide

  198. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    value
    X=1 T1 {Y}
    intention
    ·
    via intention
    metadata
    X = ?
    R
    R
    X = 1
    W
    Y = 1
    W
    X = 1
    Y = 0
    Where is T1’s write to Y?
    value
    Y=0 T0 {}
    intention
    ·

    View full-size slide

  199. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    value
    X=1 T1 {Y}
    intention
    ·
    via intention
    metadata
    X = ?
    R
    R
    X = 1
    W
    Y = 1
    W
    X = 1
    Y = 0
    Where is T1’s write to Y?
    value
    Y=0 T0 {}
    intention
    ·

    View full-size slide

  200. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    value
    X=1 T1 {Y}
    intention
    ·
    via intention
    metadata
    X = ?
    R
    R
    X = 1
    W
    Y = 1
    W
    X = 1
    Y = 0
    Where is T1’s write to Y?
    value
    Y=0 T0 {}
    intention
    ·
    via
    multi-versioning,
    ready bit

    View full-size slide

  201. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    ·
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    value
    Y=0 T0 {}
    intention
    ·

    View full-size slide

  202. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    via
    multi-versioning,
    ready bit

    View full-size slide

  203. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    ready
    ready
    via
    multi-versioning,
    ready bit

    View full-size slide

  204. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    ready
    ready
    1.) Place write on each server.
    via
    multi-versioning,
    ready bit

    View full-size slide

  205. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    ready
    ready
    1.) Place write on each server.
    2.) Set ready bit on each
    write on server.
    via
    multi-versioning,
    ready bit

    View full-size slide

  206. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    ready
    ready
    1.) Place write on each server.
    2.) Set ready bit on each
    write on server.
    via
    multi-versioning,
    ready bit
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  207. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    ready
    ready
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  208. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    ready
    ready
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  209. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    ready
    ready
    X = ?
    R
    Y = ?
    R
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  210. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    ready
    ready
    X = ?
    R
    Y = ?
    R
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  211. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    ready
    ready
    X = ?
    R
    Y = ?
    R
    1.) Fetch “highest” ready versions.
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  212. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    ready
    ready
    X = ?
    R
    Y = ?
    R
    1.) Fetch “highest” ready versions.
    X = 1
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  213. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    ready
    ready
    X = ?
    R
    Y = ?
    R
    1.) Fetch “highest” ready versions.
    X = 1
    Y = 0
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  214. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    ready
    ready
    X = ?
    R
    Y = ?
    R
    1.) Fetch “highest” ready versions.
    2.) Fetch any missing writes
    using metadata.
    X = 1
    Y = 0
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  215. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    ready
    ready
    X = ?
    R
    Y = ?
    R
    1.) Fetch “highest” ready versions.
    2.) Fetch any missing writes
    using metadata.
    X = 1
    Y = 0
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  216. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    ready
    ready
    X = ?
    R
    Y = ?
    R
    1.) Fetch “highest” ready versions.
    2.) Fetch any missing writes
    using metadata.
    X = 1
    Y = 0
    Y = 1
    Ready bit invariant: if ready bit is set, all writes in transaction
    are present on their respective servers

    View full-size slide

  217. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details

    View full-size slide

  218. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details

    View full-size slide

  219. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details
    Ensures that readers
    never have to wait

    View full-size slide

  220. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details
    Ensures that readers
    never have to wait

    View full-size slide

  221. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details
    Ensures that readers
    never have to wait

    View full-size slide

  222. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details
    Ensures that readers
    never have to wait
    2nd RTT for repair, in
    the event of a race

    View full-size slide

  223. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details
    Ensures that readers
    never have to wait
    2nd RTT for repair, in
    the event of a race

    View full-size slide

  224. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details

    View full-size slide

  225. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details
    Transaction IDs: sequence number and client ID
    » Also use to order overwrites!

    View full-size slide

  226. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details
    Garbage collection of old versions:
    » Set timeout (TTL) for overwritten versions
    » Limit read transaction duration to TTL
    Transaction IDs: sequence number and client ID
    » Also use to order overwrites!

    View full-size slide

  227. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details

    View full-size slide

  228. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details

    View full-size slide

  229. Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    2 1 2 O(txn len)
    write set summary
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Details
    Can we use less
    metadata for intent?

    View full-size slide

  230. Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(1)
    Bloom filter
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit
    RAMP Variants

    View full-size slide

  231. RAMP Variants
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(1)
    Bloom filter
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit

    View full-size slide

  232. RAMP Variants
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(1)
    Bloom filter
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit

    View full-size slide

  233. RAMP Variants
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(1)
    Bloom filter
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    Always attempt to repair…
    …no metadata needed!
    via
    multi-versioning,
    ready bit

    View full-size slide

  234. RAMP Variants
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B(ε))
    Bloom filter
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    via
    multi-versioning,
    ready bit

    View full-size slide

  235. RAMP Variants
    Algorithm Write RTT READ RTT
    (best case)
    READ RTT
    (worst case) METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set
    summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B(ε))
    Bloom filter
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via intention
    metadata
    Bloom filter summarizes intent
    False positives: extra read RTTs
    via
    multi-versioning,
    ready bit

    View full-size slide

  236. SYSTEM KNOWS SEMANTICS
    㱺 CLIENTS CAN COOPERATE
    WITHOUT WAITING FOR EACH OTHER
    RAMP Overview

    View full-size slide

  237. SYSTEM KNOWS SEMANTICS
    㱺 CLIENTS CAN COOPERATE
    WITHOUT WAITING FOR EACH OTHER
    KEY IDEA:
    DETECT
    RACES
    Storing intention in metadata allows readers
    to check for missing writes
    RAMP Overview

    View full-size slide

  238. SYSTEM KNOWS SEMANTICS
    㱺 CLIENTS CAN COOPERATE
    WITHOUT WAITING FOR EACH OTHER
    KEY IDEA:
    DETECT
    RACES
    Storing intention in metadata allows readers
    to check for missing writes
    KEY IDEA:
    REPAIR
    ATOMICITY Transactions “hide” writes until others can
    reliably complete them (ready bit)
    RAMP Overview

    View full-size slide

  239. SYSTEM KNOWS SEMANTICS
    㱺 CLIENTS CAN COOPERATE
    WITHOUT WAITING FOR EACH OTHER
    KEY IDEA:
    DETECT
    RACES
    Storing intention in metadata allows readers
    to check for missing writes
    KEY IDEA:
    REPAIR
    ATOMICITY Transactions “hide” writes until others can
    reliably complete them (ready bit)
    coordination free: transactions do not wait for
    any others to complete
    RAMP Overview

    View full-size slide

  240. RAMP Evaluation

    View full-size slide

  241. RAMP Evaluation

    View full-size slide

  242. RAMP Evaluation
    1. What is the overhead of the RAMP protocols?

    View full-size slide

  243. RAMP Evaluation
    1. What is the overhead of the RAMP protocols?
    2. What is the benefit of coordination-free execution?

    View full-size slide

  244. RAMP Evaluation
    1. What is the overhead of the RAMP protocols?
    2. What is the benefit of coordination-free execution?
    3. How do the RAMP protocols scale?

    View full-size slide

  245. RAMP Evaluation
    evaluated on Amazon EC2 cr1.8xlarge servers
    (1-100 servers; default: 5)
    1. What is the overhead of the RAMP protocols?
    2. What is the benefit of coordination-free execution?
    3. How do the RAMP protocols scale?

    View full-size slide

  246. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)

    View full-size slide

  247. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control

    View full-size slide

  248. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    Doesn’t enforce
    atomic visibility

    View full-size slide

  249. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL

    View full-size slide

  250. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only

    View full-size slide

  251. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast

    View full-size slide

  252. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    Within 5% of
    baseline

    View full-size slide

  253. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small

    View full-size slide

  254. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small
    Always needs
    2RTT reads

    View full-size slide

  255. RAMP-F RAMP-S RAMP-H NWNR
    RAMP-Hybrid
    YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    0 2000 4000 6000 8000 10000
    Concurrent Clients
    0
    30K
    60K
    90K
    120K
    150K
    180K
    Throughput (txn/s)
    RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    LWSR LWLR E-PCI
    Serializable 2PL
    NWNR LWNR LWSR LWLR E-PCI
    Write Locks Only
    RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small

    View full-size slide

  256. YCSB: uniform access, 1M items, 4 items/txn, 95% reads
    0 25 50 75 100
    Number of Servers
    0
    2M
    4M
    6M
    8M
    Throughput (ops/s)

    View full-size slide

  257. RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control
    YCSB: uniform access, 1M items, 4 items/txn, 95% reads
    0 25 50 75 100
    Number of Servers
    0
    2M
    4M
    6M
    8M
    Throughput (ops/s)

    View full-size slide

  258. RAMP-H NWNR LWNR LWSR LWLR E-PCI
    No Concurrency Control RAMP-F RAMP-S
    RAMP-Fast
    RAMP-F RAMP-S RAMP-H
    RAMP-Small
    RAMP-F RAMP-S RAMP-H NWNR
    RAMP-Hybrid
    YCSB: uniform access, 1M items, 4 items/txn, 95% reads
    0 25 50 75 100
    Number of Servers
    0
    2M
    4M
    6M
    8M
    Throughput (ops/s)

    View full-size slide

  259. “accept
    friend
    request”
    “update
    index
    entry”
    RAMP
    TRANSACTION
    RAMP
    TRANSACTION
    ATOMIC VISIBILITY
    write
    write
    read
    write
    read
    write
    read
    read
    read
    read
    read
    write
    write
    write
    read

    View full-size slide

  260. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  261. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  262. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  263. write read
    write
    read
    write
    write
    read
    write
    write
    write
    read
    write
    WHAT THE DATABASE HEARS
    read
    read
    read
    read
    read
    read
    WHAT THE APPLICATION SAYS
    my billing
    application
    is “correct”
    my new
    social app
    “does the
    right thing”

    View full-size slide

  264. Database users express
    correctness criteria
    via database constraints

    View full-size slide

  265. “usernames should be unique”
    “account balances should remain positive”
    “there should only be one administrator”
    Database users express
    correctness criteria
    via database constraints

    View full-size slide

  266. Constraint Operation
    Equality, Inequality Any
    Generate unique ID Any
    Specify unique ID Insert
    > Increment
    > Decrement
    < Decrement
    < Increment
    Foreign Key Insert
    Foreign Key Delete
    Secondary Indexing Any
    Materialized Views Any
    AUTO_INCREMENT Insert
    Typical database
    constraints and
    operations
    (SQL)

    View full-size slide

  267. adopt-a-hydrant
    alchemy_cms
    amahi
    bostonrb
    boxroom
    brevidy
    browsercms
    bucketwise
    calagator
    canvas-lms
    carter
    chiliproject
    citizenry
    comas
    comfortable-
    mexican-sofa
    communityengine
    copycopter-
    server
    danbooru
    diaspora
    discourse
    enki
    fat_free_crm
    fedena
    forem
    fulcrum
    gitlab-ci
    gitlabhq
    govsgo
    heaven
    inkwell
    insoshi
    jobsworth
    juvia
    kandan
    linuxfr.org
    lobsters
    lovd-by-less
    nimbleshop
    obtvse
    onebody
    opal
    opencongress
    opengovernment
    openproject
    piggybak
    publify
    radiant
    railscollab
    redmine
    refinerycms
    ror_ecommerce
    rucksack
    saasy
    salor-retail
    selfstarter
    sharetribe
    skyline
    spot-us
    spree
    sprintapp
    squaresquash
    sugar
    teambox
    tracks
    tryshoppe
    wallgig

    View full-size slide

  268. adopt-a-hydrant
    alchemy_cms
    amahi
    bostonrb
    boxroom
    brevidy
    browsercms
    bucketwise
    calagator
    canvas-lms
    carter
    chiliproject
    citizenry
    comas
    comfortable-mexican-sofa
    communityengine
    copycopter-server
    danbooru
    diaspora
    discourse
    enki
    fat_free_crm
    fedena
    forem
    fulcrum
    gitlab-ci
    gitlabhq
    govsgo
    heaven
    inkwell
    insoshi
    jobsworth
    juvia
    kandan
    linuxfr.org
    lobsters
    lovd-by-less
    nimbleshop
    obtvse
    onebody
    opal
    opencongress
    opengovernment
    openproject
    piggybak
    publify
    radiant
    railscollab
    redmine
    refinerycms
    ror_ecommerce
    rucksack
    saasy
    salor-retail
    selfstarter
    sharetribe
    skyline
    spot-us
    spree
    sprintapp
    squaresquash
    sugar
    teambox
    tracks
    tryshoppe
    wallgig
    zena
    67 projects 1.77M LoC 1957 tables
    [SIGMOD 2015]

    View full-size slide

  269. adopt-a-hydrant
    alchemy_cms
    amahi
    bostonrb
    boxroom
    brevidy
    browsercms
    bucketwise
    calagator
    canvas-lms
    carter
    chiliproject
    citizenry
    comas
    comfortable-mexican-sofa
    communityengine
    copycopter-server
    danbooru
    diaspora
    discourse
    enki
    fat_free_crm
    fedena
    forem
    fulcrum
    gitlab-ci
    gitlabhq
    govsgo
    heaven
    inkwell
    insoshi
    jobsworth
    juvia
    kandan
    linuxfr.org
    lobsters
    lovd-by-less
    nimbleshop
    obtvse
    onebody
    opal
    opencongress
    opengovernment
    openproject
    piggybak
    publify
    radiant
    railscollab
    redmine
    refinerycms
    ror_ecommerce
    rucksack
    saasy
    salor-retail
    selfstarter
    sharetribe
    skyline
    spot-us
    spree
    sprintapp
    squaresquash
    sugar
    teambox
    tracks
    tryshoppe
    wallgig
    zena
    67 projects 1.77M LoC 1957 tables
    259 total; avg. 0.13 per table
    [SIGMOD 2015]

    View full-size slide

  270. adopt-a-hydrant
    alchemy_cms
    amahi
    bostonrb
    boxroom
    brevidy
    browsercms
    bucketwise
    calagator
    canvas-lms
    carter
    chiliproject
    citizenry
    comas
    comfortable-mexican-sofa
    communityengine
    copycopter-server
    danbooru
    diaspora
    discourse
    enki
    fat_free_crm
    fedena
    forem
    fulcrum
    gitlab-ci
    gitlabhq
    govsgo
    heaven
    inkwell
    insoshi
    jobsworth
    juvia
    kandan
    linuxfr.org
    lobsters
    lovd-by-less
    nimbleshop
    obtvse
    onebody
    opal
    opencongress
    opengovernment
    openproject
    piggybak
    publify
    radiant
    railscollab
    redmine
    refinerycms
    ror_ecommerce
    rucksack
    saasy
    salor-retail
    selfstarter
    sharetribe
    skyline
    spot-us
    spree
    sprintapp
    squaresquash
    sugar
    teambox
    tracks
    tryshoppe
    wallgig
    zena
    67 projects 1.77M LoC 1957 tables
    9986 total; avg. 5.1 per table
    259 total; avg. 0.13 per table
    [SIGMOD 2015]

    View full-size slide

  271. CONSTRAINTS
    MORE COMMON
    37x
    adopt-a-hydrant
    alchemy_cms
    amahi
    bostonrb
    boxroom
    brevidy
    browsercms
    bucketwise
    calagator
    canvas-lms
    carter
    chiliproject
    citizenry
    comas
    comfortable-mexican-sofa
    communityengine
    copycopter-server
    danbooru
    diaspora
    discourse
    enki
    fat_free_crm
    fedena
    forem
    fulcrum
    gitlab-ci
    gitlabhq
    govsgo
    heaven
    inkwell
    insoshi
    jobsworth
    juvia
    kandan
    linuxfr.org
    lobsters
    lovd-by-less
    nimbleshop
    obtvse
    onebody
    opal
    opencongress
    opengovernment
    openproject
    piggybak
    publify
    radiant
    railscollab
    redmine
    refinerycms
    ror_ecommerce
    rucksack
    saasy
    salor-retail
    selfstarter
    sharetribe
    skyline
    spot-us
    spree
    sprintapp
    squaresquash
    sugar
    teambox
    tracks
    tryshoppe
    wallgig
    zena
    67 projects 1.77M LoC 1957 tables
    9986 total; avg. 5.1 per table
    259 total; avg. 0.13 per table
    [SIGMOD 2015]

    View full-size slide

  272. write read
    write
    read
    write
    write
    read
    write
    write
    write
    read
    write
    WHAT THE DATABASE HEARS
    read
    read
    read
    read
    read
    read
    WHAT THE APPLICATION SAYS
    “no
    duplicate
    users”

    View full-size slide

  273. write read
    write
    read
    write
    write
    read
    write
    write
    write
    read
    write
    WHAT THE DATABASE HEARS
    read
    read
    read
    read
    read
    read
    WHAT THE APPLICATION SAYS
    “no
    duplicate
    users”
    TODAY:
    ENFORCEMENT
    VIA
    COORDINATION

    View full-size slide

  274. write read
    write
    read
    write
    write
    read
    write
    write
    write
    read
    write
    WHAT THE DATABASE HEARS
    read
    read
    read
    read
    read
    read
    WHAT THE APPLICATION SAYS
    “no
    duplicate
    users”
    CAN WE USE
    CONSTRAINTS
    TO
    AVOID
    COORDINATION?

    View full-size slide

  275. WHAT THE APPLICATION SAYS
    “no
    duplicate
    users”
    constraint
    WHAT THE DATABASE HEARS
    constraint
    constraint
    constraint
    constraint
    constraint
    constraint
    constraint
    “no
    duplicate
    users”
    CAN WE USE
    CONSTRAINTS
    TO
    AVOID
    COORDINATION?

    View full-size slide

  276. Key idea: Check if constraints can be violated by
    “merging” independent operations

    View full-size slide

  277. Key idea: Check if constraints can be violated by
    “merging” independent operations
    ICT: Invariant Confluence Test

    View full-size slide

  278. CONSTRAINT: User IDs are unique
    OPERATION: Add users
    MERGE: Set union
    Key idea: Check if constraints can be violated by
    “merging” independent operations
    ICT: Invariant Confluence Test

    View full-size slide

  279. CONSTRAINT: User IDs are unique
    OPERATION: Add users
    MERGE: Set union
    {{Stu,ID=1},
    {Ann,ID=1}}
    Constraint
    violated!
    {}
    MERGE
    add
    {Stu,ID=1}
    add
    {Ann,ID=1}
    Key idea: Check if constraints can be violated by
    “merging” independent operations
    ICT: Invariant Confluence Test

    View full-size slide

  280. Key idea: Check if constraints can be violated by
    “merging” independent operations
    CONSTRAINT: User IDs are positive
    OPERATION: Add users
    MERGE: Set union
    ICT: Invariant Confluence Test

    View full-size slide

  281. Key idea: Check if constraints can be violated by
    “merging” independent operations
    CONSTRAINT: User IDs are positive
    OPERATION: Add users
    MERGE: Set union
    {{Stu,ID=1},
    {Ann,ID=1}}
    Constraint
    holds!
    {}
    MERGE
    add
    {Stu,ID=1}
    add
    {Ann,ID=1}
    ICT: Invariant Confluence Test

    View full-size slide

  282. Key idea: Check if constraints can be violated by
    “merging” independent operations
    ICT: Invariant Confluence Test

    View full-size slide

  283. Key idea: Check if constraints can be violated by
    “merging” independent operations
    OUR CONTRIBUTION:
    [VLDB 2015]
    ICT: Invariant Confluence Test

    View full-size slide

  284. Key idea: Check if constraints can be violated by
    “merging” independent operations
    OUR CONTRIBUTION:
    Theorem. A globally I-valid system can execute a set of
    transactions T with coordination-freedom, transactional availability,
    and convergence if and only if T are I-confluent with respect to I.
    [VLDB 2015]
    ICT ⟺ safe, coordination-free execution possible
    ICT: Invariant Confluence Test

    View full-size slide

  285. Key idea: Check if constraints can be violated by
    “merging” independent operations
    OUR CONTRIBUTION:
    Generalizes classic partitioning-based indistinguishability arguments
    Theorem. A globally I-valid system can execute a set of
    transactions T with coordination-freedom, transactional availability,
    and convergence if and only if T are I-confluent with respect to I.
    [VLDB 2015]
    ICT ⟺ safe, coordination-free execution possible
    ICT: Invariant Confluence Test

    View full-size slide

  286. Constraint Operation OK?
    Equality, Inequality Any ???
    Generate unique ID Any ???
    Specify unique ID Insert ???
    > Increment ???
    > Decrement ???
    < Decrement ???
    < Increment ???
    Foreign Key Insert ???
    Foreign Key Delete ???
    Secondary Indexing Any ???
    Materialized Views Any ???
    AUTO_INCREMENT Insert ???
    Typical database
    constraints and
    operations
    (SQL)
    Under set merge

    View full-size slide

  287. Constraint Operation OK?
    Equality, Inequality Any Y
    Generate unique ID Any Y
    Specify unique ID Insert N
    > Increment Y
    > Decrement N
    < Decrement Y
    < Increment N
    Foreign Key Insert Y
    Foreign Key Delete Y*
    Secondary Indexing Any Y
    Materialized Views Any Y
    AUTO_INCREMENT Insert N [VLDB 2015]
    Typical database
    constraints and
    operations
    (SQL)
    Under set merge

    View full-size slide

  288. Constraint Operation OK?
    Equality, Inequality Any Y
    Generate unique ID Any Y
    Specify unique ID Insert N
    > Increment Y
    > Decrement N
    < Decrement Y
    < Increment N
    Foreign Key Insert Y
    Foreign Key Delete Y*
    Secondary Indexing Any Y
    Materialized Views Any Y
    AUTO_INCREMENT Insert N [VLDB 2015]
    Typical database
    constraints and
    operations
    (SQL)
    R
    A
    M
    P
    Under set merge

    View full-size slide

  289. adopt-a-hydrant
    alchemy_cms
    amahi
    bostonrb
    boxroom
    brevidy
    browsercms
    bucketwise
    calagator
    canvas-lms
    carter
    chiliproject
    citizenry
    comas
    comfortable-mexican-sofa
    communityengine
    copycopter-server
    danbooru
    diaspora
    discourse
    enki
    fat_free_crm
    fedena
    forem
    fulcrum
    gitlab-ci
    gitlabhq
    govsgo
    heaven
    inkwell
    insoshi
    jobsworth
    juvia
    kandan
    linuxfr.org
    lobsters
    lovd-by-less
    nimbleshop
    obtvse
    onebody
    opal
    opencongress
    opengovernment
    openproject
    piggybak
    publify
    radiant
    railscollab
    redmine
    refinerycms
    ror_ecommerce
    rucksack
    saasy
    salor-retail
    selfstarter
    sharetribe
    skyline
    spot-us
    spree
    sprintapp
    squaresquash
    sugar
    teambox
    tracks
    tryshoppe
    wallgig
    zena
    67 projects 1.77M LoC 1957 tables
    9986 total; avg. 5.1 per table
    259 total; avg. 0.13 per table
    [SIGMOD 2015]

    View full-size slide

  290. adopt-a-hydrant
    alchemy_cms
    amahi
    bostonrb
    boxroom
    brevidy
    browsercms
    bucketwise
    calagator
    canvas-lms
    carter
    chiliproject
    citizenry
    comas
    comfortable-mexican-sofa
    communityengine
    copycopter-server
    danbooru
    diaspora
    discourse
    enki
    fat_free_crm
    fedena
    forem
    fulcrum
    gitlab-ci
    gitlabhq
    govsgo
    heaven
    inkwell
    insoshi
    jobsworth
    juvia
    kandan
    linuxfr.org
    lobsters
    lovd-by-less
    nimbleshop
    obtvse
    onebody
    opal
    opencongress
    opengovernment
    openproject
    piggybak
    publify
    radiant
    railscollab
    redmine
    refinerycms
    ror_ecommerce
    rucksack
    saasy
    salor-retail
    selfstarter
    sharetribe
    skyline
    spot-us
    spree
    sprintapp
    squaresquash
    sugar
    teambox
    tracks
    tryshoppe
    wallgig
    zena
    67 projects 1.77M LoC 1957 tables
    9986 total; avg. 5.1 per table
    259 total; avg. 0.13 per table
    86.9% PASS ICT
    [SIGMOD 2015]

    View full-size slide

  291. 14/16 CONSTRAINTS PASS ICT
    TPC-C

    View full-size slide

  292. 14/16 CONSTRAINTS PASS ICT
    TPC-C
    6-11x faster than
    ACID/serializability
    8 16 32 48 64
    Number of Warehouses
    40K
    100K
    600K
    Throughput (txns/s)
    Coordination-Avoiding Serializable (2PL)

    View full-size slide

  293. 14/16 CONSTRAINTS PASS ICT
    TPC-C
    scale to
    over 25x
    best listed result
    0 50 100 150 200
    2M
    4M
    6M
    8M
    10M
    12M
    14M
    Total Throughput (txn/s)
    0 50 100 150 200
    Number of Servers
    0
    20K
    40K
    60K
    80K
    Throughput (txn/s/server)
    6-11x faster than
    ACID/serializability
    8 16 32 48 64
    Number of Warehouses
    40K
    100K
    600K
    Throughput (txns/s)
    Coordination-Avoiding Serializable (2PL)

    View full-size slide

  294. WHAT THE APPLICATION SAYS
    “no
    duplicate
    users”
    constraint
    WHAT THE DATABASE HEARS
    constraint
    constraint
    constraint
    constraint
    constraint
    constraint
    constraint
    “no
    duplicate
    users”
    CAN WE USE
    CONSTRAINTS
    TO
    AVOID
    COORDINATION?

    View full-size slide

  295. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  296. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  297. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  298. Key idea: Exploit statistical robustness in system designs

    View full-size slide

  299. PLASMA: ASYNCHRONOUS LEARNING
    [Ongoing]
    Key idea: Exploit statistical robustness in system designs

    View full-size slide

  300. PLASMA: ASYNCHRONOUS LEARNING
    [Ongoing]
    TIME
    Bulk
    Synch
    Parallel
    Key idea: Exploit statistical robustness in system designs

    View full-size slide

  301. PLASMA: ASYNCHRONOUS LEARNING
    [Ongoing]
    ML task: Express algorithms via async iterator (e.g., ADMM)
    Bulk
    Async
    Parallel
    TIME
    TIME
    Bulk
    Synch
    Parallel
    Key idea: Exploit statistical robustness in system designs
    Break dataflow
    barriers using new
    iterator model

    View full-size slide

  302. VELOX: FAST ONLINE PREDICTIONS
    [CIDR 2015]
    PLASMA: ASYNCHRONOUS LEARNING
    [Ongoing]
    ML task: Express algorithms via async iterator (e.g., ADMM)
    Bulk
    Async
    Parallel
    TIME
    TIME
    Bulk
    Synch
    Parallel
    Key idea: Exploit statistical robustness in system designs
    Break dataflow
    barriers using new
    iterator model

    View full-size slide

  303. VELOX: FAST ONLINE PREDICTIONS
    [CIDR 2015]
    Fast
    incremental
    personalization
    Batch
    retrain
    shared
    features
    PLASMA: ASYNCHRONOUS LEARNING
    [Ongoing]
    ML task: Express algorithms via async iterator (e.g., ADMM)
    Bulk
    Async
    Parallel
    TIME
    TIME
    Bulk
    Synch
    Parallel
    Key idea: Exploit statistical robustness in system designs
    Break dataflow
    barriers using new
    iterator model

    View full-size slide

  304. VELOX: FAST ONLINE PREDICTIONS
    [CIDR 2015]
    Fast
    incremental
    personalization
    Batch
    retrain
    shared
    features
    PLASMA: ASYNCHRONOUS LEARNING
    [Ongoing]
    ML task: Express algorithms via async iterator (e.g., ADMM)
    Bulk
    Async
    Parallel
    TIME
    TIME
    Bulk
    Synch
    Parallel
    Key idea: Exploit statistical robustness in system designs
    Prioritize model
    maintenance by
    robustness
    Break dataflow
    barriers using new
    iterator model

    View full-size slide

  305. VELOX: FAST ONLINE PREDICTIONS
    [CIDR 2015]
    Fast
    incremental
    personalization
    Batch
    retrain
    shared
    features
    PLASMA: ASYNCHRONOUS LEARNING
    [Ongoing]
    ML task: Express algorithms via async iterator (e.g., ADMM)
    Bulk
    Async
    Parallel
    TIME
    TIME
    Bulk
    Synch
    Parallel
    Key idea: Exploit statistical robustness in system designs
    Prioritize model
    maintenance by
    robustness
    ML task: Split models according to robustness
    Break dataflow
    barriers using new
    iterator model

    View full-size slide

  306. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  307. Serializability
    COORDINATION
    REQUIRED
    GUARANTEED
    SAFETY
    Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  308. DESIGN DATABASE SYSTEMS
    THAT EXPLOIT SEMANTICS OF
    HIGH-VALUE USE CASES
    MY APPROACH:
    Study practical database use cases
    Derive principles and algorithms
    Build systems to realize the benefits

    View full-size slide

  309. PBS: Integrated into Cassandra 1.2 release
    + recent extensions at a major Internet company

    View full-size slide

  310. PBS: Integrated into Cassandra 1.2 release
    RAMP: Proposed feature in Cassandra 3.0
    (Reportedly) on roadmap for Facebook Apollo, IBM Cloudant
    + recent extensions at a major Internet company

    View full-size slide

  311. PBS: Integrated into Cassandra 1.2 release
    RAMP: Proposed feature in Cassandra 3.0
    (Reportedly) on roadmap for Facebook Apollo, IBM Cloudant
    + recent extensions at a major Internet company
    HAT Isolation: part of Kleppmann@LinkedIn’s Hermitage testing suite

    View full-size slide

  312. PBS: Integrated into Cassandra 1.2 release
    RAMP: Proposed feature in Cassandra 3.0
    (Reportedly) on roadmap for Facebook Apollo, IBM Cloudant
    + recent extensions at a major Internet company
    HAT Isolation: part of Kleppmann@LinkedIn’s Hermitage testing suite
    Active dialogue with developer, NoSQL community
    via invited talks, blogging, social media

    View full-size slide

  313. Current Practice
    PBS VLDB12, SIGMOD13, VLDBJ14, CACM14
    EC Today CACM/Queue13
    Consistency without Borders SoCC13
    Network Partitions CACM/Queue14
    Feral Concurrency Control SIGMOD15
    Principles
    I-Confluence VLDB15
    HATs HotOS13, VLDB14
    Explicit Causality SoCC12
    Systems
    Bolt-On SIGMOD13
    RAMP + Indexing SIGMOD14
    Velox CIDR15
    Plasma + BAP Ongoing
    MY WORK:
    COORDINATION AVOIDANCE

    View full-size slide

  314. Current Practice
    PBS VLDB12, SIGMOD13, VLDBJ14, CACM14
    EC Today CACM/Queue13
    Consistency without Borders SoCC13
    Network Partitions CACM/Queue14
    Feral Concurrency Control SIGMOD15
    Principles
    I-Confluence VLDB15
    HATs HotOS13, VLDB14
    Explicit Causality SoCC12
    Systems
    Bolt-On SIGMOD13
    RAMP + Indexing SIGMOD14
    Velox CIDR15
    Plasma + BAP Ongoing
    MY WORK:
    COORDINATION AVOIDANCE

    View full-size slide

  315. FUTURE WORK
    Automatically coordinated applications

    View full-size slide

  316. FUTURE WORK
    Automatically coordinated applications
    Bespoke analysis and coordination synthesis

    View full-size slide

  317. FUTURE WORK
    Automatically coordinated applications
    Bespoke analysis and coordination synthesis
    “Query optimization” for transaction execution

    View full-size slide

  318. FUTURE WORK
    Automatically coordinated applications
    Bespoke analysis and coordination synthesis
    “Query optimization” for transaction execution
    DB meets “Big Data” Learning

    View full-size slide

  319. FUTURE WORK
    Automatically coordinated applications
    Bespoke analysis and coordination synthesis
    “Query optimization” for transaction execution
    DB meets “Big Data” Learning
    View materialization and selection for model maintenance

    View full-size slide

  320. FUTURE WORK
    Automatically coordinated applications
    Bespoke analysis and coordination synthesis
    “Query optimization” for transaction execution
    DB meets “Big Data” Learning
    View materialization and selection for model maintenance
    Bounded divergence control for coordinating learners

    View full-size slide

  321. FUTURE WORK
    Automatically coordinated applications
    Bespoke analysis and coordination synthesis
    “Query optimization” for transaction execution
    DB meets “Big Data” Learning
    View materialization and selection for model maintenance
    Bounded divergence control for coordinating learners
    Next-Generation Data Applications

    View full-size slide

  322. FUTURE WORK
    Automatically coordinated applications
    Bespoke analysis and coordination synthesis
    “Query optimization” for transaction execution
    DB meets “Big Data” Learning
    View materialization and selection for model maintenance
    Bounded divergence control for coordinating learners
    Next-Generation Data Applications
    Next 10-100x growth in data volume due to sensors, apps

    View full-size slide

  323. FUTURE WORK
    Automatically coordinated applications
    Bespoke analysis and coordination synthesis
    “Query optimization” for transaction execution
    DB meets “Big Data” Learning
    View materialization and selection for model maintenance
    Bounded divergence control for coordinating learners
    Next-Generation Data Applications
    Next 10-100x growth in data volume due to sensors, apps
    New interfaces for increased coordination costs, heterogeneity

    View full-size slide

  324. WHAT THE APPLICATION SAYS
    “post
    on
    timeline”
    “accept
    friend
    request”
    write read
    write
    read
    write
    write
    read
    write
    write
    write
    read
    write
    WHAT THE DATABASE HEARS
    read
    read
    read
    read
    read
    read

    View full-size slide

  325. Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE

    View full-size slide

  326. Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE
    Joint work with Ali Ghodsi, Joe Hellerstein,
    Ion Stoica, Mike Franklin, Michael Jordan,
    Alan Fekete, Dan Crankshaw, Shivaram
    Venkataraman, Neil Conway, Peter Alvaro,
    Aaron Davidson, Joey Gonzalez, Kyle Kingsbury,
    Haoyuan Li, and Zhao Zhang

    View full-size slide

  327. Eventual
    Consistency
    COORDINATION
    FREE
    NO SAFETY
    Atomic Visibility
    SIGMOD14
    Database
    Constraints
    VLDB15, SIGMOD15
    Model Prediction
    and Training
    CIDR15, TBA
    Weak Isolation
    HotOS13, VLDB14
    Causality
    SOCC12, SIGMOD13
    COORDINATION AVOIDANCE
    GUARANTEED SAFETY WITHOUT COORDINATION
    MORE SEMANTICS
    MORE SAFETY
    PBS
    VLDB12, VLDBJ14,
    SIGMOD13, CACM14
    COORDINATION FREE
    Joint work with Ali Ghodsi, Joe Hellerstein,
    Ion Stoica, Mike Franklin, Michael Jordan,
    Alan Fekete, Dan Crankshaw, Shivaram
    Venkataraman, Neil Conway, Peter Alvaro,
    Aaron Davidson, Joey Gonzalez, Kyle Kingsbury,
    Haoyuan Li, and Zhao Zhang

    View full-size slide

  328. Many illustrations by the Noun Project (CC-Attribution):
    surprised by Julian Derveaux
    world by Wayne Tyler Sall
    database by Austin Condiff
    earth by Martin Vanco
    Woman by Simon Child
    Man by Simon Child
    Doctor by Simon Child
    David-Hockney by Simon Child
    Server by Simon Child
    clock by christoph robausch

    View full-size slide