Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Coordination Avoidance In Distributed Databases

B7dc26518988058faa50712248c80bd3?s=47 pbailis
January 01, 2015

Coordination Avoidance In Distributed Databases

Job talk from early 2015

The rise of Internet-scale geo-replicated services has led to considerable upheaval in the design of modern data management systems. Namely, given the availability, latency, and throughput penalties associated with classic mechanisms such as serializable transactions, a broad class of systems (e.g., “NoSQL”) has sought weaker alternatives that reduce the use of expensive coordination during system operation, often at the cost of application integrity. When can we safely forego the cost of this expensive coordination, and when must we pay the price?

In this talk, I will discuss the potential for coordination avoidance — the use of as little coordination as possible while ensuring application integrity — in several modern data-intensive domains. Specifically, I will demonstrate how to leverage the semantic requirements of applications in data serving, transaction processing, and statistical analytics to enable more efficient distributed algorithms and system designs. The prototype systems I have built demonstrate order-of-magnitude speedups compared to their traditional, coordinated counterparts on a variety of tasks, including referential integrity and index maintenance, transaction execution under common isolation models, and asynchronous convex optimization. I will also discuss our experiences studying and optimizing a range of open source applications and systems, which exhibit similar results.

B7dc26518988058faa50712248c80bd3?s=128

pbailis

January 01, 2015
Tweet

Transcript

  1. COORDINATION AVOIDANCE
 IN
 DISTRIBUTED
 DATABASES PETER BAILIS UC Berkeley

  2. None
  3. DATA TODAY:

  4. SCALE DATA TODAY: UNPRECEDENTED

  5. SCALE Billion-user Internet services 3B Internet users in 2014 2.3B

    Mobile broadband users DATA TODAY: UNPRECEDENTED Ericsson Mobility Report, UN International Telecommunication Union, Facebook, Google, NSA,
  6. SCALE VOLUME Billion-user Internet services 3B Internet users in 2014

    2.3B Mobile broadband users Facebook RocksDB: 9B ops/sec Google BigTable: 600M ops/sec LinkedIn Kafka: 2.5M ops/sec DATA TODAY: UNPRECEDENTED Ericsson Mobility Report, UN International Telecommunication Union, Facebook, Google, NSA, @RocksDB, @AKPurtell, Martin Kleppmann
  7. SCALE VOLUME INTERACTIVITY Billion-user Internet services 3B Internet users in

    2014 2.3B Mobile broadband users Facebook RocksDB: 9B ops/sec Google BigTable: 600M ops/sec LinkedIn Kafka: 2.5M ops/sec Impatient users want low latency Always-on responsiveness Personalized user experiences DATA TODAY: UNPRECEDENTED Ericsson Mobility Report, UN International Telecommunication Union, Facebook, Google, NSA, @RocksDB, @AKPurtell, Martin Kleppmann
  8. SCALE VOLUME INTERACTIVITY DATA TODAY: UNPRECEDENTED

  9. SCALE VOLUME INTERACTIVITY AND GROWING! DATA TODAY: UNPRECEDENTED

  10. None
  11. None
  12. “post on timeline” “accept friend request”

  13. How should we design database systems that enable applications to

    scale? “post on timeline” “accept friend request”
  14. None
  15. CLASSIC:
 ACID

  16. CLASSIC:
 ACID serializable transactions “accept friend request” “post on timeline”

  17. CLASSIC:
 ACID serializable transactions “accept friend request” “post on timeline”

  18. CLASSIC:
 ACID serializable transactions

  19. serializability: equivalence to some serial execution

  20. “post on timeline” serializability: equivalence to some serial execution

  21. “post on timeline” “accept friend request” serializability: equivalence to some

    serial execution
  22. “post on timeline” “accept friend request” serializability: equivalence to some

    serial execution very general!
  23. r(y) w(x←1) r(x) w(y←1) very general! serializability: equivalence to some

    serial execution
  24. r(y) w(x←1) r(x) w(y←1) very general! …but restricts concurrency serializability:

    equivalence to some serial execution
  25. serializability: equivalence to some serial execution very general! …but restricts

    concurrency
  26. serializability: equivalence to some serial execution very general! …but restricts

    concurrency CONCURRENT EXECUTION
  27. serializability: equivalence to some serial execution r(x)=0 very general! …but

    restricts concurrency CONCURRENT EXECUTION
  28. serializability: equivalence to some serial execution r(x)=0 r(y)=0 very general!

    …but restricts concurrency CONCURRENT EXECUTION
  29. serializability: equivalence to some serial execution r(x)=0 w(y←1) r(y)=0 very

    general! …but restricts concurrency CONCURRENT EXECUTION
  30. serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0

    very general! …but restricts concurrency CONCURRENT EXECUTION
  31. serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0

    very general! …but restricts concurrency r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 CONCURRENT EXECUTION
  32. serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0

    very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 CONCURRENT EXECUTION
  33. serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0

    very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 CONCURRENT EXECUTION
  34. serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0

    very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION
  35. serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0

    very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION
  36. serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0

    very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION
  37. serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0

    very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE!
  38. serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0

    very general! …but restricts concurrency transactions cannot make progress independently Serializability requires Coordination Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE!
  39. transactions cannot make progress independently Serializability requires Coordination

  40. transactions cannot make progress independently Serializability requires Coordination Two-Phase Locking

    Optimistic Concurrency Control Pre-Scheduling Multi-Version Concurrency Control
  41. transactions cannot make progress independently Serializability requires Coordination Two-Phase Locking

    Optimistic Concurrency Control Pre-Scheduling Multi-Version Concurrency Control Blocking Waiting Aborts
  42. transactions cannot make progress independently Serializability requires Coordination Two-Phase Locking

    Optimistic Concurrency Control Pre-Scheduling Multi-Version Concurrency Control Blocking Waiting Aborts Costs of Coordination Between Concurrent Transactions
  43. 1. Decreased performance transactions cannot make progress independently Serializability requires

    Coordination Two-Phase Locking Optimistic Concurrency Control Pre-Scheduling Multi-Version Concurrency Control Blocking Waiting Aborts Costs of Coordination Between Concurrent Transactions
  44. None
  45. None
  46. None
  47. 2 3 4 5 6 7 8 Number of Servers

    in Transaction 0 200 400 600 800 1000 1200 Maximum Throughput (txns/s) Number of Servers in Transaction Local datacenter (Amazon EC2) Based on [Bobtail, Xu et al., NSDI 13] For conflicting transactions
  48. 2 3 4 5 6 7 8 Number of Servers

    in Transaction 0 200 400 600 800 1000 1200 Maximum Throughput (txns/s) Number of Servers in Transaction Local datacenter (Amazon EC2) Based on [Bobtail, Xu et al., NSDI 13] For conflicting transactions
  49. 2 3 4 5 6 7 8 Number of Servers

    in Transaction 0 200 400 600 800 1000 1200 Maximum Throughput (txns/s) Number of Servers in Transaction +OR +CA +IR +SP +TO +SI +SY Participating Datacenters (+VA) 2 4 6 8 10 12 Maximum Throughput (txn/s) Local datacenter (Amazon EC2) Based on [Bobtail, Xu et al., NSDI 13] Multi-datacenter (Amazon EC2) Based on [HAT, Bailis et al., VLDB 14] For conflicting transactions
  50. 2 3 4 5 6 7 8 Number of Servers

    in Transaction 0 200 400 600 800 1000 1200 Maximum Throughput (txns/s) Number of Servers in Transaction +OR +CA +IR +SP +TO +SI +SY Participating Datacenters (+VA) 2 4 6 8 10 12 Maximum Throughput (txn/s) Local datacenter (Amazon EC2) Based on [Bobtail, Xu et al., NSDI 13] Multi-datacenter (Amazon EC2) Based on [HAT, Bailis et al., VLDB 14] For conflicting transactions
  51. 2 3 4 5 6 7 8 Number of Servers

    in Transaction 0 200 400 600 800 1000 1200 Maximum Throughput (txns/s) Number of Servers in Transaction +OR +CA +IR +SP +TO +SI +SY Participating Datacenters (+VA) 2 4 6 8 10 12 Maximum Throughput (txn/s) Local datacenter (Amazon EC2) Based on [Bobtail, Xu et al., NSDI 13] Multi-datacenter (Amazon EC2) Based on [HAT, Bailis et al., VLDB 14] For conflicting transactions
  52. 1. Decreased performance » due to waiting, communication delays, aborts

    » exacerbated in distributed environment! 2. Decreased availability during failures transactions cannot make progress independently Serializability requires Coordination Costs of Coordination Between Concurrent Transactions
  53. 1. Decreased performance » due to waiting, communication delays, aborts

    » exacerbated in distributed environment! 2. Decreased availability during failures transactions cannot make progress independently Serializability requires Coordination Costs of Coordination Between Concurrent Transactions
  54. 1. Decreased performance » due to waiting, communication delays, aborts

    » exacerbated in distributed environment! 2. Decreased availability during failures transactions cannot make progress independently Serializability requires Coordination Costs of Coordination Between Concurrent Transactions
  55. 1. Decreased performance » due to waiting, communication delays, aborts

    » exacerbated in distributed environment! 2. Decreased availability during failures transactions cannot make progress independently Serializability requires Coordination Costs of Coordination Between Concurrent Transactions
  56. 1. Decreased performance » due to waiting, communication delays, aborts

    » exacerbated in distributed environment! 2. Decreased availability during failures transactions cannot make progress independently Serializability requires Coordination Well-known for decades; cf. “CAP” Costs of Coordination Between Concurrent Transactions
  57. How should we design database systems that enable applications to

    scale?
  58. Serializability COORDINATION REQUIRED How should we design database systems that

    enable applications to scale?
  59. Serializability COORDINATION REQUIRED “NoSQL” COORDINATION FREE How should we design

    database systems that enable applications to scale?
  60. NoSQL

  61. NoSQL

  62. None
  63. Eventual Consistency “if no new updates are made to the

    [database], eventually all accesses will return the last updated value[s]” — Werner Vogels, Amazon CTO
  64. Eventual Consistency “if no new updates are made to the

    [database], eventually all accesses will return the last updated value[s]” — Werner Vogels, Amazon CTO
  65. Eventual Consistency “if no new updates are made to the

    [database], eventually all accesses will return the last updated value[s]” — Werner Vogels, Amazon CTO
  66. Eventual Consistency “if no new updates are made to the

    [database], eventually all accesses will return the last updated value[s]” — Werner Vogels, Amazon CTO
  67. Eventual Consistency “if no new updates are made to the

    [database], eventually all accesses will return the last updated value[s]” — Werner Vogels, Amazon CTO provides no safety: what happens in the meantime?
  68. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”, SIGMOD

    2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS)
  69. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”, SIGMOD

    2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior
  70. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”, SIGMOD

    2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior » Key finding: frequently “correct” results…
  71. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”, SIGMOD

    2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior » Key finding: frequently “correct” results… PBS: Voldemort Database at LinkedIn 99% of reads return the last update 23ms after write
  72. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”, SIGMOD

    2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior » Key finding: frequently “correct” results… PBS: Voldemort Database at LinkedIn 99% of reads return the last update 23ms after write 32-90% decrease in 99.9th percentile latency
  73. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”, SIGMOD

    2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior » Key finding: frequently “correct” results… PBS: Voldemort Database at LinkedIn 99% of reads return the last update 23ms after write 32-90% decrease in 99.9th percentile latency
  74. [VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”, SIGMOD

    2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior » Key finding: frequently “correct” results… PBS: Voldemort Database at LinkedIn 99% of reads return the last update 23ms after write 32-90% decrease in 99.9th percentile latency …BUT NO GUARANTEES! 㱺 DIFFICULT TO PROGRAM
  75. None
  76. None
  77. “…sometimes the [write] is retrieved from the datastore and sometimes

    it is not.”
  78. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY
  79. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  80. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY COORDINATION AVOIDANCE PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 MY WORK:
  81. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 MY WORK:
  82. The Far Side, Gary Larson

  83. None
  84. WHAT THE APPLICATION SAYS “post on timeline” “accept friend request”

  85. WHAT THE APPLICATION SAYS “post on timeline” “accept friend request”

    write read write read write write read write write write read write WHAT THE DATABASE HEARS read read read read read read
  86. None
  87. DESIGN DATABASE SYSTEMS THAT EXPLOIT SEMANTICS OF HIGH-VALUE USE CASES

    MY APPROACH:
  88. DESIGN DATABASE SYSTEMS THAT EXPLOIT SEMANTICS OF HIGH-VALUE USE CASES

    MY APPROACH: Study practical database use cases
  89. DESIGN DATABASE SYSTEMS THAT EXPLOIT SEMANTICS OF HIGH-VALUE USE CASES

    MY APPROACH: Study practical database use cases Derive principles and algorithms
  90. DESIGN DATABASE SYSTEMS THAT EXPLOIT SEMANTICS OF HIGH-VALUE USE CASES

    MY APPROACH: Study practical database use cases Derive principles and algorithms Build systems to realize the benefits
  91. DESIGN DATABASE SYSTEMS THAT EXPLOIT SEMANTICS OF HIGH-VALUE USE CASES

    MY APPROACH: Study practical database use cases Derive principles and algorithms Build systems to realize the benefits
  92. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  93. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  94. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  95. Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency

    COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  96. Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED

    GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  97. Atomic Visibility SIGMOD14 Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13

    Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  98. Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Weak Isolation HotOS13,

    VLDB14 Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  99. Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Weak Isolation HotOS13,

    VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION
  100. Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Weak Isolation HotOS13,

    VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION Data Serving and Transactions
  101. Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Weak Isolation HotOS13,

    VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION Data Serving and Transactions Model Prediction and Training CIDR15, TBA Analytics
  102. Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and

    Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  103. Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and

    Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14
  104. Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and

    Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  105. (Abridged) Related Work

  106. (Abridged) Related Work » Semantics-based concurrency control: esp. commutativity and

    CALM analysis, laws of order » Available storage systems: optimistic replication, causal memory, CRDTs, eventually consistent transactions » Distributed computing: CAP, FLP, NBAC, quorums
  107. (Abridged) Related Work » Semantics-based concurrency control: esp. commutativity and

    CALM analysis, laws of order » Available storage systems: optimistic replication, causal memory, CRDTs, eventually consistent transactions » Distributed computing: CAP, FLP, NBAC, quorums » Here: focus on necessary coordination for common, modern data-intensive apps
  108. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  109. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE 1
  110. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE 1 2
  111. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE 1 2 3
  112. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE 1
  113. Social Graph

  114. Social Graph

  115. Social Graph Facebook

  116. Social Graph 1.2B+ vertices Facebook

  117. Social Graph 1.2B+ vertices 420B+ edges Facebook

  118. Social Graph 1.2B+ vertices 420B+ edges Facebook

  119. Social Graph 1 2 3 4 5 6 User Facebook

    1.2B+ vertices 420B+ edges
  120. Social Graph 1 2 3 4 5 6 2, 3,

    5 User Adjacency List 1, 3, 5 1, 5, 6 6 1, 2, 3, 6 3, 4, 5 Facebook 1.2B+ vertices 420B+ edges
  121. Social Graph 1 2, 3, 5 User Adjacency List 2

    1, 3, 5 3 1, 5, 6 4 6 5 1, 2, 3, 6 6 3, 4, 5 1.2B+ vertices 420B+ edges Facebook
  122. 1 2, 3, 5 6 3, 4, 5

  123. 1 2, 3, 5 6 3, 4, 5

  124. 1 2, 3, 5 6 3, 4, 5 ,6 ,1

  125. 1 2, 3, 5 6 3, 4, 5 ,6 ,1

    To preserve graph, should observe either: » Both links » Neither link
  126. 1 2, 3, 5 6 3, 4, 5 ,6 ,1

    To preserve graph, should observe either: » Both links » Neither link Atomic Visibility
  127. Atomic Visibility

  128. Atomic Visibility either all or none of each transaction’s updates

    should be visible to other transactions
  129. Atomic Visibility either all or none of each transaction’s updates

    should be visible to other transactions
  130. Atomic Visibility X = 1 WRITE Y = 1 WRITE

    either all or none of each transaction’s updates should be visible to other transactions
  131. Atomic Visibility OR X = 1 READ Y = 1

    READ READ X = READ Y = X = 1 WRITE Y = 1 WRITE either all or none of each transaction’s updates should be visible to other transactions
  132. Atomic Visibility OR X = 1 READ Y = 1

    READ READ X = READ Y = X = 1 WRITE Y = 1 WRITE either all or none of each transaction’s updates should be visible to other transactions
  133. Atomic Visibility OR X = 1 READ Y = 1

    READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions
  134. BUT NOT Atomic Visibility OR X = 1 READ Y

    = 1 READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions OR X = 1 READ Y = 1 READ READ X = READ Y =
  135. BUT NOT Atomic Visibility OR X = 1 READ Y

    = 1 READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions OR X = 1 READ Y = 1 READ READ X = READ Y = “FRACTURED READS”
  136. Atomic Visibility is sufficient to correctly maintain: social graph structure

  137. r(x)=0 w(x←1) w(y←1) r(y)=0 Should have r(y)!1 r(y)=0 w(x←1) 2

    r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE! Atomic Visibility is not serializability!
  138. r(x)=0 w(x←1) w(y←1) r(y)=0 Should have r(y)!1 r(y)=0 w(x←1) 2

    r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE! Atomic Visibility is not serializability! …but respects Atomic Visibility!
  139. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared
  140. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared WANT TO PREVENT
  141. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared WANT TO PREVENT
  142. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared WANT TO PREVENT
  143. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared WANT TO PREVENT
  144. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared WANT TO PREVENT
  145. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared WANT TO PREVENT
  146. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared Require coordination to prevent! [VLDB 2014] WANT TO PREVENT
  147. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared Require coordination to prevent! [VLDB 2014] WANT TO PREVENT
  148. Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents

    Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared Require coordination to prevent! [VLDB 2014] WANT TO PREVENT
  149. Atomic Visibility is sufficient to correctly maintain: social graph structure

  150. Also applies to other relationships

  151. Also applies to other relationships an attending doctor should have

    each patient
  152. Atomic Visibility is sufficient to correctly maintain: social graph structure

  153. Atomic Visibility is sufficient to correctly maintain: referential integrity secondary

    indexes materialized views social graph structure
  154. Atomic Visibility is sufficient to correctly maintain: referential integrity secondary

    indexes materialized views despite being weaker than serializability social graph structure
  155. Atomic Visibility via Locking

  156. Atomic Visibility via Locking X=0 Y=0 X = 1 W

    Y = 1 W
  157. Atomic Visibility via Locking X = 1 W Y =

    1 W X=1 Y=1
  158. Atomic Visibility via Locking X = 1 R Y =

    1 R X = 1 W Y = 1 W X=1 Y=1
  159. Atomic Visibility via Locking X = 1 W Y =

    1 W Y=0 X=1
  160. Atomic Visibility via Locking X = ? R X =

    1 W Y = 1 W Y=0 Y = ? R X=1
  161. Atomic Visibility via Locking X = ? R X =

    1 W Y = 1 W Y=0 Y = ? R X=1 Server 1001 Server 1002
  162. Atomic Visibility via Locking X = ? R X =

    1 W Y = 1 W Y=0 Y = ? R X=1 Server 1001 Server 1002
  163. Atomic Visibility via Locking X = ? R X =

    1 W Y = 1 W Y=0 Y = ? R X=1 Server 1001 Server 1002
  164. None
  165. T I M E

  166. LOCKING W(Y) R(X) R(Y) W(X) T I M E

  167. LOCKING W(Y) R(X) R(Y) W(X) ATOMICITY VIOLATED! T I M

    E
  168. LOCKING W(Y) R(X) R(Y) W(X) ATOMICITY VIOLATED! T I M

    E
  169. LOCKING W(Y) R(X) R(Y) W(X) W(Y) R(X) R(Y) W(X) ATOMICITY

    VIOLATED! T I M E OPTIMISTIC
  170. Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y) R(X) R(Y)

    W(X) ATOMICITY VIOLATED! T I M E OPTIMISTIC VALIDATE ATOMICITY
  171. Y X LOCKING VIOLATED? ABORT W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) ATOMICITY VIOLATED! T I M E OPTIMISTIC VALIDATE ATOMICITY
  172. Y X LOCKING VIOLATED? ABORT W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) ATOMICITY VIOLATED! T I M E OPTIMISTIC VALIDATE ATOMICITY
  173. Y X LOCKING VIOLATED? ABORT W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) ATOMICITY VIOLATED! T I M E OPTIMISTIC VALIDATE ATOMICITY BOTH RELY ON COORDINATION
  174. Due to coordination overheads…

  175. Facebook Tao Google Megastore LinkedIn Espresso Due to coordination overheads…

    Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS Google App Engine
  176. Facebook Tao Google Megastore LinkedIn Espresso Due to coordination overheads…

    Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS …consciously choose to violate atomic visibility Google App Engine
  177. Facebook Tao Google Megastore LinkedIn Espresso Due to coordination overheads…

    Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS …consciously choose to violate atomic visibility “[Tao] explicitly favors efficiency and availability over consistency…[an edge] may exist without an inverse; these hanging associations are scheduled for repair by an asynchronous job.” Google App Engine
  178. Our contributions: to maintain social graph structure referential integrity [SIGMOD

    2014, selected for “Best of SIGMOD” ACM TODS] secondary indexes materialized views
  179. Our contributions: to maintain 1. A new model: atomic visibility

    (via Read Atomic isolation) is (provably) sufficient social graph structure referential integrity [SIGMOD 2014, selected for “Best of SIGMOD” ACM TODS] secondary indexes materialized views
  180. Our contributions: to maintain 1. A new model: atomic visibility

    (via Read Atomic isolation) is (provably) sufficient 2. Efficient protocols: RAMP transactions enforce atomic visibility without coordination social graph structure referential integrity [SIGMOD 2014, selected for “Best of SIGMOD” ACM TODS] secondary indexes materialized views
  181. WHAT THE APPLICATION SAYS “accept friend request” “update index entry”

    write write read write read write read read read read read write write read WHAT THE DATABASE HEARS read read read write read write
  182. “accept friend request” “update index entry” write write read write

    read write read read read read read write write write read
  183. “accept friend request” “update index entry” ATOMIC VISIBILITY write write

    read write read write read read read read read write write write read
  184. “accept friend request” “update index entry” RAMP TRANSACTION ATOMIC VISIBILITY

    write write read write read write read read read read read write write write read
  185. “accept friend request” “update index entry” RAMP TRANSACTION RAMP TRANSACTION

    ATOMIC VISIBILITY write write read write read write read read read read read write write write read
  186. ATOMICITY VIOLATED! Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) OPTIMISTIC T I M E VIOLATED? ABORT VALIDATE ATOMICITY
  187. ATOMICITY VIOLATED! Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS T I M E VIOLATED? ABORT VALIDATE ATOMICITY
  188. ATOMICITY VIOLATED! Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS T I M E Without coordination, atomicity violations will (initially) occur! VIOLATED? ABORT VALIDATE ATOMICITY
  189. ATOMICITY VIOLATED! Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) T I M E Without coordination, atomicity violations will (initially) occur! VIOLATED? ABORT VALIDATE ATOMICITY
  190. ATOMICITY VIOLATED! Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) T I M E Without coordination, atomicity violations will (initially) occur! Don’t panic! Don’t abort! VIOLATED? ABORT VALIDATE ATOMICITY
  191. ATOMICITY VIOLATED! Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) DETECT RACES T I M E Without coordination, atomicity violations will (initially) occur! Don’t panic! Don’t abort! VIOLATED? ABORT VALIDATE ATOMICITY
  192. ATOMICITY VIOLATED! Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) REPAIR ATOMICITY DETECT RACES T I M E Without coordination, atomicity violations will (initially) occur! Don’t panic! Don’t abort! VIOLATED? ABORT VALIDATE ATOMICITY
  193. ATOMICITY VIOLATED! Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y)

    R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) REPAIR ATOMICITY DETECT RACES R(Y) T I M E Without coordination, atomicity violations will (initially) occur! Don’t panic! Don’t abort! VIOLATED? ABORT VALIDATE ATOMICITY
  194. RAMP TRANSACTIONS REPAIR ATOMICITY DETECT RACES

  195. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES

  196. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES

  197. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002
  198. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1
  199. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1 X = ? R Y = ? R X = 1 Y = 0
  200. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1 X = ? R Y = ? R X = 1 Y = 0
  201. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1 X = ? R Y = ? R X = 1 Y = 0
  202. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1 X = ? R Y = ? R X = 1 Y = 0 via intention metadata
  203. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W Server 1001 Y=0 Server 1002 X=1 via intention metadata
  204. Y=0 T0 {} intention · Atomic Visibility via RAMP Transactions

    REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W X=1 T1 {Y} intention · T0 intention · via intention metadata
  205. value Y=0 T0 {} intention · Atomic Visibility via RAMP

    Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · T0 intention · via intention metadata
  206. value Y=0 T0 {} intention · Atomic Visibility via RAMP

    Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · T0 intention · via intention metadata
  207. value Y=0 T0 {} intention · Atomic Visibility via RAMP

    Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · T0 intention · via intention metadata “A transaction called T1 wrote this and also wrote to Y”
  208. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W value X=1 T1 {Y} intention · value Y=0 T0 {} intention · via intention metadata
  209. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W value X=1 T1 {Y} intention · value Y=0 T0 {} intention · via intention metadata
  210. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W value X=1 T1 {Y} intention · value Y=0 T0 {} intention · via intention metadata X = ? R Y = ? R
  211. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value

    X=1 T1 {Y} intention · via intention metadata X = ? R Y = ? R X = 1 W Y = 1 W value Y=0 T0 {} intention ·
  212. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value

    X=1 T1 {Y} intention · via intention metadata X = ? R R X = 1 W Y = 1 W X = 1 Y = 0 value Y=0 T0 {} intention · “A transaction called T1 wrote this and also wrote to Y”
  213. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value

    X=1 T1 {Y} intention · via intention metadata X = ? R R X = 1 W Y = 1 W X = 1 Y = 0 value Y=0 T0 {} intention · “A transaction called T1 wrote this and also wrote to Y”
  214. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value

    X=1 T1 {Y} intention · via intention metadata X = ? R R X = 1 W Y = 1 W X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention ·
  215. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value

    X=1 T1 {Y} intention · via intention metadata X = ? R R X = 1 W Y = 1 W X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention ·
  216. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value

    X=1 T1 {Y} intention · via intention metadata X = ? R R X = 1 W Y = 1 W X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention ·
  217. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value

    X=1 T1 {Y} intention · via intention metadata X = ? R R X = 1 W Y = 1 W X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention · via multi-versioning, ready bit
  218. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X

    = 1 W Y = 1 W value X=1 T1 {Y} intention · via intention metadata via multi-versioning, ready bit value Y=0 T0 {} intention ·
  219. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES via

    intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W via multi-versioning, ready bit
  220. Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES via

    intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready via multi-versioning, ready bit
  221. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready 1.) Place write on each server. via multi-versioning, ready bit
  222. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready 1.) Place write on each server. 2.) Set ready bit on each write on server. via multi-versioning, ready bit
  223. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready 1.) Place write on each server. 2.) Set ready bit on each write on server. via multi-versioning, ready bit Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  224. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  225. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  226. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready X = ? R Y = ? R Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  227. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  228. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  229. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. X = 1 Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  230. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. X = 1 Y = 0 Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  231. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. 2.) Fetch any missing writes using metadata. X = 1 Y = 0 Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  232. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. 2.) Fetch any missing writes using metadata. X = 1 Y = 0 Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  233. Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility

    via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. 2.) Fetch any missing writes using metadata. X = 1 Y = 0 Y = 1 Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers
  234. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details
  235. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details
  236. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Ensures that readers never have to wait
  237. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Ensures that readers never have to wait
  238. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Ensures that readers never have to wait
  239. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Ensures that readers never have to wait 2nd RTT for repair, in the event of a race
  240. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Ensures that readers never have to wait 2nd RTT for repair, in the event of a race
  241. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details
  242. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Transaction IDs: sequence number and client ID » Also use to order overwrites!
  243. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Garbage collection of old versions: » Set timeout (TTL) for overwritten versions » Limit read transaction duration to TTL Transaction IDs: sequence number and client ID » Also use to order overwrites!
  244. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details
  245. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details
  246. Write RTT READ RTT (best case) READ RTT (worst case)

    METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Can we use less metadata for intent?
  247. Algorithm Write RTT READ RTT (best case) READ RTT (worst

    case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Variants
  248. RAMP Variants Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit
  249. RAMP Variants Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit
  250. RAMP Variants Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom filter REPAIR ATOMICITY DETECT RACES via intention metadata Always attempt to repair… …no metadata needed! via multi-versioning, ready bit
  251. RAMP Variants Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B(ε)) Bloom filter REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit
  252. RAMP Variants Algorithm Write RTT READ RTT (best case) READ

    RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B(ε)) Bloom filter REPAIR ATOMICITY DETECT RACES via intention metadata Bloom filter summarizes intent False positives: extra read RTTs via multi-versioning, ready bit
  253. SYSTEM KNOWS SEMANTICS 㱺 CLIENTS CAN COOPERATE WITHOUT WAITING FOR

    EACH OTHER RAMP Overview
  254. SYSTEM KNOWS SEMANTICS 㱺 CLIENTS CAN COOPERATE WITHOUT WAITING FOR

    EACH OTHER KEY IDEA: DETECT RACES Storing intention in metadata allows readers to check for missing writes RAMP Overview
  255. SYSTEM KNOWS SEMANTICS 㱺 CLIENTS CAN COOPERATE WITHOUT WAITING FOR

    EACH OTHER KEY IDEA: DETECT RACES Storing intention in metadata allows readers to check for missing writes KEY IDEA: REPAIR ATOMICITY Transactions “hide” writes until others can reliably complete them (ready bit) RAMP Overview
  256. SYSTEM KNOWS SEMANTICS 㱺 CLIENTS CAN COOPERATE WITHOUT WAITING FOR

    EACH OTHER KEY IDEA: DETECT RACES Storing intention in metadata allows readers to check for missing writes KEY IDEA: REPAIR ATOMICITY Transactions “hide” writes until others can reliably complete them (ready bit) coordination free: transactions do not wait for any others to complete RAMP Overview
  257. RAMP Evaluation

  258. RAMP Evaluation

  259. RAMP Evaluation 1. What is the overhead of the RAMP

    protocols?
  260. RAMP Evaluation 1. What is the overhead of the RAMP

    protocols? 2. What is the benefit of coordination-free execution?
  261. RAMP Evaluation 1. What is the overhead of the RAMP

    protocols? 2. What is the benefit of coordination-free execution? 3. How do the RAMP protocols scale?
  262. RAMP Evaluation evaluated on Amazon EC2 cr1.8xlarge servers (1-100 servers;

    default: 5) 1. What is the overhead of the RAMP protocols? 2. What is the benefit of coordination-free execution? 3. How do the RAMP protocols scale?
  263. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s)
  264. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control
  265. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control Doesn’t enforce atomic visibility
  266. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL
  267. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only
  268. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast
  269. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast Within 5% of baseline
  270. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small
  271. YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000

    4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small Always needs 2RTT reads
  272. RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: WorkloadA, 95% reads, 1M

    items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small
  273. YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0

    25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)
  274. RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control YCSB:

    uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)
  275. RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control RAMP-F

    RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)
  276. “accept friend request” “update index entry” RAMP TRANSACTION RAMP TRANSACTION

    ATOMIC VISIBILITY write write read write read write read read read read read write write write read
  277. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  278. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  279. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  280. write read write read write write read write write write

    read write WHAT THE DATABASE HEARS read read read read read read WHAT THE APPLICATION SAYS my billing application is “correct” my new social app “does the right thing”
  281. None
  282. Database users express correctness criteria via database constraints

  283. “usernames should be unique” “account balances should remain positive” “there

    should only be one administrator” Database users express correctness criteria via database constraints
  284. Constraint Operation Equality, Inequality Any Generate unique ID Any Specify

    unique ID Insert > Increment > Decrement < Decrement < Increment Foreign Key Insert Foreign Key Delete Secondary Indexing Any Materialized Views Any AUTO_INCREMENT Insert Typical database constraints and operations (SQL)
  285. None
  286. adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy browsercms bucketwise calagator canvas-lms

    carter chiliproject citizenry comas comfortable- mexican-sofa communityengine copycopter- server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig
  287. adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy browsercms bucketwise calagator canvas-lms

    carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables [SIGMOD 2015]
  288. adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy browsercms bucketwise calagator canvas-lms

    carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 259 total; avg. 0.13 per table [SIGMOD 2015]
  289. adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy browsercms bucketwise calagator canvas-lms

    carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1 per table 259 total; avg. 0.13 per table [SIGMOD 2015]
  290. CONSTRAINTS MORE COMMON 37x adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy

    browsercms bucketwise calagator canvas-lms carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1 per table 259 total; avg. 0.13 per table [SIGMOD 2015]
  291. write read write read write write read write write write

    read write WHAT THE DATABASE HEARS read read read read read read WHAT THE APPLICATION SAYS “no duplicate users”
  292. write read write read write write read write write write

    read write WHAT THE DATABASE HEARS read read read read read read WHAT THE APPLICATION SAYS “no duplicate users” TODAY: ENFORCEMENT VIA COORDINATION
  293. write read write read write write read write write write

    read write WHAT THE DATABASE HEARS read read read read read read WHAT THE APPLICATION SAYS “no duplicate users” CAN WE USE CONSTRAINTS TO AVOID COORDINATION?
  294. WHAT THE APPLICATION SAYS “no duplicate users” constraint WHAT THE

    DATABASE HEARS constraint constraint constraint constraint constraint constraint constraint “no duplicate users” CAN WE USE CONSTRAINTS TO AVOID COORDINATION?
  295. Key idea: Check if constraints can be violated by “merging”

    independent operations
  296. Key idea: Check if constraints can be violated by “merging”

    independent operations ICT: Invariant Confluence Test
  297. CONSTRAINT: User IDs are unique OPERATION: Add users MERGE: Set

    union Key idea: Check if constraints can be violated by “merging” independent operations ICT: Invariant Confluence Test
  298. CONSTRAINT: User IDs are unique OPERATION: Add users MERGE: Set

    union {{Stu,ID=1}, {Ann,ID=1}} Constraint violated! {} MERGE add {Stu,ID=1} add {Ann,ID=1} Key idea: Check if constraints can be violated by “merging” independent operations ICT: Invariant Confluence Test
  299. Key idea: Check if constraints can be violated by “merging”

    independent operations CONSTRAINT: User IDs are positive OPERATION: Add users MERGE: Set union ICT: Invariant Confluence Test
  300. Key idea: Check if constraints can be violated by “merging”

    independent operations CONSTRAINT: User IDs are positive OPERATION: Add users MERGE: Set union {{Stu,ID=1}, {Ann,ID=1}} Constraint holds! {} MERGE add {Stu,ID=1} add {Ann,ID=1} ICT: Invariant Confluence Test
  301. Key idea: Check if constraints can be violated by “merging”

    independent operations ICT: Invariant Confluence Test
  302. Key idea: Check if constraints can be violated by “merging”

    independent operations OUR CONTRIBUTION: [VLDB 2015] ICT: Invariant Confluence Test
  303. Key idea: Check if constraints can be violated by “merging”

    independent operations OUR CONTRIBUTION: Theorem. A globally I-valid system can execute a set of transactions T with coordination-freedom, transactional availability, and convergence if and only if T are I-confluent with respect to I. [VLDB 2015] ICT ⟺ safe, coordination-free execution possible ICT: Invariant Confluence Test
  304. Key idea: Check if constraints can be violated by “merging”

    independent operations OUR CONTRIBUTION: Generalizes classic partitioning-based indistinguishability arguments Theorem. A globally I-valid system can execute a set of transactions T with coordination-freedom, transactional availability, and convergence if and only if T are I-confluent with respect to I. [VLDB 2015] ICT ⟺ safe, coordination-free execution possible ICT: Invariant Confluence Test
  305. Constraint Operation OK? Equality, Inequality Any ??? Generate unique ID

    Any ??? Specify unique ID Insert ??? > Increment ??? > Decrement ??? < Decrement ??? < Increment ??? Foreign Key Insert ??? Foreign Key Delete ??? Secondary Indexing Any ??? Materialized Views Any ??? AUTO_INCREMENT Insert ??? Typical database constraints and operations (SQL) Under set merge
  306. Constraint Operation OK? Equality, Inequality Any Y Generate unique ID

    Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N [VLDB 2015] Typical database constraints and operations (SQL) Under set merge
  307. Constraint Operation OK? Equality, Inequality Any Y Generate unique ID

    Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N [VLDB 2015] Typical database constraints and operations (SQL) R A M P Under set merge
  308. adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy browsercms bucketwise calagator canvas-lms

    carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1 per table 259 total; avg. 0.13 per table [SIGMOD 2015]
  309. adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy browsercms bucketwise calagator canvas-lms

    carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1 per table 259 total; avg. 0.13 per table 86.9% PASS ICT [SIGMOD 2015]
  310. None
  311. TPC-C

  312. 14/16 CONSTRAINTS PASS ICT TPC-C

  313. 14/16 CONSTRAINTS PASS ICT TPC-C 6-11x faster than ACID/serializability 8

    16 32 48 64 Number of Warehouses 40K 100K 600K Throughput (txns/s) Coordination-Avoiding Serializable (2PL)
  314. 14/16 CONSTRAINTS PASS ICT TPC-C scale to over 25x best

    listed result 0 50 100 150 200 2M 4M 6M 8M 10M 12M 14M Total Throughput (txn/s) 0 50 100 150 200 Number of Servers 0 20K 40K 60K 80K Throughput (txn/s/server) 6-11x faster than ACID/serializability 8 16 32 48 64 Number of Warehouses 40K 100K 600K Throughput (txns/s) Coordination-Avoiding Serializable (2PL)
  315. WHAT THE APPLICATION SAYS “no duplicate users” constraint WHAT THE

    DATABASE HEARS constraint constraint constraint constraint constraint constraint constraint “no duplicate users” CAN WE USE CONSTRAINTS TO AVOID COORDINATION?
  316. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  317. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  318. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  319. Key idea: Exploit statistical robustness in system designs

  320. PLASMA: ASYNCHRONOUS LEARNING [Ongoing] Key idea: Exploit statistical robustness in

    system designs
  321. PLASMA: ASYNCHRONOUS LEARNING [Ongoing] TIME Bulk Synch Parallel Key idea:

    Exploit statistical robustness in system designs
  322. PLASMA: ASYNCHRONOUS LEARNING [Ongoing] ML task: Express algorithms via async

    iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Break dataflow barriers using new iterator model
  323. VELOX: FAST ONLINE PREDICTIONS [CIDR 2015] PLASMA: ASYNCHRONOUS LEARNING [Ongoing]

    ML task: Express algorithms via async iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Break dataflow barriers using new iterator model
  324. VELOX: FAST ONLINE PREDICTIONS [CIDR 2015] Fast incremental personalization Batch

    retrain shared features PLASMA: ASYNCHRONOUS LEARNING [Ongoing] ML task: Express algorithms via async iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Break dataflow barriers using new iterator model
  325. VELOX: FAST ONLINE PREDICTIONS [CIDR 2015] Fast incremental personalization Batch

    retrain shared features PLASMA: ASYNCHRONOUS LEARNING [Ongoing] ML task: Express algorithms via async iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Prioritize model maintenance by robustness Break dataflow barriers using new iterator model
  326. VELOX: FAST ONLINE PREDICTIONS [CIDR 2015] Fast incremental personalization Batch

    retrain shared features PLASMA: ASYNCHRONOUS LEARNING [Ongoing] ML task: Express algorithms via async iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Prioritize model maintenance by robustness ML task: Split models according to robustness Break dataflow barriers using new iterator model
  327. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  328. Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO

    SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  329. DESIGN DATABASE SYSTEMS THAT EXPLOIT SEMANTICS OF HIGH-VALUE USE CASES

    MY APPROACH: Study practical database use cases Derive principles and algorithms Build systems to realize the benefits
  330. None
  331. PBS: Integrated into Cassandra 1.2 release + recent extensions at

    a major Internet company
  332. PBS: Integrated into Cassandra 1.2 release RAMP: Proposed feature in

    Cassandra 3.0 (Reportedly) on roadmap for Facebook Apollo, IBM Cloudant + recent extensions at a major Internet company
  333. PBS: Integrated into Cassandra 1.2 release RAMP: Proposed feature in

    Cassandra 3.0 (Reportedly) on roadmap for Facebook Apollo, IBM Cloudant + recent extensions at a major Internet company HAT Isolation: part of Kleppmann@LinkedIn’s Hermitage testing suite
  334. PBS: Integrated into Cassandra 1.2 release RAMP: Proposed feature in

    Cassandra 3.0 (Reportedly) on roadmap for Facebook Apollo, IBM Cloudant + recent extensions at a major Internet company HAT Isolation: part of Kleppmann@LinkedIn’s Hermitage testing suite Active dialogue with developer, NoSQL community via invited talks, blogging, social media
  335. Current Practice PBS VLDB12, SIGMOD13, VLDBJ14, CACM14 EC Today CACM/Queue13

    Consistency without Borders SoCC13 Network Partitions CACM/Queue14 Feral Concurrency Control SIGMOD15 Principles I-Confluence VLDB15 HATs HotOS13, VLDB14 Explicit Causality SoCC12 Systems Bolt-On SIGMOD13 RAMP + Indexing SIGMOD14 Velox CIDR15 Plasma + BAP Ongoing MY WORK: COORDINATION AVOIDANCE
  336. Current Practice PBS VLDB12, SIGMOD13, VLDBJ14, CACM14 EC Today CACM/Queue13

    Consistency without Borders SoCC13 Network Partitions CACM/Queue14 Feral Concurrency Control SIGMOD15 Principles I-Confluence VLDB15 HATs HotOS13, VLDB14 Explicit Causality SoCC12 Systems Bolt-On SIGMOD13 RAMP + Indexing SIGMOD14 Velox CIDR15 Plasma + BAP Ongoing MY WORK: COORDINATION AVOIDANCE
  337. None
  338. FUTURE WORK

  339. FUTURE WORK Automatically coordinated applications

  340. FUTURE WORK Automatically coordinated applications Bespoke analysis and coordination synthesis

  341. FUTURE WORK Automatically coordinated applications Bespoke analysis and coordination synthesis

    “Query optimization” for transaction execution
  342. FUTURE WORK Automatically coordinated applications Bespoke analysis and coordination synthesis

    “Query optimization” for transaction execution DB meets “Big Data” Learning
  343. FUTURE WORK Automatically coordinated applications Bespoke analysis and coordination synthesis

    “Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance
  344. FUTURE WORK Automatically coordinated applications Bespoke analysis and coordination synthesis

    “Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance Bounded divergence control for coordinating learners
  345. FUTURE WORK Automatically coordinated applications Bespoke analysis and coordination synthesis

    “Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance Bounded divergence control for coordinating learners Next-Generation Data Applications
  346. FUTURE WORK Automatically coordinated applications Bespoke analysis and coordination synthesis

    “Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance Bounded divergence control for coordinating learners Next-Generation Data Applications Next 10-100x growth in data volume due to sensors, apps
  347. FUTURE WORK Automatically coordinated applications Bespoke analysis and coordination synthesis

    “Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance Bounded divergence control for coordinating learners Next-Generation Data Applications Next 10-100x growth in data volume due to sensors, apps New interfaces for increased coordination costs, heterogeneity
  348. WHAT THE APPLICATION SAYS “post on timeline” “accept friend request”

    write read write read write write read write write write read write WHAT THE DATABASE HEARS read read read read read read
  349. Eventual Consistency COORDINATION FREE NO SAFETY Atomic Visibility SIGMOD14 Database

    Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE
  350. Eventual Consistency COORDINATION FREE NO SAFETY Atomic Visibility SIGMOD14 Database

    Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE Joint work with Ali Ghodsi, Joe Hellerstein, Ion Stoica, Mike Franklin, Michael Jordan, Alan Fekete, Dan Crankshaw, Shivaram Venkataraman, Neil Conway, Peter Alvaro, Aaron Davidson, Joey Gonzalez, Kyle Kingsbury, Haoyuan Li, and Zhao Zhang
  351. Eventual Consistency COORDINATION FREE NO SAFETY Atomic Visibility SIGMOD14 Database

    Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE Joint work with Ali Ghodsi, Joe Hellerstein, Ion Stoica, Mike Franklin, Michael Jordan, Alan Fekete, Dan Crankshaw, Shivaram Venkataraman, Neil Conway, Peter Alvaro, Aaron Davidson, Joey Gonzalez, Kyle Kingsbury, Haoyuan Li, and Zhao Zhang
  352. Many illustrations by the Noun Project (CC-Attribution): surprised by Julian

    Derveaux world by Wayne Tyler Sall database by Austin Condiff earth by Martin Vanco Woman by Simon Child Man by Simon Child Doctor by Simon Child David-Hockney by Simon Child Server by Simon Child clock by christoph robausch