Coordination Avoidance In Distributed Databases

COORDINATION AVOIDANCE  IN  DISTRIBUTED  DATABASES PETER BAILIS UC Berkeley

DATA TODAY:

SCALE DATA TODAY: UNPRECEDENTED

SCALE Billion-user Internet services 3B Internet users in 2014 2.3B
Mobile broadband users DATA TODAY: UNPRECEDENTED Ericsson Mobility Report, UN International Telecommunication Union, Facebook, Google, NSA,

SCALE VOLUME Billion-user Internet services 3B Internet users in 2014
2.3B Mobile broadband users Facebook RocksDB: 9B ops/sec Google BigTable: 600M ops/sec LinkedIn Kafka: 2.5M ops/sec DATA TODAY: UNPRECEDENTED Ericsson Mobility Report, UN International Telecommunication Union, Facebook, Google, NSA, @RocksDB, @AKPurtell, Martin Kleppmann

SCALE VOLUME INTERACTIVITY Billion-user Internet services 3B Internet users in
2014 2.3B Mobile broadband users Facebook RocksDB: 9B ops/sec Google BigTable: 600M ops/sec LinkedIn Kafka: 2.5M ops/sec Impatient users want low latency Always-on responsiveness Personalized user experiences DATA TODAY: UNPRECEDENTED Ericsson Mobility Report, UN International Telecommunication Union, Facebook, Google, NSA, @RocksDB, @AKPurtell, Martin Kleppmann

SCALE VOLUME INTERACTIVITY DATA TODAY: UNPRECEDENTED

SCALE VOLUME INTERACTIVITY AND GROWING! DATA TODAY: UNPRECEDENTED

“post on timeline” “accept friend request”

How should we design database systems that enable applications to
scale? “post on timeline” “accept friend request”

CLASSIC:  ACID

CLASSIC:  ACID serializable transactions “accept friend request” “post on timeline”

CLASSIC:  ACID serializable transactions

serializability: equivalence to some serial execution

“post on timeline” serializability: equivalence to some serial execution

“post on timeline” “accept friend request” serializability: equivalence to some
serial execution

“post on timeline” “accept friend request” serializability: equivalence to some
serial execution very general!

r(y) w(x←1) r(x) w(y←1) very general! serializability: equivalence to some
serial execution

r(y) w(x←1) r(x) w(y←1) very general! …but restricts concurrency serializability:
equivalence to some serial execution

serializability: equivalence to some serial execution very general! …but restricts
concurrency

serializability: equivalence to some serial execution very general! …but restricts
concurrency CONCURRENT EXECUTION

serializability: equivalence to some serial execution r(x)=0 very general! …but
restricts concurrency CONCURRENT EXECUTION

serializability: equivalence to some serial execution r(x)=0 r(y)=0 very general!
…but restricts concurrency CONCURRENT EXECUTION

serializability: equivalence to some serial execution r(x)=0 w(y←1) r(y)=0 very
general! …but restricts concurrency CONCURRENT EXECUTION

serializability: equivalence to some serial execution r(x)=0 w(x←1) w(y←1) r(y)=0
very general! …but restricts concurrency CONCURRENT EXECUTION

very general! …but restricts concurrency r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 CONCURRENT EXECUTION

very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 CONCURRENT EXECUTION

very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION

very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION

very general! …but restricts concurrency Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE!

very general! …but restricts concurrency transactions cannot make progress independently Serializability requires Coordination Should have r(y)!1 r(y)=0 w(x←1) 2 r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE!

transactions cannot make progress independently Serializability requires Coordination

transactions cannot make progress independently Serializability requires Coordination Two-Phase Locking
Optimistic Concurrency Control Pre-Scheduling Multi-Version Concurrency Control

Optimistic Concurrency Control Pre-Scheduling Multi-Version Concurrency Control Blocking Waiting Aborts

Optimistic Concurrency Control Pre-Scheduling Multi-Version Concurrency Control Blocking Waiting Aborts Costs of Coordination Between Concurrent Transactions

1. Decreased performance transactions cannot make progress independently Serializability requires
Coordination Two-Phase Locking Optimistic Concurrency Control Pre-Scheduling Multi-Version Concurrency Control Blocking Waiting Aborts Costs of Coordination Between Concurrent Transactions

2 3 4 5 6 7 8 Number of Servers
in Transaction 0 200 400 600 800 1000 1200 Maximum Throughput (txns/s) Number of Servers in Transaction Local datacenter (Amazon EC2) Based on [Bobtail, Xu et al., NSDI 13] For conﬂicting transactions

2 3 4 5 6 7 8 Number of Servers
in Transaction 0 200 400 600 800 1000 1200 Maximum Throughput (txns/s) Number of Servers in Transaction +OR +CA +IR +SP +TO +SI +SY Participating Datacenters (+VA) 2 4 6 8 10 12 Maximum Throughput (txn/s) Local datacenter (Amazon EC2) Based on [Bobtail, Xu et al., NSDI 13] Multi-datacenter (Amazon EC2) Based on [HAT, Bailis et al., VLDB 14] For conﬂicting transactions

1. Decreased performance » due to waiting, communication delays, aborts
» exacerbated in distributed environment! 2. Decreased availability during failures transactions cannot make progress independently Serializability requires Coordination Costs of Coordination Between Concurrent Transactions

1. Decreased performance » due to waiting, communication delays, aborts
» exacerbated in distributed environment! 2. Decreased availability during failures transactions cannot make progress independently Serializability requires Coordination Well-known for decades; cf. “CAP” Costs of Coordination Between Concurrent Transactions

How should we design database systems that enable applications to
scale?

Serializability COORDINATION REQUIRED How should we design database systems that
enable applications to scale?

Serializability COORDINATION REQUIRED “NoSQL” COORDINATION FREE How should we design
database systems that enable applications to scale?

Eventual Consistency “if no new updates are made to the
[database], eventually all accesses will return the last updated value[s]” — Werner Vogels, Amazon CTO

Eventual Consistency “if no new updates are made to the
[database], eventually all accesses will return the last updated value[s]” — Werner Vogels, Amazon CTO provides no safety: what happens in the meantime?

[VLDB 2012, VLDB Journal 2014 “Best of VLDB 2012”, SIGMOD
2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS)

2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior

2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior » Key ﬁnding: frequently “correct” results…

2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior » Key ﬁnding: frequently “correct” results… PBS: Voldemort Database at LinkedIn 99% of reads return the last update 23ms after write

2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior » Key ﬁnding: frequently “correct” results… PBS: Voldemort Database at LinkedIn 99% of reads return the last update 23ms after write 32-90% decrease in 99.9th percentile latency

2013 (Demo), CACM Research Highlight] Probabilistically Bounded Staleness (PBS) » Monte Carlo analysis of protocol behavior » Key ﬁnding: frequently “correct” results… PBS: Voldemort Database at LinkedIn 99% of reads return the last update 23ms after write 32-90% decrease in 99.9th percentile latency …BUT NO GUARANTEES! 㱺 DIFFICULT TO PROGRAM

“…sometimes the [write] is retrieved from the datastore and sometimes
it is not.”

Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO
SAFETY

SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14

SAFETY COORDINATION AVOIDANCE PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 MY WORK:

SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 MY WORK:

The Far Side, Gary Larson

WHAT THE APPLICATION SAYS “post on timeline” “accept friend request”

write read write read write write read write write write read write WHAT THE DATABASE HEARS read read read read read read

DESIGN DATABASE SYSTEMS THAT EXPLOIT SEMANTICS OF HIGH-VALUE USE CASES
MY APPROACH:

MY APPROACH: Study practical database use cases

MY APPROACH: Study practical database use cases Derive principles and algorithms

MY APPROACH: Study practical database use cases Derive principles and algorithms Build systems to realize the beneﬁts

SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION PBS VLDB12, VLDBJ14, SIGMOD13, CACM14

SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14

SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14

Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency
COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14

Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED
GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14

Atomic Visibility SIGMOD14 Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13
Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14

Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Weak Isolation HotOS13,
VLDB14 Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14

VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION

VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION Data Serving and Transactions

VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION Data Serving and Transactions Model Prediction and Training CIDR15, TBA Analytics

Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and
Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14

Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and
Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 Serializability COORDINATION REQUIRED GUARANTEED SAFETY Eventual Consistency COORDINATION FREE NO SAFETY COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE

(Abridged) Related Work

(Abridged) Related Work » Semantics-based concurrency control: esp. commutativity and
CALM analysis, laws of order » Available storage systems: optimistic replication, causal memory, CRDTs, eventually consistent transactions » Distributed computing: CAP, FLP, NBAC, quorums

(Abridged) Related Work » Semantics-based concurrency control: esp. commutativity and
CALM analysis, laws of order » Available storage systems: optimistic replication, causal memory, CRDTs, eventually consistent transactions » Distributed computing: CAP, FLP, NBAC, quorums » Here: focus on necessary coordination for common, modern data-intensive apps

SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE

SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE 1

SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE 1 2

SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE 1 2 3

SAFETY Atomic Visibility SIGMOD14 Database Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE 1

Social Graph

Social Graph Facebook

Social Graph 1.2B+ vertices Facebook

Social Graph 1.2B+ vertices 420B+ edges Facebook

Social Graph 1 2 3 4 5 6 User Facebook
1.2B+ vertices 420B+ edges

Social Graph 1 2 3 4 5 6 2, 3,
5 User Adjacency List 1, 3, 5 1, 5, 6 6 1, 2, 3, 6 3, 4, 5 Facebook 1.2B+ vertices 420B+ edges

Social Graph 1 2, 3, 5 User Adjacency List 2
1, 3, 5 3 1, 5, 6 4 6 5 1, 2, 3, 6 6 3, 4, 5 1.2B+ vertices 420B+ edges Facebook

1 2, 3, 5 6 3, 4, 5

1 2, 3, 5 6 3, 4, 5 ,6 ,1

1 2, 3, 5 6 3, 4, 5 ,6 ,1
To preserve graph, should observe either: » Both links » Neither link

1 2, 3, 5 6 3, 4, 5 ,6 ,1
To preserve graph, should observe either: » Both links » Neither link Atomic Visibility

Atomic Visibility

Atomic Visibility either all or none of each transaction’s updates
should be visible to other transactions

Atomic Visibility X = 1 WRITE Y = 1 WRITE
either all or none of each transaction’s updates should be visible to other transactions

Atomic Visibility OR X = 1 READ Y = 1
READ READ X = READ Y = X = 1 WRITE Y = 1 WRITE either all or none of each transaction’s updates should be visible to other transactions

Atomic Visibility OR X = 1 READ Y = 1
READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions

BUT NOT Atomic Visibility OR X = 1 READ Y
= 1 READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions OR X = 1 READ Y = 1 READ READ X = READ Y =

BUT NOT Atomic Visibility OR X = 1 READ Y
= 1 READ READ X = READ Y = either all or none of each transaction’s updates should be visible to other transactions OR X = 1 READ Y = 1 READ READ X = READ Y = “FRACTURED READS”

Atomic Visibility is sufﬁcient to correctly maintain: social graph structure

r(x)=0 w(x←1) w(y←1) r(y)=0 Should have r(y)!1 r(y)=0 w(x←1) 2
r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE! Atomic Visibility is not serializability!

r(x)=0 w(x←1) w(y←1) r(y)=0 Should have r(y)!1 r(y)=0 w(x←1) 2
r(x)=0 w(y←1) 1 Should have r(x)!1 r(y)=0 w(x←1) 1 r(x)=0 w(y←1) 2 CONCURRENT EXECUTION IS NOT SERIALIZABLE! Atomic Visibility is not serializability! …but respects Atomic Visibility!

Fractured Reads Item Anti- Dependency Cycles Anti-Dependency Cycles Serializability Prevents
Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared

Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared WANT TO PREVENT

Prevents Prevents Snapshot Isolation Prevents Prevents Doesn’t prevent Atomic Visibility via Read Atomic Prevents Doesn’t prevent Doesn’t prevent Eventual Consistency Doesn’t prevent Doesn’t prevent Doesn’t prevent Atomic Visibility compared Require coordination to prevent! [VLDB 2014] WANT TO PREVENT

Also applies to other relationships

Also applies to other relationships an attending doctor should have
each patient

Atomic Visibility is sufﬁcient to correctly maintain: referential integrity secondary
indexes materialized views social graph structure

Atomic Visibility is sufﬁcient to correctly maintain: referential integrity secondary
indexes materialized views despite being weaker than serializability social graph structure

Atomic Visibility via Locking

Atomic Visibility via Locking X=0 Y=0 X = 1 W
Y = 1 W

Atomic Visibility via Locking X = 1 W Y =
1 W X=1 Y=1

Atomic Visibility via Locking X = 1 R Y =
1 R X = 1 W Y = 1 W X=1 Y=1

Atomic Visibility via Locking X = 1 W Y =
1 W Y=0 X=1

Atomic Visibility via Locking X = ? R X =
1 W Y = 1 W Y=0 Y = ? R X=1

Atomic Visibility via Locking X = ? R X =
1 W Y = 1 W Y=0 Y = ? R X=1 Server 1001 Server 1002

T I M E

LOCKING W(Y) R(X) R(Y) W(X) T I M E

LOCKING W(Y) R(X) R(Y) W(X) ATOMICITY VIOLATED! T I M
E

LOCKING W(Y) R(X) R(Y) W(X) W(Y) R(X) R(Y) W(X) ATOMICITY
VIOLATED! T I M E OPTIMISTIC

Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y) R(X) R(Y)
W(X) ATOMICITY VIOLATED! T I M E OPTIMISTIC VALIDATE ATOMICITY

Y X LOCKING VIOLATED? ABORT W(Y) R(X) R(Y) W(X) W(Y)
R(X) R(Y) W(X) ATOMICITY VIOLATED! T I M E OPTIMISTIC VALIDATE ATOMICITY

Y X LOCKING VIOLATED? ABORT W(Y) R(X) R(Y) W(X) W(Y)
R(X) R(Y) W(X) ATOMICITY VIOLATED! T I M E OPTIMISTIC VALIDATE ATOMICITY BOTH RELY ON COORDINATION

Due to coordination overheads…

Facebook Tao Google Megastore LinkedIn Espresso Due to coordination overheads…
Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS Google App Engine

Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS …consciously choose to violate atomic visibility Google App Engine

Amazon DynamoDB Apache Cassandra Basho Riak Yahoo! PNUTS …consciously choose to violate atomic visibility “[Tao] explicitly favors efﬁciency and availability over consistency…[an edge] may exist without an inverse; these hanging associations are scheduled for repair by an asynchronous job.” Google App Engine

Our contributions: to maintain social graph structure referential integrity [SIGMOD
2014, selected for “Best of SIGMOD” ACM TODS] secondary indexes materialized views

Our contributions: to maintain 1. A new model: atomic visibility
(via Read Atomic isolation) is (provably) sufﬁcient social graph structure referential integrity [SIGMOD 2014, selected for “Best of SIGMOD” ACM TODS] secondary indexes materialized views

Our contributions: to maintain 1. A new model: atomic visibility
(via Read Atomic isolation) is (provably) sufﬁcient 2. Efﬁcient protocols: RAMP transactions enforce atomic visibility without coordination social graph structure referential integrity [SIGMOD 2014, selected for “Best of SIGMOD” ACM TODS] secondary indexes materialized views

WHAT THE APPLICATION SAYS “accept friend request” “update index entry”
write write read write read write read read read read read write write read WHAT THE DATABASE HEARS read read read write read write

“accept friend request” “update index entry” write write read write
read write read read read read read write write write read

“accept friend request” “update index entry” ATOMIC VISIBILITY write write
read write read write read read read read read write write write read

“accept friend request” “update index entry” RAMP TRANSACTION ATOMIC VISIBILITY
write write read write read write read read read read read write write write read

“accept friend request” “update index entry” RAMP TRANSACTION RAMP TRANSACTION
ATOMIC VISIBILITY write write read write read write read read read read read write write write read

ATOMICITY VIOLATED! Y X LOCKING W(Y) R(X) R(Y) W(X) W(Y)
R(X) R(Y) W(X) OPTIMISTIC T I M E VIOLATED? ABORT VALIDATE ATOMICITY

R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS T I M E VIOLATED? ABORT VALIDATE ATOMICITY

R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS T I M E Without coordination, atomicity violations will (initially) occur! VIOLATED? ABORT VALIDATE ATOMICITY

R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) T I M E Without coordination, atomicity violations will (initially) occur! VIOLATED? ABORT VALIDATE ATOMICITY

R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) T I M E Without coordination, atomicity violations will (initially) occur! Don’t panic! Don’t abort! VIOLATED? ABORT VALIDATE ATOMICITY

R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) DETECT RACES T I M E Without coordination, atomicity violations will (initially) occur! Don’t panic! Don’t abort! VIOLATED? ABORT VALIDATE ATOMICITY

R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) REPAIR ATOMICITY DETECT RACES T I M E Without coordination, atomicity violations will (initially) occur! Don’t panic! Don’t abort! VIOLATED? ABORT VALIDATE ATOMICITY

R(X) R(Y) W(X) OPTIMISTIC RAMP TRANSACTIONS W(Y) R(X) R(Y) W(X) REPAIR ATOMICITY DETECT RACES R(Y) T I M E Without coordination, atomicity violations will (initially) occur! Don’t panic! Don’t abort! VIOLATED? ABORT VALIDATE ATOMICITY

RAMP TRANSACTIONS REPAIR ATOMICITY DETECT RACES

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES X
= 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002

= 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1

= 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1 X = ? R Y = ? R X = 1 Y = 0

= 1 W Y = 1 W Server 1001 X=0 Y=0 Server 1002 X=1 X = ? R Y = ? R X = 1 Y = 0 via intention metadata

= 1 W Y = 1 W Server 1001 Y=0 Server 1002 X=1 via intention metadata

Y=0 T0 {} intention · Atomic Visibility via RAMP Transactions
REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W X=1 T1 {Y} intention · T0 intention · via intention metadata

value Y=0 T0 {} intention · Atomic Visibility via RAMP
Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · T0 intention · via intention metadata

value Y=0 T0 {} intention · Atomic Visibility via RAMP
Transactions REPAIR ATOMICITY DETECT RACES X = 1 W Y = 1 W value X=1 T1 {Y} intention · T0 intention · via intention metadata “A transaction called T1 wrote this and also wrote to Y”

= 1 W Y = 1 W value X=1 T1 {Y} intention · value Y=0 T0 {} intention · via intention metadata

= 1 W Y = 1 W value X=1 T1 {Y} intention · value Y=0 T0 {} intention · via intention metadata X = ? R Y = ? R

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES value
X=1 T1 {Y} intention · via intention metadata X = ? R Y = ? R X = 1 W Y = 1 W value Y=0 T0 {} intention ·

X=1 T1 {Y} intention · via intention metadata X = ? R R X = 1 W Y = 1 W X = 1 Y = 0 value Y=0 T0 {} intention · “A transaction called T1 wrote this and also wrote to Y”

X=1 T1 {Y} intention · via intention metadata X = ? R R X = 1 W Y = 1 W X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention ·

X=1 T1 {Y} intention · via intention metadata X = ? R R X = 1 W Y = 1 W X = 1 Y = 0 Where is T1’s write to Y? value Y=0 T0 {} intention · via multi-versioning, ready bit

= 1 W Y = 1 W value X=1 T1 {Y} intention · via intention metadata via multi-versioning, ready bit value Y=0 T0 {} intention ·

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES via
intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W via multi-versioning, ready bit

Atomic Visibility via RAMP Transactions REPAIR ATOMICITY DETECT RACES via
intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready via multi-versioning, ready bit

Y=1 T1 {X} · X=1 T1 {Y} · Atomic Visibility
via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready 1.) Place write on each server. via multi-versioning, ready bit

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready 1.) Place write on each server. 2.) Set ready bit on each write on server. via multi-versioning, ready bit

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready 1.) Place write on each server. 2.) Set ready bit on each write on server. via multi-versioning, ready bit Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · X = 1 W Y = 1 W ready ready X = ? R Y = ? R Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. X = 1 Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. X = 1 Y = 0 Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. 2.) Fetch any missing writes using metadata. X = 1 Y = 0 Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers

via RAMP Transactions REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning value intention X=0 T0 {} · value intention Y=0 T0 {} · ready ready X = ? R Y = ? R 1.) Fetch “highest” ready versions. 2.) Fetch any missing writes using metadata. X = 1 Y = 0 Y = 1 Ready bit invariant: if ready bit is set, all writes in transaction are present on their respective servers

Write RTT READ RTT (best case) READ RTT (worst case)
METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details

METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Ensures that readers never have to wait

METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Ensures that readers never have to wait 2nd RTT for repair, in the event of a race

METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Transaction IDs: sequence number and client ID » Also use to order overwrites!

METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Garbage collection of old versions: » Set timeout (TTL) for overwritten versions » Limit read transaction duration to TTL Transaction IDs: sequence number and client ID » Also use to order overwrites!

METADATA 2 1 2 O(txn len) write set summary REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Details Can we use less metadata for intent?

Algorithm Write RTT READ RTT (best case) READ RTT (worst
case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom ﬁlter REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit RAMP Variants

RAMP Variants Algorithm Write RTT READ RTT (best case) READ
RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom ﬁlter REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit

RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(1) Bloom ﬁlter REPAIR ATOMICITY DETECT RACES via intention metadata Always attempt to repair… …no metadata needed! via multi-versioning, ready bit

RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B(ε)) Bloom ﬁlter REPAIR ATOMICITY DETECT RACES via intention metadata via multi-versioning, ready bit

RTT (worst case) METADATA RAMP-Fast 2 1 2 O(txn len) write set summary RAMP-Small 2 2 2 O(1) timestamp RAMP-Hybrid 2 1+ε 2 O(B(ε)) Bloom ﬁlter REPAIR ATOMICITY DETECT RACES via intention metadata Bloom ﬁlter summarizes intent False positives: extra read RTTs via multi-versioning, ready bit

SYSTEM KNOWS SEMANTICS 㱺 CLIENTS CAN COOPERATE WITHOUT WAITING FOR
EACH OTHER RAMP Overview

EACH OTHER KEY IDEA: DETECT RACES Storing intention in metadata allows readers to check for missing writes RAMP Overview

EACH OTHER KEY IDEA: DETECT RACES Storing intention in metadata allows readers to check for missing writes KEY IDEA: REPAIR ATOMICITY Transactions “hide” writes until others can reliably complete them (ready bit) RAMP Overview

EACH OTHER KEY IDEA: DETECT RACES Storing intention in metadata allows readers to check for missing writes KEY IDEA: REPAIR ATOMICITY Transactions “hide” writes until others can reliably complete them (ready bit) coordination free: transactions do not wait for any others to complete RAMP Overview

RAMP Evaluation

RAMP Evaluation 1. What is the overhead of the RAMP
protocols?

protocols? 2. What is the beneﬁt of coordination-free execution?

protocols? 2. What is the beneﬁt of coordination-free execution? 3. How do the RAMP protocols scale?

RAMP Evaluation evaluated on Amazon EC2 cr1.8xlarge servers (1-100 servers;
default: 5) 1. What is the overhead of the RAMP protocols? 2. What is the beneﬁt of coordination-free execution? 3. How do the RAMP protocols scale?

YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn 0 2000
4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s)

4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control

4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control Doesn’t enforce atomic visibility

4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL

4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only

4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast

4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast Within 5% of baseline

4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small

4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small Always needs 2RTT reads

RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: WorkloadA, 95% reads, 1M
items, 4 items/txn 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) 0 2000 4000 6000 8000 10000 Concurrent Clients 0 30K 60K 90K 120K 150K 180K Throughput (txn/s) RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control LWSR LWLR E-PCI Serializable 2PL NWNR LWNR LWSR LWLR E-PCI Write Locks Only RAMP-F RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small

YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0
25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)

RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control YCSB:
uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)

RAMP-H NWNR LWNR LWSR LWLR E-PCI No Concurrency Control RAMP-F
RAMP-S RAMP-Fast RAMP-F RAMP-S RAMP-H RAMP-Small RAMP-F RAMP-S RAMP-H NWNR RAMP-Hybrid YCSB: uniform access, 1M items, 4 items/txn, 95% reads 0 25 50 75 100 Number of Servers 0 2M 4M 6M 8M Throughput (ops/s)

“accept friend request” “update index entry” RAMP TRANSACTION RAMP TRANSACTION
ATOMIC VISIBILITY write write read write read write read read read read read write write write read

write read write read write write read write write write
read write WHAT THE DATABASE HEARS read read read read read read WHAT THE APPLICATION SAYS my billing application is “correct” my new social app “does the right thing”

Database users express correctness criteria via database constraints

“usernames should be unique” “account balances should remain positive” “there
should only be one administrator” Database users express correctness criteria via database constraints

Constraint Operation Equality, Inequality Any Generate unique ID Any Specify
unique ID Insert > Increment > Decrement < Decrement < Increment Foreign Key Insert Foreign Key Delete Secondary Indexing Any Materialized Views Any AUTO_INCREMENT Insert Typical database constraints and operations (SQL)

adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy browsercms bucketwise calagator canvas-lms
carter chiliproject citizenry comas comfortable- mexican-sofa communityengine copycopter- server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig

carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables [SIGMOD 2015]

carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 259 total; avg. 0.13 per table [SIGMOD 2015]

carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1 per table 259 total; avg. 0.13 per table [SIGMOD 2015]

CONSTRAINTS MORE COMMON 37x adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy
browsercms bucketwise calagator canvas-lms carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1 per table 259 total; avg. 0.13 per table [SIGMOD 2015]

read write WHAT THE DATABASE HEARS read read read read read read WHAT THE APPLICATION SAYS “no duplicate users”

read write WHAT THE DATABASE HEARS read read read read read read WHAT THE APPLICATION SAYS “no duplicate users” TODAY: ENFORCEMENT VIA COORDINATION

read write WHAT THE DATABASE HEARS read read read read read read WHAT THE APPLICATION SAYS “no duplicate users” CAN WE USE CONSTRAINTS TO AVOID COORDINATION?

WHAT THE APPLICATION SAYS “no duplicate users” constraint WHAT THE
DATABASE HEARS constraint constraint constraint constraint constraint constraint constraint “no duplicate users” CAN WE USE CONSTRAINTS TO AVOID COORDINATION?

Key idea: Check if constraints can be violated by “merging”
independent operations

independent operations ICT: Invariant Confluence Test

CONSTRAINT: User IDs are unique OPERATION: Add users MERGE: Set
union Key idea: Check if constraints can be violated by “merging” independent operations ICT: Invariant Confluence Test

CONSTRAINT: User IDs are unique OPERATION: Add users MERGE: Set
union {{Stu,ID=1}, {Ann,ID=1}} Constraint violated! {} MERGE add {Stu,ID=1} add {Ann,ID=1} Key idea: Check if constraints can be violated by “merging” independent operations ICT: Invariant Confluence Test

independent operations CONSTRAINT: User IDs are positive OPERATION: Add users MERGE: Set union ICT: Invariant Confluence Test

independent operations CONSTRAINT: User IDs are positive OPERATION: Add users MERGE: Set union {{Stu,ID=1}, {Ann,ID=1}} Constraint holds! {} MERGE add {Stu,ID=1} add {Ann,ID=1} ICT: Invariant Confluence Test

independent operations ICT: Invariant Confluence Test

independent operations OUR CONTRIBUTION: [VLDB 2015] ICT: Invariant Confluence Test

independent operations OUR CONTRIBUTION: Theorem. A globally I-valid system can execute a set of transactions T with coordination-freedom, transactional availability, and convergence if and only if T are I-conﬂuent with respect to I. [VLDB 2015] ICT ⟺ safe, coordination-free execution possible ICT: Invariant Confluence Test

independent operations OUR CONTRIBUTION: Generalizes classic partitioning-based indistinguishability arguments Theorem. A globally I-valid system can execute a set of transactions T with coordination-freedom, transactional availability, and convergence if and only if T are I-conﬂuent with respect to I. [VLDB 2015] ICT ⟺ safe, coordination-free execution possible ICT: Invariant Confluence Test

Constraint Operation OK? Equality, Inequality Any ??? Generate unique ID
Any ??? Specify unique ID Insert ??? > Increment ??? > Decrement ??? < Decrement ??? < Increment ??? Foreign Key Insert ??? Foreign Key Delete ??? Secondary Indexing Any ??? Materialized Views Any ??? AUTO_INCREMENT Insert ??? Typical database constraints and operations (SQL) Under set merge

Constraint Operation OK? Equality, Inequality Any Y Generate unique ID
Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N [VLDB 2015] Typical database constraints and operations (SQL) Under set merge

Constraint Operation OK? Equality, Inequality Any Y Generate unique ID
Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N [VLDB 2015] Typical database constraints and operations (SQL) R A M P Under set merge

carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1 per table 259 total; avg. 0.13 per table [SIGMOD 2015]

carter chiliproject citizenry comas comfortable-mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena 67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1 per table 259 total; avg. 0.13 per table 86.9% PASS ICT [SIGMOD 2015]

14/16 CONSTRAINTS PASS ICT TPC-C

14/16 CONSTRAINTS PASS ICT TPC-C 6-11x faster than ACID/serializability 8
16 32 48 64 Number of Warehouses 40K 100K 600K Throughput (txns/s) Coordination-Avoiding Serializable (2PL)

14/16 CONSTRAINTS PASS ICT TPC-C scale to over 25x best
listed result 0 50 100 150 200 2M 4M 6M 8M 10M 12M 14M Total Throughput (txn/s) 0 50 100 150 200 Number of Servers 0 20K 40K 60K 80K Throughput (txn/s/server) 6-11x faster than ACID/serializability 8 16 32 48 64 Number of Warehouses 40K 100K 600K Throughput (txns/s) Coordination-Avoiding Serializable (2PL)

WHAT THE APPLICATION SAYS “no duplicate users” constraint WHAT THE
DATABASE HEARS constraint constraint constraint constraint constraint constraint constraint “no duplicate users” CAN WE USE CONSTRAINTS TO AVOID COORDINATION?

Key idea: Exploit statistical robustness in system designs

PLASMA: ASYNCHRONOUS LEARNING [Ongoing] Key idea: Exploit statistical robustness in
system designs

PLASMA: ASYNCHRONOUS LEARNING [Ongoing] TIME Bulk Synch Parallel Key idea:
Exploit statistical robustness in system designs

PLASMA: ASYNCHRONOUS LEARNING [Ongoing] ML task: Express algorithms via async
iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Break dataﬂow barriers using new iterator model

VELOX: FAST ONLINE PREDICTIONS [CIDR 2015] PLASMA: ASYNCHRONOUS LEARNING [Ongoing]
ML task: Express algorithms via async iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Break dataﬂow barriers using new iterator model

VELOX: FAST ONLINE PREDICTIONS [CIDR 2015] Fast incremental personalization Batch
retrain shared features PLASMA: ASYNCHRONOUS LEARNING [Ongoing] ML task: Express algorithms via async iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Break dataﬂow barriers using new iterator model

retrain shared features PLASMA: ASYNCHRONOUS LEARNING [Ongoing] ML task: Express algorithms via async iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Prioritize model maintenance by robustness Break dataﬂow barriers using new iterator model

retrain shared features PLASMA: ASYNCHRONOUS LEARNING [Ongoing] ML task: Express algorithms via async iterator (e.g., ADMM) Bulk Async Parallel TIME TIME Bulk Synch Parallel Key idea: Exploit statistical robustness in system designs Prioritize model maintenance by robustness ML task: Split models according to robustness Break dataﬂow barriers using new iterator model

MY APPROACH: Study practical database use cases Derive principles and algorithms Build systems to realize the beneﬁts

PBS: Integrated into Cassandra 1.2 release + recent extensions at
a major Internet company

PBS: Integrated into Cassandra 1.2 release RAMP: Proposed feature in
Cassandra 3.0 (Reportedly) on roadmap for Facebook Apollo, IBM Cloudant + recent extensions at a major Internet company

Cassandra 3.0 (Reportedly) on roadmap for Facebook Apollo, IBM Cloudant + recent extensions at a major Internet company HAT Isolation: part of Kleppmann@LinkedIn’s Hermitage testing suite

Cassandra 3.0 (Reportedly) on roadmap for Facebook Apollo, IBM Cloudant + recent extensions at a major Internet company HAT Isolation: part of Kleppmann@LinkedIn’s Hermitage testing suite Active dialogue with developer, NoSQL community via invited talks, blogging, social media

Current Practice PBS VLDB12, SIGMOD13, VLDBJ14, CACM14 EC Today CACM/Queue13
Consistency without Borders SoCC13 Network Partitions CACM/Queue14 Feral Concurrency Control SIGMOD15 Principles I-Conﬂuence VLDB15 HATs HotOS13, VLDB14 Explicit Causality SoCC12 Systems Bolt-On SIGMOD13 RAMP + Indexing SIGMOD14 Velox CIDR15 Plasma + BAP Ongoing MY WORK: COORDINATION AVOIDANCE

FUTURE WORK

FUTURE WORK Automatically coordinated applications

FUTURE WORK Automatically coordinated applications Bespoke analysis and coordination synthesis

“Query optimization” for transaction execution

“Query optimization” for transaction execution DB meets “Big Data” Learning

“Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance

“Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance Bounded divergence control for coordinating learners

“Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance Bounded divergence control for coordinating learners Next-Generation Data Applications

“Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance Bounded divergence control for coordinating learners Next-Generation Data Applications Next 10-100x growth in data volume due to sensors, apps

“Query optimization” for transaction execution DB meets “Big Data” Learning View materialization and selection for model maintenance Bounded divergence control for coordinating learners Next-Generation Data Applications Next 10-100x growth in data volume due to sensors, apps New interfaces for increased coordination costs, heterogeneity

write read write read write write read write write write read write WHAT THE DATABASE HEARS read read read read read read

Eventual Consistency COORDINATION FREE NO SAFETY Atomic Visibility SIGMOD14 Database
Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE

Eventual Consistency COORDINATION FREE NO SAFETY Atomic Visibility SIGMOD14 Database
Constraints VLDB15, SIGMOD15 Model Prediction and Training CIDR15, TBA Weak Isolation HotOS13, VLDB14 Causality SOCC12, SIGMOD13 COORDINATION AVOIDANCE GUARANTEED SAFETY WITHOUT COORDINATION MORE SEMANTICS MORE SAFETY PBS VLDB12, VLDBJ14, SIGMOD13, CACM14 COORDINATION FREE Joint work with Ali Ghodsi, Joe Hellerstein, Ion Stoica, Mike Franklin, Michael Jordan, Alan Fekete, Dan Crankshaw, Shivaram Venkataraman, Neil Conway, Peter Alvaro, Aaron Davidson, Joey Gonzalez, Kyle Kingsbury, Haoyuan Li, and Zhao Zhang

Many illustrations by the Noun Project (CC-Attribution): surprised by Julian
Derveaux world by Wayne Tyler Sall database by Austin Condiff earth by Martin Vanco Woman by Simon Child Man by Simon Child Doctor by Simon Child David-Hockney by Simon Child Server by Simon Child clock by christoph robausch

Coordination Avoidance In Distributed Databases

Coordination Avoidance In Distributed Databases

More Decks by pbailis

Other Decks in Programming

Featured

Transcript