Slide 1

Slide 1 text

Highly Available Transactions Peter Bailis Aaron Davidson Alan Fekete Ali Ghodsi Joe Hellerstein Ion Stoica UC Berkeley & University of Sydney Virtues and Limitations VLDB 2015, Hangzhou, China, 4 Sept. 2014

Slide 2

Slide 2 text

July 2000: CAP Theorem

Slide 3

Slide 3 text

High Availability

Slide 4

Slide 4 text

High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System guarantees a response, even during network partitions (async network)

Slide 5

Slide 5 text

High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System guarantees a response, even during network partitions (async network)

Slide 6

Slide 6 text

High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System guarantees a response, even during network partitions (async network)

Slide 7

Slide 7 text

High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System guarantees a response, even during network partitions (async network)

Slide 8

Slide 8 text

High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System guarantees a response, even during network partitions (async network)

Slide 9

Slide 9 text

High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System guarantees a response, even during network partitions (async network)

Slide 10

Slide 10 text

High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System guarantees a response, even during network partitions (async network)

Slide 11

Slide 11 text

High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System guarantees a response, even during network partitions (async network)

Slide 12

Slide 12 text

network partitions

Slide 13

Slide 13 text

NETWORK PARTITIONS

Slide 14

Slide 14 text

“Network partitions should be rare but net gear continues to cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS

Slide 15

Slide 15 text

MSFT LAN: avg. 40.8 failures/day (95th %ile: 136) 5 min median time to repair (up to 1 week) [SIGCOMM 2011] UC WAN: avg. 16.2–302.0 failures/link/year avg. downtime of 24–497 minutes/link/year [SIGCOMM 2011] HP LAN: 67.1% of support tickets are due to network median incident duration 114-188 min [HP Labs 2012] “Network partitions should be rare but net gear continues to cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS

Slide 16

Slide 16 text

“THE NETWORK IS RELIABLE” tops Peter Deutsch’s classic list of “Eight fallacies of distributed computing,” all [of which] “prove to be false in the long run and all [of which] cause big trouble and painful learning experiences” (https://blogs.oracle. com/jag/resource/Fallacies.html). Accounting for and understanding the implications of network behavior is key to designing robust distributed programs— in fact, six of Deutsch’s “fallacies” directly pertain to limitations on networked communications. This should be unsurprising: the ability (and often requirement) to communicate over a shared channel possibility and impossibility of perform- ing distributed computations under particular sets of network conditions. For example, the celebrated FLP impossibility result9 demonstrates the inability to guarantee consensus in an asynchronous network (that is, one facing indefinite communication partitions between processes) with one faulty process. This means that, in the presence of unreliable (untimely) mes- sage delivery, basic operations such as modifying the set of machines in a cluster (that is, maintaining group membership, as systems such as Zoo- keeper are tasked with today) are not guaranteed to complete in the event of both network asynchrony and indi- vidual server failures. Related results describe the inability to guarantee the progress of serializable transactions,7 linearizable reads/writes,11 and a variety of useful, programmer-friendly guar- antees under adverse conditions.3 The implications of these results are not simply academic: these impossibility results have motivated a proliferation of systems and designs offering a range of alternative guarantees in the event of network failures.5 However, under a friendlier, more reliable network that guarantees timely message delivery, FLP and many of these related results no longer hold:8 by making stronger guarantees about network behavior, we can circumvent the programmabil- ity implications of these impossibility proofs. Therefore, the degree of reliability in deployment environments is critical in robust systems design and directly determines the kinds of operations that systems can reliably perform with- out waiting. Unfortunately, the degree to which networks are actually reliable in the real world is the subject of con- siderable and evolving debate. Some have claimed that networks are reliable (or that partitions are rare enough in practice) and that we are too concerned with designing for theoretical failure The Network Is Reliable DOI:10.1145/2643130 Article development led by queue.acm.org An informal survey of real-world communications failures. BY PETER BAILIS AND KYLE KINGSBURY CACM, September 2014 issue

Slide 17

Slide 17 text

High Availability System guarantees a response, even during network partitions (async network) [Gilbert and Lynch, ACM SIGACT News 2002]

Slide 18

Slide 18 text

High Availability System guarantees a response, even during network partitions (async network) [Gilbert and Lynch, ACM SIGACT News 2002] [“PACELC,” Abadi, IEEE Computer 2012] Corollary: low latency, especially over WAN

Slide 19

Slide 19 text

low latency

Slide 20

Slide 20 text

LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY

Slide 21

Slide 21 text

LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY

Slide 22

Slide 22 text

LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY

Slide 23

Slide 23 text

LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY

Slide 24

Slide 24 text

LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY

Slide 25

Slide 25 text

THOSE LIGHT CONES_

Slide 26

Slide 26 text

July 2000: CAP Theorem

Slide 27

Slide 27 text

“BQ”!jt!gvoebnfoubmmz!bcpvu!

Slide 28

Slide 28 text

“BQ”!jt!gvoebnfoubmmz!bcpvu! avoiding coordination

Slide 29

Slide 29 text

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability avoiding coordination

Slide 30

Slide 30 text

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency avoiding coordination

Slide 31

Slide 31 text

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency avoiding coordination

Slide 32

Slide 32 text

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput avoiding coordination

Slide 33

Slide 33 text

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput Aggressive Scale-out avoiding coordination

Slide 34

Slide 34 text

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput Aggressive Scale-out cf. “Coordination Avoidance in Database Systems” to appear in VLDB 2015 avoiding coordination

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

CONSISTENCY
 vs COORDINATION

Slide 37

Slide 37 text

CONSISTENCY
 vs AVAILABILITY

Slide 38

Slide 38 text

CONSISTENCY
 vs AVAILABILITY Linearizability “Atomic” C in CAP

Slide 39

Slide 39 text

CONSISTENCY
 vs AVAILABILITY Eventual Linearizability “Atomic” C in CAP

Slide 40

Slide 40 text

CONSISTENCY
 vs AVAILABILITY Eventual Linearizability “Atomic” C in CAP

Slide 41

Slide 41 text

OpTRM!

Slide 42

Slide 42 text

Strong consistency is expensive; avoid whenever possible! OpTRM!

Slide 43

Slide 43 text

CAP implies transactions are unavailable Common (mis)conception: Strong consistency is expensive; avoid whenever possible! OpTRM!

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

CAP is about linearizability CAP doesn’t mention transactions

Slide 46

Slide 46 text

Was the NoSQL movement right?

Slide 47

Slide 47 text

Was the NoSQL movement right? Are all transactions unavailable?

Slide 48

Slide 48 text

serializability Is achievable with HA?

Slide 49

Slide 49 text

serializability Is achievable with HA?

Slide 50

Slide 50 text

serializability Is achievable with HA?

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Serializability is expensive

Slide 54

Slide 54 text

! Use weaker models instead Serializability is expensive

Slide 55

Slide 55 text

HANA

Slide 56

Slide 56 text

ep!opu!tvqqpsu! tfsjbmj{bcjmjuz HANA

Slide 57

Slide 57 text

ep!opu!tvqqpsu! tfsjbmj{bcjmjuz HANA Actian Ingres YES Aerospike NO N Persistit NO N Clustrix NO N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO N MS SQL Server YES NuoDB NO N Oracle 11G NO N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP Hana NO N ScaleDB NO N VoltDB YES 8/18 databases surveyed did not 15/18 used weak models by default Serializability supported?

Slide 58

Slide 58 text

serializability

Slide 59

Slide 59 text

serializability

Slide 60

Slide 60 text

serializability snapshot isolation read committed repeatable read cursor stability read uncommitted monotonic view update serializability

Slide 61

Slide 61 text

serializability snapshot isolation read committed repeatable read cursor stability read uncommitted monotonic view update serializability

Slide 62

Slide 62 text

serializability snapshot isolation read committed repeatable read cursor stability read uncommitted monotonic view update serializability

Slide 63

Slide 63 text

serializability snapshot isolation read committed repeatable read cursor stability read uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA?

Slide 64

Slide 64 text

serializability snapshot isolation read committed repeatable read cursor stability read uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA? Highly Available Transactions

Slide 65

Slide 65 text

serializability snapshot isolation read committed repeatable read cursor stability read uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA? HATs

Slide 66

Slide 66 text

[Atul Adya Ph.D. Thesis, MIT 2000]

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Challenge: traditional implementations are unavailable

Slide 69

Slide 69 text

Challenge: traditional implementations are unavailable

Slide 70

Slide 70 text

Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

Fyjtujoh! Ebubcbtf! Jtpmbujpo

Slide 73

Slide 73 text

Fyjtujoh! Ebubcbtf! Jtpmbujpo Ejtusjcvufe! Sfhjtufst!

Slide 74

Slide 74 text

Fyjtujoh! Ebubcbtf! Jtpmbujpo Tfttjpo!Hvbsboufft Ejtusjcvufe! Sfhjtufst!

Slide 75

Slide 75 text

Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available

Slide 76

Slide 76 text

Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available

Slide 77

Slide 77 text

Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

Read Committed (RC)

Slide 80

Slide 80 text

Read Committed (RC) Replicas never serve dirty or non-final writes

Slide 81

Slide 81 text

Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes

Slide 82

Slide 82 text

Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes

Slide 83

Slide 83 text

Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes ANSI Repeatable Read (RR)

Slide 84

Slide 84 text

Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions read from a snapshot of DB

Slide 85

Slide 85 text

Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB

Slide 86

Slide 86 text

Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB

Slide 87

Slide 87 text

Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB Unavailable implementations ⇏ unavailable semantics

Slide 88

Slide 88 text

Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available

Slide 89

Slide 89 text

Snapshot reads of database state (database does not change) including predicate-based reads + ANSI Repeatable Read Read Atomic Isolation (+TA) Read your writes Time doesn’t go backwards Writes follow reads + Causal Consistency Observe all or none of another txn’s updates

Slide 90

Slide 90 text

https://github.com/pbailis/hat-vldb2014-code Experimental Validation Thrift - based sharded key - value store with LevelDB for persistence Focus on “CP” vs. HAT overheads cluster A cluster B

Slide 91

Slide 91 text

2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes

Slide 92

Slide 92 text

2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes 0 200 400 600 800 1000 0 20 40 60 80 100 120 Avg. Latency (ms) Eventual RC TA Master

Slide 93

Slide 93 text

0 200 400 600 800 1000 0 20 40 60 80 100 120 Avg. Latency (ms) Eventual RC TA Master 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes

Slide 94

Slide 94 text

0 200 400 600 800 1000 0 20 40 60 80 100 120 Avg. Latency (ms) Eventual RC TA Master Mastered 2x latency of HATs 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes

Slide 95

Slide 95 text

0 200 400 600 800 1000 0 20 40 60 80 100 120 Avg. Latency (ms) Eventual RC TA Master Mastered 2x latency of HATs 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes 128K ops/s

Slide 96

Slide 96 text

clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes

Slide 97

Slide 97 text

0 200 400 600 800 1000 Avg. Latency (ms) Eventual RC TA Master clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes

Slide 98

Slide 98 text

0 200 400 600 800 1000 Avg. Latency (ms) Eventual RC TA Master 300ms clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes

Slide 99

Slide 99 text

0 200 400 600 800 1000 Avg. Latency (ms) Eventual RC TA Master Mastered 2-70x latency of HATs 300ms clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes

Slide 100

Slide 100 text

CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes

Slide 101

Slide 101 text

0 500 1000 1500 2000 2500 Avg. Latency (ms) Eventual RC TA Master CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes

Slide 102

Slide 102 text

0 500 1000 1500 2000 2500 Avg. Latency (ms) Eventual RC TA Master 800ms CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes

Slide 103

Slide 103 text

0 500 1000 1500 2000 2500 Avg. Latency (ms) Eventual RC TA Master 800ms Mastered 8-186x latency of HATs CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes

Slide 104

Slide 104 text

Also in paper In-depth discussion of isolation guarantees Extending “AP” to transactional context Sticky availability and sessions Discussion of atomicity and durability More evaluation

Slide 105

Slide 105 text

This paper: All about coordination + isolation levels (Some surprising results!) How else can databases benefit? How do we address whole programs? Our experience: isolation levels are unintuitive!

Slide 106

Slide 106 text

RAMP Transactions: new isolation model and coordination-free implementation of indexing, matviews, multi-put [SIGMOD14] I-confluence: which integrity constraints are enforceable without coordination? OLTPBench suite plus general theory [VLDB 2015] Real-world applications: analysis of open- source applications for coordination requirements; similar results [In preparation] Distributed optimization: Numerical convex programs have close analogues to transaction-processing techniques [In preparation]

Slide 107

Slide 107 text

PUNCHLINE: Coordination is avoidable surprisingly often Need to understand use cases + semantics The use cases are staggering in number and often in plain sight Hint: look to applications, big systems in the wild We have a huge opportunity to improve theory and practice by understanding what’s possible