Splitting and Replicating Data for Fast Transactions

Slide 1

Slide 1 text

Splitting and Replicating Data for Fast Transactions Neha Narula MIT CSAIL CRAFT Budapest April 2015 1 Don’t Give Up on Serializability Just Yet

Slide 2

Slide 2 text

@neha 2 •  PhD candidate at MIT •  Formerly at Google •  Research in fast transactions for multi-core databases and distributed systems

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Distributed Database 4 Partition data on multiple servers for more performance

Slide 5

Slide 5 text

Applications experience write contention on popular data 5 Problem

Slide 6

Slide 6 text

Serial Execution on the Same Records in a Distributed Database server 0 server 1 server 2 INCR(x,1) INCR(x,1) INCR(x,1) 6 Increments on the same records execute one at a time and require coordination 1) Network calls 2) Waiting for locks time

Slide 7

Slide 7 text

Replicated Counter 7 v0 v1 v2 Store local counters on each server counter value = v0 + v1 + v2 +1 +1 +1

Slide 8

Slide 8 text

Increments on the Same Record Can Execute in Parallel server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) INCR(x2 ,1) 8 •  Increments on the same record can proceed in parallel on local counters •  No network calls, no shared locks 1 1 1 Use replicated counters for x time

Slide 9

Slide 9 text

Challenge What about more complex functions? 9

Slide 10

Slide 10 text

10 increment retweet count insert to my timeline insert to my follower’s timelines insert to list of retweeters

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Many Operations in Retweet Can Execute in Parallel 12 count0 : 2 retweet count = count0 + count1 + count2 rts0 : {alice, bob} count1 : 1 rts1 : {eve} count2 : 0 rts2 : { } rt list = rts0 U rts1 U rts2

Slide 13

Slide 13 text

Retweet(tweet, user) { x := GET(tweet) if !x { return // tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } } 13 DeleteTweet(tweet, user) { x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) }

Slide 14

Slide 14 text

x := GET(tweet) if !x { return // tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } 14 x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) Result: Deleted tweets left around in timelines!

Slide 15

Slide 15 text

Problem •  Difﬁcult to reason about concurrent interleavings •  Might result in incorrect, unrecoverable state 15

Slide 16

Slide 16 text

Talk •  Serializability (ACID Transactions) •  PhaseDB •  Experimental Results 16

Slide 17

Slide 17 text

ACID Transactions Atomic Consistent Isolated Durable 17 Whole thing happens or not Application-deﬁned correctness Other transactions do not interfere Can recover correctly from a crash SET TRANSACTION ISOLATION LEVEL SERIALIZABLE BEGIN TRANSACTION ... COMMIT

Slide 18

Slide 18 text

18 mysql> BEGIN TRANSACTION RETWEET(...) COMMIT mysql> BEGIN TRANSACTION DELETE_TWEET(...) COMMIT

Slide 19

Slide 19 text

RetweetTxn(tweet, user) { x := GET(tweet) if !x { return // tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } } 19 DeleteTweetTxn(tweet, user) { x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) }

Slide 20

Slide 20 text

20 RETWEET DELETE TWEET submitted concurrently interleaved/ parallel execution transactions appear atomic RETWEET DELETE TWEET

Slide 21

Slide 21 text

21 RETWEET DELETE TWEET submitted concurrently interleaved/ parallel execution transactions appear atomic DELETE TWEET RETWEET x := GET(tweet) if !x { return // tweet deleted }

Slide 22

Slide 22 text

What is Serializability? The result of executing a set of transactions is the same as if those transactions had executed one at a time, in some serial order. If each transaction preserves correctness, the DB will be in a correct state. We can pretend like there’s no concurrency! 22

Slide 23

Slide 23 text

Beneﬁts of Serializability •  Do not have to reason about interleavings •  Do not have to express invariants separately from the code! 23

Slide 24

Slide 24 text

Serializability Costs •  On a multi-core database, serialization and cache line transfers •  On a distributed database, serialization (and network calls Concurrency control: Locking and coordination 24

Slide 25

Slide 25 text

Talk •  Serializability (ACID Transactions) •  PhaseDB •  Experimental Results 25

Slide 26

Slide 26 text

Key Insight •  Many records are mostly accessed one way – Reads – Updating aggregates – Index inserts •  Plan to perform these operations without coordination •  Coordinate for incompatible operations 26

Slide 27

Slide 27 text

x x0 x1 x GET(x0 ) GET(x1 ) GET(x) Replicate for Reads 27 Plan: store local copies of records; mark record as read-only x2 GET(x2 )

Slide 28

Slide 28 text

x x0 x1 x PUT(x) When Replicated, Writes Are Slower 28 Writers have to lock all copies x2 x0 x1 x2

Slide 29

Slide 29 text

PhaseDB A (research) distributed, transactional database •  Choose an execution plan based on operations on popular records •  Split a record into local copies on every server •  Coordination-free execution in the common case •  Coordinate in the uncommon case to maintain serializability. 29

Slide 30

Slide 30 text

Ordered PUT, insert to an ordered list, user- deﬁned functions Operation Model Developers write transactions as stored procedures which are composed of operations on keys and values: 30 value GET(k) void PUT(k,v) void INCR(k,n) void MAX(k,n) void MULT(k,n) void OPUT(k,v,o) void TOPK_INSERT(k,v,o) void UDF(k,v,a) Traditional key/value operations Operations on numeric values which modify the existing value

Slide 31

Slide 31 text

Execution Plans in PhaseDB •  Replicate for reads •  Replicate for commutative operations •  Track last write •  Log operations; apply later 31 GET() Increment, inserts

Slide 32

Slide 32 text

Ordered PUT, insert to an ordered list, user- deﬁned functions Operation Model Developers write transactions as stored procedures which are composed of operations on keys and values: 32 value GET(k) void PUT(k,v) void INCR(k,n) void MAX(k,n) void MULT(k,n) void OPUT(k,v,o) void TOPK_INSERT(k,v,o) void UDF(k,v,a) Traditional key/value operations Operations on numeric values which modify the existing value Replicate for reads Save last write Replicate for commutative operations Log operations

Slide 33

Slide 33 text

What Kinds of Applications Beneﬁt? •  Twitter •  TPC-C •  Auction website RUBiS 33 BidTxn(bidder, amount, item) { INCR(item.num_bids,1) MAX(item.max_bid, amount) OPUT(item.max_bidder, bidder, amount) PUT(NewBidKey(), Bid{bidder, amount, item}) }

Slide 34

Slide 34 text

Challenge #1 Correctly executing transactions with incompatible operations 34

Slide 35

Slide 35 text

Doppel, OSDI 2014 35

Slide 36

Slide 36 text

Challenge #2 Popular data changes over time and is unpredictable 36

Slide 37

Slide 37 text

Slide 38

Slide 38 text

PhaseDB Uses Dynamic Execution Plans •  Start out with no plans and no split data •  Sample remote operations on records and lock wait times •  Initiate plan based on most common operation •  Stop plan if common operation changes 38 PhaseDB handles dynamic, changing workloads

Slide 39

Slide 39 text

Sample Transactions During Execution server 0 server 1 server 2 GET(x) GET(x) PUT(y,2) GET(x) PUT(z,1) 39 server 3 GET(x) PUT(y,2) •  Suppose x is on server 0 (home server). •  Home server watches remote accesses time +1 +1 +1 Split x for reads

Slide 40

Slide 40 text

Summary of PhaseDB •  Changes the data layout and allowed operations to align with common operations •  Executes ommon operations in parallel when record is split •  Samples to automatically determine a good plan for contended records and adjust to changing workloads 40

Slide 41

Slide 41 text

Talk •  Serializability (ACID Transactions) •  PhaseDB •  Experimental Results 41

Slide 42

Slide 42 text

Implementation •  PhaseDB implemented as a multithreaded Go server •  Transactions are procedures written in Go •  Experiments on 8 servers using two cores each •  All data ﬁts in memory 42

Slide 43

Slide 43 text

All PhaseDB Plans Have Higher Throughput Than 2PL 43 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 read commute overwrite log phaser 2PL PhaseDB 2PL Reduced network calls Increments in parallel Distributed transactions with GET, INCR, PUT, and UDF operations on a popular record

Slide 44

Slide 44 text

TPC-C •  Warehouses, items for sale, stock, orders •  NewOrderTxn: Issue an order for 5-15 items •  10% of the time requires a distributed transaction to retrieve stock from a remote warehouse 44

Slide 45

Slide 45 text

order_id := GET(req.DistrictNextOIDKey) order_id = order_id + 1 PUT(req.DistrictNextOIDKey, order_id) for i := range req.Items { item := GET(req.Items[i]) Restock(req.StockKey[i]), req.Amount[i]) INCR(req.StockYTDKey[i], req.Amount[i]*item.Price) if req.Swid[i] != wid { INCR(req.StockRemoteKey[i], 1) } // Construct orderline to insert PUT(order_line_id, order_line) } . . . 45 TPC-C NewOrderTxn Mostly read User-Deﬁned Function Commutative operation Commutative operation

Slide 46

Slide 46 text

TPC-C Performance Improves Over Time By Replicating Data 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 50 100 150 200 250 Throughput (txns/sec) time (seconds) phaser 2PL 46 PhaseDB 2PL PhaseDB detects contended operations and splits records

Slide 47

Slide 47 text

order_id := GET(req.DistrictNextOIDKey) order_id = order_id + 1 PUT(req.DistrictNextOIDKey, order_id) for i := range req.Items { item := GET(req.Items[i]) Restock(req.StockKey[i]), req.Amount[i]) INCR(req.StockYTDKey[i], req.Amount[i]*item.Price) if req.Swid[i] != wid { INCR(req.StockRemoteKey[i], 1) } // Construct orderline to insert PUT(order_line_id, order_line) } . . . 47 TPC-C NewOrderTxn Replicated for reads Replicated for ADD Replicated for ADD Logged operations

Slide 48

Slide 48 text

Improving Serializability Performance 48 Technique Systems Single-partition transactions Megastore Transaction chopping Lynx, ROCOCO Commutative locking Escrow transactions, abstract data types, Doppel, PhaseDB Deterministic ordering Granola, Calvin Commutativity in Distributed Systems Topics Red/Blue Consistency Commutative Replicated Datatypes Counting sets (Walter)

Slide 49

Slide 49 text

Conclusion •  If it performs well enough, use SERIALIZABLE •  Workloads are regular; we can optimize for the common case. •  Still many opportunities to improve performance while retaining easy to understand semantics. 49 http://nehanaru.la @neha

Slide 50

Slide 50 text

x x0 x1 x INCR(x0 ,1) INCR(x,1) INCR(x1 ,1) Replicate for Commutative Operations 50 Update local copies of records x2 INCR(x2 ,1)

Slide 51

Slide 51 text

x x0 x1 x GET(x) When Replicated for Increment, Reads Are Slower 51 Readers have to lock all copies x2 x0 x1 x2

Slide 52

Slide 52 text

time Batching Amortizes the Cost of Reconciliation 52 server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) INCR(y,2) INCR(x2 ,1) INCR(z,1) server 3 INCR(x3 ,1) INCR(y,2) GET(x) •  Wait to accumulate stashed transactions, batch for cycle •  Amortize the cost of reconciliation over many transactions INCR(x1 ,1) INCR(x2 ,1) INCR(z,1) GET(x) GET(x) GET(x) GET(x) GET(x) split cycle

Slide 53

Slide 53 text

Split Execution server 0 server 1 server 2 INCR(x,1) INCR(x,1) PUT(y,2) INCR(x,1) PUT(z,1) 53 server 3 INCR(x,1) PUT(y,2) server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) •  When a record (x) is split operations on it are transformed into operations on local copies (x0 , x1 , x2 , x3 ) •  Home server sends copies to other servers split time

Slide 54

Slide 54 text

•  Transactions can operate on split and non-split records •  Rest of the records use 2PL+2PC (y, z) •  2PL+2PC ensures serializability for the non-split parts of the transaction 54 server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) split time

Slide 55

Slide 55 text

•  Split records have assigned operations •  Cannot correctly process a read of x in the current state •  Block operation to execute after reconciliation 55 server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) split INCR(x1 ,1) INCR(x2 ,1) INCR(x1 ,1) time GET(x)

Slide 56

Slide 56 text

time 56 server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) split •  Home server initiates a cycle. •  All servers hear they should reconcile their local copies of x •  Stop processing local copy operations GET(x) INCR(x1 ,1) INCR(x2 ,1) INCR(x1 ,1)

Slide 57

Slide 57 text

time •  Reconcile state to owning server •  Wait until all servers have ﬁnished reconciliation •  Unblock x for other operations 57 server 0 server 1 server 2 server 3 cycling x = x + x0 + x1 + x2 + x3 GET(x) GET(x1 ) GET(x2 ) GET(x3 )

Slide 58

Slide 58 text

time 58 server 0 server 1 server 2 server 3 cycling GET(x) •  Reconcile state to owning server •  Wait until all servers have ﬁnished reconciliation •  Unblock x for other operations x = x + x0 + x1 + x2 + x3 GET(x1 ) GET(x2 ) GET(x3 )

Slide 59

Slide 59 text

time 59 server 0 server 1 server 2 server 3 •  Inform other servers cycle is completed •  Other servers can being using local copies again split INCR(x0 ,1) INCR(x1 ,1) GET(x)

Slide 60

Slide 60 text

PhaseDB Has Higher Throughput and Better Latency than 2PL+2PC 60 0 5 10 15 20 25 30 35 40 45 0 10000 20000 30000 40000 50000 60000 70000 80000 Write Latency (ms) Total Throughput (txn/sec) PhaseDB writes 2PL writes