Splitting and Replicating Data for Fast Transactions

Splitting and Replicating Data for Fast Transactions Neha Narula MIT
CSAIL CRAFT Budapest April 2015 1 Don’t Give Up on Serializability Just Yet

@neha 2 •  PhD candidate at MIT •  Formerly
at Google •  Research in fast transactions for multi-core databases and distributed systems

Distributed Database 4 Partition data on multiple servers for
more performance

Applications experience write contention on popular data 5 Problem

Serial Execution on the Same Records in a Distributed Database
server 0 server 1 server 2 INCR(x,1) INCR(x,1) INCR(x,1) 6 Increments on the same records execute one at a time and require coordination 1) Network calls 2) Waiting for locks time

Replicated Counter 7 v0 v1 v2 Store local counters
on each server counter value = v0 + v1 + v2 +1 +1 +1

Increments on the Same Record Can Execute in Parallel server
0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) INCR(x2 ,1) 8 •  Increments on the same record can proceed in parallel on local counters •  No network calls, no shared locks 1 1 1 Use replicated counters for x time

Challenge What about more complex functions? 9

10 increment retweet count insert to my timeline insert
to my follower’s timelines insert to list of retweeters

Retweet(tweet, user) { x := GET(tweet) if !x { return
// tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(following:user) for f := range followers { INSERT(timeline:f, x) } } 11 Retweet Increment Inserts

Many Operations in Retweet Can Execute in Parallel 12
count0 : 2 retweet count = count0 + count1 + count2 rts0 : {alice, bob} count1 : 1 rts1 : {eve} count2 : 0 rts2 : { } rt list = rts0 U rts1 U rts2

Retweet(tweet, user) { x := GET(tweet) if !x { return
// tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } } 13 DeleteTweet(tweet, user) { x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) }

x := GET(tweet) if !x { return // tweet deleted
} INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } 14 x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) Result: Deleted tweets left around in timelines!

Problem •  Difﬁcult to reason about concurrent interleavings •  Might
result in incorrect, unrecoverable state 15

Talk •  Serializability (ACID Transactions) •  PhaseDB •  Experimental Results
16

ACID Transactions Atomic Consistent Isolated Durable 17 Whole thing
happens or not Application-deﬁned correctness Other transactions do not interfere Can recover correctly from a crash SET TRANSACTION ISOLATION LEVEL SERIALIZABLE BEGIN TRANSACTION ... COMMIT

18 mysql> BEGIN TRANSACTION RETWEET(...) COMMIT mysql> BEGIN TRANSACTION
DELETE_TWEET(...) COMMIT

RetweetTxn(tweet, user) { x := GET(tweet) if !x { return
// tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } } 19 DeleteTweetTxn(tweet, user) { x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) }

20 RETWEET DELETE TWEET submitted concurrently interleaved/ parallel execution
transactions appear atomic RETWEET DELETE TWEET

21 RETWEET DELETE TWEET submitted concurrently interleaved/ parallel execution
transactions appear atomic DELETE TWEET RETWEET x := GET(tweet) if !x { return // tweet deleted }

What is Serializability? The result of executing a set of
transactions is the same as if those transactions had executed one at a time, in some serial order. If each transaction preserves correctness, the DB will be in a correct state. We can pretend like there’s no concurrency! 22

Beneﬁts of Serializability •  Do not have to reason about
interleavings •  Do not have to express invariants separately from the code! 23

Serializability Costs •  On a multi-core database, serialization and cache
line transfers •  On a distributed database, serialization (and network calls Concurrency control: Locking and coordination 24

25

Key Insight •  Many records are mostly accessed one way
– Reads – Updating aggregates – Index inserts •  Plan to perform these operations without coordination •  Coordinate for incompatible operations 26

x x0 x1 x GET(x0 ) GET(x1 ) GET(x) Replicate
for Reads 27 Plan: store local copies of records; mark record as read-only x2 GET(x2 )

x x0 x1 x PUT(x) When Replicated, Writes Are Slower
28 Writers have to lock all copies x2 x0 x1 x2

PhaseDB A (research) distributed, transactional database •  Choose an execution
plan based on operations on popular records •  Split a record into local copies on every server •  Coordination-free execution in the common case •  Coordinate in the uncommon case to maintain serializability. 29

Ordered PUT, insert to an ordered list, user- deﬁned functions
Operation Model Developers write transactions as stored procedures which are composed of operations on keys and values: 30 value GET(k) void PUT(k,v) void INCR(k,n) void MAX(k,n) void MULT(k,n) void OPUT(k,v,o) void TOPK_INSERT(k,v,o) void UDF(k,v,a) Traditional key/value operations Operations on numeric values which modify the existing value

Execution Plans in PhaseDB •  Replicate for reads •  Replicate
for commutative operations •  Track last write •  Log operations; apply later 31 GET() Increment, inserts

Ordered PUT, insert to an ordered list, user- deﬁned functions
Operation Model Developers write transactions as stored procedures which are composed of operations on keys and values: 32 value GET(k) void PUT(k,v) void INCR(k,n) void MAX(k,n) void MULT(k,n) void OPUT(k,v,o) void TOPK_INSERT(k,v,o) void UDF(k,v,a) Traditional key/value operations Operations on numeric values which modify the existing value Replicate for reads Save last write Replicate for commutative operations Log operations

What Kinds of Applications Beneﬁt? •  Twitter •  TPC-C • 
Auction website RUBiS 33 BidTxn(bidder, amount, item) { INCR(item.num_bids,1) MAX(item.max_bid, amount) OPUT(item.max_bidder, bidder, amount) PUT(NewBidKey(), Bid{bidder, amount, item}) }

Challenge #1 Correctly executing transactions with incompatible operations 34

Doppel, OSDI 2014 35

Challenge #2 Popular data changes over time and is unpredictable
36

PhaseDB Uses Dynamic Execution Plans •  Start out with no
plans and no split data •  Sample remote operations on records and lock wait times •  Initiate plan based on most common operation •  Stop plan if common operation changes 38 PhaseDB handles dynamic, changing workloads

Sample Transactions During Execution server 0 server 1 server 2
GET(x) GET(x) PUT(y,2) GET(x) PUT(z,1) 39 server 3 GET(x) PUT(y,2) •  Suppose x is on server 0 (home server). •  Home server watches remote accesses time +1 +1 +1 Split x for reads

Summary of PhaseDB •  Changes the data layout and allowed
operations to align with common operations •  Executes ommon operations in parallel when record is split •  Samples to automatically determine a good plan for contended records and adjust to changing workloads 40

41

Implementation •  PhaseDB implemented as a multithreaded Go server • 
Transactions are procedures written in Go •  Experiments on 8 servers using two cores each •  All data ﬁts in memory 42

All PhaseDB Plans Have Higher Throughput Than 2PL 43
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 read commute overwrite log phaser 2PL PhaseDB 2PL Reduced network calls Increments in parallel Distributed transactions with GET, INCR, PUT, and UDF operations on a popular record

TPC-C •  Warehouses, items for sale, stock, orders •  NewOrderTxn:
Issue an order for 5-15 items •  10% of the time requires a distributed transaction to retrieve stock from a remote warehouse 44

order_id := GET(req.DistrictNextOIDKey) order_id = order_id + 1 PUT(req.DistrictNextOIDKey, order_id)
for i := range req.Items { item := GET(req.Items[i]) Restock(req.StockKey[i]), req.Amount[i]) INCR(req.StockYTDKey[i], req.Amount[i]*item.Price) if req.Swid[i] != wid { INCR(req.StockRemoteKey[i], 1) } // Construct orderline to insert PUT(order_line_id, order_line) } . . . 45 TPC-C NewOrderTxn Mostly read User-Deﬁned Function Commutative operation Commutative operation

TPC-C Performance Improves Over Time By Replicating Data 0 2000
4000 6000 8000 10000 12000 14000 16000 18000 0 50 100 150 200 250 Throughput (txns/sec) time (seconds) phaser 2PL 46 PhaseDB 2PL PhaseDB detects contended operations and splits records

order_id := GET(req.DistrictNextOIDKey) order_id = order_id + 1 PUT(req.DistrictNextOIDKey, order_id)
for i := range req.Items { item := GET(req.Items[i]) Restock(req.StockKey[i]), req.Amount[i]) INCR(req.StockYTDKey[i], req.Amount[i]*item.Price) if req.Swid[i] != wid { INCR(req.StockRemoteKey[i], 1) } // Construct orderline to insert PUT(order_line_id, order_line) } . . . 47 TPC-C NewOrderTxn Replicated for reads Replicated for ADD Replicated for ADD Logged operations

Improving Serializability Performance 48 Technique Systems Single-partition transactions Megastore
Transaction chopping Lynx, ROCOCO Commutative locking Escrow transactions, abstract data types, Doppel, PhaseDB Deterministic ordering Granola, Calvin Commutativity in Distributed Systems Topics Red/Blue Consistency Commutative Replicated Datatypes Counting sets (Walter)

Conclusion •  If it performs well enough, use SERIALIZABLE • 
Workloads are regular; we can optimize for the common case. •  Still many opportunities to improve performance while retaining easy to understand semantics. 49 http://nehanaru.la @neha

x x0 x1 x INCR(x0 ,1) INCR(x,1) INCR(x1 ,1) Replicate
for Commutative Operations 50 Update local copies of records x2 INCR(x2 ,1)

x x0 x1 x GET(x) When Replicated for Increment, Reads
Are Slower 51 Readers have to lock all copies x2 x0 x1 x2

time Batching Amortizes the Cost of Reconciliation 52 server
0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) INCR(y,2) INCR(x2 ,1) INCR(z,1) server 3 INCR(x3 ,1) INCR(y,2) GET(x) •  Wait to accumulate stashed transactions, batch for cycle •  Amortize the cost of reconciliation over many transactions INCR(x1 ,1) INCR(x2 ,1) INCR(z,1) GET(x) GET(x) GET(x) GET(x) GET(x) split cycle

Split Execution server 0 server 1 server 2 INCR(x,1) INCR(x,1)
PUT(y,2) INCR(x,1) PUT(z,1) 53 server 3 INCR(x,1) PUT(y,2) server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) •  When a record (x) is split operations on it are transformed into operations on local copies (x0 , x1 , x2 , x3 ) •  Home server sends copies to other servers split time

•  Transactions can operate on split and non-split records • 
Rest of the records use 2PL+2PC (y, z) •  2PL+2PC ensures serializability for the non-split parts of the transaction 54 server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) split time

•  Split records have assigned operations •  Cannot correctly process
a read of x in the current state •  Block operation to execute after reconciliation 55 server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) split INCR(x1 ,1) INCR(x2 ,1) INCR(x1 ,1) time GET(x)

time 56 server 0 server 1 server 2 INCR(x0
,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) split •  Home server initiates a cycle. •  All servers hear they should reconcile their local copies of x •  Stop processing local copy operations GET(x) INCR(x1 ,1) INCR(x2 ,1) INCR(x1 ,1)

time •  Reconcile state to owning server •  Wait until
all servers have ﬁnished reconciliation •  Unblock x for other operations 57 server 0 server 1 server 2 server 3 cycling x = x + x0 + x1 + x2 + x3 GET(x) GET(x1 ) GET(x2 ) GET(x3 )

time 58 server 0 server 1 server 2 server
3 cycling GET(x) •  Reconcile state to owning server •  Wait until all servers have ﬁnished reconciliation •  Unblock x for other operations x = x + x0 + x1 + x2 + x3 GET(x1 ) GET(x2 ) GET(x3 )

time 59 server 0 server 1 server 2 server
3 •  Inform other servers cycle is completed •  Other servers can being using local copies again split INCR(x0 ,1) INCR(x1 ,1) GET(x)

PhaseDB Has Higher Throughput and Better Latency than 2PL+2PC 60
0 5 10 15 20 25 30 35 40 45 0 10000 20000 30000 40000 50000 60000 70000 80000 Write Latency (ms) Total Throughput (txn/sec) PhaseDB writes 2PL writes

Splitting and Replicating Data for Fast Transac...

Splitting and Replicating Data for Fast Transactions

More Decks by Neha

Other Decks in Research

Featured

Transcript