Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Splitting and Replicating Data for Fast Transactions

Neha
April 24, 2015

Splitting and Replicating Data for Fast Transactions

This talk and the GOTO talk cover similar topic

Neha

April 24, 2015
Tweet

More Decks by Neha

Other Decks in Research

Transcript

  1. Splitting and Replicating Data for Fast Transactions Neha Narula MIT

    CSAIL CRAFT Budapest April 2015 1   Don’t Give Up on Serializability Just Yet
  2. @neha 2   •  PhD candidate at MIT •  Formerly

    at Google •  Research in fast transactions for multi-core databases and distributed systems
  3. Serial Execution on the Same Records in a Distributed Database

    server 0 server 1 server 2 INCR(x,1) INCR(x,1) INCR(x,1) 6   Increments on the same records execute one at a time and require coordination 1) Network calls 2) Waiting for locks time
  4. Replicated Counter 7   v0 v1 v2 Store local counters

    on each server counter value = v0 + v1 + v2 +1 +1 +1
  5. Increments on the Same Record Can Execute in Parallel server

    0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) INCR(x2 ,1) 8   •  Increments on the same record can proceed in parallel on local counters •  No network calls, no shared locks 1 1 1 Use replicated counters for x time
  6. 10   increment retweet count insert to my timeline insert

    to my follower’s timelines insert to list of retweeters
  7. Retweet(tweet, user) { x := GET(tweet) if !x { return

    // tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(following:user) for f := range followers { INSERT(timeline:f, x) } } 11   Retweet Increment Inserts
  8. Many Operations in Retweet Can Execute in Parallel 12  

    count0 : 2 retweet count = count0 + count1 + count2 rts0 : {alice, bob} count1 : 1 rts1 : {eve} count2 : 0 rts2 : { } rt list = rts0 U rts1 U rts2
  9. Retweet(tweet, user) { x := GET(tweet) if !x { return

    // tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } } 13   DeleteTweet(tweet, user) { x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) }
  10. x := GET(tweet) if !x { return // tweet deleted

    } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } 14   x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) Result: Deleted tweets left around in timelines!
  11. ACID Transactions Atomic Consistent Isolated Durable 17   Whole thing

    happens or not Application-defined correctness Other transactions do not interfere Can recover correctly from a crash SET TRANSACTION ISOLATION LEVEL SERIALIZABLE BEGIN TRANSACTION ... COMMIT
  12. RetweetTxn(tweet, user) { x := GET(tweet) if !x { return

    // tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } } 19   DeleteTweetTxn(tweet, user) { x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) }
  13. 21   RETWEET DELETE TWEET submitted concurrently interleaved/ parallel execution

    transactions appear atomic DELETE TWEET RETWEET x := GET(tweet) if !x { return // tweet deleted }
  14. What is Serializability? The result of executing a set of

    transactions is the same as if those transactions had executed one at a time, in some serial order. If each transaction preserves correctness, the DB will be in a correct state. We can pretend like there’s no concurrency! 22  
  15. Benefits of Serializability •  Do not have to reason about

    interleavings •  Do not have to express invariants separately from the code! 23  
  16. Serializability Costs •  On a multi-core database, serialization and cache

    line transfers •  On a distributed database, serialization (and network calls Concurrency control: Locking and coordination 24  
  17. Key Insight •  Many records are mostly accessed one way

    – Reads – Updating aggregates – Index inserts •  Plan to perform these operations without coordination •  Coordinate for incompatible operations 26  
  18. x x0 x1 x GET(x0 ) GET(x1 ) GET(x) Replicate

    for Reads 27   Plan: store local copies of records; mark record as read-only x2 GET(x2 )
  19. x x0 x1 x PUT(x) When Replicated, Writes Are Slower

    28   Writers have to lock all copies x2 x0 x1 x2
  20. PhaseDB A (research) distributed, transactional database •  Choose an execution

    plan based on operations on popular records •  Split a record into local copies on every server •  Coordination-free execution in the common case •  Coordinate in the uncommon case to maintain serializability. 29  
  21. Ordered PUT, insert to an ordered list, user- defined functions

    Operation Model Developers write transactions as stored procedures which are composed of operations on keys and values: 30   value GET(k) void PUT(k,v) void INCR(k,n) void MAX(k,n) void MULT(k,n) void OPUT(k,v,o) void TOPK_INSERT(k,v,o) void UDF(k,v,a) Traditional key/value operations Operations on numeric values which modify the existing value
  22. Execution Plans in PhaseDB •  Replicate for reads •  Replicate

    for commutative operations •  Track last write •  Log operations; apply later 31   GET() Increment, inserts
  23. Ordered PUT, insert to an ordered list, user- defined functions

    Operation Model Developers write transactions as stored procedures which are composed of operations on keys and values: 32   value GET(k) void PUT(k,v) void INCR(k,n) void MAX(k,n) void MULT(k,n) void OPUT(k,v,o) void TOPK_INSERT(k,v,o) void UDF(k,v,a) Traditional key/value operations Operations on numeric values which modify the existing value Replicate for reads Save last write Replicate for commutative operations Log operations
  24. What Kinds of Applications Benefit? •  Twitter •  TPC-C • 

    Auction website RUBiS 33   BidTxn(bidder, amount, item) { INCR(item.num_bids,1) MAX(item.max_bid, amount) OPUT(item.max_bidder, bidder, amount) PUT(NewBidKey(), Bid{bidder, amount, item}) }
  25. PhaseDB Uses Dynamic Execution Plans •  Start out with no

    plans and no split data •  Sample remote operations on records and lock wait times •  Initiate plan based on most common operation •  Stop plan if common operation changes 38   PhaseDB handles dynamic, changing workloads
  26. Sample Transactions During Execution server 0 server 1 server 2

    GET(x) GET(x) PUT(y,2) GET(x) PUT(z,1) 39   server 3 GET(x) PUT(y,2) •  Suppose x is on server 0 (home server). •  Home server watches remote accesses time +1 +1 +1 Split x for reads
  27. Summary of PhaseDB •  Changes the data layout and allowed

    operations to align with common operations •  Executes ommon operations in parallel when record is split •  Samples to automatically determine a good plan for contended records and adjust to changing workloads 40  
  28. Implementation •  PhaseDB implemented as a multithreaded Go server • 

    Transactions are procedures written in Go •  Experiments on 8 servers using two cores each •  All data fits in memory 42  
  29. All PhaseDB Plans Have Higher Throughput Than 2PL 43  

    10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 read commute overwrite log phaser 2PL PhaseDB 2PL Reduced network calls Increments in parallel Distributed transactions with GET, INCR, PUT, and UDF operations on a popular record
  30. TPC-C •  Warehouses, items for sale, stock, orders •  NewOrderTxn:

    Issue an order for 5-15 items •  10% of the time requires a distributed transaction to retrieve stock from a remote warehouse 44  
  31. order_id := GET(req.DistrictNextOIDKey) order_id = order_id + 1 PUT(req.DistrictNextOIDKey, order_id)

    for i := range req.Items { item := GET(req.Items[i]) Restock(req.StockKey[i]), req.Amount[i]) INCR(req.StockYTDKey[i], req.Amount[i]*item.Price) if req.Swid[i] != wid { INCR(req.StockRemoteKey[i], 1) } // Construct orderline to insert PUT(order_line_id, order_line) } . . . 45   TPC-C NewOrderTxn Mostly read User-Defined Function Commutative operation Commutative operation
  32. TPC-C Performance Improves Over Time By Replicating Data 0 2000

    4000 6000 8000 10000 12000 14000 16000 18000 0 50 100 150 200 250 Throughput (txns/sec) time (seconds) phaser 2PL 46   PhaseDB 2PL PhaseDB detects contended operations and splits records
  33. order_id := GET(req.DistrictNextOIDKey) order_id = order_id + 1 PUT(req.DistrictNextOIDKey, order_id)

    for i := range req.Items { item := GET(req.Items[i]) Restock(req.StockKey[i]), req.Amount[i]) INCR(req.StockYTDKey[i], req.Amount[i]*item.Price) if req.Swid[i] != wid { INCR(req.StockRemoteKey[i], 1) } // Construct orderline to insert PUT(order_line_id, order_line) } . . . 47   TPC-C NewOrderTxn Replicated for reads Replicated for ADD Replicated for ADD Logged operations
  34. Improving Serializability Performance 48   Technique Systems Single-partition transactions Megastore

    Transaction chopping Lynx, ROCOCO Commutative locking Escrow transactions, abstract data types, Doppel, PhaseDB Deterministic ordering Granola, Calvin Commutativity in Distributed Systems Topics Red/Blue Consistency Commutative Replicated Datatypes Counting sets (Walter)
  35. Conclusion •  If it performs well enough, use SERIALIZABLE • 

    Workloads are regular; we can optimize for the common case. •  Still many opportunities to improve performance while retaining easy to understand semantics. 49   http://nehanaru.la @neha
  36. x x0 x1 x INCR(x0 ,1) INCR(x,1) INCR(x1 ,1) Replicate

    for Commutative Operations 50   Update local copies of records x2 INCR(x2 ,1)
  37. x x0 x1 x GET(x) When Replicated for Increment, Reads

    Are Slower 51   Readers have to lock all copies x2 x0 x1 x2
  38. time Batching Amortizes the Cost of Reconciliation 52   server

    0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) INCR(y,2) INCR(x2 ,1) INCR(z,1) server 3 INCR(x3 ,1) INCR(y,2) GET(x) •  Wait to accumulate stashed transactions, batch for cycle •  Amortize the cost of reconciliation over many transactions INCR(x1 ,1) INCR(x2 ,1) INCR(z,1) GET(x) GET(x) GET(x) GET(x) GET(x) split cycle
  39. Split Execution server 0 server 1 server 2 INCR(x,1) INCR(x,1)

    PUT(y,2) INCR(x,1) PUT(z,1) 53   server 3 INCR(x,1) PUT(y,2) server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) •  When a record (x) is split operations on it are transformed into operations on local copies (x0 , x1 , x2 , x3 ) •  Home server sends copies to other servers split time
  40. •  Transactions can operate on split and non-split records • 

    Rest of the records use 2PL+2PC (y, z) •  2PL+2PC ensures serializability for the non-split parts of the transaction 54   server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) split time
  41. •  Split records have assigned operations •  Cannot correctly process

    a read of x in the current state •  Block operation to execute after reconciliation 55   server 0 server 1 server 2 INCR(x0 ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) split INCR(x1 ,1) INCR(x2 ,1) INCR(x1 ,1) time GET(x)
  42. time 56   server 0 server 1 server 2 INCR(x0

    ,1) INCR(x1 ,1) PUT(y,2) INCR(x2 ,1) PUT(z,1) server 3 INCR(x3 ,1) PUT(y,2) split •  Home server initiates a cycle. •  All servers hear they should reconcile their local copies of x •  Stop processing local copy operations GET(x) INCR(x1 ,1) INCR(x2 ,1) INCR(x1 ,1)
  43. time •  Reconcile state to owning server •  Wait until

    all servers have finished reconciliation •  Unblock x for other operations 57   server 0 server 1 server 2 server 3 cycling x = x + x0 + x1 + x2 + x3 GET(x) GET(x1 ) GET(x2 ) GET(x3 )
  44. time 58   server 0 server 1 server 2 server

    3 cycling GET(x) •  Reconcile state to owning server •  Wait until all servers have finished reconciliation •  Unblock x for other operations x = x + x0 + x1 + x2 + x3 GET(x1 ) GET(x2 ) GET(x3 )
  45. time 59   server 0 server 1 server 2 server

    3 •  Inform other servers cycle is completed •  Other servers can being using local copies again split INCR(x0 ,1) INCR(x1 ,1) GET(x)
  46. PhaseDB Has Higher Throughput and Better Latency than 2PL+2PC 60

      0 5 10 15 20 25 30 35 40 45 0 10000 20000 30000 40000 50000 60000 70000 80000 Write Latency (ms) Total Throughput (txn/sec) PhaseDB writes 2PL writes