Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Splitting and Replicating Data for Fast Transactions

Neha
April 24, 2015

Splitting and Replicating Data for Fast Transactions

This talk and the GOTO talk cover similar topic

Neha

April 24, 2015
Tweet

More Decks by Neha

Other Decks in Research

Transcript

  1. Splitting and Replicating
    Data for Fast Transactions
    Neha Narula
    MIT CSAIL

    CRAFT Budapest April 2015
    1  
    Don’t Give Up on Serializability Just Yet

    View Slide

  2. @neha
    2  
    •  PhD candidate at
    MIT
    •  Formerly at
    Google
    •  Research in fast
    transactions for
    multi-core
    databases and
    distributed
    systems

    View Slide

  3. 3  

    View Slide

  4. Distributed Database
    4  
    Partition data on multiple servers for
    more performance

    View Slide

  5. Applications experience write contention on
    popular data
    5  
    Problem

    View Slide

  6. Serial Execution on the Same
    Records in a Distributed Database
    server 0
    server 1
    server 2
    INCR(x,1)
    INCR(x,1)
    INCR(x,1)
    6  
    Increments on the same records execute
    one at a time and require coordination
    1) Network calls
    2) Waiting for locks
    time

    View Slide

  7. Replicated Counter
    7  
    v0
    v1
    v2

    Store local counters on each server
    counter value = v0
    + v1
    + v2

    +1
    +1
    +1

    View Slide

  8. Increments on the Same Record
    Can Execute in Parallel
    server 0
    server 1
    server 2
    INCR(x0
    ,1)
    INCR(x1
    ,1)
    INCR(x2
    ,1)
    8  
    •  Increments on the same record can proceed in
    parallel on local counters
    •  No network calls, no shared locks
    1
    1
    1
    Use replicated
    counters for x
    time

    View Slide

  9. Challenge
    What about more complex functions?
    9  

    View Slide

  10. 10  
    increment
    retweet count
    insert to my
    timeline
    insert to my
    follower’s timelines
    insert to list
    of retweeters

    View Slide

  11. Retweet(tweet, user) {
    x := GET(tweet)
    if !x {
    return // tweet deleted
    }
    INCR(rt_count:tweet, 1)
    INSERT(rt_list:tweet, user)
    INSERT(timeline:user, x)
    followers := GET(following:user)
    for f := range followers {
    INSERT(timeline:f, x)
    }
    }
    11  
    Retweet
    Increment
    Inserts

    View Slide

  12. Many Operations in Retweet Can
    Execute in Parallel
    12  
    count0
    : 2
    retweet count = count0
    + count1
    + count2
    rts0
    : {alice, bob}
    count1
    : 1
    rts1
    : {eve}
    count2
    : 0
    rts2
    : { }
    rt list = rts0
    U rts1
    U rts2

    View Slide

  13. Retweet(tweet, user) {
    x := GET(tweet)
    if !x {
    return // tweet deleted
    }
    INCR(rt_count:tweet, 1)
    INSERT(rt_list:tweet, user)
    INSERT(timeline:user, x)
    followers := GET(follow:user)
    for f := range followers {
    INSERT(timeline:f, x)
    }
    }
    13  
    DeleteTweet(tweet, user) {
    x := GET(tweet)
    rts := GET(rt_list:tweet)
    DELETE(rt_list:tweet)
    followers := GET(follow:user)
    DELETE(rt_count:tweet)
    for u := range rts {
    REMOVE(timeline:u, x)
    }
    for f := range followers {
    REMOVE(timeline:f, x)
    }
    DELETE(tweet)
    }

    View Slide

  14. x := GET(tweet)
    if !x {
    return // tweet deleted
    }
    INCR(rt_count:tweet, 1)
    INSERT(rt_list:tweet, user)
    INSERT(timeline:user, x)
    followers := GET(follow:user)
    for f := range followers {
    INSERT(timeline:f, x)
    } 14  
    x := GET(tweet)
    rts := GET(rt_list:tweet)
    DELETE(rt_list:tweet)
    followers := GET(follow:user)
    DELETE(rt_count:tweet)
    for u := range rts {
    REMOVE(timeline:u, x)
    }
    for f := range followers {
    REMOVE(timeline:f, x)
    }
    DELETE(tweet)
    Result: Deleted tweets
    left around in timelines!

    View Slide

  15. Problem
    •  Difficult to reason about concurrent
    interleavings
    •  Might result in incorrect, unrecoverable
    state
    15  

    View Slide

  16. Talk
    •  Serializability (ACID Transactions)
    •  PhaseDB
    •  Experimental Results
    16  

    View Slide

  17. ACID Transactions
    Atomic
    Consistent
    Isolated
    Durable
    17  
    Whole thing happens or not
    Application-defined
    correctness
    Other transactions do not
    interfere
    Can recover correctly from a
    crash
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
    BEGIN TRANSACTION
    ...
    COMMIT

    View Slide

  18. 18  
    mysql> BEGIN TRANSACTION
    RETWEET(...)
    COMMIT
    mysql> BEGIN TRANSACTION
    DELETE_TWEET(...)
    COMMIT

    View Slide

  19. RetweetTxn(tweet, user) {
    x := GET(tweet)
    if !x {
    return // tweet deleted
    }
    INCR(rt_count:tweet, 1)
    INSERT(rt_list:tweet, user)
    INSERT(timeline:user, x)
    followers := GET(follow:user)
    for f := range followers {
    INSERT(timeline:f, x)
    }
    }
    19  
    DeleteTweetTxn(tweet, user) {
    x := GET(tweet)
    rts := GET(rt_list:tweet)
    DELETE(rt_list:tweet)
    followers := GET(follow:user)
    DELETE(rt_count:tweet)
    for u := range rts {
    REMOVE(timeline:u, x)
    }
    for f := range followers {
    REMOVE(timeline:f, x)
    }
    DELETE(tweet)
    }

    View Slide

  20. 20  
    RETWEET
    DELETE TWEET
    submitted
    concurrently
    interleaved/
    parallel execution
    transactions
    appear atomic
    RETWEET
    DELETE TWEET

    View Slide

  21. 21  
    RETWEET
    DELETE TWEET
    submitted
    concurrently
    interleaved/
    parallel execution
    transactions
    appear atomic
    DELETE TWEET
    RETWEET
    x := GET(tweet)
    if !x {
    return // tweet deleted
    }

    View Slide

  22. What is Serializability?
    The result of executing a set of transactions is
    the same as if those transactions had
    executed one at a time, in some serial order.

    If each transaction preserves correctness, the
    DB will be in a correct state.

    We can pretend like there’s no concurrency!
    22  

    View Slide

  23. Benefits of Serializability
    •  Do not have to reason about interleavings
    •  Do not have to express invariants separately
    from the code!
    23  

    View Slide

  24. Serializability Costs
    •  On a multi-core database, serialization and
    cache line transfers
    •  On a distributed database, serialization (and
    network calls
    Concurrency control: Locking and
    coordination
    24  

    View Slide

  25. Talk
    •  Serializability (ACID Transactions)
    •  PhaseDB
    •  Experimental Results
    25  

    View Slide

  26. Key Insight
    •  Many records are mostly accessed one
    way
    – Reads
    – Updating aggregates
    – Index inserts
    •  Plan to perform these operations without
    coordination
    •  Coordinate for incompatible operations
    26  

    View Slide

  27. x
    x0
    x1

    x
    GET(x0
    ) GET(x1
    )
    GET(x)
    Replicate for Reads
    27  
    Plan: store local copies of records; mark record
    as read-only
    x2

    GET(x2
    )

    View Slide

  28. x
    x0
    x1

    x
    PUT(x)
    When Replicated, Writes Are Slower
    28  
    Writers have to lock all copies
    x2

    x0
    x1
    x2

    View Slide

  29. PhaseDB
    A (research) distributed, transactional database
    •  Choose an execution plan based on
    operations on popular records
    •  Split a record into local copies on every server
    •  Coordination-free execution in the common
    case
    •  Coordinate in the uncommon case to maintain
    serializability.

    29  

    View Slide

  30. Ordered PUT, insert to
    an ordered list, user-
    defined functions
    Operation Model
    Developers write transactions as stored procedures which
    are composed of operations on keys and values:
    30  
    value GET(k)
    void PUT(k,v)
    void INCR(k,n)
    void MAX(k,n)
    void MULT(k,n)
    void OPUT(k,v,o)
    void TOPK_INSERT(k,v,o)
    void UDF(k,v,a)

    Traditional key/value
    operations
    Operations on numeric
    values which modify the
    existing value

    View Slide

  31. Execution Plans in PhaseDB
    •  Replicate for reads
    •  Replicate for commutative operations
    •  Track last write
    •  Log operations; apply later
    31  
    GET()
    Increment, inserts

    View Slide

  32. Ordered PUT, insert to
    an ordered list, user-
    defined functions
    Operation Model
    Developers write transactions as stored procedures which
    are composed of operations on keys and values:
    32  
    value GET(k)
    void PUT(k,v)
    void INCR(k,n)
    void MAX(k,n)
    void MULT(k,n)
    void OPUT(k,v,o)
    void TOPK_INSERT(k,v,o)
    void UDF(k,v,a)

    Traditional key/value
    operations
    Operations on numeric
    values which modify the
    existing value
    Replicate for reads
    Save last write
    Replicate for
    commutative
    operations
    Log operations

    View Slide

  33. What Kinds of Applications Benefit?
    •  Twitter
    •  TPC-C
    •  Auction website RUBiS
    33  
    BidTxn(bidder, amount, item) {
    INCR(item.num_bids,1)
    MAX(item.max_bid, amount)
    OPUT(item.max_bidder, bidder, amount)
    PUT(NewBidKey(), Bid{bidder, amount, item})
    }

    View Slide

  34. Challenge #1
    Correctly executing transactions with
    incompatible operations
    34  

    View Slide

  35. Doppel, OSDI 2014
    35  

    View Slide

  36. Challenge #2
    Popular data changes over time and is
    unpredictable
    36  

    View Slide

  37. 37  

    View Slide

  38. PhaseDB Uses Dynamic Execution
    Plans
    •  Start out with no plans and no split data
    •  Sample remote operations on records and
    lock wait times
    •  Initiate plan based on most common
    operation
    •  Stop plan if common operation changes
    38  
    PhaseDB handles dynamic,
    changing workloads

    View Slide

  39. Sample Transactions During
    Execution
    server 0
    server 1
    server 2
    GET(x)
    GET(x) PUT(y,2)
    GET(x) PUT(z,1)
    39  
    server 3 GET(x) PUT(y,2)
    •  Suppose x is on server 0 (home server).
    •  Home server watches remote accesses
    time
    +1
    +1
    +1 Split x for reads

    View Slide

  40. Summary of PhaseDB
    •  Changes the data layout and allowed
    operations to align with common operations
    •  Executes ommon operations in parallel
    when record is split
    •  Samples to automatically determine a good
    plan for contended records and adjust to
    changing workloads
    40  

    View Slide

  41. Talk
    •  Serializability (ACID Transactions)
    •  PhaseDB
    •  Experimental Results
    41  

    View Slide

  42. Implementation
    •  PhaseDB implemented as a multithreaded
    Go server
    •  Transactions are procedures written in Go
    •  Experiments on 8 servers using two cores
    each
    •  All data fits in memory
    42  

    View Slide

  43. All PhaseDB Plans Have Higher
    Throughput Than 2PL
    43  
    10000
    20000
    30000
    40000
    50000
    60000
    70000
    80000
    90000
    100000
    110000
    120000
    read commute overwrite log
    phaser
    2PL
    PhaseDB
    2PL
    Reduced network calls
    Increments in parallel
    Distributed transactions with GET, INCR, PUT, and UDF operations on a popular record

    View Slide

  44. TPC-C
    •  Warehouses, items for sale, stock, orders
    •  NewOrderTxn: Issue an order for 5-15
    items
    •  10% of the time requires a distributed
    transaction to retrieve stock from a remote
    warehouse
    44  

    View Slide

  45. order_id := GET(req.DistrictNextOIDKey)
    order_id = order_id + 1
    PUT(req.DistrictNextOIDKey, order_id)
    for i := range req.Items {
    item := GET(req.Items[i])
    Restock(req.StockKey[i]), req.Amount[i])
    INCR(req.StockYTDKey[i], req.Amount[i]*item.Price)
    if req.Swid[i] != wid {
    INCR(req.StockRemoteKey[i], 1)
    }
    // Construct orderline to insert
    PUT(order_line_id, order_line)
    }
    . . .
    45  
    TPC-C NewOrderTxn
    Mostly read
    User-Defined Function
    Commutative operation
    Commutative operation

    View Slide

  46. TPC-C Performance Improves Over
    Time By Replicating Data
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    0 50 100 150 200 250
    Throughput (txns/sec)
    time (seconds)
    phaser
    2PL
    46  
    PhaseDB
    2PL
    PhaseDB detects contended
    operations and splits records

    View Slide

  47. order_id := GET(req.DistrictNextOIDKey)
    order_id = order_id + 1
    PUT(req.DistrictNextOIDKey, order_id)
    for i := range req.Items {
    item := GET(req.Items[i])
    Restock(req.StockKey[i]), req.Amount[i])
    INCR(req.StockYTDKey[i], req.Amount[i]*item.Price)
    if req.Swid[i] != wid {
    INCR(req.StockRemoteKey[i], 1)
    }
    // Construct orderline to insert
    PUT(order_line_id, order_line)
    }
    . . .
    47  
    TPC-C NewOrderTxn
    Replicated for reads
    Replicated for ADD
    Replicated for ADD
    Logged operations

    View Slide

  48. Improving Serializability Performance
    48  
    Technique Systems
    Single-partition transactions Megastore
    Transaction chopping Lynx, ROCOCO
    Commutative locking Escrow transactions, abstract data
    types, Doppel, PhaseDB
    Deterministic ordering Granola, Calvin
    Commutativity in Distributed Systems
    Topics
    Red/Blue Consistency
    Commutative Replicated Datatypes
    Counting sets (Walter)

    View Slide

  49. Conclusion
    •  If it performs well enough, use SERIALIZABLE
    •  Workloads are regular; we can optimize for the
    common case.
    •  Still many opportunities to improve performance
    while retaining easy to understand semantics.
    49  
    http://nehanaru.la
    @neha

    View Slide

  50. x
    x0
    x1

    x
    INCR(x0
    ,1) INCR(x,1)
    INCR(x1
    ,1)
    Replicate for Commutative Operations
    50  
    Update local copies of records
    x2

    INCR(x2
    ,1)

    View Slide

  51. x
    x0
    x1

    x
    GET(x)
    When Replicated for Increment,
    Reads Are Slower
    51  
    Readers have to lock all copies
    x2

    x0
    x1
    x2

    View Slide

  52. time
    Batching Amortizes the Cost of
    Reconciliation
    52  
    server 0
    server 1
    server 2
    INCR(x0
    ,1)
    INCR(x1
    ,1) INCR(y,2)
    INCR(x2
    ,1) INCR(z,1)
    server 3 INCR(x3
    ,1) INCR(y,2)
    GET(x)
    •  Wait to accumulate stashed transactions, batch for cycle
    •  Amortize the cost of reconciliation over many transactions
    INCR(x1
    ,1)
    INCR(x2
    ,1) INCR(z,1)
    GET(x)
    GET(x)
    GET(x)
    GET(x)
    GET(x)
    split
    cycle

    View Slide

  53. Split Execution
    server 0
    server 1
    server 2
    INCR(x,1)
    INCR(x,1) PUT(y,2)
    INCR(x,1) PUT(z,1)
    53  
    server 3 INCR(x,1) PUT(y,2)
    server 0
    server 1
    server 2
    INCR(x0
    ,1)
    INCR(x1
    ,1) PUT(y,2)
    INCR(x2
    ,1) PUT(z,1)
    server 3 INCR(x3
    ,1) PUT(y,2)
    •  When a record (x) is split operations on it are transformed
    into operations on local copies (x0
    , x1
    , x2
    , x3
    )
    •  Home server sends copies to other servers
    split
    time

    View Slide

  54. •  Transactions can operate on split and non-split records
    •  Rest of the records use 2PL+2PC (y, z)
    •  2PL+2PC ensures serializability for the non-split parts of the
    transaction
    54  
    server 0
    server 1
    server 2
    INCR(x0
    ,1)
    INCR(x1
    ,1) PUT(y,2)
    INCR(x2
    ,1) PUT(z,1)
    server 3 INCR(x3
    ,1) PUT(y,2)
    split
    time

    View Slide

  55. •  Split records have assigned operations
    •  Cannot correctly process a read of x in the current state
    •  Block operation to execute after reconciliation
    55  
    server 0
    server 1
    server 2
    INCR(x0
    ,1)
    INCR(x1
    ,1) PUT(y,2)
    INCR(x2
    ,1) PUT(z,1)
    server 3 INCR(x3
    ,1) PUT(y,2)
    split
    INCR(x1
    ,1)
    INCR(x2
    ,1)
    INCR(x1
    ,1)
    time
    GET(x)

    View Slide

  56. time
    56  
    server 0
    server 1
    server 2
    INCR(x0
    ,1)
    INCR(x1
    ,1) PUT(y,2)
    INCR(x2
    ,1) PUT(z,1)
    server 3 INCR(x3
    ,1) PUT(y,2)
    split
    •  Home server initiates a cycle.
    •  All servers hear they should reconcile their local copies of x
    •  Stop processing local copy operations
    GET(x)
    INCR(x1
    ,1)
    INCR(x2
    ,1)
    INCR(x1
    ,1)

    View Slide

  57. time
    •  Reconcile state to owning server
    •  Wait until all servers have finished reconciliation
    •  Unblock x for other operations
    57  
    server 0
    server 1
    server 2
    server 3
    cycling
    x = x + x0
    + x1
    + x2
    + x3

    GET(x)
    GET(x1
    ) GET(x2
    ) GET(x3
    )

    View Slide

  58. time
    58  
    server 0
    server 1
    server 2
    server 3
    cycling
    GET(x)
    •  Reconcile state to owning server
    •  Wait until all servers have finished reconciliation
    •  Unblock x for other operations
    x = x + x0
    + x1
    + x2
    + x3

    GET(x1
    ) GET(x2
    ) GET(x3
    )

    View Slide

  59. time
    59  
    server 0
    server 1
    server 2
    server 3
    •  Inform other servers cycle is completed
    •  Other servers can being using local copies again
    split
    INCR(x0
    ,1)
    INCR(x1
    ,1)
    GET(x)

    View Slide

  60. PhaseDB Has Higher Throughput
    and Better Latency than 2PL+2PC
    60  
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    0 10000 20000 30000 40000 50000 60000 70000 80000
    Write Latency (ms)
    Total Throughput (txn/sec)
    PhaseDB writes
    2PL writes

    View Slide