Serial Execution on the Same Records in a Distributed Database
server 0
server 1
server 2
INCR(x,1)
INCR(x,1)
INCR(x,1)
6
Increments on the same records execute one at a time and require coordination
1) Network calls
2) Waiting for locks
time
Increments on the Same Record Can Execute in Parallel
server 0
server 1
server 2
INCR(x0 ,1)
INCR(x1 ,1)
INCR(x2 ,1)
8
• Increments on the same record can proceed in parallel on local counters
• No network calls, no shared locks
1 1 1 Use replicated counters for x
time
x := GET(tweet) if !x { return // tweet deleted } INCR(rt_count:tweet, 1) INSERT(rt_list:tweet, user) INSERT(timeline:user, x) followers := GET(follow:user) for f := range followers { INSERT(timeline:f, x) } 14
x := GET(tweet) rts := GET(rt_list:tweet) DELETE(rt_list:tweet) followers := GET(follow:user) DELETE(rt_count:tweet) for u := range rts { REMOVE(timeline:u, x) } for f := range followers { REMOVE(timeline:f, x) } DELETE(tweet) Result: Deleted tweets left around in timelines!
ACID Transactions
Atomic
Consistent
Isolated
Durable
17
Whole thing happens or not
Application-defined correctness
Other transactions do not interfere
Can recover correctly from a crash
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE BEGIN TRANSACTION ... COMMIT
What is Serializability?
The result of executing a set of transactions is the same as if those transactions had executed one at a time, in some serial order.
If each transaction preserves correctness, the DB will be in a correct state.
Serializability Costs
• On a multi-core database, serialization and cache line transfers
• On a distributed database, serialization (and network calls
Concurrency control: Locking and coordination
24
Key Insight
• Many records are mostly accessed one way
– Reads
– Updating aggregates
– Index inserts
• Plan to perform these operations without coordination
• Coordinate for incompatible operations
26
PhaseDB
A (research) distributed, transactional database
• Choose an execution plan based on operations on popular records
• Split a record into local copies on every server
• Coordination-free execution in the common case
• Coordinate in the uncommon case to maintain serializability.
Ordered PUT, insert to an ordered list, user- defined functions
Operation Model
Developers write transactions as stored procedures which are composed of operations on keys and values:
30
value GET(k) void PUT(k,v) void INCR(k,n) void MAX(k,n) void MULT(k,n) void OPUT(k,v,o) void TOPK_INSERT(k,v,o) void UDF(k,v,a)
Traditional key/value operations
Operations on numeric values which modify the existing value
Ordered PUT, insert to an ordered list, user- defined functions
Operation Model
Developers write transactions as stored procedures which are composed of operations on keys and values:
32
value GET(k) void PUT(k,v) void INCR(k,n) void MAX(k,n) void MULT(k,n) void OPUT(k,v,o) void TOPK_INSERT(k,v,o) void UDF(k,v,a)
Traditional key/value operations
Operations on numeric values which modify the existing value
Replicate for reads
Save last write
Replicate for commutative operations
Log operations
PhaseDB Uses Dynamic Execution Plans
• Start out with no plans and no split data
• Sample remote operations on records and lock wait times
• Initiate plan based on most common operation
• Stop plan if common operation changes
38
PhaseDB handles dynamic, changing workloads
Sample Transactions During Execution
server 0
server 1
server 2
GET(x)
GET(x) PUT(y,2)
GET(x) PUT(z,1)
39
server 3
GET(x) PUT(y,2)
• Suppose x is on server 0 (home server).
• Home server watches remote accesses
time
+1
+1
+1
Split x for reads
Summary of PhaseDB
• Changes the data layout and allowed operations to align with common operations
• Executes ommon operations in parallel when record is split
• Samples to automatically determine a good plan for contended records and adjust to changing workloads
40
Implementation
• PhaseDB implemented as a multithreaded Go server
• Transactions are procedures written in Go
• Experiments on 8 servers using two cores each
• All data fits in memory
42
All PhaseDB Plans Have Higher Throughput Than 2PL
43
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 120000 read commute overwrite log phaser 2PL PhaseDB
2PL
Reduced network calls
Increments in parallel
Distributed transactions with GET, INCR, PUT, and UDF operations on a popular record
TPC-C
• Warehouses, items for sale, stock, orders
• NewOrderTxn: Issue an order for 5-15 items
• 10% of the time requires a distributed transaction to retrieve stock from a remote warehouse
44
Conclusion
• If it performs well enough, use SERIALIZABLE • Workloads are regular; we can optimize for the common case.
• Still many opportunities to improve performance while retaining easy to understand semantics.
49
http://nehanaru.la
@neha
time
Batching Amortizes the Cost of Reconciliation
52
server 0
server 1
server 2
INCR(x0 ,1)
INCR(x1 ,1) INCR(y,2)
INCR(x2 ,1) INCR(z,1)
server 3
INCR(x3 ,1) INCR(y,2)
GET(x)
• Wait to accumulate stashed transactions, batch for cycle
• Amortize the cost of reconciliation over many transactions
INCR(x1 ,1)
INCR(x2 ,1) INCR(z,1)
GET(x)
GET(x)
GET(x)
GET(x)
GET(x)
split
cycle
Split Execution
server 0
server 1
server 2
INCR(x,1)
INCR(x,1) PUT(y,2)
INCR(x,1) PUT(z,1)
53
server 3
INCR(x,1) PUT(y,2)
server 0
server 1
server 2
INCR(x0 ,1)
INCR(x1 ,1) PUT(y,2)
INCR(x2 ,1) PUT(z,1)
server 3
INCR(x3 ,1) PUT(y,2)
• When a record (x) is split operations on it are transformed into operations on local copies (x0 , x1 , x2 , x3 )
• Home server sends copies to other servers
split
time
• Transactions can operate on split and non-split records
• Rest of the records use 2PL+2PC (y, z)
• 2PL+2PC ensures serializability for the non-split parts of the transaction
54
server 0
server 1
server 2
INCR(x0 ,1)
INCR(x1 ,1) PUT(y,2)
INCR(x2 ,1) PUT(z,1)
server 3
INCR(x3 ,1) PUT(y,2)
split
time
• Split records have assigned operations
• Cannot correctly process a read of x in the current state
• Block operation to execute after reconciliation
55
server 0
server 1
server 2
INCR(x0 ,1)
INCR(x1 ,1) PUT(y,2)
INCR(x2 ,1) PUT(z,1)
server 3
INCR(x3 ,1) PUT(y,2)
split
INCR(x1 ,1)
INCR(x2 ,1)
INCR(x1 ,1)
time
GET(x)
time
56
server 0
server 1
server 2
INCR(x0 ,1)
INCR(x1 ,1) PUT(y,2)
INCR(x2 ,1) PUT(z,1)
server 3
INCR(x3 ,1) PUT(y,2)
split
• Home server initiates a cycle.
• All servers hear they should reconcile their local copies of x
• Stop processing local copy operations
GET(x)
INCR(x1 ,1)
INCR(x2 ,1)
INCR(x1 ,1)
time
• Reconcile state to owning server
• Wait until all servers have finished reconciliation
• Unblock x for other operations
57
server 0
server 1
server 2
server 3
cycling
x = x + x0 + x1 + x2 + x3
time
58
server 0
server 1
server 2
server 3
cycling
GET(x)
• Reconcile state to owning server
• Wait until all servers have finished reconciliation
• Unblock x for other operations
x = x + x0 + x1 + x2 + x3
time
59
server 0
server 1
server 2
server 3
• Inform other servers cycle is completed
• Other servers can being using local copies again
split
INCR(x0 ,1)
INCR(x1 ,1)
GET(x)