Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PWLSF#4 => Joel VanderWerf on Calvin

PWLSF#4 => Joel VanderWerf on Calvin

Joel VanderWerf stops by to talk about Calvin. This time we have 3 relevant papers:

• Calvin: Fast Distributed Transactions for Partitioned Database Systems, SIGMOD 2012 by Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi.

• Consistency Tradeoffs in Modern Distributed Database System Design, 2012 by Daniel J. Abadi.

• Modularity and Scalability in Calvin, IEEE 2013 by Alexander Thomson and Daniel J. Abadi.

Video link coming soon!

Papers_We_Love

June 05, 2014
Tweet

More Decks by Papers_We_Love

Other Decks in Technology

Transcript

  1. The Case for Determinism in Database Systems Thomson, Abadi, 2010

    Calvin: Fast Distributed Transactions for Partitioned Database Systems Thomson, Diamond, Weng, Ren, Shao, Abadi, 2012 Modularity and Scalability in Calvin Thomson, Abadi, 2013
  2. Consensus early in pipeline Replicate inputs rather than effects Read

    / write sets known before execution Transaction reordering with stricter constraints Commit is a predicate on data Minimize coordination Batching
  3. Good for: • OLTP • Simple CRUD • High throughput

    • Cross-partition • WAN • Consistent Replicas • Modularity Watch out for: • High latency IO • Availability • Large read sets • Snapshot reads • Complex transactions • Esp. with contention
  4. Clients R1N1 Replica 1 – determines geographical location (“R =

    Region = Replica”) Node 1 – determines partition (e.g., table set) R1N2 R2N1 R2N2 Sequencer Sequencer Sequencer Sequencer Rest of Pipeline....
  5. R1N1 Sequencer Rest of Pipeline.... T3 T2 T1 ... Transaction

    Requests At node R1N1 0ms 10ms 20ms 30ms 40ms T1 T2 T3 Batch with GUID = (R1N1, 1) At 10ms in node, Write {T1, T2} to LOG At key (R1N1, 1) Epoch 1 Epoch 2 Epoch 3 Epoch 4
  6. The Log: Replication of Inputs Distributed key-value store Replicated on

    each node RiNj Eventually consistent Suitable for high volume ...Cassandra, for example
  7. R1N1 R1N2 R2N1 R2N2 0ms 10ms 20ms 30ms GUID (R1N1,

    1) Log eventually contains keys and batched requests: (R1N1, 1), (R1N2, 1), … (R1N1, 2), (R1N2, 2), … (but keys may arrive in different orders at different nodes) (R1N1, 2) (R1N2, 1) Transaction Request
  8. Log contains: (R1N1, 1) → { T 1,1,1 T 1,1,2

    … } (R2N1, 1) → { T 2,1,1 T 2,1,2 … } How to merge? What if some RiN1 missing? Need consensus...
  9. The Meta Log: Ordering of Batched Requests Distributed key-value store

    Replicated on each node RiNj Sequentially consistent Suitable for low volume ...Zookeeper, for example
  10. All nodes in replication group R1N1 R2N1 agree on a

    linear order, such as: R1N1, R2N1 Batches are merged as: B 1,1 = { T 1,1,1 T 1,1,2 … T 2,1,1 T 2,1,2 … } Note: preserves arrival order at any one node.
  11. Batches from different replication groups? R*N1 and R*N2 merge in

    static order: B 1 = B 1,1 + B 2,1 Epoch Group
  12. Sequencer is done with epoch! Batches of requests at epoch

    are replicated. Each batch is accessible by GUID. GUIDs are ordered in Meta Log. Meta Log is replicated to all nodes. Each node has same ordered sequence of T's. No more consensus in rest of pipeline.
  13. At Node R1N1 Consume T i,j,k in order. Execute on

    local store. With concurrency. Preserving invariant. No more consensus. “Light” coordination with other nodes.
  14. Scheduler Lock Manager Executor (thread 1) No lock awareness below

    line Readcaster Executor (thread 2) ... Readcaster R1N2 Readcaster Broadcast Read results On R1N1 The only coordination for rest of pipeline.
  15. Lock Manager Can be separate from storage. Logical (row) rather

    than physical (page) locks. LM on R1N1 locks only the rows stored at R1N1. (Enough to preserve determinism.) Invariant: If T1 before T2 and both T1, T2 access row R Then lock(T1, R) granted before lock(T2, R). Locks held until Executor completes transaction.
  16. Concurrency: contention footprint • Locks are held around disk and

    network IO. • Normally, the only network IO is: – Unidirectional broadcasts – Within replica (i.e. same region / DC). • Compare 2PC: – Locks held around propose and accept phases • two consensus rounds across WAN – This also limits batching in 2PC: • you can't batch what you can't lock (due to conflicts).
  17. Executor on R1N1 • Receive T from Scheduler (in context

    of locks). • Read all N1 rows in read set of T. • Broadcast to peers. ('Readcaster' – next slide) • Receive peer broadcasts, until T satisfied. • Execute logic and N1 writes – Logic must be deterministic – Factor out rand() and time() before sequencer
  18. Readcaster Coordinates Read Results Needs to receive rows from subset

    of peers. Not commit acks. Commit success is predicated on data. Not predicated on independent decisions.
  19. T = { s = read(table A, row 1, salary)

    abort if s < 0 update(table B, row 2, balance += s) } R1 R2 N1 Table A lives in N1's N1 Table B lives in N2's N2 N2 s=42 s=42 (if R1N1 fails) Commit at N2 iff P(N1 data, N2 data) Doesn't matter who sent you the data! Compare 2PC!
  20. Commit at R1N2 iff P(N1 data, N2 data) In 2PC:

    Commit at R1N2 iff P(R1N1 ack, R1N2 ack, R2N1 ack, R2N2 ack)
  21. Concurrency: transaction ordering Less concurrency than 2PC. But more than

    sequential execution. Compare the constraints...
  22. Vertical lines represent Rows accessed. T3 T2 T1 Scheduler Executor

    1 Executor 2 Works on T1 Works on T2 No lock awareness below line
  23. T3 T2 T1 Scheduler Executor 1 Executor 2 Works on

    T1 Works on T3 T2 blocked, So reorder. Locking invariant Is preserved, since Row sets disjoint.
  24. T3 T2 T1 2PC has more opportunities to reorder T3

    is blocked for Calvin, but 2PC can reorder it. Without agreed global order, there's no reason not to. Can increase throughput (slow disk). But replicas diverge.
  25. Even with concurrency and reordering: If all T1...Tn have executed,

    Then R1N1 agrees with R2N1. Not so with 2PC. Due to weaker constraint on order. Hence log-shipping (or alternatives). Consistent Replication
  26. Dependent Transactions • The LM needs to know r/w sets

    of T before Executor handles T. • What if row ID depends on previous read? row_id read(secondary_idx, name='fred') ← update(row_id, spouse='wilma') • Split, optimistically, and literalize row_id: T1: row_id read(…) ← T2: abort unless row_id == read(…) update(row_id, …) Literal number
  27. Failure modes and availability • No better than Zookeeper (quorum).

    • Replication group failure: – When all R*N2 die. – Nondeterministic DBs: • Can still run transactions that are disjoint from data in N2. – Calvin: • T0 is blocked if it touches N2. • Suppose T0 and T1 both write to same row in N1. – Can't reorder T1 around T0 – Blockage propagates transitively.
  28. Other Benefits • Pre-warm cache using read/write sets – At

    sequencer, ~500ms upstream of execution. • Node recovery – Replay, or checkpoint + delta replay. • Read-only queries – Fast, consistent results from a single, local replica.
  29. Purpose • Expository • Not really analysis for correctness or

    performance • But latency is modeled • Investigate interesting corner cases: – Transaction concurrency / reordering – Node failure – Slow link • How do latency params affect throughtput?
  30. Model • Many nodes and processes • Networks with non-zero

    latency • System software: – Log, MetaLog, DB, Locks – Latency for Log and MetaLog • Time to “durability”/quorum; time to full replication • Time – Sequence of instantaneous events – Concurrency is interleaving of events – Elapsing time is scheduling an event later
  31. Implementation • All within one thread – Omniscient view of

    distributed system • In-memory sqlite databases (one per node) • Ruby 2 • Red-black trees for future and past events – Assert/inspect what happened (or is scheduled) in time interval • https://github.com/vjoel/spinoza