Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Consistency without Clocks: The FaunaDB Distributed Transaction Protocol

Consistency without Clocks: The FaunaDB Distributed Transaction Protocol

Jeferson David Ossa

May 17, 2019
Tweet

More Decks by Jeferson David Ossa

Other Decks in Technology

Transcript

  1. FaunaDB A distributed, indexed document store based on the Calvin

    transaction protocol that provides snapshot isolation up to strict serializability and does not rely on physical clock synchronization to maintain consistency. ACID transactions with up to serializable isolation. Linearizable, consistent operations across replicas geographically distributed.
  2. FaunaDB A distributed, indexed document store based on the Calvin

    transaction protocol that provides snapshot isolation up to strict serializability and does not rely on physical clock synchronization to maintain consistency. ACID transactions with up to serializable isolation. Linearizable, consistent operations across replicas geographically distributed.
  3. Calvin: Fast Distributed Transactions for Partitioned Database Systems T2 T3

    T1 Tn sequencer Log T1 T2 T3 Tn Replica 1 Replica 2 Replica X ordering transactions, and actually executing those transactions, are separable problems
  4. Snapshot Isolation Each transaction appears to operate on an independent,

    consistent snapshot of the database and are external consistent (Tn-1 is visible to Tn). If transaction T1 has modified an object X, and another transaction T2 committed a write to X after T1’s snapshot began, and before T1’s commit, then T1 must abort.
  5. Strict Serializability Serializability: transactions appear to have occurred in some

    total order. + Linearizability: operation appears to take place atomically, in some order, consistent with the real-time ordering of those operations. = Transactions with total order and real-time constraints.
  6. Serializable Isolation The system can process many transactions in parallel,

    but the final result is equivalent to processing them one after another. For most database systems, the order is not determined in advance. Instead, transactions are run in parallel, and some variant of locking is used to ensure that the final result is equivalent to some serial order.
  7. FaunaDB’s replication protocol Caveat: FaunaDB’s replication protocol uses consensus, not

    wall clocks, to construct its transaction logs but still relies on wall clocks to decide when to seal time windows in the log, which means that clock skew can delay transaction processing.
  8. Transaction Log & Snapshots 1:00 P.M. 1:01 P.M. 1:03 P.M.

    1:07 P.M. T1 T2 T3 T4 Time Stamp Customer ID Credit T4 1 50 T1 2 100 T3 3 200 T2 2 50
  9. Node 2 Node 1 Replication Time Stamp Ticket Price Stock

    T3 1 100 12 Replica ABC Node 1 Node 2 Replica XYZ Node 1 Node 2 Time Stamp Customer ID Credit T1 2 200 T4 2 100 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T3 3 80 1 Time Stamp Customer ID Credit T0 7 50
  10. The FaunaDB Distributed Transaction Protocol Transaction submitted to Replica ABC:

    1. Read ticket 3, validate there’s at least 1 in stock, check price. 2. Read customer 2, validate credit is enough to buy ticket 3. 3. Subtract one from ticket 3’s stock. 4. Subtract price from customer 2’s credit. A similar transaction is submitted to Replica XYZ for customer 6 at the same time.
  11. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T4 2 100 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T3 3 80 1 Time Stamp Customer ID Credit T0 7 50 T0 T1 T2 T3 T4 Replica ABC Running Transaction Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100
  12. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T4 2 100 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T3 3 80 1 Time Stamp Customer ID Credit T0 7 50 T0 T1 T2 T3 T4 Replica XYZ Running Transaction Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T2 6 300
  13. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T4 2 100 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T3 3 80 1 Time Stamp Customer ID Credit T0 7 50 T0 T1 T2 T3 T4 Replica ABC Coordinator’s buffer Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100 Ticket Price Stock 3 80 0 Customer ID Credit 2 20
  14. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T4 2 100 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T3 3 80 1 Time Stamp Customer ID Credit T0 7 50 T0 T1 T2 T3 T4 Replica XYZ Coordinator’s buffer Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T2 6 300 Ticket Price Stock 3 80 0 Customer ID Credit 6 220
  15. Replica ABC Coordinator Time Stamp Ticket Price Stock T3 3

    80 1 Time Stamp Customer ID Credit T4 2 100 Ticket Price Stock 3 80 0 Customer ID Credit 2 20 Replica XYZ Coordinator Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T2 6 300 Ticket Price Stock 3 80 0 Customer ID Credit 6 220 T0 T1 T2 T3 T4 T5 T6
  16. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T4 2 100 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T3 3 80 1 Time Stamp Customer ID Credit T0 7 50 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100 Ticket Price Stock 3 80 0 Customer ID Credit 2 20 T0 T1 T2 T3 T4 T5 T6 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100 Replica ABC T5’s buffered writes
  17. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T4 2 100 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T3 3 80 1 Time Stamp Customer ID Credit T0 7 50 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100 Ticket Price Stock 3 80 0 Customer ID Credit 2 20 T0 T1 T2 T3 T4 T5 T6 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100 Replica XYZ T5’s buffered writes
  18. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T5 2 20 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T5 3 80 0 Time Stamp Customer ID Credit T0 7 50 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100 Ticket Price Stock 3 80 0 Customer ID Credit 2 20 T0 T1 T2 T3 T4 T5 T6 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100 Replica ABC Commit T5 SAME
  19. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T5 2 20 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T5 3 80 0 Time Stamp Customer ID Credit T0 7 50 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100 Ticket Price Stock 3 80 0 Customer ID Credit 2 20 T0 T1 T2 T3 T4 T5 T6 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T4 2 100 Replica XYZ Commit T5 SAME
  20. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T5 2 20 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T5 3 80 0 Time Stamp Customer ID Credit T0 7 50 Time Stamp Ticket Price Stock T5 3 80 0 Time Stamp Customer ID Credit T2 6 300 Ticket Price Stock 3 80 0 Customer ID Credit 2 20 T0 T1 T2 T3 T4 T5 T6 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T2 6 300 Replica ABC T5’s buffered writes
  21. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T5 2 20 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T5 3 80 0 Time Stamp Customer ID Credit T0 7 50 Time Stamp Ticket Price Stock T5 3 80 0 Time Stamp Customer ID Credit T2 6 300 Ticket Price Stock 3 80 0 Customer ID Credit 2 20 T0 T1 T2 T3 T4 T5 T6 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T2 6 300 Replica XYZ T5’s buffered writes
  22. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T5 2 20 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T5 3 80 0 Time Stamp Customer ID Credit T0 7 50 Time Stamp Ticket Price Stock T5 3 80 0 Time Stamp Customer ID Credit T2 6 300 Ticket Price Stock 3 80 0 Customer ID Credit 2 20 T0 T1 T2 T3 T4 T5 T6 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T2 6 300 Replica ABC ABORT DIFFERENT
  23. Time Stamp Ticket Price Stock T3 1 100 12 Time

    Stamp Customer ID Credit T1 2 200 T5 2 20 T2 6 300 Time Stamp Ticket Price Stock T2 3 80 2 T5 3 80 0 Time Stamp Customer ID Credit T0 7 50 Time Stamp Ticket Price Stock T5 3 80 0 Time Stamp Customer ID Credit T2 6 300 Ticket Price Stock 3 80 0 Customer ID Credit 2 20 T0 T1 T2 T3 T4 T5 T6 Time Stamp Ticket Price Stock T3 3 80 1 Time Stamp Customer ID Credit T2 6 300 Replica XYZ ABORT DIFFERENT
  24. Multi-Region Global Replica Consistency Once a transaction commits, it is

    guaranteed that any subsequent read-write transaction—no matter which replica is processing it—will read all data that was written by the earlier transaction.
  25. Summary 1. Reads are performed as of a recent snapshot,

    and writes are buffered. 2. A consensus protocol is used (Raft) to insert the transaction into a distributed log. This is the only point at which global consensus is required. 3. Checks each replica for potential violations of serializability guarantees.
  26. Performance implications - Transactions that update data only go through

    a single round of global consensus. - FaunaDB does not require clock synchronization or bounds on clock skew uncertainty across machines in a deployment. - FaunaDB has a global notion of "FaunaDB time" that is agreed upon by every node in the system. - FaunaDB supports serializable snapshot reads with no consensus or locking, so they complete with local datacenter latency.
  27. References - https://fauna.com/faunadb - http://jepsen.io/analyses/faunadb-2.5.4 - https://jepsen.io/consistency - http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf -

    https://www.microsoft.com/en-us/research/wp-content/uploads/2016/0 2/tr-95-51.pdf - http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf - http://web.cecs.pdx.edu/~len/sql1999.pdf - http://pmg.csail.mit.edu/papers/adya-phd.pdf