Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zeus: Locality-aware Distributed Transactions [EuroSys '21 presentation]

Zeus: Locality-aware Distributed Transactions [EuroSys '21 presentation]

Zeus: Locality-aware Distributed Transactions [Eurosys '21]

Antonios Katsarakis

April 21, 2021
Tweet

More Decks by Antonios Katsarakis

Other Decks in Research

Transcript

  1. Zeus Locality-aware distributed transactions A. Katsarakis*, Y. Ma†, Z. Tan§,

    A. Bainbridge, M. Balkwill, A. Dragojevic, B. Grot*, B. Radunovic, Y. Zhang *University of Edinburgh, †Fudan University, §UCLA, Microsoft Research zeus-protocol.com Thanks to:
  2. Keep data in-memory, replicated, sharded across nodes of a datacenter

    Backbone of transactional cloud applications Demand distributed reliable transactions (txs) strongly-consistent and fault-tolerant high performance Modern distributed datastores 2 distributed datastore
  3. Keep data in-memory, replicated, sharded across nodes of a datacenter

    Backbone of transactional cloud applications Demand distributed reliable transactions (txs) strongly-consistent and fault-tolerant high performance Modern distributed datastores 3 distributed datastore
  4. Keep data in-memory, replicated, sharded across nodes of a datacenter

    Backbone of transactional cloud applications Demand distributed reliable transactions (txs) strongly-consistent and fault-tolerant high performance Modern distributed datastores 4 distributed datastore
  5. Keep data in-memory, replicated, sharded across nodes of a datacenter

    Backbone of transactional cloud applications Demand distributed reliable transactions (txs) strongly-consistent and fault-tolerant high performance Modern distributed datastores 5 distributed datastore
  6. Keep data in-memory, replicated, sharded across nodes of a datacenter

    Backbone of transactional cloud applications Demand distributed reliable transactions (txs) strongly-consistent and fault-tolerant high performance Modern distributed datastores 6 distributed datastore Traditional distributed txs well-known as expensive
  7. Many tx applications exhibit dynamic locality network functions, peer-to-peer payments

    … Example: cellular control plane manages phone connectivity and handovers among base stations Locality every phone user repeats txs: same phone & nearest base-station But locality is dynamic changes at run-time e.g., user commutes à base-station changes Observation 7
  8. base-station A base-station B Many tx applications exhibit dynamic locality

    network functions, peer-to-peer payments … Example: cellular control plane manages phone connectivity and handovers among base stations Locality every phone user repeats txs: same phone & nearest base-station But locality is dynamic changes at run-time e.g., user commutes à base-station changes Observation 8
  9. Many tx applications exhibit dynamic locality network functions, peer-to-peer payments

    … Example: cellular control plane manages phone connectivity and handovers among base stations Locality every phone user repeats txs: same phone & nearest base-station But locality is dynamic changes at run-time e.g., user commutes à base-station changes Observation 9 base-station A base-station B
  10. base-station A base-station B Many tx applications exhibit dynamic locality

    network functions, peer-to-peer payments … Example: cellular control plane manages phone connectivity and handovers among base stations Locality every phone user repeats txs: same phone & nearest base-station But locality is dynamic changes at run-time e.g., user commutes à base-station changes Observation 10 handover
  11. base-station A base-station B Many tx applications exhibit dynamic locality

    network functions, peer-to-peer payments … Example: cellular control plane manages phone connectivity and handovers among base stations Locality every phone user repeats txs: same phone & nearest base-station But locality is dynamic changes at run-time e.g., user commutes à base-station changes Observation 11 handover Can state-of-the-art datastores exploit dynamic locality?
  12. State-of-the-art reliable datastores 12 Static sharding (e.g., consistent hashing) Objects

    placed randomly on fixed nodes easy to locate and access objects reliable txs regardless of access pattern expensive reliable txs mostly remote accesses some blocking (control flow, pοinter chasing) related objects on different shards costly distributed commit tx coordinator
  13. State-of-the-art reliable datastores 13 Static sharding (e.g., consistent hashing) Objects

    placed randomly on fixed nodes easy to locate and access objects reliable txs regardless of access pattern expensive reliable txs mostly remote accesses some blocking (control flow, pοinter chasing) related objects on different shards costly distributed commit tx coordinator distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16]
  14. distributed commit 1. tx: if (p) b++; remote accesses Adapted

    from FaSST [OSDI’16] State-of-the-art reliable datastores 14 Static sharding (e.g., consistent hashing) Objects placed randomly on fixed nodes easy to locate and access objects reliable txs regardless of access pattern expensive reliable txs mostly remote accesses some blocking (control flow, pοinter chasing) related objects on different shards costly distributed commit tx coordinator distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16]
  15. distributed commit 1. tx: if (p) b++; remote accesses Adapted

    from FaSST [OSDI’16] State-of-the-art reliable datastores 15 Static sharding (e.g., consistent hashing) Objects placed randomly on fixed nodes easy to locate and access objects reliable txs regardless of access pattern expensive reliable txs mostly remote accesses some blocking (control flow, pοinter chasing) related objects on different shards costly distributed commit tx coordinator distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] Cannot exploit locality à expensive reliable txs
  16. Enter Zeus 16 Distributed datastore: exploits locality for fast reliable

    txs Inspired by multiprocessor’s hardware transactional memory Basic idea Each object has a single node owner = data + exclusive write access the owner changes dynamically and is tracked by replicated directory Coordinator executes a tx by acquiring ownership of all its objects à single-node commit Ownership stays with coordinator à future txs on these objects enjoy local accesses
  17. Enter Zeus 17 Distributed datastore: exploits locality for fast reliable

    txs Inspired by multiprocessor’s hardware transactional memory Basic idea Each object has a single node owner = data + exclusive write access the owner changes dynamically and is tracked by replicated directory Coordinator executes a tx by acquiring ownership of all its objects à single-node commit Ownership stays with coordinator à future txs on these objects enjoy local accesses
  18. Enter Zeus 18 Distributed datastore: exploits locality for fast reliable

    txs Inspired by multiprocessor’s hardware transactional memory Basic idea Each object has a single node owner = data + exclusive write access the owner changes dynamically and is tracked by replicated directory Coordinator executes a tx by acquiring ownership of all its objects à single-node commit Ownership stays with coordinator à future txs on these objects enjoy local accesses
  19. Enter Zeus 19 Distributed datastore: exploits locality for fast reliable

    txs Inspired by multiprocessor’s hardware transactional memory Basic idea Each object has a single node owner = data + exclusive write access the owner changes dynamically and is tracked by replicated directory Coordinator executes a tx by acquiring ownership of all its objects à single-node commit Ownership stays with coordinator à future txs on these objects enjoy local accesses
  20. Enter Zeus 20 Distributed datastore: exploits locality for fast reliable

    txs Inspired by multiprocessor’s hardware transactional memory Basic idea Each object has a single node owner = data + exclusive write access the owner changes dynamically and is tracked by replicated directory Coordinator executes a tx by acquiring ownership of all its objects à single-node commit Ownership stays with coordinator à future txs on these objects enjoy local accesses What are the exact steps?
  21. 1. Execute as the owner a) at object access: if

    (not owner) get ownership b) local access 2. Local commit commits tx: traditional single-node commit (updates not yet replicated) 3. Reliable commit completes tx: updating replicas for availability 1. Locality-aware txs in Zeus 21
  22. 1. Execute as the owner a) at object access: if

    (not owner) get ownership b) local access 2. Local commit commits tx: traditional single-node commit (updates not yet replicated) 3. Reliable commit completes tx: updating replicas for availability 1. Locality-aware txs in Zeus 22
  23. 1. Execute as the owner a) at object access: if

    (not owner) get ownership b) local access 2. Local commit commits tx: traditional single-node commit (updates not yet replicated) 3. Reliable commit completes tx: updating replicas for availability 1. Locality-aware txs in Zeus 23 How to get ownership reliably?
  24. Ownership protocol 24 1) Coordinator gets ownership from current owner

    2) Keeps consistent directory replicas 1. Coordinator sends object ownership invalidations (through a directory replica) to all arbiters 2. Arbiters acknowledge the coordinator directly 3. Coordinator sends validations informing arbiters for acquisition Conflicts: logical timestamps, fault tolerance: idempotent replays as in Hermes [ASPLOS’20]
  25. Ownership protocol 25 1) Coordinator gets ownership from current owner

    2) Keeps consistent directory replicas 1. Coordinator sends object ownership invalidations (through a directory replica) to all arbiters 2. Arbiters acknowledge the coordinator directly 3. Coordinator sends validations informing arbiters for acquisition Conflicts: logical timestamps, fault tolerance: idempotent replays as in Hermes [ASPLOS’20] arbiters
  26. Ownership protocol 26 1) Coordinator gets ownership from current owner

    2) Keeps consistent directory replicas 1. Coordinator sends object ownership invalidations (through a directory replica) to all arbiters 2. Arbiters acknowledge the coordinator directly 3. Coordinator sends validations informing arbiters for acquisition Conflicts: logical timestamps, fault tolerance: idempotent replays as in Hermes [ASPLOS’20] arbiters
  27. Ownership protocol 27 1) Coordinator gets ownership from current owner

    2) Keeps consistent directory replicas 1. Coordinator sends object ownership invalidations (through a directory replica) to all arbiters 2. Arbiters acknowledge the coordinator directly 3. Coordinator sends validations informing arbiters for acquisition Conflicts: logical timestamps, fault tolerance: idempotent replays as in Hermes [ASPLOS’20] arbiters
  28. Ownership protocol 28 1) Coordinator gets ownership from current owner

    2) Keeps consistent directory replicas 1. Coordinator sends object ownership invalidations (through a directory replica) to all arbiters 2. Arbiters acknowledge the coordinator directly 3. Coordinator sends validations informing arbiters for acquisition Conflicts: logical timestamps, fault tolerance: idempotent replays as in Hermes [ASPLOS’20] arbiters Ownership is acquired & coordinator proceeds with tx
  29. Ownership protocol 29 1) Coordinator gets ownership from current owner

    2) Keeps consistent directory replicas 1. Coordinator sends object ownership invalidations (through a directory replica) to all arbiters 2. Arbiters acknowledge the coordinator directly 3. Coordinator sends validations informing arbiters for acquisition Conflicts: logical timestamps, fault tolerance: idempotent replays as in Hermes [ASPLOS’20] arbiters
  30. Ownership protocol 30 1) Coordinator gets ownership from current owner

    2) Keeps consistent directory replicas 1. Coordinator sends object ownership invalidations (through a directory replica) to all arbiters 2. Arbiters acknowledge the coordinator directly 3. Coordinator sends validations informing arbiters for acquisition Conflicts: logical timestamps, fault tolerance: idempotent replays as in Hermes [ASPLOS’20] arbiters
  31. Ownership protocol 31 1) Coordinator gets ownership from current owner

    2) Keeps consistent directory replicas 1. Coordinator sends object ownership invalidations (through a directory replica) to all arbiters 2. Arbiters acknowledge the coordinator directly 3. Coordinator sends validations informing arbiters for acquisition Conflicts: logical timestamps, fault tolerance: idempotent replays as in Hermes [ASPLOS’20] Correctness verified under conflicts and faults arbiters
  32. 32 1. Execute as the owner a) at object access:

    if (not owner) get ownership b) local access 2. Local commit commits tx: traditional single-node commit 3. Reliable commit completes tx: updating replicas for availability Locality + ownership stays with coordinator Locality-aware txs in Zeus
  33. 33 1. Execute as the owner a) at object access:

    if (not owner) get ownership b) local access 2. Local commit commits tx: traditional single-node commit 3. Reliable commit completes tx: updating replicas for availability common case Locality + ownership stays with coordinator Locality-aware txs in Zeus
  34. 34 1. Execute as the owner a) at object access:

    if (not owner) get ownership b) local access 2. Local commit commits tx: traditional single-node commit 3. Reliable commit completes tx: updating replicas for availability common case Locality-aware txs in Zeus
  35. 35 1. Execute as the owner a) at object access:

    if (not owner) get ownership b) local access 2. Local commit commits tx: traditional single-node commit 3. Reliable commit completes tx: updating replicas for availability common case Locality-aware txs in Zeus
  36. 36 1. Execute as the owner a) at object access:

    if (not owner) get ownership b) local access 2. Local commit commits tx: traditional single-node commit 3. Reliable commit completes tx: updating replicas for availability common case Locality-aware txs in Zeus
  37. 37 1. Execute as the owner a) at object access:

    if (not owner) get ownership b) local access 2. Local commit commits tx: traditional single-node commit 3. Reliable commit completes tx: updating replicas for availability common case Locality-aware txs in Zeus Great! But how efficient is reliable commit?
  38. Reliable commit 38 1. Committed tx à no conflicts à

    fast tx completion - coordinator sends updates to replicas and waits for ACKs - read-only txs: no updates à no reliable commit 2. No conflicts à no aborts à pipelined txs (no waiting for replication) - subsequent txs use local state with certainty & issue updates - coordinator sequences updates, which replicas apply in order Fault tolerance: idempotent replays
  39. Reliable commit 39 1. Committed tx à no conflicts à

    fast tx completion - coordinator sends updates to replicas and waits for ACKs - read-only txs: no updates à no reliable commit 2. No conflicts à no aborts à pipelined txs (no waiting for replication) - subsequent txs use local state with certainty & issue updates - coordinator sequences updates, which replicas apply in order Fault tolerance: idempotent replays
  40. Reliable commit 40 1. Committed tx à no conflicts à

    fast tx completion - coordinator sends updates to replicas and waits for ACKs - read-only txs: no updates à no reliable commit 2. No conflicts à no aborts à pipelined txs (no waiting for replication) - subsequent txs use local state with certainty & issue updates - coordinator sequences updates, which replicas apply in order Fault tolerance: idempotent replays
  41. Reliable commit 41 1. Committed tx à no conflicts à

    fast tx completion - coordinator sends updates to replicas and waits for ACKs - read-only txs: no updates à no reliable commit 2. No conflicts à no aborts à pipelined txs (no waiting for replication) - subsequent txs use local state with certainty & issue updates - coordinator sequences updates, which replicas apply in order Fault tolerance: idempotent replays Very efficient! Correctness verified under faults
  42. Performance 49 Million txs / sec Ideal (all local) Zeus

    (real-world locality) Handovers 6 nodes, 3-way replication, Zeus 40Gb (no RDMA)
  43. Performance 50 Zeus: within 9% of ideal Million txs /

    sec Ideal (all local) Zeus (real-world locality) Handovers 6 nodes, 3-way replication, Zeus 40Gb (no RDMA) 9%
  44. Performance 51 Zeus: within 9% of ideal Million txs /

    sec Ideal (all local) Zeus (real-world locality) Handovers 6 nodes, 3-way replication, Zeus 40Gb (no RDMA) % write txs needing ownership TATP 9%
  45. Performance 52 Zeus: within 9% of ideal Million txs /

    sec Ideal (all local) Zeus (real-world locality) Handovers 6 nodes, 3-way replication, Zeus 40Gb (no RDMA) % write txs needing ownership TATP 2x 9% Up to 40M.tx/s and 2x state-of-the-art FaSST [OSDI’16], FaRM [SOSP’15] which use 56Gb RDMA
  46. Performance 53 Zeus: within 9% of ideal Million txs /

    sec Ideal (all local) Zeus (real-world locality) Handovers 6 nodes, 3-way replication, Zeus 40Gb (no RDMA) % write txs needing ownership TATP 2x 9% Up to 40M.tx/s and 2x state-of-the-art FaSST [OSDI’16], FaRM [SOSP’15] which use 56Gb RDMA Paper: more benchmarks, ownership, latency ...
  47. State-of-the-art reliable txs over static sharding: cannot exploit dynamic locality

    remote accesses costly distributed commit Zeus’ reliable txs exploit locality via dynamic ownership: local accesses in the common case single-node commit - local for read-only txs - fast and pipelined for write txs Performance 10s millions txs/second up to 2x state-of-the-art Bonus: programmability! Conclusion 54
  48. State-of-the-art reliable txs over static sharding: cannot exploit dynamic locality

    remote accesses costly distributed commit Zeus’ reliable txs exploit locality via dynamic ownership: local accesses in the common case single-node commit - local for read-only txs - fast and pipelined for write txs Performance 10s millions txs/second up to 2x state-of-the-art Bonus: programmability! Conclusion 55 zeus-protocol.com TLA+ specification, Q&A …
  49. State-of-the-art reliable txs over static sharding: cannot exploit dynamic locality

    remote accesses costly distributed commit Zeus’ reliable txs exploit locality via dynamic ownership: local accesses in the common case single-node commit - local for read-only txs - fast and pipelined for write txs Performance 10s millions txs/second up to 2x state-of-the-art Bonus: programmability! Conclusion 56 zeus-protocol.com TLA+ specification, Q&A …
  50. State-of-the-art reliable txs over static sharding: cannot exploit dynamic locality

    remote accesses costly distributed commit Zeus’ reliable txs exploit locality via dynamic ownership: local accesses in the common case single-node commit - local for read-only txs - fast and pipelined for write txs Performance 10s millions txs/second up to 2x state-of-the-art Bonus: programmability! Conclusion 57 zeus-protocol.com TLA+ specification, Q&A …
  51. State-of-the-art reliable txs over static sharding: cannot exploit dynamic locality

    remote accesses costly distributed commit Zeus’ reliable txs exploit locality via dynamic ownership: local accesses in the common case single-node commit - local for read-only txs - fast and pipelined for write txs Performance 10s millions txs/second up to 2x state-of-the-art Bonus: programmability! Conclusion 58 zeus-protocol.com TLA+ specification, Q&A … Reliable txs with locality? Use Zeus!