Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Invalidation-Based Protocols for Replicated Datastores

Invalidation-Based Protocols for Replicated Datastores

PhD viva slides of Antonios Katsarakis. The University of Edinburgh, Fall 2021.

Antonios Katsarakis

September 13, 2021
Tweet

More Decks by Antonios Katsarakis

Other Decks in Research

Transcript

  1. Data: in-memory, sharded across servers within a datacenter (DC) Offer

    a single-object read/write or multi-object transactions API Backbone of online services and cloud applications Must provide: High performance Fault tolerance Distributed datastores 2 distributed datastore
  2. Data: in-memory, sharded across servers within a datacenter (DC) Offer

    a single-object read/write or multi-object transactions API Backbone of online services and cloud applications Must provide: High performance Fault tolerance Distributed datastores 2 distributed datastore
  3. Data: in-memory, sharded across servers within a datacenter (DC) Offer

    a single-object read/write or multi-object transactions API Backbone of online services and cloud applications Must provide: High performance Fault tolerance Distributed datastores 2 distributed datastore Mandate data replication
  4. Performance a single node may not keep up with load

    Fault tolerance data remain available despite failures Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: intuitive, broadest spectrum of apps Replication protocols - Strong consistency even under faults – if fault tolerant - Define actions to execute reads/writes or transactions (txs) à determine the datastore’s performance Replication 101 3 … … … … replication protocol
  5. Performance a single node may not keep up with load

    Fault tolerance data remain available despite failures Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: intuitive, broadest spectrum of apps Replication protocols - Strong consistency even under faults – if fault tolerant - Define actions to execute reads/writes or transactions (txs) à determine the datastore’s performance Replication 101 3 … … … … replication protocol
  6. Performance a single node may not keep up with load

    Fault tolerance data remain available despite failures Typically 3 to 7 replicas Consistency Weak: performance but nasty surprises Strong: intuitive, broadest spectrum of apps Replication protocols - Strong consistency even under faults – if fault tolerant - Define actions to execute reads/writes or transactions (txs) à determine the datastore’s performance Replication 101 3 Can strongly consistent protocols offer fault tolerance and high performance? … … … … replication protocol
  7. Multiprocessor: coherence / HTM data replicated across caches Fault tolerance

    Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality Strongly consistent replication 4 Datastores: replication protocols data replicated across nodes Fault tolerance Performance - reads/writes: sacrifice concurrency or speed - txs: cannot exploit locality Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality
  8. Multiprocessor: coherence / HTM data replicated across caches Fault tolerance

    Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality Replication protocols inside a DC - Network: fast, remote direct memory access (RDMA) - Faults are rare within a replica group A server fails at most twice a year fault-free operation >> operation under faults Strongly consistent replication 4 Datastores: replication protocols data replicated across nodes Fault tolerance Performance - reads/writes: sacrifice concurrency or speed - txs: cannot exploit locality Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality
  9. Multiprocessor: coherence / HTM data replicated across caches Fault tolerance

    Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality Replication protocols inside a DC - Network: fast, remote direct memory access (RDMA) - Faults are rare within a replica group A server fails at most twice a year fault-free operation >> operation under faults Strongly consistent replication 4 Datastores: replication protocols data replicated across nodes Fault tolerance Performance - reads/writes: sacrifice concurrency or speed - txs: cannot exploit locality Multiprocessor: coherence / HTM data replicated across caches Fault tolerance Performance via Invalidations (low-latency interconnect) - reads/writes: concurrency & speed - txs: fully exploit locality The common operation of replication protocols resembles the multiprocessor!
  10. Thesis overview 5 Adapting multiprocessor-inspired invalidating protocols to intra-DC replicated

    datastores enables: strong consistency, fault tolerance, high performance Primary contributions 4 invalidating protocols à 3 most common replication uses in datastores 1-slide summary N-slides 1-slide summary Scale-out ccNUMA [Eurosys’18] Galene protocol Performant read/write replication (skew) Zeus [Eurosys’21] Zeus ownership, Zeus reliable commit Replicated fault-tolerant distributed txs Hermes [ASPLOS’20] Hermes protocol Fast fault-tolerant read/write replication
  11. Performant read/write replication for skew 12 Many workloads exhibit skewed

    data accesses a few servers are overloaded, most are underutilized State-of-the-art skew mitigation distributes accesses across all servers & uses RDMA No locality: most requests need remote access à increased latency, bottlenecked by network b/w Symmetric caching all servers have a cache same hottest objects Throughput scales with numbers of servers Less network b/w: most requests served locally Challenge: efficiently keep caches consistent Existing protocols serialize writes @ physical point = hotspot Galene protocol invalidations + logical timestamps = fully distributed writes RDMA
  12. Performant read/write replication for skew 13 Many workloads exhibit skewed

    data accesses a few servers are overloaded, most are underutilized State-of-the-art skew mitigation distributes accesses across all servers & uses RDMA No locality: most requests need remote access à increased latency, bottlenecked by network b/w Symmetric caching all servers have a cache same hottest objects Throughput scales with numbers of servers Less network b/w: most requests served locally Challenge: efficiently keep caches consistent Existing protocols serialize writes @ physical point = hotspot Galene protocol invalidations + logical timestamps = fully distributed writes RDMA RDMA Symmetric caching and Galene
  13. Performant read/write replication for skew 14 Many workloads exhibit skewed

    data accesses a few servers are overloaded, most are underutilized State-of-the-art skew mitigation distributes accesses across all servers & uses RDMA No locality: most requests need remote access à increased latency, bottlenecked by network b/w Symmetric caching all servers have a cache same hottest objects Throughput scales with numbers of servers Less network b/w: most requests served locally Challenge: efficiently keep caches consistent Existing protocols serialize writes @ physical point = hotspot Galene protocol invalidations + logical timestamps = fully distributed writes RDMA RDMA Symmetric caching and Galene 100s millions ops/sec & up to 3x state-of-the-art!
  14. 15 Hmmm … Invalidating protocols good read/write performance when replicating

    under skew can maintain high read/write performance while providing fault tolerance? reliable = strongly consistent + fault tolerant 2nd primary contribution: Hermes!
  15. 16 Hmmm … Invalidating protocols good read/write performance when replicating

    under skew can maintain high read/write performance while providing fault tolerance? reliable = strongly consistent + fault tolerant 2nd primary contribution: Hermes! What is the issue of existing reliable protocols?
  16. Golden standard strong consistency and fault tolerance Low performance reads

    à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 17 Paxos
  17. Golden standard strong consistency and fault tolerance Low performance reads

    à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 18 Paxos State-of-the-art replication protocols exploit failure-free operation for performance
  18. 20 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize

    on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency write read bcast ucast Local reads form all replicas à Fast Local reads form all replicas à Fast
  19. 21 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize

    on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency write read bcast ucast Fast reads but poor write performance Local reads form all replicas à Fast Local reads form all replicas à Fast
  20. 13 Goal: low latency + high throughput Reads Local from

    all replicas Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Head Tail Avoid long latencies
  21. 23 Goal: low-latency + high-throughput Reads Local from all replicas

    Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service Leader Avoid write serialization Key protocol features for high performance Local reads from all replicas
  22. 24 Goal: low-latency + high-throughput Reads Local from all replicas

    Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes
  23. 25 Goal: low-latency + high-throughput Reads Local from all replicas

    Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes Existing replication protocols are deficient
  24. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 26 States of A: Valid, Invalid write(A=3) Coordinator Followers I Invalidation I
  25. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 27 States of A: Valid, Invalid write(A=3) Coordinator Followers At this point, no stale reads can be served Strong consistency! I Invalidation I
  26. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 28 States of A: Valid, Invalid write(A=3) Coordinator Followers V Validation V Ack Ack I Invalidation I V commit
  27. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 29 States of A: Valid, Invalid write(A=3) Coordinator Followers What about concurrent writes? V Validation V Ack Ack I Invalidation I V commit
  28. Challenge How to efficiently order concurrent writes to an object?

    Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 30 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  29. Challenge How to efficiently order concurrent writes to an object?

    Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 31 write(A=3) write(A=1) Inv(TS1) Inv(TS4) Broadcast + Invalidations + TS à high performance writes
  30. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully

    concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Commit in 1 RTT Never abort Writes in Hermes 32 Broadcast + Invalidations + TS
  31. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully

    concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Commit in 1 RTT Never abort Writes in Hermes 33 Awesome! But what about fault tolerance? Broadcast + Invalidations + TS
  32. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 34 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  33. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à early value propagation Handling faults in Hermes 35 Inv(3,TS) write(A=3) read(A) Coordinator fails I I Coordinator Followers
  34. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 36 Inv(3,TS) write(A=3) Coordinator fails I I Coordinator Followers
  35. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 37 Inv(3,TS) write(A=3) Early value propagation enables write replays Coordinator fails I I Coordinator Followers
  36. Evaluation 38 Evaluated protocols: - ZAB - CRAQ - Hermes

    State-of-the-art hardware testbed - 5 servers - 56 Gb/s InfiniBand NICs - 2x 10 core Intel Xeon E5-2630v4 per server KVS Workload - Uniform access distribution - Million key-value pairs: <8B keys, 32B values>
  37. Performance 39 Throughput high-perf. writes + local reads conc. writes

    + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Million requests / sec Write performance matters even at low write ratios 6x % Write Ratio
  38. Performance 40 Throughput high-perf. writes + local reads conc. writes

    + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Million requests / sec Write performance matters even at low write ratios 6x Hermes: highest throughput & lowest latency % Write Ratio
  39. Strong Consistency through multiprocessor-inspired Invalidations Fault-tolerance write replays via early

    value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully concurrent Hermes recap 41 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation
  40. Strong Consistency through multiprocessor-inspired Invalidations Fault-tolerance write replays via early

    value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully concurrent Hermes recap 42 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation What about reliable txs? … 3rd primary contribution (1-slide)!
  41. Reliable replicated transactions 43 Many tx workloads exhibit locality in

    accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16]
  42. Reliable replicated transactions 44 Many tx workloads exhibit locality in

    accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  43. Reliable replicated transactions 45 Many tx workloads exhibit locality in

    accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  44. Reliable replicated transactions 46 Many tx workloads exhibit locality in

    accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  45. Reliable replicated transactions 47 Many tx workloads exhibit locality in

    accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  46. Reliable replicated transactions 48 Many tx workloads exhibit locality in

    accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit 10s millions txs/sec & up to 2x state-of-the-art! distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality
  47. Reliable replicated transactions 49 Many tx workloads exhibit locality in

    accesses State-of-the-art datastores rely on static sharding Reliable txs regardless of access pattern Objects randomly sharded on fixed nodes - remote accesses to execute - expensive distributed commit Zeus – locality-aware reliable txs: Each object: node owner = data + excl. write access changes dynamically Coordinator: becomes owner of all tx’s objects à single node commit Ownership stays with coordinator à future tx = local accesses Reliable ownership (1.5 RTT) alters replica placement, access levels Reliable commit - read-only txs: local from all replicas - fast write txs: pipelined, 1 RTT to commit 10s millions txs/sec & up to 2x state-of-the-art! distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] distributed commit 1. tx: if (p) b++; remote accesses Adapted from FaSST [OSDI’16] Adapted from FaSST [OSDI’16] costly txs, cannot exploit locality Two Invalidating protocols!
  48. Thesis summary 50 Replicated datastores powered by multiprocessor-inspired invalidating protocols

    can deliver: strong consistency, fault tolerance, high performance 4 invalidating protocols à 3 most common replication uses in datastores - High performance (10s–100s M ops / sec) - Strong consistency under concurrency & faults (formally verified in TLA+) Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21] Galene protocol Hermes protocol Zeus ownership, Zeus reliable commit Performant read/write replication for skew Fast reliable read/write replication Locality-aware reliable txs with dynamic sharding
  49. Thesis summary 51 Replicated datastores powered by multiprocessor-inspired invalidating protocols

    can deliver: strong consistency, fault tolerance, high performance 4 invalidating protocols à 3 most common replication uses in datastores - High performance (10s–100s M ops / sec) - Strong consistency under concurrency & faults (formally verified in TLA+) Scale-out ccNUMA [Eurosys’18] Hermes [ASPLOS’20] Zeus [Eurosys’21] Galene protocol Hermes protocol Zeus ownership, Zeus reliable commit Performant read/write replication for skew Fast reliable read/write replication Locality-aware reliable txs with dynamic sharding Is this the end ??
  50. Follow up research 52 • The L2AW theorem [to be

    submitted] • Hardware offloading • Replication across datacenters • Single-shot reliable writes from external clients • Non-blocking reconfiguration on node crashes …
  51. Follow up research 53 • The L2AW theorem [to be

    submitted] • Hardware offloading • Replication across datacenters • Single-shot reliable writes from external clients • Non-blocking reconfiguration on node crashes … Thank you! Questions?