Hermes Reliable Replication Protocol - [ASPLOS'20]

Hermes Reliable Replication Protocol - [ASPLOS'20]

The presentation slides as appeared in ASPLOS'20 for the paper
"Hermes: A Fast, Fault-Tolerant and Linearizable Replication Protocol".

Ac27874cf698c0879ec677ddeea64f5b?s=128

Antonios Katsarakis

April 30, 2020
Tweet

Transcript

  1. Hermes A Fast, Fault-tolerant and Linearizable Replication Protocol Antonios Katsarakis,

    V. Gavrielatos, S. Katebzadeh, A. Joshi*, B. Grot, V. Nagarajan, A. Dragojevic† University of Edinburgh, *Intel, †Microsoft Research hermes-protocol.com Thanks to:
  2. In-memory with read/write API Backbone of online services Need: High

    performance Fault tolerance Distributed datastores 2 Distributed Datastore
  3. In-memory with read/write API Backbone of online services Need: High

    performance Fault tolerance Distributed datastores 3 Distributed Datastore
  4. In-memory with read/write API Backbone of online services Need: High

    performance Fault tolerance Distributed datastores 4 Distributed Datastore
  5. In-memory with read/write API Backbone of online services Need: High

    performance Fault tolerance Distributed datastores 5 Distributed Datastore
  6. In-memory with read/write API Backbone of online services Need: High

    performance Fault tolerance Distributed datastores 6 Distributed Datastore
  7. In-memory with read/write API Backbone of online services Need: High

    performance Fault tolerance Distributed datastores 7 Distributed Datastore Mandates data replication
  8. Typically 3 to 7 replicas Consistency Weak: performance but nasty

    surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 9 … … … …
  9. Typically 3 to 7 replicas Consistency Weak: performance but nasty

    surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 10 … … … …
  10. Typically 3 to 7 replicas Consistency Weak: performance but nasty

    surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 11 … … … … Reliable Replication Protocol
  11. Typically 3 to 7 replicas Consistency Weak: performance but nasty

    surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 12 … … … … Reliable Replication Protocol
  12. Typically 3 to 7 replicas Consistency Weak: performance but nasty

    surprises Strong: programmable and intuitive Reliable replication protocols • Strong consistency even under faults • Define actions to execute reads & writes à these determine a datastore’s performance Replication 101 13 Can reliable protocols provide high performance? … … … … Reliable Replication Protocol
  13. Golden standard strong consistency and fault tolerance Low performance reads

    à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 15 Paxos
  14. Golden standard strong consistency and fault tolerance Low performance reads

    à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 16 Paxos
  15. Golden standard strong consistency and fault tolerance Low performance reads

    à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 17 Paxos
  16. Golden standard strong consistency and fault tolerance Low performance reads

    à inter-replica communication writes à multiple RTTs over the network Common-case performance (i.e., no faults) as bad as worst-case (under faults) 18 Paxos State-of-the-art reliable protocols exploit failure-free operation for performance
  17. 20 Performance of state-of-the-art protocols Leader ZAB replicas

  18. 21 Performance of state-of-the-art protocols Leader ZAB write read bcast

    ucast Local reads form all replicas à Fast
  19. 22 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize

    on the leader à Low throughput write read bcast ucast Local reads form all replicas à Fast
  20. 23 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize

    on the leader à Low throughput Head Tail CRAQ write read bcast ucast Local reads form all replicas à Fast
  21. 24 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize

    on the leader à Low throughput Head Tail CRAQ write read bcast ucast Local reads form all replicas à Fast Local reads form all replicas à Fast
  22. 25 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize

    on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency write read bcast ucast Local reads form all replicas à Fast Local reads form all replicas à Fast
  23. 26 Performance of state-of-the-art protocols Leader ZAB Leader Writes serialize

    on the leader à Low throughput Head Tail CRAQ Head Tail Writes traverse length of the chain à High latency write read bcast ucast Fast reads but poor write performance Local reads form all replicas à Fast Local reads form all replicas à Fast
  24. 28 Goal: low-latency + high-throughput Reads Local from all replicas

    Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance
  25. 29 Goal: low-latency + high-throughput Reads Local from all replicas

    Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas
  26. 30 Goal: low-latency + high-throughput Reads Local from all replicas

    Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Head Tail Avoid long latencies
  27. 32 Goal: low-latency + high-throughput Reads Local from all replicas

    Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Leader Avoid write serialization Key protocol features for high performance Local reads from all replicas
  28. 33 Goal: low-latency + high-throughput Reads Local from all replicas

    Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes
  29. 34 Goal: low-latency + high-throughput Reads Local from all replicas

    Writes Fast - Minimize network hops Decentralized - No serialization points Fully concurrent - Any replica can service a write Key protocol features for high performance Local reads from all replicas Fast, decentralized, fully concurrent writes Existing replication protocols are deficient
  30. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 36
  31. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 37 write(A=3) Coordinator Followers
  32. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 38 States of A: Valid, Invalid write(A=3) Coordinator Followers I Invalidation I
  33. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations - Coordinator is a replica servicing a write Enter Hermes 39 States of A: Valid, Invalid write(A=3) Coordinator Followers At this point, no stale reads can be served Strong consistency! I Invalidation I
  34. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 41 States of A: Valid, Invalid write(A=3) Coordinator Followers Ack Ack I Invalidation I
  35. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 42 States of A: Valid, Invalid write(A=3) Coordinator Followers Ack Ack I Invalidation I V commit
  36. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 43 States of A: Valid, Invalid write(A=3) Coordinator Followers V Validation V Ack Ack I Invalidation I V
  37. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 44 States of A: Valid, Invalid write(A=3) Coordinator Followers V Validation V Ack Ack I Invalidation I V
  38. Broadcast-based, invalidating replication protocol Inspired by multiprocessor cache-coherence protocols Fault-free

    operation: 1. Coordinator broadcasts Invalidations 2. Followers Acknowledge invalidation 3. Coordinator broadcasts Validations - All replicas can now serve reads for this object Strongest consistency Linearizability Local reads from all replicas à valid objects = latest value Enter Hermes 45 States of A: Valid, Invalid write(A=3) Coordinator Followers What about concurrent writes? V Validation V Ack Ack I Invalidation I V
  39. Challenge How to efficiently order concurrent writes to an object?

    Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 47 write(A=3) write(A=1)
  40. Challenge How to efficiently order concurrent writes to an object?

    Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 48 write(A=3) write(A=1)
  41. Challenge How to efficiently order concurrent writes to an object?

    Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 49 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  42. Challenge How to efficiently order concurrent writes to an object?

    Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 50 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  43. Challenge How to efficiently order concurrent writes to an object?

    Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 51 write(A=3) write(A=1) Inv(TS1) Inv(TS4)
  44. Challenge How to efficiently order concurrent writes to an object?

    Solution Store a logical timestamp (TS) along with each object - Upon a write: coordinator increments TS and sends it with Invalidations - Upon receiving Invalidation: a follower updates the object’s TS - When two writes to the same object race: use node ID to order them Concurrent writes = challenge 52 write(A=3) write(A=1) Inv(TS1) Inv(TS4) Broadcast + Invalidations + TS à high performance writes
  45. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully

    concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 54 Broadcast + Invalidations + TS
  46. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully

    concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 55 Broadcast + Invalidations + TS
  47. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully

    concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 56 Broadcast + Invalidations + TS
  48. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully

    concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 57 Broadcast + Invalidations + TS
  49. 1. Decentralized Fully distributed write ordering at endpoints 2. Fully

    concurrent Any replica can coordinate a write Writes to different objects proceed in parallel 3. Fast Writes commit in 1 RTT Writes never abort Writes in Hermes 58 Awesome! But what about fault tolerance? Broadcast + Invalidations + TS
  50. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation 60 Handling faults in Hermes
  51. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 61 Handling faults in Hermes
  52. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 62 Handling faults in Hermes Inv(TS) I I
  53. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 63 Handling faults in Hermes Inv(TS) Coordinator fails I I
  54. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 64 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  55. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 65 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  56. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 66 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  57. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 67 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  58. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation write(A=3) Coordinator Followers 68 Handling faults in Hermes read(A) Inv(TS) Coordinator fails I I
  59. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation Handling faults in Hermes 70 Inv(3,TS) write(A=3) Coordinator fails I I Coordinator Followers
  60. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation Handling faults in Hermes 71 Inv(3,TS) write(A=3) read(A) Coordinator fails I I Coordinator Followers
  61. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 73 Inv(3,TS) write(A=3) Coordinator fails I I Coordinator Followers
  62. Problem A failure in the middle of a write can

    permanently leave a replica in Invalid state Idea Allow any Invalidated replica to replay the write and unblock. How? Insight: to replay a write need - Write’s original TS (for ordering) - Write value TS sent with Invalidation, but write value is not Solution: send write value with Invalidation à Early value propagation V V Inv(3,TS) completion write replay read(A) Handling faults in Hermes 74 Inv(3,TS) write(A=3) Early value propagation enables write replays Coordinator fails I I Coordinator Followers
  63. Strong Consistency through CC-inspired Invalidations Fault-tolerance write replays via early

    value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully-distributed Hermes recap 76 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation
  64. Strong Consistency through CC-inspired Invalidations Fault-tolerance write replays via early

    value propagation High Performance Local reads at all replicas High performance writes Fast Decentralized Fully-distributed Hermes recap 77 V I write(A=3) commit Coordinator Followers Inv(3,TS) V I V Broadcast + Invalidations + TS + early value propagation In the paper: protocol details, RMWs, other goodies
  65. Evaluation 78 State-of-the-art hardware testbed - 5 servers - 2x

    10 core Intel Xeon E5-2630v4 per server - 56 Gb/s InfiniBand NICs KVS Workload - Uniform access distribution - Million KV pairs: <8B keys, 32B values> Evaluated protocols: - ZAB - CRAQ - Hermes
  66. Performance 79 Throughput high-perf. writes + local reads conc. writes

    + local reads local reads Million requests / sec
  67. Performance 80 Throughput high-perf. writes + local reads conc. writes

    + local reads local reads 4x 40% Million requests / sec
  68. Performance 81 Throughput high-perf. writes + local reads conc. writes

    + local reads local reads 4x 40% Million requests / sec Write performance matters even at low write ratios
  69. Performance 82 Throughput high-perf. writes + local reads conc. writes

    + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Million requests / sec Write performance matters even at low write ratios
  70. Performance 83 Throughput high-perf. writes + local reads conc. writes

    + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Million requests / sec Write performance matters even at low write ratios 6x
  71. Performance 84 Throughput high-perf. writes + local reads conc. writes

    + local reads local reads 4x 40% 5% Write Ratio Write Latency (normalized to Hermes) Million requests / sec Write performance matters even at low write ratios 6x Hermes: highest throughput & lowest latency
  72. Hermes Broadcast + Invalidations + TS + early value propagation

    Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 86
  73. Hermes Broadcast + Invalidations + TS + early value propagation

    Strong consistency Fault tolerance via write replays High performance Local reads from all replicas High performance writes Fast Decentralized Fully concurrent Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 87
  74. Hermes Broadcast + Invalidations + TS + early value propagation

    Strong consistency Fault tolerance via write replays High performance Local reads from all replicas High performance writes Fast Decentralized Fully concurrent Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 88
  75. Hermes Broadcast + Invalidations + TS + early value propagation

    Strong consistency Fault tolerance via write replays High performance Local reads from all replicas High performance writes Fast Decentralized Fully concurrent Hermes-protocol.com Code available TLA+ verification Q&A Conclusion 89 Need reliability and performance? Choose Hermes!