HyParView: A Membership Protocol for Reliable Gossip-based Broadcast

HyParView: A Membership Protocol for Reliable Gossip-based Broadcast

Papers We Love, Madrid
May 2016

3e09fee7b359be847ed5fa48f524a3d3?s=128

Christopher Meiklejohn

May 11, 2016
Tweet

Transcript

  1. HyParView A Membership Protocol for Reliable Gossip-based Broadcast 1 João

    Leitao, José Pereira, Luís Rodrigues Universidade de Lisboa TR-07-13 May 2007
  2. None
  3. Introduction Gossip Protocols 3

  4. Gossip Protocols • Highly-scalable and resilient 
 Eases the implementation

    of reliable broadcast protocols 4
  5. Gossip Protocols • Highly-scalable and resilient 
 Eases the implementation

    of reliable broadcast protocols • Simple protocol
 Select t nodes at random, send the message; on initial receipt, forward to t nodes 4
  6. Gossip Protocols • Highly-scalable and resilient 
 Eases the implementation

    of reliable broadcast protocols • Simple protocol
 Select t nodes at random, send the message; on initial receipt, forward to t nodes • Load distribution
 Distributes the load evenly amongst all of the nodes in the system 4
  7. Full Membership • Full membership 
 Assumes each node knows

    full membership when targeting nodes in a gossip round 5
  8. Full Membership • Full membership 
 Assumes each node knows

    full membership when targeting nodes in a gossip round • High cost
 Cost of keeping membership up-to-date can be prohibitive in large networks 5
  9. Partial Membership • Partial views 
 Maintain a partial view

    of nodes in the system; a subset of the full membership of the system 6
  10. Partial Membership • Partial views 
 Maintain a partial view

    of nodes in the system; a subset of the full membership of the system • Fanout t
 Randomly select t nodes from the partial view for broadcast 6
  11. Partial view must ensure that random selection from the partial

    view should provide the same resiliency as random selection from the entire membership. 7
  12. Partial Membership Problems • Node failures 
 Easier for partial

    views to become disconnected 8
  13. Partial Membership Problems • Node failures 
 Easier for partial

    views to become disconnected • Repair
 Repair may take more rounds of membership to repair 8
  14. Partial Membership Problems • Node failures 
 Easier for partial

    views to become disconnected • Repair
 Repair may take more rounds of membership to repair • Dissemination during failure
 Negative impact on message delivery during failure and repair periods 8
  15. Contributions HyParView 9

  16. Gossip strategy based on a reliable transport protocol (TCP) that

    alleviates the need for the gossip protocol to directly address network omissions. 10
  17. Nodes maintain a symmetric active view used to flood the

    network with messages. 11
  18. TCP is used as a failure detector. 12

  19. Nodes maintain a low-cost passive view that is used to

    replace failed nodes in the active view. 13
  20. Membership protocol to control promotion into the active view. 14

  21. Smaller fanout but with stronger resilience. 15

  22. Related Work Partial Views 16

  23. Partial Views • Partial views 
 Small subset of identifiers

    in the system; typically log(n) 17
  24. Partial Views • Partial views 
 Small subset of identifiers

    in the system; typically log(n) • Membership protocol
 Change the view based on dynamic changes in the system 17
  25. Abridged Algorithm • Join algorithm 
 Create own partial view;

    add itself to others partial views 18
  26. Abridged Algorithm • Join algorithm 
 Create own partial view;

    add itself to others partial views • Leave algorithm
 Remove itself from partial views in the system 18
  27. Overlay Network • Neighbor relation 
 Partial views establish a

    neighbor relation between nodes in the system 19
  28. Overlay Network • Neighbor relation 
 Partial views establish a

    neighbor relation between nodes in the system • Directed graph
 Overlay network is a direct graph that captures the neighbor relation between all nodes in the system 19
  29. View Maintenance • Reactive strategy 
 Respond to events occurring

    in the system; join/leave; Scamp 20
  30. View Maintenance • Reactive strategy 
 Respond to events occurring

    in the system; join/leave; Scamp • Cyclic strategy
 Updated at an interval; occurs even if global membership is stable 20
  31. Reactive strategies rely on a failure detector. 21

  32. Reactive strategies can be faster than cyclic strategies with a

    fast and accurate failure detector. 22
  33. Related Work Partial View Properties 23

  34. Connectedness • Connected 
 Overlay should be connected 24

  35. Connectedness • Connected 
 Overlay should be connected • Isolated

    nodes
 If not, isolated nodes will not receive any broadcast messages 24
  36. Degree Distribution • In-degree 
 Number of nodes that have

    a node in their active view; measure of reachability 25
  37. Degree Distribution • In-degree 
 Number of nodes that have

    a node in their active view; measure of reachability • Out-degree
 Number of nodes in a node’s active view; measure of contribution and importance 25
  38. 26

  39. Assuming uniform distribution of failure, 26

  40. Assuming uniform distribution of failure, 26 balanced in-degree and out-

    degree measurements result in better resilience.
  41. Average Path Length • Path 
 Set of edges that

    a message has to cross to reach a node 27
  42. Average Path Length • Path 
 Set of edges that

    a message has to cross to reach a node • Average Path Length
 Sum of the shortest paths between all pairs of nodes 27
  43. Average Path Length • Path 
 Set of edges that

    a message has to cross to reach a node • Average Path Length
 Sum of the shortest paths between all pairs of nodes • Latency
 Related to message delivery latency, as it represents the path that must be taken to reach a node 27
  44. Clustering Coefficient • Coefficient 
 Number of edges between neighbors

    divided by the possible maximum of edges across nodes 28
  45. Clustering Coefficient • Coefficient 
 Number of edges between neighbors

    divided by the possible maximum of edges across nodes • Redundancy
 Related to the number of redundant messages that will be received during dissemination 28
  46. Clustering Coefficient • Coefficient 
 Number of edges between neighbors

    divided by the possible maximum of edges across nodes • Redundancy
 Related to the number of redundant messages that will be received during dissemination • Clustering
 High-values of this coefficient related to how easy it is for nodes to become isolated 28
  47. Accuracy • Accuracy 
 Number of neighbors that have not

    failed divided by the number of neighbors 29
  48. Accuracy • Accuracy 
 Number of neighbors that have not

    failed divided by the number of neighbors • Low values
 Low values mean that failed nodes will be targeted for gossip more frequently 29
  49. Accuracy • Accuracy 
 Number of neighbors that have not

    failed divided by the number of neighbors • Low values
 Low values mean that failed nodes will be targeted for gossip more frequently • Fanout
 Requires higher fanout value to mask these failures 29
  50. Related Work Membership Protocols 30

  51. Scamp (Reactive) • “PartialView” 
 Target of gossip messages; not

    fixed in size 31
  52. Scamp (Reactive) • “PartialView” 
 Target of gossip messages; not

    fixed in size • “InView” 
 Nodes that messages are received from 31
  53. Scamp (Reactive) • “PartialView” 
 Target of gossip messages; not

    fixed in size • “InView” 
 Nodes that messages are received from • Isolated nodes
 Detected with heartbeating; required to rejoin the cluster on isolation 31
  54. Cyclon (Cyclic) • Partial view 
 Nodes maintain a fixed

    length partial view 32
  55. Cyclon (Cyclic) • Partial view 
 Nodes maintain a fixed

    length partial view • Random walk 
 Nodes join the cluster by performing a random walk of the overlay 32
  56. Cyclon (Cyclic) • Partial view 
 Nodes maintain a fixed

    length partial view • Random walk 
 Nodes join the cluster by performing a random walk of the overlay • Shuffle
 Periodically, view is shuffled with the oldest member of the partial view 32
  57. Related Work Gossip Protocols 33

  58. NeEM • NeEM
 “Network Friendly Epidemic Multicast” 34

  59. NeEM • NeEM
 “Network Friendly Epidemic Multicast” • TCP 


    Leverages TCP flow control to eliminate correlated message loss 34
  60. CREW • Flash dissemination 
 Epidemic dissemination of data from

    multiple sources; files, etc. 35
  61. CREW • Flash dissemination 
 Epidemic dissemination of data from

    multiple sources; files, etc. • Bandwidth estimation
 TCP is used to estimate available bandwidth from available connections 35
  62. CREW • Flash dissemination 
 Epidemic dissemination of data from

    multiple sources; files, etc. • Bandwidth estimation
 TCP is used to estimate available bandwidth from available connections • Cache 
 Uses open TCP connections discovered by performing a random walk of the overlay to avoid penalty of opening a new connection 35
  63. Gossip Reliability 36

  64. Percentage of nodes that deliver the message. 37

  65. 100% = atomic broadcast 38

  66. Motivation HyParView Protocol 39

  67. Motivation • Fanout
 Constrained by desired fault tolerance level and

    target reliability 40
  68. Motivation • Fanout
 Constrained by desired fault tolerance level and

    target reliability • Quality of views
 Quality affects required fanout value 40
  69. Motivation • Fanout
 Constrained by desired fault tolerance level and

    target reliability • Quality of views
 Quality affects required fanout value • TCP and “better” failure detection
 Results in more cost-effective gossip protocols 40
  70. High failure rates can have a strong negative impact on

    partial view quality. 41
  71. Therefore, we need membership protocols with fast healing properties. 42

  72. None
  73. None
  74. Increasing fanout from 4 to 6 can result in 99%

    redundant messages.
  75. Increasing fanout from 4 to 6 can result in 99%

    redundant messages. but, required to reach 99% delivery.
  76. Increasing fanout from 4 to 6 can result in 99%

    redundant messages. 44 but, required to reach 99% delivery.
  77. Contribution HyParView Protocol 45

  78. Two Views • Active view
 Fanout + 1, symmetric active

    view 46
  79. Two Views • Active view
 Fanout + 1, symmetric active

    view • Passive view
 Larger than log(n); used to ensure connectivity during faults 46
  80. Two Views • Active view
 Fanout + 1, symmetric active

    view • Passive view
 Larger than log(n); used to ensure connectivity during faults • Minimal overhead
 Connections are not maintained to the members of the passive view 46
  81. Active View • Overlay
 Active view defines and overlay network

    47
  82. Active View • Overlay
 Active view defines and overlay network

    • Symmetric
 Links are symmetric; nodes have each other in their view 47
  83. Active View • Overlay
 Active view defines and overlay network

    • Symmetric
 Links are symmetric; nodes have each other in their view • TCP as transport
 TCP is used as the network transmission protocol 47
  84. Active View • Overlay
 Active view defines and overlay network

    • Symmetric
 Links are symmetric; nodes have each other in their view • TCP as transport
 TCP is used as the network transmission protocol • Connection caching
 Connections are maintained, tested at each step, and used for failure detection for active view members 47
  85. Active View Broadcast • Broadcast
 Active views are flooded with

    message broadcasts 48
  86. Active View Broadcast • Broadcast
 Active views are flooded with

    message broadcasts • Deterministic
 Target selection at each gossip step is deterministic instead of random 48
  87. Active View Broadcast • Broadcast
 Active views are flooded with

    message broadcasts • Deterministic
 Target selection at each gossip step is deterministic instead of random • Overlay
 Overlay is created at random from the global membership, through the use of the membership protocol 48
  88. Active View Maintenance • Join
 Nodes are added to the

    active view on join 49
  89. Active View Maintenance • Join
 Nodes are added to the

    active view on join • Failure
 Nodes are removed from the active view on failure 49
  90. Active View Maintenance • Join
 Nodes are added to the

    active view on join • Failure
 Nodes are removed from the active view on failure • Implicitly tested
 Nodes are checked for failure at every step of the protocol; this result in very fast failure detection 49
  91. Passive View • Passive
 Larger, and not used for message

    dissemination 50
  92. Passive View • Passive
 Larger, and not used for message

    dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view 50
  93. Passive View • Passive
 Larger, and not used for message

    dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view • Cyclic strategy
 While the active view is maintained using a reactive strategy, the passive view is maintained using a cyclic strategy* 50
  94. Passive View • Passive
 Larger, and not used for message

    dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view • Cyclic strategy
 While the active view is maintained using a reactive strategy, the passive view is maintained using a cyclic strategy* • Shuffle
 At each interval defined by the cyclic strategy, a “shuffle” is performed with another node in the system 50
  95. Passive View • Passive
 Larger, and not used for message

    dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view • Cyclic strategy
 While the active view is maintained using a reactive strategy, the passive view is maintained using a cyclic strategy* • Shuffle
 At each interval defined by the cyclic strategy, a “shuffle” is performed with another node in the system 50 * not completely true
  96. Passive View Shuffle • Selection
 Add own identifier, random selection

    of nodes from the active view, and members from the passive view 51
  97. Passive View Shuffle • Selection
 Add own identifier, random selection

    of nodes from the active view, and members from the passive view • Increases replacement probability
 Increase probability of having “active” replacement nodes in another node’s passive view 51
  98. Membership Operations: Join • Join
 When a node joins the

    cluster, it must join through an existing node in the system: the “contact” node 52
  99. Membership Operations: Join • Join
 When a node joins the

    cluster, it must join through an existing node in the system: the “contact” node • “Contact” node
 Always accepts the join and adds to the active view; evicts a member if necessary 52
  100. Membership Operations: Join • Join
 When a node joins the

    cluster, it must join through an existing node in the system: the “contact” node • “Contact” node
 Always accepts the join and adds to the active view; evicts a member if necessary • Forward join request
 Request is forwarded using a random walk that results in addition to active views (based on size, then TTL), and finally a single passive view based on a walk TTL 52
  101. Active View Maintenance • Suspicion or knowledge of failure
 Randomly

    select random peers from the passive view until one can be found to replace it 53
  102. Active View Maintenance • Suspicion or knowledge of failure
 Randomly

    select random peers from the passive view until one can be found to replace it • Eviction
 Evict from the passive view when a node can’t be contacted to be a replacement* 53
  103. Active View Maintenance • Suspicion or knowledge of failure
 Randomly

    select random peers from the passive view until one can be found to replace it • Eviction
 Evict from the passive view when a node can’t be contacted to be a replacement* • Priorities
 Nodes only accept new nodes as neighbors if they have an open slot, or the requesting node has no members in its active view 53
  104. Active View Maintenance • Suspicion or knowledge of failure
 Randomly

    select random peers from the passive view until one can be found to replace it • Eviction
 Evict from the passive view when a node can’t be contacted to be a replacement* • Priorities
 Nodes only accept new nodes as neighbors if they have an open slot, or the requesting node has no members in its active view 53 * can result in permanently isolated nodes
  105. Passive View Maintenance • Shuffle operation
 Propagated using a random

    walk as well 54
  106. Passive View Maintenance • Shuffle operation
 Propagated using a random

    walk as well • Exchange selection
 Remote node will shuffle from its passive view only 54
  107. Passive View Maintenance • Shuffle operation
 Propagated using a random

    walk as well • Exchange selection
 Remote node will shuffle from its passive view only • Integration
 Only integrate into passive view nodes that are not already contained in sender’s active view 54
  108. View Maintenance 55

  109. Nodes can move from the passive view to the active

    view in order to fill the active view. (in response to node failures, etc.) 56
  110. When a node moves to the passive view, it's probability

    of being included in a shuffle increases… …therefore increasing the probability it will be used as a replacement for a failed node in another's active view. 57
  111. Active Passive Passive Active Shuffle Replacement Failure (after replacement) Evict

    on failed replacement Join, ForwardJoin ForwardJoin n n n
  112. Evaluation HyParView Protocol 59

  113. Experiment Overview • Membership protocols
 Evaluates HyParView, Scamp, Cyclon (and

    CyclonAcked) 60
  114. Experiment Overview • Membership protocols
 Evaluates HyParView, Scamp, Cyclon (and

    CyclonAcked) • Gossip protocol
 Generic gossip protocol that can be used with any of these membership services 60
  115. Experiment Overview • Membership protocols
 Evaluates HyParView, Scamp, Cyclon (and

    CyclonAcked) • Gossip protocol
 Generic gossip protocol that can be used with any of these membership services • Clustering
 Clustered at beginning of test without interleaved membership rounds 60
  116. Experiment Setting • PeerSim
 PeerSim used to test 10,000 node

    cluster 61
  117. Experiment Setting • PeerSim
 PeerSim used to test 10,000 node

    cluster • HyParView
 Active view: 5; passive view: 30; ARWL: 6; PRWL: 3; shuffle k_p: 4; shuffle k_a: 3 61
  118. None
  119. Effect of Failures • Resilience
 100% reliability up to 95%

    63
  120. Effect of Failures • Resilience
 100% reliability up to 95%

    • Degradation
 90% at 95% failure rates 63
  121. Effect of Failures • Resilience
 100% reliability up to 95%

    • Degradation
 90% at 95% failure rates • Scamp / Cyclon
 Begin experiencing problems at 50% failure rate 63
  122. None
  123. Failure Detection • Improves performance
 View maintenance and failure detection

    performed at each step increase how fast the system can react 65
  124. Failure Detection • Improves performance
 View maintenance and failure detection

    performed at each step increase how fast the system can react • Cyclon/Scamp
 No failure detector; membership is only repaired at the next membership interval 65
  125. Failure Detection • Improves performance
 View maintenance and failure detection

    performed at each step increase how fast the system can react • Cyclon/Scamp
 No failure detector; membership is only repaired at the next membership interval • Scamp / Cyclon
 Begin experiencing problems at 50% failure rate 65
  126. (A)Symmetric Views • Asymmetric views
 Cyclon results in nodes with

    outgoing links but no incoming links 66
  127. (A)Symmetric Views • Asymmetric views
 Cyclon results in nodes with

    outgoing links but no incoming links • Symmetric views
 Guarantees that the node is reachable by other nodes in the system 66
  128. None
  129. Good Properties • Low clustering coefficient
 Harder for nodes to

    become isolated 68
  130. Good Properties • Low clustering coefficient
 Harder for nodes to

    become isolated • Small average shortest path
 Reduces latency in message delivery 68
  131. Good Properties • Low clustering coefficient
 Harder for nodes to

    become isolated • Small average shortest path
 Reduces latency in message delivery • Balance degree distribution
 Increased fault tolerance with balanced reachability and importance 68
  132. HyParView Properties • Low clustering coefficient
 Active view per node

    is smaller than alternative protocols 69
  133. HyParView Properties • Low clustering coefficient
 Active view per node

    is smaller than alternative protocols • Small active view
 Results in fewer distinct paths that can be used for dissemination of messages 69
  134. HyParView Properties • Low clustering coefficient
 Active view per node

    is smaller than alternative protocols • Small active view
 Results in fewer distinct paths that can be used for dissemination of messages • Larger average shortest path
 Doesn’t affect latency, because every single link in the overlay can be used to disseminate messages 69
  135. None
  136. None
  137. Cyclon/Scamp Degree Distribution • Wide distribution
 Some very popular nodes,

    some unknown nodes 72
  138. Cyclon/Scamp Degree Distribution • Wide distribution
 Some very popular nodes,

    some unknown nodes • Redundant messages 
 High probability of seeing redundant messages because of distribution 72
  139. Cyclon/Scamp Degree Distribution • Wide distribution
 Some very popular nodes,

    some unknown nodes • Redundant messages 
 High probability of seeing redundant messages because of distribution • Missing messages
 Low probability of seeing a message just once 72
  140. HyParView Degree Distribution • Symmetric views
 Ensure all nodes are

    known by the maximum amount of possible nodes 73
  141. HyParView Degree Distribution • Symmetric views
 Ensure all nodes are

    known by the maximum amount of possible nodes • Similar number of deliveries 
 With high probability, nodes should receive a message approximately the same amount of time 73
  142. HyParView Degree Distribution • Symmetric views
 Ensure all nodes are

    known by the maximum amount of possible nodes • Similar number of deliveries 
 With high probability, nodes should receive a message approximately the same amount of time • No missing messages
 With low probability, nodes will not receive a single copy of a message 73
  143. Summary HyParView Protocol 74

  144. Speed of failure detection is extremely important for high availability.

    75
  145. Gossip, with reliable transport and a failure detection, on a

    fixed overlay delivers the best possible performance. 76
  146. If the overlay is connected, utilization of all available links

    for message dissemination aims at 100% delivery. 77
  147. Smaller fanouts can be used if you do not need

    to worry about masking failures and network omissions. 78
  148. Therefore, hybrid approach that contains a small active view, and

    a larger (low cost) passive view, maintained by different strategies, offers a better resilience and better resource usage than using a single (large) view with higher fanout. 79
  149. Finally, TCP flow control can cause blockages with slow links,

    and it may be preferable to consider slow nodes as failed to preserve liveness in the system. 80
  150. None
  151. 82 Christopher Meiklejohn @cmeik Thanks!