Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HyParView: A Membership Protocol for Reliable G...

HyParView: A Membership Protocol for Reliable Gossip-based Broadcast

Papers We Love, Madrid
May 2016

Christopher Meiklejohn

May 11, 2016
Tweet

More Decks by Christopher Meiklejohn

Other Decks in Research

Transcript

  1. HyParView A Membership Protocol for Reliable Gossip-based Broadcast 1 João

    Leitao, José Pereira, Luís Rodrigues Universidade de Lisboa TR-07-13 May 2007
  2. Gossip Protocols • Highly-scalable and resilient 
 Eases the implementation

    of reliable broadcast protocols • Simple protocol
 Select t nodes at random, send the message; on initial receipt, forward to t nodes 4
  3. Gossip Protocols • Highly-scalable and resilient 
 Eases the implementation

    of reliable broadcast protocols • Simple protocol
 Select t nodes at random, send the message; on initial receipt, forward to t nodes • Load distribution
 Distributes the load evenly amongst all of the nodes in the system 4
  4. Full Membership • Full membership 
 Assumes each node knows

    full membership when targeting nodes in a gossip round 5
  5. Full Membership • Full membership 
 Assumes each node knows

    full membership when targeting nodes in a gossip round • High cost
 Cost of keeping membership up-to-date can be prohibitive in large networks 5
  6. Partial Membership • Partial views 
 Maintain a partial view

    of nodes in the system; a subset of the full membership of the system 6
  7. Partial Membership • Partial views 
 Maintain a partial view

    of nodes in the system; a subset of the full membership of the system • Fanout t
 Randomly select t nodes from the partial view for broadcast 6
  8. Partial view must ensure that random selection from the partial

    view should provide the same resiliency as random selection from the entire membership. 7
  9. Partial Membership Problems • Node failures 
 Easier for partial

    views to become disconnected • Repair
 Repair may take more rounds of membership to repair 8
  10. Partial Membership Problems • Node failures 
 Easier for partial

    views to become disconnected • Repair
 Repair may take more rounds of membership to repair • Dissemination during failure
 Negative impact on message delivery during failure and repair periods 8
  11. Gossip strategy based on a reliable transport protocol (TCP) that

    alleviates the need for the gossip protocol to directly address network omissions. 10
  12. Nodes maintain a low-cost passive view that is used to

    replace failed nodes in the active view. 13
  13. Partial Views • Partial views 
 Small subset of identifiers

    in the system; typically log(n) • Membership protocol
 Change the view based on dynamic changes in the system 17
  14. Abridged Algorithm • Join algorithm 
 Create own partial view;

    add itself to others partial views • Leave algorithm
 Remove itself from partial views in the system 18
  15. Overlay Network • Neighbor relation 
 Partial views establish a

    neighbor relation between nodes in the system 19
  16. Overlay Network • Neighbor relation 
 Partial views establish a

    neighbor relation between nodes in the system • Directed graph
 Overlay network is a direct graph that captures the neighbor relation between all nodes in the system 19
  17. View Maintenance • Reactive strategy 
 Respond to events occurring

    in the system; join/leave; Scamp • Cyclic strategy
 Updated at an interval; occurs even if global membership is stable 20
  18. Connectedness • Connected 
 Overlay should be connected • Isolated

    nodes
 If not, isolated nodes will not receive any broadcast messages 24
  19. Degree Distribution • In-degree 
 Number of nodes that have

    a node in their active view; measure of reachability 25
  20. Degree Distribution • In-degree 
 Number of nodes that have

    a node in their active view; measure of reachability • Out-degree
 Number of nodes in a node’s active view; measure of contribution and importance 25
  21. 26

  22. Assuming uniform distribution of failure, 26 balanced in-degree and out-

    degree measurements result in better resilience.
  23. Average Path Length • Path 
 Set of edges that

    a message has to cross to reach a node 27
  24. Average Path Length • Path 
 Set of edges that

    a message has to cross to reach a node • Average Path Length
 Sum of the shortest paths between all pairs of nodes 27
  25. Average Path Length • Path 
 Set of edges that

    a message has to cross to reach a node • Average Path Length
 Sum of the shortest paths between all pairs of nodes • Latency
 Related to message delivery latency, as it represents the path that must be taken to reach a node 27
  26. Clustering Coefficient • Coefficient 
 Number of edges between neighbors

    divided by the possible maximum of edges across nodes 28
  27. Clustering Coefficient • Coefficient 
 Number of edges between neighbors

    divided by the possible maximum of edges across nodes • Redundancy
 Related to the number of redundant messages that will be received during dissemination 28
  28. Clustering Coefficient • Coefficient 
 Number of edges between neighbors

    divided by the possible maximum of edges across nodes • Redundancy
 Related to the number of redundant messages that will be received during dissemination • Clustering
 High-values of this coefficient related to how easy it is for nodes to become isolated 28
  29. Accuracy • Accuracy 
 Number of neighbors that have not

    failed divided by the number of neighbors 29
  30. Accuracy • Accuracy 
 Number of neighbors that have not

    failed divided by the number of neighbors • Low values
 Low values mean that failed nodes will be targeted for gossip more frequently 29
  31. Accuracy • Accuracy 
 Number of neighbors that have not

    failed divided by the number of neighbors • Low values
 Low values mean that failed nodes will be targeted for gossip more frequently • Fanout
 Requires higher fanout value to mask these failures 29
  32. Scamp (Reactive) • “PartialView” 
 Target of gossip messages; not

    fixed in size • “InView” 
 Nodes that messages are received from 31
  33. Scamp (Reactive) • “PartialView” 
 Target of gossip messages; not

    fixed in size • “InView” 
 Nodes that messages are received from • Isolated nodes
 Detected with heartbeating; required to rejoin the cluster on isolation 31
  34. Cyclon (Cyclic) • Partial view 
 Nodes maintain a fixed

    length partial view • Random walk 
 Nodes join the cluster by performing a random walk of the overlay 32
  35. Cyclon (Cyclic) • Partial view 
 Nodes maintain a fixed

    length partial view • Random walk 
 Nodes join the cluster by performing a random walk of the overlay • Shuffle
 Periodically, view is shuffled with the oldest member of the partial view 32
  36. NeEM • NeEM
 “Network Friendly Epidemic Multicast” • TCP 


    Leverages TCP flow control to eliminate correlated message loss 34
  37. CREW • Flash dissemination 
 Epidemic dissemination of data from

    multiple sources; files, etc. • Bandwidth estimation
 TCP is used to estimate available bandwidth from available connections 35
  38. CREW • Flash dissemination 
 Epidemic dissemination of data from

    multiple sources; files, etc. • Bandwidth estimation
 TCP is used to estimate available bandwidth from available connections • Cache 
 Uses open TCP connections discovered by performing a random walk of the overlay to avoid penalty of opening a new connection 35
  39. Motivation • Fanout
 Constrained by desired fault tolerance level and

    target reliability • Quality of views
 Quality affects required fanout value 40
  40. Motivation • Fanout
 Constrained by desired fault tolerance level and

    target reliability • Quality of views
 Quality affects required fanout value • TCP and “better” failure detection
 Results in more cost-effective gossip protocols 40
  41. Increasing fanout from 4 to 6 can result in 99%

    redundant messages. but, required to reach 99% delivery.
  42. Increasing fanout from 4 to 6 can result in 99%

    redundant messages. 44 but, required to reach 99% delivery.
  43. Two Views • Active view
 Fanout + 1, symmetric active

    view • Passive view
 Larger than log(n); used to ensure connectivity during faults 46
  44. Two Views • Active view
 Fanout + 1, symmetric active

    view • Passive view
 Larger than log(n); used to ensure connectivity during faults • Minimal overhead
 Connections are not maintained to the members of the passive view 46
  45. Active View • Overlay
 Active view defines and overlay network

    • Symmetric
 Links are symmetric; nodes have each other in their view 47
  46. Active View • Overlay
 Active view defines and overlay network

    • Symmetric
 Links are symmetric; nodes have each other in their view • TCP as transport
 TCP is used as the network transmission protocol 47
  47. Active View • Overlay
 Active view defines and overlay network

    • Symmetric
 Links are symmetric; nodes have each other in their view • TCP as transport
 TCP is used as the network transmission protocol • Connection caching
 Connections are maintained, tested at each step, and used for failure detection for active view members 47
  48. Active View Broadcast • Broadcast
 Active views are flooded with

    message broadcasts • Deterministic
 Target selection at each gossip step is deterministic instead of random 48
  49. Active View Broadcast • Broadcast
 Active views are flooded with

    message broadcasts • Deterministic
 Target selection at each gossip step is deterministic instead of random • Overlay
 Overlay is created at random from the global membership, through the use of the membership protocol 48
  50. Active View Maintenance • Join
 Nodes are added to the

    active view on join • Failure
 Nodes are removed from the active view on failure 49
  51. Active View Maintenance • Join
 Nodes are added to the

    active view on join • Failure
 Nodes are removed from the active view on failure • Implicitly tested
 Nodes are checked for failure at every step of the protocol; this result in very fast failure detection 49
  52. Passive View • Passive
 Larger, and not used for message

    dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view 50
  53. Passive View • Passive
 Larger, and not used for message

    dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view • Cyclic strategy
 While the active view is maintained using a reactive strategy, the passive view is maintained using a cyclic strategy* 50
  54. Passive View • Passive
 Larger, and not used for message

    dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view • Cyclic strategy
 While the active view is maintained using a reactive strategy, the passive view is maintained using a cyclic strategy* • Shuffle
 At each interval defined by the cyclic strategy, a “shuffle” is performed with another node in the system 50
  55. Passive View • Passive
 Larger, and not used for message

    dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view • Cyclic strategy
 While the active view is maintained using a reactive strategy, the passive view is maintained using a cyclic strategy* • Shuffle
 At each interval defined by the cyclic strategy, a “shuffle” is performed with another node in the system 50 * not completely true
  56. Passive View Shuffle • Selection
 Add own identifier, random selection

    of nodes from the active view, and members from the passive view 51
  57. Passive View Shuffle • Selection
 Add own identifier, random selection

    of nodes from the active view, and members from the passive view • Increases replacement probability
 Increase probability of having “active” replacement nodes in another node’s passive view 51
  58. Membership Operations: Join • Join
 When a node joins the

    cluster, it must join through an existing node in the system: the “contact” node 52
  59. Membership Operations: Join • Join
 When a node joins the

    cluster, it must join through an existing node in the system: the “contact” node • “Contact” node
 Always accepts the join and adds to the active view; evicts a member if necessary 52
  60. Membership Operations: Join • Join
 When a node joins the

    cluster, it must join through an existing node in the system: the “contact” node • “Contact” node
 Always accepts the join and adds to the active view; evicts a member if necessary • Forward join request
 Request is forwarded using a random walk that results in addition to active views (based on size, then TTL), and finally a single passive view based on a walk TTL 52
  61. Active View Maintenance • Suspicion or knowledge of failure
 Randomly

    select random peers from the passive view until one can be found to replace it 53
  62. Active View Maintenance • Suspicion or knowledge of failure
 Randomly

    select random peers from the passive view until one can be found to replace it • Eviction
 Evict from the passive view when a node can’t be contacted to be a replacement* 53
  63. Active View Maintenance • Suspicion or knowledge of failure
 Randomly

    select random peers from the passive view until one can be found to replace it • Eviction
 Evict from the passive view when a node can’t be contacted to be a replacement* • Priorities
 Nodes only accept new nodes as neighbors if they have an open slot, or the requesting node has no members in its active view 53
  64. Active View Maintenance • Suspicion or knowledge of failure
 Randomly

    select random peers from the passive view until one can be found to replace it • Eviction
 Evict from the passive view when a node can’t be contacted to be a replacement* • Priorities
 Nodes only accept new nodes as neighbors if they have an open slot, or the requesting node has no members in its active view 53 * can result in permanently isolated nodes
  65. Passive View Maintenance • Shuffle operation
 Propagated using a random

    walk as well • Exchange selection
 Remote node will shuffle from its passive view only 54
  66. Passive View Maintenance • Shuffle operation
 Propagated using a random

    walk as well • Exchange selection
 Remote node will shuffle from its passive view only • Integration
 Only integrate into passive view nodes that are not already contained in sender’s active view 54
  67. Nodes can move from the passive view to the active

    view in order to fill the active view. (in response to node failures, etc.) 56
  68. When a node moves to the passive view, it's probability

    of being included in a shuffle increases… …therefore increasing the probability it will be used as a replacement for a failed node in another's active view. 57
  69. Active Passive Passive Active Shuffle Replacement Failure (after replacement) Evict

    on failed replacement Join, ForwardJoin ForwardJoin n n n
  70. Experiment Overview • Membership protocols
 Evaluates HyParView, Scamp, Cyclon (and

    CyclonAcked) • Gossip protocol
 Generic gossip protocol that can be used with any of these membership services 60
  71. Experiment Overview • Membership protocols
 Evaluates HyParView, Scamp, Cyclon (and

    CyclonAcked) • Gossip protocol
 Generic gossip protocol that can be used with any of these membership services • Clustering
 Clustered at beginning of test without interleaved membership rounds 60
  72. Experiment Setting • PeerSim
 PeerSim used to test 10,000 node

    cluster • HyParView
 Active view: 5; passive view: 30; ARWL: 6; PRWL: 3; shuffle k_p: 4; shuffle k_a: 3 61
  73. Effect of Failures • Resilience
 100% reliability up to 95%

    • Degradation
 90% at 95% failure rates 63
  74. Effect of Failures • Resilience
 100% reliability up to 95%

    • Degradation
 90% at 95% failure rates • Scamp / Cyclon
 Begin experiencing problems at 50% failure rate 63
  75. Failure Detection • Improves performance
 View maintenance and failure detection

    performed at each step increase how fast the system can react 65
  76. Failure Detection • Improves performance
 View maintenance and failure detection

    performed at each step increase how fast the system can react • Cyclon/Scamp
 No failure detector; membership is only repaired at the next membership interval 65
  77. Failure Detection • Improves performance
 View maintenance and failure detection

    performed at each step increase how fast the system can react • Cyclon/Scamp
 No failure detector; membership is only repaired at the next membership interval • Scamp / Cyclon
 Begin experiencing problems at 50% failure rate 65
  78. (A)Symmetric Views • Asymmetric views
 Cyclon results in nodes with

    outgoing links but no incoming links • Symmetric views
 Guarantees that the node is reachable by other nodes in the system 66
  79. Good Properties • Low clustering coefficient
 Harder for nodes to

    become isolated • Small average shortest path
 Reduces latency in message delivery 68
  80. Good Properties • Low clustering coefficient
 Harder for nodes to

    become isolated • Small average shortest path
 Reduces latency in message delivery • Balance degree distribution
 Increased fault tolerance with balanced reachability and importance 68
  81. HyParView Properties • Low clustering coefficient
 Active view per node

    is smaller than alternative protocols • Small active view
 Results in fewer distinct paths that can be used for dissemination of messages 69
  82. HyParView Properties • Low clustering coefficient
 Active view per node

    is smaller than alternative protocols • Small active view
 Results in fewer distinct paths that can be used for dissemination of messages • Larger average shortest path
 Doesn’t affect latency, because every single link in the overlay can be used to disseminate messages 69
  83. Cyclon/Scamp Degree Distribution • Wide distribution
 Some very popular nodes,

    some unknown nodes • Redundant messages 
 High probability of seeing redundant messages because of distribution 72
  84. Cyclon/Scamp Degree Distribution • Wide distribution
 Some very popular nodes,

    some unknown nodes • Redundant messages 
 High probability of seeing redundant messages because of distribution • Missing messages
 Low probability of seeing a message just once 72
  85. HyParView Degree Distribution • Symmetric views
 Ensure all nodes are

    known by the maximum amount of possible nodes 73
  86. HyParView Degree Distribution • Symmetric views
 Ensure all nodes are

    known by the maximum amount of possible nodes • Similar number of deliveries 
 With high probability, nodes should receive a message approximately the same amount of time 73
  87. HyParView Degree Distribution • Symmetric views
 Ensure all nodes are

    known by the maximum amount of possible nodes • Similar number of deliveries 
 With high probability, nodes should receive a message approximately the same amount of time • No missing messages
 With low probability, nodes will not receive a single copy of a message 73
  88. Gossip, with reliable transport and a failure detection, on a

    fixed overlay delivers the best possible performance. 76
  89. If the overlay is connected, utilization of all available links

    for message dissemination aims at 100% delivery. 77
  90. Smaller fanouts can be used if you do not need

    to worry about masking failures and network omissions. 78
  91. Therefore, hybrid approach that contains a small active view, and

    a larger (low cost) passive view, maintained by different strategies, offers a better resilience and better resource usage than using a single (large) view with higher fanout. 79
  92. Finally, TCP flow control can cause blockages with slow links,

    and it may be preferable to consider slow nodes as failed to preserve liveness in the system. 80