Slide 1

Slide 1 text

HyParView A Membership Protocol for Reliable Gossip-based Broadcast 1 João Leitao, José Pereira, Luís Rodrigues Universidade de Lisboa TR-07-13 May 2007

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Introduction Gossip Protocols 3

Slide 4

Slide 4 text

Gossip Protocols • Highly-scalable and resilient 
 Eases the implementation of reliable broadcast protocols 4

Slide 5

Slide 5 text

Gossip Protocols • Highly-scalable and resilient 
 Eases the implementation of reliable broadcast protocols • Simple protocol
 Select t nodes at random, send the message; on initial receipt, forward to t nodes 4

Slide 6

Slide 6 text

Gossip Protocols • Highly-scalable and resilient 
 Eases the implementation of reliable broadcast protocols • Simple protocol
 Select t nodes at random, send the message; on initial receipt, forward to t nodes • Load distribution
 Distributes the load evenly amongst all of the nodes in the system 4

Slide 7

Slide 7 text

Full Membership • Full membership 
 Assumes each node knows full membership when targeting nodes in a gossip round 5

Slide 8

Slide 8 text

Full Membership • Full membership 
 Assumes each node knows full membership when targeting nodes in a gossip round • High cost
 Cost of keeping membership up-to-date can be prohibitive in large networks 5

Slide 9

Slide 9 text

Partial Membership • Partial views 
 Maintain a partial view of nodes in the system; a subset of the full membership of the system 6

Slide 10

Slide 10 text

Partial Membership • Partial views 
 Maintain a partial view of nodes in the system; a subset of the full membership of the system • Fanout t
 Randomly select t nodes from the partial view for broadcast 6

Slide 11

Slide 11 text

Partial view must ensure that random selection from the partial view should provide the same resiliency as random selection from the entire membership. 7

Slide 12

Slide 12 text

Partial Membership Problems • Node failures 
 Easier for partial views to become disconnected 8

Slide 13

Slide 13 text

Partial Membership Problems • Node failures 
 Easier for partial views to become disconnected • Repair
 Repair may take more rounds of membership to repair 8

Slide 14

Slide 14 text

Partial Membership Problems • Node failures 
 Easier for partial views to become disconnected • Repair
 Repair may take more rounds of membership to repair • Dissemination during failure
 Negative impact on message delivery during failure and repair periods 8

Slide 15

Slide 15 text

Contributions HyParView 9

Slide 16

Slide 16 text

Gossip strategy based on a reliable transport protocol (TCP) that alleviates the need for the gossip protocol to directly address network omissions. 10

Slide 17

Slide 17 text

Nodes maintain a symmetric active view used to flood the network with messages. 11

Slide 18

Slide 18 text

TCP is used as a failure detector. 12

Slide 19

Slide 19 text

Nodes maintain a low-cost passive view that is used to replace failed nodes in the active view. 13

Slide 20

Slide 20 text

Membership protocol to control promotion into the active view. 14

Slide 21

Slide 21 text

Smaller fanout but with stronger resilience. 15

Slide 22

Slide 22 text

Related Work Partial Views 16

Slide 23

Slide 23 text

Partial Views • Partial views 
 Small subset of identifiers in the system; typically log(n) 17

Slide 24

Slide 24 text

Partial Views • Partial views 
 Small subset of identifiers in the system; typically log(n) • Membership protocol
 Change the view based on dynamic changes in the system 17

Slide 25

Slide 25 text

Abridged Algorithm • Join algorithm 
 Create own partial view; add itself to others partial views 18

Slide 26

Slide 26 text

Abridged Algorithm • Join algorithm 
 Create own partial view; add itself to others partial views • Leave algorithm
 Remove itself from partial views in the system 18

Slide 27

Slide 27 text

Overlay Network • Neighbor relation 
 Partial views establish a neighbor relation between nodes in the system 19

Slide 28

Slide 28 text

Overlay Network • Neighbor relation 
 Partial views establish a neighbor relation between nodes in the system • Directed graph
 Overlay network is a direct graph that captures the neighbor relation between all nodes in the system 19

Slide 29

Slide 29 text

View Maintenance • Reactive strategy 
 Respond to events occurring in the system; join/leave; Scamp 20

Slide 30

Slide 30 text

View Maintenance • Reactive strategy 
 Respond to events occurring in the system; join/leave; Scamp • Cyclic strategy
 Updated at an interval; occurs even if global membership is stable 20

Slide 31

Slide 31 text

Reactive strategies rely on a failure detector. 21

Slide 32

Slide 32 text

Reactive strategies can be faster than cyclic strategies with a fast and accurate failure detector. 22

Slide 33

Slide 33 text

Related Work Partial View Properties 23

Slide 34

Slide 34 text

Connectedness • Connected 
 Overlay should be connected 24

Slide 35

Slide 35 text

Connectedness • Connected 
 Overlay should be connected • Isolated nodes
 If not, isolated nodes will not receive any broadcast messages 24

Slide 36

Slide 36 text

Degree Distribution • In-degree 
 Number of nodes that have a node in their active view; measure of reachability 25

Slide 37

Slide 37 text

Degree Distribution • In-degree 
 Number of nodes that have a node in their active view; measure of reachability • Out-degree
 Number of nodes in a node’s active view; measure of contribution and importance 25

Slide 38

Slide 38 text

26

Slide 39

Slide 39 text

Assuming uniform distribution of failure, 26

Slide 40

Slide 40 text

Assuming uniform distribution of failure, 26 balanced in-degree and out- degree measurements result in better resilience.

Slide 41

Slide 41 text

Average Path Length • Path 
 Set of edges that a message has to cross to reach a node 27

Slide 42

Slide 42 text

Average Path Length • Path 
 Set of edges that a message has to cross to reach a node • Average Path Length
 Sum of the shortest paths between all pairs of nodes 27

Slide 43

Slide 43 text

Average Path Length • Path 
 Set of edges that a message has to cross to reach a node • Average Path Length
 Sum of the shortest paths between all pairs of nodes • Latency
 Related to message delivery latency, as it represents the path that must be taken to reach a node 27

Slide 44

Slide 44 text

Clustering Coefficient • Coefficient 
 Number of edges between neighbors divided by the possible maximum of edges across nodes 28

Slide 45

Slide 45 text

Clustering Coefficient • Coefficient 
 Number of edges between neighbors divided by the possible maximum of edges across nodes • Redundancy
 Related to the number of redundant messages that will be received during dissemination 28

Slide 46

Slide 46 text

Clustering Coefficient • Coefficient 
 Number of edges between neighbors divided by the possible maximum of edges across nodes • Redundancy
 Related to the number of redundant messages that will be received during dissemination • Clustering
 High-values of this coefficient related to how easy it is for nodes to become isolated 28

Slide 47

Slide 47 text

Accuracy • Accuracy 
 Number of neighbors that have not failed divided by the number of neighbors 29

Slide 48

Slide 48 text

Accuracy • Accuracy 
 Number of neighbors that have not failed divided by the number of neighbors • Low values
 Low values mean that failed nodes will be targeted for gossip more frequently 29

Slide 49

Slide 49 text

Accuracy • Accuracy 
 Number of neighbors that have not failed divided by the number of neighbors • Low values
 Low values mean that failed nodes will be targeted for gossip more frequently • Fanout
 Requires higher fanout value to mask these failures 29

Slide 50

Slide 50 text

Related Work Membership Protocols 30

Slide 51

Slide 51 text

Scamp (Reactive) • “PartialView” 
 Target of gossip messages; not fixed in size 31

Slide 52

Slide 52 text

Scamp (Reactive) • “PartialView” 
 Target of gossip messages; not fixed in size • “InView” 
 Nodes that messages are received from 31

Slide 53

Slide 53 text

Scamp (Reactive) • “PartialView” 
 Target of gossip messages; not fixed in size • “InView” 
 Nodes that messages are received from • Isolated nodes
 Detected with heartbeating; required to rejoin the cluster on isolation 31

Slide 54

Slide 54 text

Cyclon (Cyclic) • Partial view 
 Nodes maintain a fixed length partial view 32

Slide 55

Slide 55 text

Cyclon (Cyclic) • Partial view 
 Nodes maintain a fixed length partial view • Random walk 
 Nodes join the cluster by performing a random walk of the overlay 32

Slide 56

Slide 56 text

Cyclon (Cyclic) • Partial view 
 Nodes maintain a fixed length partial view • Random walk 
 Nodes join the cluster by performing a random walk of the overlay • Shuffle
 Periodically, view is shuffled with the oldest member of the partial view 32

Slide 57

Slide 57 text

Related Work Gossip Protocols 33

Slide 58

Slide 58 text

NeEM • NeEM
 “Network Friendly Epidemic Multicast” 34

Slide 59

Slide 59 text

NeEM • NeEM
 “Network Friendly Epidemic Multicast” • TCP 
 Leverages TCP flow control to eliminate correlated message loss 34

Slide 60

Slide 60 text

CREW • Flash dissemination 
 Epidemic dissemination of data from multiple sources; files, etc. 35

Slide 61

Slide 61 text

CREW • Flash dissemination 
 Epidemic dissemination of data from multiple sources; files, etc. • Bandwidth estimation
 TCP is used to estimate available bandwidth from available connections 35

Slide 62

Slide 62 text

CREW • Flash dissemination 
 Epidemic dissemination of data from multiple sources; files, etc. • Bandwidth estimation
 TCP is used to estimate available bandwidth from available connections • Cache 
 Uses open TCP connections discovered by performing a random walk of the overlay to avoid penalty of opening a new connection 35

Slide 63

Slide 63 text

Gossip Reliability 36

Slide 64

Slide 64 text

Percentage of nodes that deliver the message. 37

Slide 65

Slide 65 text

100% = atomic broadcast 38

Slide 66

Slide 66 text

Motivation HyParView Protocol 39

Slide 67

Slide 67 text

Motivation • Fanout
 Constrained by desired fault tolerance level and target reliability 40

Slide 68

Slide 68 text

Motivation • Fanout
 Constrained by desired fault tolerance level and target reliability • Quality of views
 Quality affects required fanout value 40

Slide 69

Slide 69 text

Motivation • Fanout
 Constrained by desired fault tolerance level and target reliability • Quality of views
 Quality affects required fanout value • TCP and “better” failure detection
 Results in more cost-effective gossip protocols 40

Slide 70

Slide 70 text

High failure rates can have a strong negative impact on partial view quality. 41

Slide 71

Slide 71 text

Therefore, we need membership protocols with fast healing properties. 42

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

Increasing fanout from 4 to 6 can result in 99% redundant messages.

Slide 75

Slide 75 text

Increasing fanout from 4 to 6 can result in 99% redundant messages. but, required to reach 99% delivery.

Slide 76

Slide 76 text

Increasing fanout from 4 to 6 can result in 99% redundant messages. 44 but, required to reach 99% delivery.

Slide 77

Slide 77 text

Contribution HyParView Protocol 45

Slide 78

Slide 78 text

Two Views • Active view
 Fanout + 1, symmetric active view 46

Slide 79

Slide 79 text

Two Views • Active view
 Fanout + 1, symmetric active view • Passive view
 Larger than log(n); used to ensure connectivity during faults 46

Slide 80

Slide 80 text

Two Views • Active view
 Fanout + 1, symmetric active view • Passive view
 Larger than log(n); used to ensure connectivity during faults • Minimal overhead
 Connections are not maintained to the members of the passive view 46

Slide 81

Slide 81 text

Active View • Overlay
 Active view defines and overlay network 47

Slide 82

Slide 82 text

Active View • Overlay
 Active view defines and overlay network • Symmetric
 Links are symmetric; nodes have each other in their view 47

Slide 83

Slide 83 text

Active View • Overlay
 Active view defines and overlay network • Symmetric
 Links are symmetric; nodes have each other in their view • TCP as transport
 TCP is used as the network transmission protocol 47

Slide 84

Slide 84 text

Active View • Overlay
 Active view defines and overlay network • Symmetric
 Links are symmetric; nodes have each other in their view • TCP as transport
 TCP is used as the network transmission protocol • Connection caching
 Connections are maintained, tested at each step, and used for failure detection for active view members 47

Slide 85

Slide 85 text

Active View Broadcast • Broadcast
 Active views are flooded with message broadcasts 48

Slide 86

Slide 86 text

Active View Broadcast • Broadcast
 Active views are flooded with message broadcasts • Deterministic
 Target selection at each gossip step is deterministic instead of random 48

Slide 87

Slide 87 text

Active View Broadcast • Broadcast
 Active views are flooded with message broadcasts • Deterministic
 Target selection at each gossip step is deterministic instead of random • Overlay
 Overlay is created at random from the global membership, through the use of the membership protocol 48

Slide 88

Slide 88 text

Active View Maintenance • Join
 Nodes are added to the active view on join 49

Slide 89

Slide 89 text

Active View Maintenance • Join
 Nodes are added to the active view on join • Failure
 Nodes are removed from the active view on failure 49

Slide 90

Slide 90 text

Active View Maintenance • Join
 Nodes are added to the active view on join • Failure
 Nodes are removed from the active view on failure • Implicitly tested
 Nodes are checked for failure at every step of the protocol; this result in very fast failure detection 49

Slide 91

Slide 91 text

Passive View • Passive
 Larger, and not used for message dissemination 50

Slide 92

Slide 92 text

Passive View • Passive
 Larger, and not used for message dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view 50

Slide 93

Slide 93 text

Passive View • Passive
 Larger, and not used for message dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view • Cyclic strategy
 While the active view is maintained using a reactive strategy, the passive view is maintained using a cyclic strategy* 50

Slide 94

Slide 94 text

Passive View • Passive
 Larger, and not used for message dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view • Cyclic strategy
 While the active view is maintained using a reactive strategy, the passive view is maintained using a cyclic strategy* • Shuffle
 At each interval defined by the cyclic strategy, a “shuffle” is performed with another node in the system 50

Slide 95

Slide 95 text

Passive View • Passive
 Larger, and not used for message dissemination • Replacement
 Passive view maintained to keep a list of possible replacement nodes for the active view • Cyclic strategy
 While the active view is maintained using a reactive strategy, the passive view is maintained using a cyclic strategy* • Shuffle
 At each interval defined by the cyclic strategy, a “shuffle” is performed with another node in the system 50 * not completely true

Slide 96

Slide 96 text

Passive View Shuffle • Selection
 Add own identifier, random selection of nodes from the active view, and members from the passive view 51

Slide 97

Slide 97 text

Passive View Shuffle • Selection
 Add own identifier, random selection of nodes from the active view, and members from the passive view • Increases replacement probability
 Increase probability of having “active” replacement nodes in another node’s passive view 51

Slide 98

Slide 98 text

Membership Operations: Join • Join
 When a node joins the cluster, it must join through an existing node in the system: the “contact” node 52

Slide 99

Slide 99 text

Membership Operations: Join • Join
 When a node joins the cluster, it must join through an existing node in the system: the “contact” node • “Contact” node
 Always accepts the join and adds to the active view; evicts a member if necessary 52

Slide 100

Slide 100 text

Membership Operations: Join • Join
 When a node joins the cluster, it must join through an existing node in the system: the “contact” node • “Contact” node
 Always accepts the join and adds to the active view; evicts a member if necessary • Forward join request
 Request is forwarded using a random walk that results in addition to active views (based on size, then TTL), and finally a single passive view based on a walk TTL 52

Slide 101

Slide 101 text

Active View Maintenance • Suspicion or knowledge of failure
 Randomly select random peers from the passive view until one can be found to replace it 53

Slide 102

Slide 102 text

Active View Maintenance • Suspicion or knowledge of failure
 Randomly select random peers from the passive view until one can be found to replace it • Eviction
 Evict from the passive view when a node can’t be contacted to be a replacement* 53

Slide 103

Slide 103 text

Active View Maintenance • Suspicion or knowledge of failure
 Randomly select random peers from the passive view until one can be found to replace it • Eviction
 Evict from the passive view when a node can’t be contacted to be a replacement* • Priorities
 Nodes only accept new nodes as neighbors if they have an open slot, or the requesting node has no members in its active view 53

Slide 104

Slide 104 text

Active View Maintenance • Suspicion or knowledge of failure
 Randomly select random peers from the passive view until one can be found to replace it • Eviction
 Evict from the passive view when a node can’t be contacted to be a replacement* • Priorities
 Nodes only accept new nodes as neighbors if they have an open slot, or the requesting node has no members in its active view 53 * can result in permanently isolated nodes

Slide 105

Slide 105 text

Passive View Maintenance • Shuffle operation
 Propagated using a random walk as well 54

Slide 106

Slide 106 text

Passive View Maintenance • Shuffle operation
 Propagated using a random walk as well • Exchange selection
 Remote node will shuffle from its passive view only 54

Slide 107

Slide 107 text

Passive View Maintenance • Shuffle operation
 Propagated using a random walk as well • Exchange selection
 Remote node will shuffle from its passive view only • Integration
 Only integrate into passive view nodes that are not already contained in sender’s active view 54

Slide 108

Slide 108 text

View Maintenance 55

Slide 109

Slide 109 text

Nodes can move from the passive view to the active view in order to fill the active view. (in response to node failures, etc.) 56

Slide 110

Slide 110 text

When a node moves to the passive view, it's probability of being included in a shuffle increases… …therefore increasing the probability it will be used as a replacement for a failed node in another's active view. 57

Slide 111

Slide 111 text

Active Passive Passive Active Shuffle Replacement Failure (after replacement) Evict on failed replacement Join, ForwardJoin ForwardJoin n n n

Slide 112

Slide 112 text

Evaluation HyParView Protocol 59

Slide 113

Slide 113 text

Experiment Overview • Membership protocols
 Evaluates HyParView, Scamp, Cyclon (and CyclonAcked) 60

Slide 114

Slide 114 text

Experiment Overview • Membership protocols
 Evaluates HyParView, Scamp, Cyclon (and CyclonAcked) • Gossip protocol
 Generic gossip protocol that can be used with any of these membership services 60

Slide 115

Slide 115 text

Experiment Overview • Membership protocols
 Evaluates HyParView, Scamp, Cyclon (and CyclonAcked) • Gossip protocol
 Generic gossip protocol that can be used with any of these membership services • Clustering
 Clustered at beginning of test without interleaved membership rounds 60

Slide 116

Slide 116 text

Experiment Setting • PeerSim
 PeerSim used to test 10,000 node cluster 61

Slide 117

Slide 117 text

Experiment Setting • PeerSim
 PeerSim used to test 10,000 node cluster • HyParView
 Active view: 5; passive view: 30; ARWL: 6; PRWL: 3; shuffle k_p: 4; shuffle k_a: 3 61

Slide 118

Slide 118 text

No content

Slide 119

Slide 119 text

Effect of Failures • Resilience
 100% reliability up to 95% 63

Slide 120

Slide 120 text

Effect of Failures • Resilience
 100% reliability up to 95% • Degradation
 90% at 95% failure rates 63

Slide 121

Slide 121 text

Effect of Failures • Resilience
 100% reliability up to 95% • Degradation
 90% at 95% failure rates • Scamp / Cyclon
 Begin experiencing problems at 50% failure rate 63

Slide 122

Slide 122 text

No content

Slide 123

Slide 123 text

Failure Detection • Improves performance
 View maintenance and failure detection performed at each step increase how fast the system can react 65

Slide 124

Slide 124 text

Failure Detection • Improves performance
 View maintenance and failure detection performed at each step increase how fast the system can react • Cyclon/Scamp
 No failure detector; membership is only repaired at the next membership interval 65

Slide 125

Slide 125 text

Failure Detection • Improves performance
 View maintenance and failure detection performed at each step increase how fast the system can react • Cyclon/Scamp
 No failure detector; membership is only repaired at the next membership interval • Scamp / Cyclon
 Begin experiencing problems at 50% failure rate 65

Slide 126

Slide 126 text

(A)Symmetric Views • Asymmetric views
 Cyclon results in nodes with outgoing links but no incoming links 66

Slide 127

Slide 127 text

(A)Symmetric Views • Asymmetric views
 Cyclon results in nodes with outgoing links but no incoming links • Symmetric views
 Guarantees that the node is reachable by other nodes in the system 66

Slide 128

Slide 128 text

No content

Slide 129

Slide 129 text

Good Properties • Low clustering coefficient
 Harder for nodes to become isolated 68

Slide 130

Slide 130 text

Good Properties • Low clustering coefficient
 Harder for nodes to become isolated • Small average shortest path
 Reduces latency in message delivery 68

Slide 131

Slide 131 text

Good Properties • Low clustering coefficient
 Harder for nodes to become isolated • Small average shortest path
 Reduces latency in message delivery • Balance degree distribution
 Increased fault tolerance with balanced reachability and importance 68

Slide 132

Slide 132 text

HyParView Properties • Low clustering coefficient
 Active view per node is smaller than alternative protocols 69

Slide 133

Slide 133 text

HyParView Properties • Low clustering coefficient
 Active view per node is smaller than alternative protocols • Small active view
 Results in fewer distinct paths that can be used for dissemination of messages 69

Slide 134

Slide 134 text

HyParView Properties • Low clustering coefficient
 Active view per node is smaller than alternative protocols • Small active view
 Results in fewer distinct paths that can be used for dissemination of messages • Larger average shortest path
 Doesn’t affect latency, because every single link in the overlay can be used to disseminate messages 69

Slide 135

Slide 135 text

No content

Slide 136

Slide 136 text

No content

Slide 137

Slide 137 text

Cyclon/Scamp Degree Distribution • Wide distribution
 Some very popular nodes, some unknown nodes 72

Slide 138

Slide 138 text

Cyclon/Scamp Degree Distribution • Wide distribution
 Some very popular nodes, some unknown nodes • Redundant messages 
 High probability of seeing redundant messages because of distribution 72

Slide 139

Slide 139 text

Cyclon/Scamp Degree Distribution • Wide distribution
 Some very popular nodes, some unknown nodes • Redundant messages 
 High probability of seeing redundant messages because of distribution • Missing messages
 Low probability of seeing a message just once 72

Slide 140

Slide 140 text

HyParView Degree Distribution • Symmetric views
 Ensure all nodes are known by the maximum amount of possible nodes 73

Slide 141

Slide 141 text

HyParView Degree Distribution • Symmetric views
 Ensure all nodes are known by the maximum amount of possible nodes • Similar number of deliveries 
 With high probability, nodes should receive a message approximately the same amount of time 73

Slide 142

Slide 142 text

HyParView Degree Distribution • Symmetric views
 Ensure all nodes are known by the maximum amount of possible nodes • Similar number of deliveries 
 With high probability, nodes should receive a message approximately the same amount of time • No missing messages
 With low probability, nodes will not receive a single copy of a message 73

Slide 143

Slide 143 text

Summary HyParView Protocol 74

Slide 144

Slide 144 text

Speed of failure detection is extremely important for high availability. 75

Slide 145

Slide 145 text

Gossip, with reliable transport and a failure detection, on a fixed overlay delivers the best possible performance. 76

Slide 146

Slide 146 text

If the overlay is connected, utilization of all available links for message dissemination aims at 100% delivery. 77

Slide 147

Slide 147 text

Smaller fanouts can be used if you do not need to worry about masking failures and network omissions. 78

Slide 148

Slide 148 text

Therefore, hybrid approach that contains a small active view, and a larger (low cost) passive view, maintained by different strategies, offers a better resilience and better resource usage than using a single (large) view with higher fanout. 79

Slide 149

Slide 149 text

Finally, TCP flow control can cause blockages with slow links, and it may be preferable to consider slow nodes as failed to preserve liveness in the system. 80

Slide 150

Slide 150 text

No content

Slide 151

Slide 151 text

82 Christopher Meiklejohn @cmeik Thanks!