Slide 1

Slide 1 text

Data Dissemination: from Academia to Industry Jo˜ ao Leit˜ ao NOVA-LINCS & NOVA University of Lisbon Jordan West BASHO Inc. RICON Las Vegas October 2014

Slide 2

Slide 2 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 3

Slide 3 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 4

Slide 4 text

Motivation Data Dissemination Classical Distributed Systems Challenge How to disseminate information across a large number of participants? Some intuitive requirements: Reliable. Efficient. Scalable.

Slide 5

Slide 5 text

Motivation Data Dissemination Classical Distributed Systems Challenge How to disseminate information across a large number of participants? Some intuitive requirements: Reliable. Efficient. Scalable.

Slide 6

Slide 6 text

Motivation Data Dissemination Applications Notification systems. Streaming multimedia content. Cluster Management. In practice... When I started to think about this problem, I was mostly focused on peer-to-peer systems.

Slide 7

Slide 7 text

Motivation Data Dissemination Applications Notification systems. Streaming multimedia content. Cluster Management. In practice... When I started to think about this problem, I was mostly focused on peer-to-peer systems.

Slide 8

Slide 8 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 9

Slide 9 text

The Academic View: Data Dissemination Design Alternatives: One to All Lets start simple... If you have information to disseminate, send it to everyone! One to All

Slide 10

Slide 10 text

The Academic View: Data Dissemination Design Alternatives: One to All Lets start simple... If you have information to disseminate, send it to everyone! One to All

Slide 11

Slide 11 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 12

Slide 12 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 13

Slide 13 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 14

Slide 14 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 15

Slide 15 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 16

Slide 16 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 17

Slide 17 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 18

Slide 18 text

The Academic View: Data Dissemination Design Alternatives: One to All One to All Positive Aspects: Straight forward. No redundancy (i.e, each node receives each message a single time). Negative Aspects: Requires each node to know the full membership. Not Scalable (no load distribution). Not Resilient to Faults.

Slide 19

Slide 19 text

The Academic View: Data Dissemination Design Alternatives: One to All One to All Positive Aspects: Straight forward. No redundancy (i.e, each node receives each message a single time). Negative Aspects: Requires each node to know the full membership. Not Scalable (no load distribution). Not Resilient to Faults.

Slide 20

Slide 20 text

The Academic View: Data Dissemination Design Alternatives: One to All One to All Positive Aspects: Straight forward. No redundancy (i.e, each node receives each message a single time). Negative Aspects: Requires each node to know the full membership. Not Scalable (no load distribution). Not Resilient to Faults.

Slide 21

Slide 21 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 22

Slide 22 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 23

Slide 23 text

The Academic View: Data Dissemination Design Alternatives: One to All

Slide 24

Slide 24 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree Dealing with Load Distribution Organize participants/nodes in a tree and forward messages across the tree. Spanning Tree

Slide 25

Slide 25 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree Dealing with Load Distribution Organize participants/nodes in a tree and forward messages across the tree. Spanning Tree

Slide 26

Slide 26 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 27

Slide 27 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 28

Slide 28 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 29

Slide 29 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 30

Slide 30 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 31

Slide 31 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 32

Slide 32 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 33

Slide 33 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 34

Slide 34 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 35

Slide 35 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 36

Slide 36 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 37

Slide 37 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree Spanning Tree Positive Aspects: Load Distribution. No redundancy (i.e, each node receives each message a single time). Negative Aspects: Complexity in Managing the Topology (Hinders Scalability). (Still) Not Resilient to Faults.

Slide 38

Slide 38 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree Spanning Tree Positive Aspects: Load Distribution. No redundancy (i.e, each node receives each message a single time). Negative Aspects: Complexity in Managing the Topology (Hinders Scalability). (Still) Not Resilient to Faults.

Slide 39

Slide 39 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree Spanning Tree Positive Aspects: Load Distribution. No redundancy (i.e, each node receives each message a single time). Negative Aspects: Complexity in Managing the Topology (Hinders Scalability). (Still) Not Resilient to Faults.

Slide 40

Slide 40 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 41

Slide 41 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 42

Slide 42 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 43

Slide 43 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 44

Slide 44 text

The Academic View: Data Dissemination Design Alternatives: Spanning Tree

Slide 45

Slide 45 text

The Academic View: Data Dissemination Design Alternatives: Flood Dealing with Fault Tolerance Organize participants/nodes in a random, highly connected, graph and forward messages across all links. Flood

Slide 46

Slide 46 text

The Academic View: Data Dissemination Design Alternatives: Flood Dealing with Fault Tolerance Organize participants/nodes in a random, highly connected, graph and forward messages across all links. Flood

Slide 47

Slide 47 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 48

Slide 48 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 49

Slide 49 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 50

Slide 50 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 51

Slide 51 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 52

Slide 52 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 53

Slide 53 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 54

Slide 54 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 55

Slide 55 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 56

Slide 56 text

The Academic View: Data Dissemination Design Alternatives: Flood

Slide 57

Slide 57 text

The Academic View: Data Dissemination Design Alternatives: Flood Flood Positive Aspects: Load Distribution. Simple Design, and a random overlay is easier to maintain than a tree. Robust (due to inherent redudancy). Negative Aspects: No longer that efficient (nodes receive and are required to process several copies of each message).

Slide 58

Slide 58 text

The Academic View: Data Dissemination Design Alternatives: Flood Flood Positive Aspects: Load Distribution. Simple Design, and a random overlay is easier to maintain than a tree. Robust (due to inherent redudancy). Negative Aspects: No longer that efficient (nodes receive and are required to process several copies of each message).

Slide 59

Slide 59 text

The Academic View: Data Dissemination Design Alternatives: Gossip Addressing Efficiency Organize participants/nodes in a random, highly connected, graph and have nodes forward each message across a subset (f ) of the links. Gossip

Slide 60

Slide 60 text

The Academic View: Data Dissemination Design Alternatives: Gossip Addressing Efficiency Organize participants/nodes in a random, highly connected, graph and have nodes forward each message across a subset (f ) of the links. Gossip

Slide 61

Slide 61 text

The Academic View: Data Dissemination Design Alternatives: Gossip

Slide 62

Slide 62 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 63

Slide 63 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 64

Slide 64 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 65

Slide 65 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 66

Slide 66 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 67

Slide 67 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 68

Slide 68 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 69

Slide 69 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 70

Slide 70 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 71

Slide 71 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 72

Slide 72 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip"fanout"="2"

Slide 73

Slide 73 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip Positive Aspects: Load Distribution. Simple Design, and a random overlay is easier to maintain than a tree. Robust (due to inherent redudancy). Negative Aspects: Slightly inefficient (but better than flood). No longer predictable (each message will be disseminated across different links).

Slide 74

Slide 74 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip Positive Aspects: Load Distribution. Simple Design, and a random overlay is easier to maintain than a tree. Robust (due to inherent redudancy). Negative Aspects: Slightly inefficient (but better than flood). No longer predictable (each message will be disseminated across different links).

Slide 75

Slide 75 text

The Academic View: Data Dissemination Design Alternatives: Gossip Gossip Positive Aspects: Load Distribution. Simple Design, and a random overlay is easier to maintain than a tree. Robust (due to inherent redudancy). Negative Aspects: Slightly inefficient (but better than flood). No longer predictable (each message will be disseminated across different links).

Slide 76

Slide 76 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree All these design alternatives: One-to-All. Spanning Tree. Flood. Gossip.

Slide 77

Slide 77 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree All these design alternatives: Spanning Tree. Flood. Gossip.

Slide 78

Slide 78 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree All these design alternatives: Spanning Tree [Efficiency]. Flood. Gossip.

Slide 79

Slide 79 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree All these design alternatives: Spanning Tree [Efficiency]. Flood. [Robustness]. Gossip. [Robustness].

Slide 80

Slide 80 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree At some point you start to see trees everywhere...

Slide 81

Slide 81 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree

Slide 82

Slide 82 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree

Slide 83

Slide 83 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree

Slide 84

Slide 84 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree

Slide 85

Slide 85 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree

Slide 86

Slide 86 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree Gossip"fanout"="2"

Slide 87

Slide 87 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree Gossip"fanout"="2"

Slide 88

Slide 88 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree Gossip"fanout"="2"

Slide 89

Slide 89 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree Gossip"fanout"="2"

Slide 90

Slide 90 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree Gossip"fanout"="2"

Slide 91

Slide 91 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree Observation: The closure of the paths the lead to the first delivery of a message to each node forms a tree. This observation: Lead us to design the Plumtree protocol.

Slide 92

Slide 92 text

The Academic View: Data Dissemination Design Alternatives: Embedded Tree Observation: The closure of the paths the lead to the first delivery of a message to each node forms a tree. This observation: Lead us to design the Plumtree protocol.

Slide 93

Slide 93 text

Embedded Trees: Plumtree Protocol Epidemic Broadcast Trees. Jo˜ ao Leit˜ ao, Jos´ e Pereira and Lu´ ıs Rodrigues Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems, Beijing, China, October, 2007.

Slide 94

Slide 94 text

Embedded Trees: Plumtree Protocol Gossip modes of operation There are 1001 ways to Gossip... Eager push Nodes send the message payload to random selected peers as soon as they receive a message for the first time. Lazy push When a node receive a message for the first time, they only send that message identifier to random selected peers. If those peers have not received the message, they make an explicit request.

Slide 95

Slide 95 text

Embedded Trees: Plumtree Protocol Gossip modes of operation There are 1001 ways to Gossip... Eager push Nodes send the message payload to random selected peers as soon as they receive a message for the first time. Lazy push When a node receive a message for the first time, they only send that message identifier to random selected peers. If those peers have not received the message, they make an explicit request.

Slide 96

Slide 96 text

Embedded Trees: Plumtree Protocol Gossip modes of operation There are 1001 ways to Gossip... Eager push Nodes send the message payload to random selected peers as soon as they receive a message for the first time. Lazy push When a node receive a message for the first time, they only send that message identifier to random selected peers. If those peers have not received the message, they make an explicit request.

Slide 97

Slide 97 text

Embedded Trees: Plumtree Protocol Overview Plumtree: push-lazy-push multicast tree. It operates as any gossip protocol, each node gossips with f neighbors on top of a random overlay. Decentralized. Creates (and maintain) an embedded tree on a random overlay. It combines two distinct gossip strategies: Eager Push: In the random overlay links that belong to the embedded spanning tree. Lazy Push: In the remaining random overlay links. TCP is used between nodes as it offers a better reliability and an additional fault detection mechanism.

Slide 98

Slide 98 text

Embedded Trees: Plumtree Protocol Overview Plumtree: push-lazy-push multicast tree. It operates as any gossip protocol, each node gossips with f neighbors on top of a random overlay. Decentralized. Creates (and maintain) an embedded tree on a random overlay. It combines two distinct gossip strategies: Eager Push: In the random overlay links that belong to the embedded spanning tree. Lazy Push: In the remaining random overlay links. TCP is used between nodes as it offers a better reliability and an additional fault detection mechanism.

Slide 99

Slide 99 text

Embedded Trees: Plumtree Protocol Overview Plumtree: push-lazy-push multicast tree. It operates as any gossip protocol, each node gossips with f neighbors on top of a random overlay. Decentralized. Creates (and maintain) an embedded tree on a random overlay. It combines two distinct gossip strategies: Eager Push: In the random overlay links that belong to the embedded spanning tree. Lazy Push: In the remaining random overlay links. TCP is used between nodes as it offers a better reliability and an additional fault detection mechanism.

Slide 100

Slide 100 text

Embedded Trees: Plumtree Protocol Building the Tree Intuition: Use the links in the overlay that generated a message delivery. After initialization all links in the overlay are candidates do belong to the spanning tree. When a message is broadcasted, all links used to disseminate a redundant message are pruned from the tree. If the random overlay is connected, after the dissemination of a message the tree will cover all nodes.

Slide 101

Slide 101 text

Embedded Trees: Plumtree Protocol Building the Tree Intuition: Use the links in the overlay that generated a message delivery. After initialization all links in the overlay are candidates do belong to the spanning tree. When a message is broadcasted, all links used to disseminate a redundant message are pruned from the tree. If the random overlay is connected, after the dissemination of a message the tree will cover all nodes.

Slide 102

Slide 102 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 103

Slide 103 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 104

Slide 104 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 105

Slide 105 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 106

Slide 106 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 107

Slide 107 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 108

Slide 108 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 109

Slide 109 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 110

Slide 110 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 111

Slide 111 text

Embedded Trees: Plumtree Protocol Building the Tree

Slide 112

Slide 112 text

Embedded Trees: Plumtree Protocol Recovering the Tree If a node fails some nodes might become disconnected from the embedded tree. When a node receives a announcement for a message it did not received it starts a timer. If the timer expires before receiving the message the node explicitly requests it to the node that sent the announcement. Reintroduces the link to that node into the embedded tree. The node that had sent the announcement sends the message when requested. It also reintroduces that link into the embedded tree. At the end of this process the embedded tree is repaired.

Slide 113

Slide 113 text

Embedded Trees: Plumtree Protocol Recovering the Tree If a node fails some nodes might become disconnected from the embedded tree. When a node receives a announcement for a message it did not received it starts a timer. If the timer expires before receiving the message the node explicitly requests it to the node that sent the announcement. Reintroduces the link to that node into the embedded tree. The node that had sent the announcement sends the message when requested. It also reintroduces that link into the embedded tree. At the end of this process the embedded tree is repaired.

Slide 114

Slide 114 text

Embedded Trees: Plumtree Protocol Recovering the Tree If a node fails some nodes might become disconnected from the embedded tree. When a node receives a announcement for a message it did not received it starts a timer. If the timer expires before receiving the message the node explicitly requests it to the node that sent the announcement. Reintroduces the link to that node into the embedded tree. The node that had sent the announcement sends the message when requested. It also reintroduces that link into the embedded tree. At the end of this process the embedded tree is repaired.

Slide 115

Slide 115 text

Embedded Trees: Plumtree Protocol Recovering the Tree If a node fails some nodes might become disconnected from the embedded tree. When a node receives a announcement for a message it did not received it starts a timer. If the timer expires before receiving the message the node explicitly requests it to the node that sent the announcement. Reintroduces the link to that node into the embedded tree. The node that had sent the announcement sends the message when requested. It also reintroduces that link into the embedded tree. At the end of this process the embedded tree is repaired.

Slide 116

Slide 116 text

Embedded Trees: Plumtree Protocol Experimental Evaluation Evaluation was conducted with the Peersim simulator. 10.000 nodes system. Hundreds and hundreds of Simulations. Compare the performance between: A standard (push) Gossip Protocol named Eager. Several configurations of Plumtree.

Slide 117

Slide 117 text

Embedded Trees: Plumtree Protocol Experimental Evaluation Evaluation was conducted with the Peersim simulator. 10.000 nodes system. Hundreds and hundreds of Simulations. Compare the performance between: A standard (push) Gossip Protocol named Eager. Several configurations of Plumtree.

Slide 118

Slide 118 text

Embedded Trees: Plumtree Protocol Experimental Evaluation

Slide 119

Slide 119 text

Embedded Trees: Plumtree Protocol Experimental Evaluation

Slide 120

Slide 120 text

Embedded Trees: Plumtree Protocol Experimental Evaluation

Slide 121

Slide 121 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 122

Slide 122 text

The Industry View: Cluster Metadata Management Global Information, Needed Locally Operational Information Mebership, Ownership, Transfers, Capabilities Configuration Replica Counts, Backends, Is Feature X Enabled? Monitoring Data Transfer Statistics, Background Work Tracking, Self-Monitoring

Slide 123

Slide 123 text

The Industry View: Cluster Metadata Management Requirements Failure Tolerance Disable malicious clients despite network partitions Add new nodes despite others being down Quick Convergence Become aware of new nodes ASAP Keep monitoring data as fresh as possible Minimal Resource Usage Goal is to speed up, not slow down, foreground work Network is needed for the data! AP typically, CP rarely necessary

Slide 124

Slide 124 text

The Industry View: Cluster Metadata Management Requirements Failure Tolerance Disable malicious clients despite network partitions Add new nodes despite others being down Quick Convergence Become aware of new nodes ASAP Keep monitoring data as fresh as possible Minimal Resource Usage Goal is to speed up, not slow down, foreground work Network is needed for the data! AP typically, CP rarely necessary

Slide 125

Slide 125 text

The Industry View: Cluster Metadata Management Requirements Failure Tolerance Disable malicious clients despite network partitions Add new nodes despite others being down Quick Convergence Become aware of new nodes ASAP Keep monitoring data as fresh as possible Minimal Resource Usage Goal is to speed up, not slow down, foreground work Network is needed for the data! AP typically, CP rarely necessary

Slide 126

Slide 126 text

The Industry View: Cluster Metadata Management Requirements Failure Tolerance Disable malicious clients despite network partitions Add new nodes despite others being down Quick Convergence Become aware of new nodes ASAP Keep monitoring data as fresh as possible Minimal Resource Usage Goal is to speed up, not slow down, foreground work Network is needed for the data! AP typically, CP rarely necessary

Slide 127

Slide 127 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 128

Slide 128 text

Cluster Metadata Management In Riak Implementations 0.x Peer Service: Classic Anti-Entropy Protocol Data Dissemination: Same 1.x Peer Service: Static Graph Overlay w/ Anti-Entropy Data Dissemination: Same 2.x Peer Service: Static Graph Overlay w/ Anti-Entropy Data Dissemination: Sender-Based Embedded Trees

Slide 129

Slide 129 text

Cluster Metadata Management In Riak Motivations 0.x to 1.0 Anti-Entropy is Reliable but Slow to Converge Improve Convergence Time, Specifically in Failure Scenarios 1.x to 2.0 Static Graph Overlay/Anti-Entropy have several issues: Too Much Network Overhead Improved Convergence Only Needed for Subset of Data Cold Rumor Mongering Improve Resource Usage and Stability

Slide 130

Slide 130 text

Cluster Metadata Management In Riak Plumtree: From Academia to Industry Connectivity “all nodes should have in their partial views...another correct node” is not an invariant we can maintain System Model Must consider dropped, delayed, duplicated* messages in addition to node failures Topology Peer Service is Fully Connected (as is Erlang) Membership Operator-controlled instead of reactive (failure detector is implemented separately from peer service)

Slide 131

Slide 131 text

Cluster Metadata Management In Riak Extending Plumtree What happens if Lazy Messages are Dropped? Add to Outstanding Set. Remove on Graft or Ack Handling Extreme Failures Back protocol by random anti-entropy with peers that are not in eager or lazy sets Shared vs. Sender-Based Trees Sender-Based. Minimize overhead w/ Lazy Construction

Slide 132

Slide 132 text

Cluster Metadata Management In Riak Measurements Reduced convergence time (measured in LDH) by 66% Reduced needless message overhead in stable state 99.9% % Failures before needing to rely on anti-entropy good for small clusters (near 90%) but decreased quickly for larger clusters (under 50%)

Slide 133

Slide 133 text

Cluster Metadata Management In Riak Peer Service Tuning Want to increase percent failures we can tolerate before relying on anti-entropy Fanout of tree given to us by peer service matters For clusters > 10 nodes, fanout = ln(cluster size) + 1 With our use of multiple trees, more peer service tuning may be necessary

Slide 134

Slide 134 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 135

Slide 135 text

The Academic View: When One Tree is not Enough Understanding The Problem We had Plumtree: Efficient. Very Robust. Could even rebalance itself to improve latency (height of the tree). The reality about trees... Trees do not distribute the load evenly among participants.

Slide 136

Slide 136 text

The Academic View: When One Tree is not Enough Understanding The Problem We had Plumtree: Efficient. Very Robust. Could even rebalance itself to improve latency (height of the tree). The reality about trees... Trees do not distribute the load evenly among participants.

Slide 137

Slide 137 text

The Academic View: When One Tree is not Enough Understanding The Problem

Slide 138

Slide 138 text

The Academic View: When One Tree is not Enough Understanding The Problem

Slide 139

Slide 139 text

The Academic View: When One Tree is not Enough Understanding The Problem 4"in"13"nodes"

Slide 140

Slide 140 text

The Academic View: When One Tree is not Enough Understanding The Problem 4"in"13"nodes" 9"in"13"nodes"

Slide 141

Slide 141 text

The Academic View: When One Tree is not Enough Naive Solutions How hard can this be? We started by experimenting with two simple approaches: NUTS: Naive Unstructured spliTStream BOLTS: Basic multiple Overlay-TreeS

Slide 142

Slide 142 text

The Academic View: When One Tree is not Enough Naive Solutions: NUTS Pick t nodes at random and create t Plumtrees; each rooted at a different node.

Slide 143

Slide 143 text

The Academic View: When One Tree is not Enough Naive Solutions: NUTS Pick t nodes at random and create t Plumtrees; each rooted at a different node.

Slide 144

Slide 144 text

The Academic View: When One Tree is not Enough Naive Solutions: NUTS Pick t nodes at random and create t Plumtrees; each rooted at a different node.

Slide 145

Slide 145 text

The Academic View: When One Tree is not Enough Naive Solutions: BOLTS Create t different unstructured overlays and create a Plumtree using each overlay.

Slide 146

Slide 146 text

The Academic View: When One Tree is not Enough Naive Solutions: BOLTS Create t different unstructured overlays and create a Plumtree using each overlay.

Slide 147

Slide 147 text

The Academic View: When One Tree is not Enough Naive Solutions: NUTS & BOLTS

Slide 148

Slide 148 text

The Academic View: When One Tree is not Enough Understanding The Goal From one tree to multiple trees Load balancing: every node should be interior on the same number of trees Use all resources: each node should be interior in at least one tree Fault-tolerance: each node should be interior in at most one tree

Slide 149

Slide 149 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 150

Slide 150 text

The Thicket Protocol Thicket: A Protocol for Building and Maintaining Multiple Trees in a P2P Overlay. M´ ario Ferreira, Jo˜ ao Leit˜ ao, and Lu´ ıs Rodrigues Proceedings of the 29th IEEE Symposium on Reliable Distributed Systems (SRDS), New Delhi, India, 31 October-3 November 2010.

Slide 151

Slide 151 text

The Thicket Protocol Deploying the Trees We use the same principle as in Plumtree to define trees, however: Overlay links are explicitly added to each tree (instead of removed). Trees are deployed by taking into consideration the remaining trees. Finding the minimal coordination necessary to achieve our goal is tricky. Tree coverage is impossible at tree deployment time.

Slide 152

Slide 152 text

The Thicket Protocol Deploying the Trees We use the same principle as in Plumtree to define trees, however: Overlay links are explicitly added to each tree (instead of removed). Trees are deployed by taking into consideration the remaining trees. Finding the minimal coordination necessary to achieve our goal is tricky. Tree coverage is impossible at tree deployment time.

Slide 153

Slide 153 text

The Thicket Protocol Deploying the Trees We use the same principle as in Plumtree to define trees, however: Overlay links are explicitly added to each tree (instead of removed). Trees are deployed by taking into consideration the remaining trees. Finding the minimal coordination necessary to achieve our goal is tricky. Tree coverage is impossible at tree deployment time.

Slide 154

Slide 154 text

The Thicket Protocol Recovering Trees Global coverage of trees is provided by the repair mechanism. All messages exchanged among nodes convey information about the load of the sender. How many messages it sends per time unit. How many child nodes in each tree. Tree repair uses less loaded nodes.

Slide 155

Slide 155 text

The Thicket Protocol Recovering Trees Global coverage of trees is provided by the repair mechanism. All messages exchanged among nodes convey information about the load of the sender. How many messages it sends per time unit. How many child nodes in each tree. Tree repair uses less loaded nodes.

Slide 156

Slide 156 text

The Thicket Protocol Experimental Results Simulated a 10.000 overlay using PeerSim. Comparison with NUTS and BOLTS (and Plumtree). Same underlying unstructured overlay network. Assumes reliable FIFO channels.

Slide 157

Slide 157 text

The Thicket Protocol Experimental Results

Slide 158

Slide 158 text

The Thicket Protocol Experimental Results

Slide 159

Slide 159 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 160

Slide 160 text

The Industry View: Improving Cluster Metadata Measuring Interior Nodes 10 Nodes 4 of 10 nodes are interior in 3 of 10 trees 3 of 10 nodes are interior in 4 of 10 trees 3 of 10 nodes are interior in 5 of 10 trees 20 Nodes 4 nodes are interior only when roots 16 nodes are interior in 5 or more trees 80 Nodes 4 nodes are interior only when roots 36 of 80 nodes are interior in > 25% of trees

Slide 161

Slide 161 text

The Industry View: Improving Cluster Metadata Measuring Interior Nodes 10 Nodes 4 of 10 nodes are interior in 3 of 10 trees 3 of 10 nodes are interior in 4 of 10 trees 3 of 10 nodes are interior in 5 of 10 trees 20 Nodes 4 nodes are interior only when roots 16 nodes are interior in 5 or more trees 80 Nodes 4 nodes are interior only when roots 36 of 80 nodes are interior in > 25% of trees

Slide 162

Slide 162 text

The Industry View: Improving Cluster Metadata Improving Interior Node Counts Coupling between fanout, number of trees, and how many times a node must be interior Height of initial trees constructed by peer service

Slide 163

Slide 163 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 164

Slide 164 text

The Future of Cluster Metadata in Riak Expand Usage Exploring Thicket Research Area Migration & New Data Move data from 1.x subsystem to 2.x subsystem Store data we couldn’t before WAN Replication Integrate with Riak’s Multi-Site Replication Abstract Messaging Layer

Slide 165

Slide 165 text

The Future of Cluster Metadata in Riak Expand Usage Exploring Thicket Research Area Migration & New Data Move data from 1.x subsystem to 2.x subsystem Store data we couldn’t before WAN Replication Integrate with Riak’s Multi-Site Replication Abstract Messaging Layer

Slide 166

Slide 166 text

The Future of Cluster Metadata in Riak Expand Usage Exploring Thicket Research Area Migration & New Data Move data from 1.x subsystem to 2.x subsystem Store data we couldn’t before WAN Replication Integrate with Riak’s Multi-Site Replication Abstract Messaging Layer

Slide 167

Slide 167 text

Outline 1 Motivation 2 The Academic View: Data Dissemination 3 Embedded Tree: Plumtree Protocol 4 The Industry View: Cluster Metadata Management 5 Cluster Metadata Management In Riak 6 The Academic View: When One Tree is not Enough 7 The Thicket Protocol 8 The Industry View: Improving Cluster Metadata 9 The Future of Cluster Metadata in Riak 10 Summary

Slide 168

Slide 168 text

Summary Joining academia and industry is powerful. Riak’s implementation has largely followed academic research, inadvertently... ...even when the research was on sort of a different field... ...and we have now begun to work together with academia.

Slide 169

Slide 169 text

Summary Joining academia and industry is powerful. Riak’s implementation has largely followed academic research, inadvertently... ...even when the research was on sort of a different field... ...and we have now begun to work together with academia.

Slide 170

Slide 170 text

Summary Joining academia and industry is powerful. Riak’s implementation has largely followed academic research, inadvertently... ...even when the research was on sort of a different field... ...and we have now begun to work together with academia.

Slide 171

Slide 171 text

Thanks. Jordan West − @ jrwest Jo˜ ao Leit˜ ao − @jcaleitao