Slide 1

Slide 1 text

Social content matching in MapReduce Gianmarco De Francisci Morales Aristides Gionis Mauro Sozio 10 March 2011

Slide 2

Slide 2 text

Outline Application scenario Problem statement Algorithms Experimental evaluation Conclusions

Slide 3

Slide 3 text

Social Media

Slide 4

Slide 4 text

Social Media

Slide 5

Slide 5 text

Social Media

Slide 6

Slide 6 text

Social Media

Slide 7

Slide 7 text

A needle in a haystack User generated content Users are both consumers and producers Large volume Diverse Difficult to navigate Difficult to find interesting things

Slide 8

Slide 8 text

Featured Item Be proactive! Propose content to users (interesting photos, open questions, etc..) Increase the engagement of the user with the platform Effectiveness of the system depends on: Relevance (consumer) Exposure (producer)

Slide 9

Slide 9 text

Graph b-matching Given a set of items T, consumers C, bipartite graph, weights w(ti, cj), capacity constraints b(ti) and b(cj) The goal is to find a matching M = {(t, c)} such that: (i) |M(ti )| ≤ b(ti ) (ii) |M(cj )| ≤ b(cj ) (iii) the total value w(M) of the matching is maximized

Slide 10

Slide 10 text

Graph b-matching Items Consumers

Slide 11

Slide 11 text

Graph b-matching Items Consumers

Slide 12

Slide 12 text

Graph b-matching Items Consumers

Slide 13

Slide 13 text

Contributions Investigate the b-matching problem in the context of social content distribution Devise a fully-MapReduce framework to address it StackMR GreedyMR Use SSJ-2R to build the graph Large scale experiments with real-world datasets

Slide 14

Slide 14 text

System overview The application operates in consecutive phases (each phase in the range from hours to days) Before the beginning of the ith phase, the application makes a tentative allocation of items to users Capacity constraints User: an estimate of the number of logins during the ith phase Items: proportional to a quality assessment or constant B = c∈C b(c) = t∈T b(t)

Slide 15

Slide 15 text

Graph building Edge weight is the cosine similarity between some vector representations of the item and the consumer w(ti, cj) = v(ti ) · v(cj) Prune the candidate edges O(|T||C|) by discarding low weight edges (we want to maximize the total weight) Similarity join between T and C in MapReduce

Slide 16

Slide 16 text

StackMR Primal-dual formulation of the problem (Integer Linear Programming) Compute a maximal ⌈㸜C⌉-matching in parallel Push it in the stack, update dual variables and remove covered edges When there are no more edges, pop the whole stack and include edges in the solution layer by layer For efficiency, allows (1+∊) violations on capacity constraints

Slide 17

Slide 17 text

StackMR Example

Slide 18

Slide 18 text

StackMR Example

Slide 19

Slide 19 text

StackMR Example

Slide 20

Slide 20 text

StackMR Example

Slide 21

Slide 21 text

StackMR Example

Slide 22

Slide 22 text

StackMR Example

Slide 23

Slide 23 text

StackMR Example

Slide 24

Slide 24 text

StackMR Example

Slide 25

Slide 25 text

StackMR Example

Slide 26

Slide 26 text

StackMR Example

Slide 27

Slide 27 text

StackMR Example

Slide 28

Slide 28 text

StackMR Example

Slide 29

Slide 29 text

StackMR Example

Slide 30

Slide 30 text

GreedyMR Adaptation in MR of a classical greedy algorithm (sort the edges by weight, include the current edge if it maintains the constraints and update the capacities) At each round, each node proposes its top weighting b(v) edges to its neighbors The intersection between the proposal of each node and the ones of its neighbors is included in the solution Capacities are updated in parallel Yields a feasible sub-optimal solution at each round

Slide 31

Slide 31 text

StackGreedyMR Hybrid approach Same structure as StackMR Uses a greedy heuristic in one of the randomized phases, when choosing the edges to propose We tried also with a proportional heuristic, but the results were always worse than with the greedy one Mixed results overall

Slide 32

Slide 32 text

Algorithms summary Approximation guarantee MR rounds Capacity violations StackMR GreedyMR ⅙ poly- logarithmic 1+∊ ½ linear no

Slide 33

Slide 33 text

Datasets 3 datasets: flickr-small, flickr-large, yahoo-answers user capacities: α-proportional to user activity item capacities: proportional to #favorites for flickr, constant for Yahoo! answers weight threshold σ to sparsify the graph Dataset |T| |C| |E| flickr-small 2 817 526 550 667 flickr-large 373 373 32 707 1 995 123 827 yahoo-answers 4 852 689 1 149 714 18 847 281 236 b(u) = α n(u) b(p) = f(p) u αn(u) q f(q) b(q) = u αn(u) |Q|

Slide 34

Slide 34 text

Vector representation Bag-of-words model flickr users: set of tags used in all photos flickr items (photos): set of tags Y! Answers users: set of words used in all answers Y! Answers items (questions): set of words Y! Answers: stopword removal, stemming, tf-idf

Slide 35

Slide 35 text

Measures Quality = b-matching value Efficiency = number of MR rounds Evaluation of capacity violations for StackMR Evaluation of convergence speed for GreedyMR Parameter exploration for α and σ

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Conclusions 2 algorithms with different trade offs between result quality and efficiency StackMR scales to very large datasets, has provable poly-logarithmic complexity and is faster in practice, capacity violations are negligible GreedyMR yields higher quality results, has ½ approximation and can be stopped at any time

Slide 39

Slide 39 text

Thanks!