Social Content Matching in MapReduce

Social Content Matching in MapReduce

Presentation of my article at VLDB'11

Transcript

  1. Social content matching in MapReduce Gianmarco De Francisci Morales Aristides

    Gionis Mauro Sozio 10 March 2011
  2. Outline Application scenario Problem statement Algorithms Experimental evaluation Conclusions

  3. Social Media

  4. Social Media

  5. Social Media

  6. Social Media

  7. A needle in a haystack User generated content Users are

    both consumers and producers Large volume Diverse Difficult to navigate Difficult to find interesting things
  8. Featured Item Be proactive! Propose content to users (interesting photos,

    open questions, etc..) Increase the engagement of the user with the platform Effectiveness of the system depends on: Relevance (consumer) Exposure (producer)
  9. Graph b-matching Given a set of items T, consumers C,

    bipartite graph, weights w(ti, cj), capacity constraints b(ti) and b(cj) The goal is to find a matching M = {(t, c)} such that: (i) |M(ti )| ≤ b(ti ) (ii) |M(cj )| ≤ b(cj ) (iii) the total value w(M) of the matching is maximized
  10. Graph b-matching Items Consumers

  11. Graph b-matching Items Consumers

  12. Graph b-matching Items Consumers

  13. Contributions Investigate the b-matching problem in the context of social

    content distribution Devise a fully-MapReduce framework to address it StackMR GreedyMR Use SSJ-2R to build the graph Large scale experiments with real-world datasets
  14. System overview The application operates in consecutive phases (each phase

    in the range from hours to days) Before the beginning of the ith phase, the application makes a tentative allocation of items to users Capacity constraints User: an estimate of the number of logins during the ith phase Items: proportional to a quality assessment or constant B = c∈C b(c) = t∈T b(t)
  15. Graph building Edge weight is the cosine similarity between some

    vector representations of the item and the consumer w(ti, cj) = v(ti ) · v(cj) Prune the candidate edges O(|T||C|) by discarding low weight edges (we want to maximize the total weight) Similarity join between T and C in MapReduce
  16. StackMR Primal-dual formulation of the problem (Integer Linear Programming) Compute

    a maximal ⌈㸜C⌉-matching in parallel Push it in the stack, update dual variables and remove covered edges When there are no more edges, pop the whole stack and include edges in the solution layer by layer For efficiency, allows (1+∊) violations on capacity constraints
  17. StackMR Example

  18. StackMR Example

  19. StackMR Example

  20. StackMR Example

  21. StackMR Example

  22. StackMR Example

  23. StackMR Example

  24. StackMR Example

  25. StackMR Example

  26. StackMR Example

  27. StackMR Example

  28. StackMR Example

  29. StackMR Example

  30. GreedyMR Adaptation in MR of a classical greedy algorithm (sort

    the edges by weight, include the current edge if it maintains the constraints and update the capacities) At each round, each node proposes its top weighting b(v) edges to its neighbors The intersection between the proposal of each node and the ones of its neighbors is included in the solution Capacities are updated in parallel Yields a feasible sub-optimal solution at each round
  31. StackGreedyMR Hybrid approach Same structure as StackMR Uses a greedy

    heuristic in one of the randomized phases, when choosing the edges to propose We tried also with a proportional heuristic, but the results were always worse than with the greedy one Mixed results overall
  32. Algorithms summary Approximation guarantee MR rounds Capacity violations StackMR GreedyMR

    ⅙ poly- logarithmic 1+∊ ½ linear no
  33. Datasets 3 datasets: flickr-small, flickr-large, yahoo-answers user capacities: α-proportional to

    user activity item capacities: proportional to #favorites for flickr, constant for Yahoo! answers weight threshold σ to sparsify the graph Dataset |T| |C| |E| flickr-small 2 817 526 550 667 flickr-large 373 373 32 707 1 995 123 827 yahoo-answers 4 852 689 1 149 714 18 847 281 236 b(u) = α n(u) b(p) = f(p) u αn(u) q f(q) b(q) = u αn(u) |Q|
  34. Vector representation Bag-of-words model flickr users: set of tags used

    in all photos flickr items (photos): set of tags Y! Answers users: set of words used in all answers Y! Answers items (questions): set of words Y! Answers: stopword removal, stemming, tf-idf
  35. Measures Quality = b-matching value Efficiency = number of MR

    rounds Evaluation of capacity violations for StackMR Evaluation of convergence speed for GreedyMR Parameter exploration for α and σ
  36. None
  37. None
  38. Conclusions 2 algorithms with different trade offs between result quality

    and efficiency StackMR scales to very large datasets, has provable poly-logarithmic complexity and is faster in practice, capacity violations are negligible GreedyMR yields higher quality results, has ½ approximation and can be stopped at any time
  39. Thanks!