Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Social Content Matching in MapReduce

Social Content Matching in MapReduce

Presentation of my article at VLDB'11

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. Social content matching
    in MapReduce
    Gianmarco De Francisci Morales
    Aristides Gionis
    Mauro Sozio
    10 March 2011

    View Slide

  2. Outline
    Application scenario
    Problem statement
    Algorithms
    Experimental evaluation
    Conclusions

    View Slide

  3. Social Media

    View Slide

  4. Social Media

    View Slide

  5. Social Media

    View Slide

  6. Social Media

    View Slide

  7. A needle in a haystack
    User generated content
    Users are both consumers and producers
    Large volume
    Diverse
    Difficult to navigate
    Difficult to find interesting things

    View Slide

  8. Featured Item
    Be proactive! Propose content to users
    (interesting photos, open questions, etc..)
    Increase the engagement of the user with the platform
    Effectiveness of the system depends on:
    Relevance (consumer)
    Exposure (producer)

    View Slide

  9. Graph b-matching
    Given a set of items T, consumers C, bipartite graph,
    weights w(ti, cj), capacity constraints b(ti) and b(cj)
    The goal is to find a matching M = {(t, c)} such that:
    (i) |M(ti
    )| ≤ b(ti
    )
    (ii) |M(cj
    )| ≤ b(cj
    )
    (iii) the total value w(M) of the matching is maximized

    View Slide

  10. Graph b-matching
    Items Consumers

    View Slide

  11. Graph b-matching
    Items Consumers

    View Slide

  12. Graph b-matching
    Items Consumers

    View Slide

  13. Contributions
    Investigate the b-matching problem in the context of
    social content distribution
    Devise a fully-MapReduce framework to address it
    StackMR
    GreedyMR
    Use SSJ-2R to build the graph
    Large scale experiments with real-world datasets

    View Slide

  14. System overview
    The application operates in consecutive phases
    (each phase in the range from hours to days)
    Before the beginning of the ith phase,
    the application makes a tentative allocation of items to users
    Capacity constraints
    User: an estimate of the number of logins during the ith phase
    Items: proportional to a quality assessment or constant
    B =
    c∈C
    b(c) =
    t∈T
    b(t)

    View Slide

  15. Graph building
    Edge weight is the cosine similarity between some
    vector representations of the item and the consumer
    w(ti, cj) = v(ti ) · v(cj)
    Prune the candidate edges O(|T||C|) by discarding low
    weight edges (we want to maximize the total weight)
    Similarity join between T and C in MapReduce

    View Slide

  16. StackMR
    Primal-dual formulation of the problem
    (Integer Linear Programming)
    Compute a maximal ⌈㸜C⌉-matching in parallel
    Push it in the stack, update dual variables and
    remove covered edges
    When there are no more edges, pop the whole stack
    and include edges in the solution layer by layer
    For efficiency, allows (1+∊) violations on capacity constraints

    View Slide

  17. StackMR Example

    View Slide

  18. StackMR Example

    View Slide

  19. StackMR Example

    View Slide

  20. StackMR Example

    View Slide

  21. StackMR Example

    View Slide

  22. StackMR Example

    View Slide

  23. StackMR Example

    View Slide

  24. StackMR Example

    View Slide

  25. StackMR Example

    View Slide

  26. StackMR Example

    View Slide

  27. StackMR Example

    View Slide

  28. StackMR Example

    View Slide

  29. StackMR Example

    View Slide

  30. GreedyMR
    Adaptation in MR of a classical greedy algorithm
    (sort the edges by weight, include the current edge if
    it maintains the constraints and update the capacities)
    At each round, each node proposes its top weighting b(v)
    edges to its neighbors
    The intersection between the proposal of each node and the
    ones of its neighbors is included in the solution
    Capacities are updated in parallel
    Yields a feasible sub-optimal solution at each round

    View Slide

  31. StackGreedyMR
    Hybrid approach
    Same structure as StackMR
    Uses a greedy heuristic in one of the randomized
    phases, when choosing the edges to propose
    We tried also with a proportional heuristic, but the
    results were always worse than with the greedy one
    Mixed results overall

    View Slide

  32. Algorithms summary
    Approximation
    guarantee
    MR rounds
    Capacity
    violations
    StackMR
    GreedyMR

    poly-
    logarithmic
    1+∊
    ½ linear no

    View Slide

  33. Datasets
    3 datasets:
    flickr-small, flickr-large, yahoo-answers
    user capacities:
    α-proportional to user activity
    item capacities:
    proportional to #favorites for flickr,
    constant for Yahoo! answers
    weight threshold σ to sparsify the graph
    Dataset |T| |C| |E|
    flickr-small 2 817 526 550 667
    flickr-large 373 373 32 707 1 995 123 827
    yahoo-answers 4 852 689 1 149 714 18 847 281 236
    b(u) = α n(u)
    b(p) = f(p) u
    αn(u)
    q
    f(q)
    b(q) = u
    αn(u)
    |Q|

    View Slide

  34. Vector representation
    Bag-of-words model
    flickr users: set of tags used in all photos
    flickr items (photos): set of tags
    Y! Answers users: set of words used in all answers
    Y! Answers items (questions): set of words
    Y! Answers: stopword removal, stemming, tf-idf

    View Slide

  35. Measures
    Quality = b-matching value
    Efficiency = number of MR rounds
    Evaluation of capacity violations for StackMR
    Evaluation of convergence speed for GreedyMR
    Parameter exploration for α and σ

    View Slide

  36. View Slide

  37. View Slide

  38. Conclusions
    2 algorithms with different trade offs between result
    quality and efficiency
    StackMR scales to very large datasets, has provable
    poly-logarithmic complexity and is faster in practice,
    capacity violations are negligible
    GreedyMR yields higher quality results, has ½
    approximation and can be stopped at any time

    View Slide

  39. Thanks!

    View Slide