$30 off During Our Annual Pro Sale. View Details »

Big Data and the Web: Algorithms for Data Intensive Scalable Computing

Big Data and the Web: Algorithms for Data Intensive Scalable Computing

Presentation of my Ph.D. defense at IMT

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. Big Data and the Web:
    Algorithms for Data Intensive
    Scalable Computing
    Gianmarco De Francisci Morales
    IMT Institute for Advanced Studies, Lucca
    ISTI-CNR, Pisa
    Supervisors:
    Claudio Lucchese
    Ranieri Baraglia

    View Slide

  2. Big Data...
    “Data whose size forces us to look beyond the tried-
    and-true methods that are prevalent at the
    time” (Jacobs 2009)
    “When the size of the data itself becomes part of the
    problem and traditional techniques for working with
    data run out of steam” (Loukides 2010)
    3V: Volume, Variety, Velocity (Gartner 2011)

    View Slide

  3. ...and the Web
    Largest publicly accessible data source in the world
    Economical, socio-political and scientific importance
    Center of our digital lives, digital footprint
    Data is large, noisy, diverse, fast
    3 main models for data:
    Bags, Graphs, Streams

    View Slide

  4. Big Data Mining
    (Data Mining) Data mining is the process of inspecting
    data in order to extract useful information
    (Data Exhaust) The quality of the information extracted
    benefits from the availability of extensive datasets
    (Data Deluge) The size of these datasets calls for
    parallel solutions: Data Intensive Scalable Computing

    View Slide

  5. DISC
    Data Intensive Scalable Computing systems
    Parallel, scalable, cost effective, fault tolerant
    Non general purpose, data-parallel, restricted
    computing interface for the sake of performance
    2 main computational models: MapReduce, Streaming

    View Slide

  6. MapReduce
    DFS
    Input 1
    Input 2
    Input 3
    MAP
    MAP
    MAP
    REDUCE
    REDUCE
    DFS
    Output 1
    Output 2
    Shuffle
    Merge &
    Group
    Partition &
    Sort
    Map : [k1, v1
    ] → [k2, v2
    ]
    Reduce : {k2 : [v2]} → [k3, v3
    ]

    View Slide

  7. Streaming (Actor Model)
    PE : [s1, k1, v1
    ] → [s2, k2, v2
    ]
    Live Streams
    Stream 1
    Stream 2
    Stream 3
    PE
    PE
    PE
    PE
    PE
    External
    Persister
    Output 1
    Output 2
    Event
    routing

    View Slide

  8. Research Goal

    View Slide

  9. Research Goal
    Design algorithms for Web mining
    that efficiently harness the power of
    Data Intensive Scalable Computing

    View Slide

  10. Contributions
    Algorithm Structure
    Data Complexity
    MR-Iterative
    MR-Optimized S4-Streaming & MR
    Bags
    Streams & Graphs
    Graphs Social
    Content
    Matching
    Similarity
    Self-Join
    Personalized
    Online News
    Recommendation

    View Slide

  11. Similarity Self-Join
    Discover all those pairs of objects
    whose similarity is above a threshold
    2 new MapReduce algorithms:
    SSJ-2 and SSJ-2R
    Exact solution with efficient pruning
    Test on a large Web corpus from TREC
    4.5x faster than state-of-the-art
    R. Baraglia, G. De Francisci Morales, C. Lucchese
    “Document Similarity Self-Join with MapReduce”
    IEEE International Conference on Data Mining 2010
    R. Baraglia, G. De Francisci Morales, C. Lucchese
    “Scaling out All Pairs Similarity Search with MapReduce”
    ACM Workshop on Large Scale Distributed Systems for IR 2010

    View Slide

  12. Motivation

    View Slide

  13. SSJ-2R

    View Slide

  14. shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    Remainder
    File
    d
    1
    "A A"
    d
    3
    "A"
    d
    2
    "B"
    Distributed Cache
    shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    shuffle
    <(d
    1
    ,d
    3
    ), 2>
    <(d
    1
    ,d
    3
    ), 1>
    <(d
    1
    ,!),"A A B C">
    <(d
    1
    ,d
    3
    ), 2>
    <(d
    1
    ,d
    3
    ), 1>
    reduce <(d
    1
    ,d
    3
    ), 5>
    Similarity
    map
    map
    map
    map
    Remainder
    File
    d
    1
    "A A"
    d
    3
    "A"
    d
    2
    "B"
    Distributed Cache
    <(d
    1
    ,!),
    "A A B C">
    <(d
    3
    ,!),
    "A B B C">
    shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    SSJ-2R

    View Slide

  15. shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    Remainder
    File
    d
    1
    "A A"
    d
    3
    "A"
    d
    2
    "B"
    Distributed Cache
    shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    shuffle
    <(d
    1
    ,d
    3
    ), 2>
    <(d
    1
    ,d
    3
    ), 1>
    <(d
    1
    ,!),"A A B C">
    <(d
    1
    ,d
    3
    ), 2>
    <(d
    1
    ,d
    3
    ), 1>
    reduce <(d
    1
    ,d
    3
    ), 5>
    Similarity
    map
    map
    map
    map
    Remainder
    File
    d
    1
    "A A"
    d
    3
    "A"
    d
    2
    "B"
    Distributed Cache
    <(d
    1
    ,!),
    "A A B C">
    <(d
    3
    ,!),
    "A B B C">
    shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    Remainder
    File
    d
    1
    "A A"
    d
    3
    "A"
    d
    2
    "B"
    Distributed Cache
    SSJ-2R

    View Slide

  16. shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    Remainder
    File
    d
    1
    "A A"
    d
    3
    "A"
    d
    2
    "B"
    Distributed Cache
    shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    shuffle
    <(d
    1
    ,d
    3
    ), 2>
    <(d
    1
    ,d
    3
    ), 1>
    <(d
    1
    ,!),"A A B C">
    <(d
    1
    ,d
    3
    ), 2>
    <(d
    1
    ,d
    3
    ), 1>
    reduce <(d
    1
    ,d
    3
    ), 5>
    Similarity
    map
    map
    map
    map
    Remainder
    File
    d
    1
    "A A"
    d
    3
    "A"
    d
    2
    "B"
    Distributed Cache
    <(d
    1
    ,!),
    "A A B C">
    <(d
    3
    ,!),
    "A B B C">
    shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    Remainder
    File
    d
    1
    "A A"
    d
    3
    "A"
    d
    2
    "B"
    Distributed Cache
    shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    shuffle
    <(d
    1
    ,d
    3
    ), 2>
    <(d
    1
    ,d
    3
    ), 1>
    <(d
    1
    ,!),"A A B C">
    <(d
    1
    ,d
    3
    ), 2>
    <(d
    1
    ,d
    3
    ), 1>
    reduce <(d
    1
    ,d
    3
    ), 5>
    Similarity
    map
    map
    map
    map
    Remainder
    File
    d
    1
    "A A"
    d
    3
    "A"
    d
    2
    "B"
    Distributed Cache
    <(d
    1
    ,!),
    "A A B C">
    <(d
    3
    ,!),
    "A B B C">
    SSJ-2R

    View Slide

  17. Experiments
    4 workers with 16 cores, 8 GB memory, 2 TB disks
    WT10G samples
    Metric: running time
    Table II
    SAMPLES FROM THE TREC WT10G COLLECTION
    D17K D30K D63K
    # documents 17,024 30,683 63,126
    # terms 183,467 297,227 580,915
    # all pairs 289,816,576 941,446,489 3,984,891,876
    # similar pairs 94,220 138,816 189,969
    B
    al
    co
    ar
    gr
    do

    View Slide

  18. Results
    0
    10000
    20000
    30000
    40000
    50000
    60000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of vectors
    ELSA
    VERN
    SSJ-2
    SSJ-2R
    0
    10000
    20000
    30000
    40000
    50000
    60000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of vectors
    ELSA
    VERN
    SSJ-2
    SSJ-2R

    View Slide

  19. Results
    1
    10
    100
    1000
    100 1000
    Number of lists
    Inverted list length
    max=6600
    ELSA
    1
    10
    100
    1000
    Number of lists
    max=1729
    SSJ-2R
    1
    10
    100
    1000
    100 1000
    Number of lists
    Inverted list length
    max=6600
    ELSA
    1
    10
    100
    1000
    Number of lists
    max=1729
    SSJ-2R

    View Slide

  20. Results
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    1800
    2000
    0 10 20 30 40 50
    Time (seconds)
    Mapper ID
    ELSA
    VERN
    SSJ-2R without bucketing
    SSJ-2R with bucketing
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    1800
    2000
    0 10 20 30 40 50
    Time (seconds)
    Mapper ID
    ELSA
    VERN
    SSJ-2R without bucketing
    SSJ-2R with bucketing

    View Slide

  21. Social Content
    Matching
    Select a subset of the edges of a weighed graphs,
    maximizing the total weight of the solution,
    while obeying capacity constraints of the nodes
    StackMR: ⅙ approx., poly-logarithmic, (1+∊) violations
    GreedyMR: ½ approx., linear worst case, no violations
    Validation on 2 large datasets coming from real world
    systems: flickr and Yahoo! Answers
    SSJ-2R to build the weighted bipartite graphs
    G. De Francisci Morales, A. Gionis, M. Sozio
    “Social Content Matching in MapReduce”
    International Conference on Very Large Data Bases 2011

    View Slide

  22. Motivation

    View Slide

  23. Motivation

    View Slide

  24. Motivation

    View Slide

  25. Motivation

    View Slide

  26. Problem: graph b-matching
    Given a set of items T,
    consumers C, bipartite
    graph, weights w(ti, cj),
    capacity constraints
    b(ti) and b(cj)
    Find a matching
    M={(t, c)} such that
    - |M(ti)| ≤ b(ti)
    - |M(cj)| ≤ b(cj)
    - w(M) is maximized
    Items Consumers

    View Slide

  27. Graph processing in MR
    Map
    Reduce

    View Slide

  28. Experiments
    3 datasets:
    Quality = b-matching value
    Efficiency = number of MR iterations
    Evaluation of capacity violations for StackMR
    Evaluation of convergence speed for GreedyMR
    Dataset |T| |C| |E|
    flickr-small 2 817 526 550 667
    flickr-large 373 373 32 707 1 995 123 827
    yahoo-answers 4 852 689 1 149 714 18 847 281 236

    View Slide

  29. View Slide

  30. View Slide

  31. Personalized Online
    News Recommendation
    Deliver personalized news recommendations based on
    a model built from the Twitter profile of users
    Learn personalized ranking function from 3 signals:
    Social, Content, Popularity
    Deep personalization via entity extraction
    Test on 1 month of Y! News + Twitter + Toolbar logs
    Predict user click in top-10 positions 20% of the times
    G. De Francisci Morales, A. Gionis, C. Lucchese
    “From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendations”
    ACM International Conference on Web Search and Data Mining 2012

    View Slide

  32. Motivation

    View Slide

  33. Timeliness
    Personalization
    Number of
    mentions of
    “Osama Bin
    Laden”
    -0.2
    0
    0.2
    0.4
    0.6
    0.8
    1
    1.2
    M
    ay-01
    h20
    M
    ay-02
    h00
    M
    ay-02
    h04
    M
    ay-02
    h08
    M
    ay-02
    h12
    M
    ay-02
    h16
    M
    ay-02
    h20
    M
    ay-03
    h00
    M
    ay-03
    h04
    M
    ay-03
    h08
    news
    twitter
    clicks
    Why Twitter?

    View Slide

  34. FEATURED FROM YOUR
    TWITTER ACCOUNT
    !"#$%&"'()*+&,
    Recommended from Twitter!

    View Slide

  35. Designed to be streaming and lightweight
    Recommendation model is updated in real-time
    Tweets
    User
    Tweets
    Followee
    Tweets
    Followee
    Tweets
    Followee
    Tweets
    Twitter
    articles
    news
    T.Rex
    User
    Model
    !
    "
    #
    Personalized
    ranked list of
    news articles
    System Overview
    Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)

    View Slide

  36. Automatic evaluation, aim for precision
    Frame as a click prediction problem, at time τ:
    Given a user model and a stream of published news
    Predict which news the user clicks on
    Clicks from Y! Toolbar and news from Y! News
    1 month of English Tweets + crawled follower network
    Experiments

    View Slide

  37. Evaluation Metrics
    where is the rank of the clicked news article
    at the i-th event and Q is the set of tests
    where is the relevance of the document at
    position j in the i-th ranking
    MRR =
    1
    |Q|
    Q

    i=1
    1
    r(ni

    )
    r(ni

    ) ni

    DCG[j] =

    G[j] if j = 1;
    DCG[j − 1] + G[j]
    log2j
    if j > 1,
    ni
    j
    G[j]
    G[j] = Jaccard(ni

    , ni
    j
    ) × 5 ÷ 5

    View Slide

  38. Entity overlap between clicked and suggested news article
    0
    2
    4
    6
    8
    10
    12
    14
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
    Average DCG
    Rank
    T.Rex+
    T.Rex
    Popularity
    Content
    Social
    Recency
    Click count
    Results

    View Slide

  39. Conclusions
    Tackle the big data problem on the Web by designing
    large scale Web mining algorithms for DISC systems
    Address classical problems like similarity, matching and
    recommendation in the context of Web mining with
    large, heterogeneous datasets
    Provide novel, efficient and scalable solutions for the
    MapReduce and streaming programming models

    View Slide

  40. Conclusions
    Similarity of bags of Web pages:
    SSJ-2 and SSJ-2R 4.5x faster than state-of-the-art
    Importance of careful design of MR algorithms
    Matching of Web 2.0 content on graphs:
    StackMR and GreedyMR iterative MR algorithms with
    provable approximation guarantees
    First solution to b-matching problem in MR
    Scalable computation pattern for graph mining in MR
    Personalized recommendation of news from streams:
    T.Rex predicts user interest from real-time social Web
    Parallelizable online stream + graph mining

    View Slide

  41. Thanks

    View Slide

  42. Similarity Self-Join

    View Slide

  43. SSJ-2 Example

    View Slide

  44. shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    SSJ-2 Example

    View Slide

  45. shuffle
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    1
    ,1)>
    1
    ,1)>
    2
    ,2)>
    3
    ,2)>
    3
    ,1)>
    map
    d
    1
    "A A B C"
    map
    d
    2
    "B D D"
    map
    d
    3
    "A B B C"
    1
    ,1),
    (d
    3
    ,2)]>
    1
    ,1),
    (d
    3
    ,1)]>
    2
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    SSJ-2 Example
    shuffle
    <(d1
    ,d3
    ), 2>
    <(d1
    ,d3
    ), 1>
    <(d1
    ,d3
    ), [2,1]> reduce <(d1
    ,d3
    ), 5>
    HDFS
    d3
    "A B B C"
    d1
    "A A B C"
    Similarity
    map
    map
    map
    shuffle
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    ,1)>
    ,1)>
    ,2)>
    ,2)>
    ,1)>
    map
    d1
    "A A B C"
    map
    d2
    "B D D"
    map
    d3
    "A B B C"
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    reduce
    reduce
    reduce
    Indexing

    View Slide

  46. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0

    View Slide

  47. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0
    • Indexing &
    Prefix filtering

    View Slide

  48. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0
    • Indexing &
    Prefix filtering
    • Need to retrieve
    pruned part

    View Slide

  49. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0
    • Indexing &
    Prefix filtering
    • Need to retrieve
    pruned part
    • Actually, retrieve
    the whole
    documents

    View Slide

  50. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0
    • Indexing &
    Prefix filtering
    • Need to retrieve
    pruned part
    • Actually, retrieve
    the whole
    documents
    • 2 remote (DFS)
    I/O per pair

    View Slide

  51. SSJ-2R
    di
    ; (di, dj), WA
    ij
    ; (di, dj), WB
    ij
    ; (di, dk), WA
    ik
    ; . . .

    group by key di
    dj
    ; (dj, dk), WA
    jk
    ; (dj, dk), WB
    jk
    ; (dj, dl), WA
    jl
    ; . . .

    group by key dj
    Remainder file =
    pruned part of the input
    Pre-load remainder file in
    memory, no further disk I/O
    Shuffle the input together with
    the partial similarity scores
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0

    View Slide

  52. (d0
    ,d1
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d3
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d4
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d2
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,!),[t1
    ,t2
    ,t3
    ...]
    SSJ-2R Reducer
    Reduce Input
    Sort pairs on both IDs,
    group on first
    (Secondary Sort)
    Only 1 reducer reads d0
    Remainder file contains
    only the useful portion
    of the other documents
    (about 10%)

    View Slide

  53. (d0
    ,d1
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d3
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d4
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d2
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,!),[t1
    ,t2
    ,t3
    ...]
    SSJ-2R Reducer
    Whole document
    shuffled via MR
    Reduce Input
    Sort pairs on both IDs,
    group on first
    (Secondary Sort)
    Only 1 reducer reads d0
    Remainder file contains
    only the useful portion
    of the other documents
    (about 10%)

    View Slide

  54. (d0
    ,d1
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d3
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d4
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d2
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,!),[t1
    ,t2
    ,t3
    ...]
    SSJ-2R Reducer
    Whole document
    shuffled via MR
    Remainder file
    preloaded in memory
    Reduce Input
    Sort pairs on both IDs,
    group on first
    (Secondary Sort)
    Only 1 reducer reads d0
    Remainder file contains
    only the useful portion
    of the other documents
    (about 10%)

    View Slide

  55. Results
    0
    10000
    20000
    30000
    40000
    50000
    60000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of GRFXPHQWV
    Running time
    Elsayed et al.
    SSJ-2
    Vernica et al.
    SSJ-2R

    View Slide

  56. (d0
    ,d1
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d3
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d4
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d2
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,!),[t1
    ,t2
    ,t3
    ...]
    (d0
    ,!),[t1
    ,t2
    ,t3
    ...]
    Partitioning
    K=2
    • Split in K slices
    • Each reducer needs to
    load only 1/K of the
    remainder file
    • Need to replicate the input
    K times

    View Slide

  57. Map phase
    0
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Number of !"#$%&'()
    0
    5
    15
    (a)
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of !"#$%&'()
    Avg. Map Running time
    Elsayed et al.
    SSJ-2
    Vernica et al.
    SSJ-2R
    2
    4
    6
    8
    10
    12
    14
    16
    18
    Time (seconds)
    (c)
    0
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Number of !"#$%&'()
    0
    5
    15
    (a)
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of !"#$%&'()
    Avg. Map Running time
    Elsayed et al.
    SSJ-2
    Vernica et al.
    SSJ-2R
    2
    4
    6
    8
    10
    12
    14
    16
    18
    Time (seconds)
    (c)

    View Slide

  58. Map phase
    1
    10
    100
    1000
    100 1000
    Number of lists
    Inverted list length
    max=6600
    Elsayed et al.
    1
    10
    100
    1000
    Number of lists
    max=1729
    SSJ-2R
    1
    10
    100
    1000
    100 1000
    Number of lists
    Inverted list length
    max=6600
    Elsayed et al.
    1
    10
    100
    1000
    Number of lists
    max=1729
    SSJ-2R

    View Slide

  59. Map phase
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    1800
    2000
    0 10 20 30 40 50
    Time (seconds)
    Mapper
    Map times (threshold=0.9)
    Elsayed et al.
    SSJ-2R without bucketing
    SSJ-2R with bucketing
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1600
    1800
    2000
    0 10 20 30 40 50
    Time (seconds)
    Mapper
    Map times (threshold=0.9)
    Elsayed et al.
    SSJ-2R without bucketing
    SSJ-2R with bucketing

    View Slide

  60. Reduce phase
    000 65000
    0
    5
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Number of !"#$%&'(s
    (b)
    000 65000
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of !"#$%&'(s
    Avg. )&!$#& Running time
    Elsayed et al.
    SSJ-2
    Vernica et al.
    SSJ-2R
    (d)

    View Slide

  61. Social Content Matching

    View Slide

  62. System overview
    The application operates in consecutive phases
    (each phase in the range from hours to days)
    Before the beginning of the ith phase,
    the application makes a tentative allocation of items to users
    Capacity constraints
    User: an estimate of the number of logins during the ith phase
    Items: proportional to a quality assessment or constant
    B =

    c∈C
    b(c) =

    t∈T
    b(t)

    View Slide

  63. Graph building
    Edge weight is the cosine similarity between some
    vector representations of the item and the consumer w
    (ti, cj) = v(ti ) · v(cj)
    Prune the candidate edges O(|T||C|) by discarding low
    weight edges (we want to maximize the total weight)
    Similarity join between T and C in MapReduce

    View Slide

  64. StackMR
    Primal-dual formulation of the problem
    (Integer Linear Programming)
    Compute a maximal ⌈∊b⌉-matching in parallel
    Push it in the stack, update dual variables and
    remove covered edges
    When there are no more edges, pop the whole stack
    and include edges in the solution layer by layer
    For efficiency, allows (1+∊) violations on capacity constraints

    View Slide

  65. StackMR Example

    View Slide

  66. StackMR Example

    View Slide

  67. StackMR Example

    View Slide

  68. StackMR Example

    View Slide

  69. StackMR Example

    View Slide

  70. StackMR Example

    View Slide

  71. StackMR Example

    View Slide

  72. StackMR Example

    View Slide

  73. StackMR Example

    View Slide

  74. StackMR Example

    View Slide

  75. StackMR Example

    View Slide

  76. StackMR Example

    View Slide

  77. StackMR Example

    View Slide

  78. GreedyMR
    Adaptation in MR of a classical greedy algorithm
    (sort the edges by weight, include the current edge if
    it maintains the constraints and update the capacities)
    At each round, each node proposes its top weighting b(v)
    edges to its neighbors
    The intersection between the proposal of each node and the
    ones of its neighbors is included in the solution
    Capacities are updated in parallel
    Yields a feasible sub-optimal solution at each round

    View Slide

  79. StackGreedyMR
    Hybrid approach
    Same structure as StackMR
    Uses a greedy heuristic in one of the randomized
    phases, when choosing the edges to propose
    We tried also with a proportional heuristic, but the
    results were always worse than with the greedy one
    Mixed results overall

    View Slide

  80. Algorithms summary
    Approximation
    guarantee
    MR rounds
    Capacity
    violations
    StackMR
    GreedyMR

    poly-
    logarithmic
    1+∊
    ½ linear no

    View Slide

  81. Vector representation
    Bag-of-words model
    flickr users: set of tags used in all photos
    flickr items (photos): set of tags
    Y! Answers users: set of words used in all answers
    Y! Answers items (questions): set of words
    Y! Answers: stopword removal, stemming, tf-idf

    View Slide

  82. Conclusions
    2 algorithms with different trade offs between result
    quality and efficiency
    StackMR scales to very large datasets, has provable
    poly-logarithmic complexity and is faster in practice,
    capacity violations are negligible
    GreedyMR yields higher quality results, has ½
    approximation and can be stopped at any time

    View Slide

  83. Personalized Online News
    Recommendation

    View Slide

  84. 90% of the
    clicks
    happen
    within 2
    days from
    publication
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    1 10 100 1000 10000
    Minutes
    News-click delay
    !"#$%&'()'(**"&&%!*%+
    ,%-+.*/0*1'2%/34'20+5&0$"50(!
    News Get Old Soon

    View Slide

  85. Builds a user model from Twitter
    Signals from user generated content, social neighbors and
    popularity across Twitter and news
    Deep personalization based on entities (overcomes
    vocabulary mismatch, easier to model relevance)
    Learn a personalized news ranking function
    Pick up candidates from a pool of related or popular fresh
    news, rank them and present top-k to the user
    T.Rex
    Twitter-based news recommendation system

    View Slide

  86. Ranking function is user and time dependent
    Social model + Content model + Popularity model
    Social model weights the content model of neighbors
    by a truncated PageRank on the Twitter network
    Content model measures relatedness of user’s tweet
    stream and news article represented as bag-of-entities
    Popularity model tracks entity popularity by the number
    of mentions in Twitter and news (exponential forgetting)
    Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
    Recommendation Model

    View Slide

  87. Entities
    News
    Tweets
    Content Model Γ
    !"#"$"%"&!"#$%$!!'()*+!&'!(#$!)*+($+(!
    %$,$-.+)$!*/!+$"'!,
    *"
    /*%!0'$%!-
    (
    1
    Social Model Σ!
    ."#"/0"%"$"%"&!"#$%$!.'()*+!&'!(#$!
    '*)&.,!%$,$-.+)$!*/!+$"'!,
    *"
    /*%!0'$%!-
    (
    1
    Popularity Model Π
    1"#"2"%"3!"#$%$"1'*+!&'!(#$!
    2*20,.%&(3!*/!+$"'!.%(&),$!,
    *
    1
    $'()*+"#"4!&4!-
    (!
    &'!(#$!
    .0(#*%!*/!5"$$(!5
    *
    U
    T
    !"#!.0(#*%'#&2!6.(%&7
    /'()*+"#"4!&4!-
    (!
    &'!
    &+($%$'($8!&+!
    (#$!)*+($+(!
    2%*80)$8!93!-
    *
    U
    U
    ""#"'*)&.,!6.(%&7
    o a user-dependent relevance criteria. We also aim at
    e recency into our model, so that our recommendations
    ently published news articles.
    ed to model the factors that affect the relevance of news
    We first model the social-network aspect. In our case,
    ent is induced by the twitter following relationship. We
    social network adjacency matrix, were S(i, j) is equal
    e number of users followed by user ui
    if ui
    follows uj
    ,
    We also adopt a functional ranking (Baeza-Yates et al.,
    the interests of a user among its neighbors recursively.
    aximum hop distance d, we define the social influence
    llows.
    al influence S∗). Given a set of users U = {u0, u1, . . .},
    al network where each user may express an interest to the
    y another user, we define the social influence model S∗ as the
    here S∗(i, j) measures the interest of user ui
    to the content
    j
    and it is computed as
    S∗ =

    i=d

    i=1
    σiSi

    ,
    normalized adjacency matrix of the social network, d is the
    nce up to which users may influence their neighbors, and σ
    '*)&.,!
    &+($%$'(
    /0'()*+!:!,$-$,!*/!&+($%$'(!*/!-
    (
    !
    (*!(#$!)*+($+(!2%*80)$8!93!-
    *
    1
    Z = 6,5(57"89:;6"*+(*!"#&)#!
    T!.+8"N!.%$!6.22$81
    ;$!0'$!;&<&2$8&.!2.=$'!.'!
    *0%!$+(&(3!'2.)$1
    >28.($8!93!
    (%.)<&+=!
    6$+(&*+'!&+!
    +$"'!.+8!
    5"&(($%!"&(#!
    $72*+$+(&.,!
    8$).31
    Z
    2'(+"#!2*20,.%&(3!*/!$+(&(3!<
    (
    #"#!2*20,.%&(3!-$)(*%
    &'()*+"#"
    %$,.($8+$''!*/!
    5"$$(!5
    (!
    (*!+$"'!,
    *
    T
    N
    $"#"("$$(?(*?+$"'
    !6.(%&7
    $%&%'%(%)
    ='()*+"#!
    %$,.($8+$''!
    */!5"$$(!5
    ("
    (*!
    $+(&(3!<
    *
    T
    Z
    '"#!5"$$(!6.(%&7
    3'()*+"#!%$,.($8+$''!*/!!
    $+(&(3!<
    ("
    (*!+$"'!,
    *
    Z
    N
    )"#!+$"'!6.(%&7
    Recommendation Model

    View Slide

  88. Learning to rank approach with SVM
    Each time the user clicks on a news, we learn a set of
    preferences (clicked_news > non_clicked_news):
    Prune the number of constraints for scalability:
    only news published in the last 2 days
    only take the top-k news for each ranking component
    T.Rex+ includes additional features: click count, age.
    τ ≤ c(ni) < c(nj) then Rτ (u, ni) > Rτ (u, nj)
    Learning the Weights

    View Slide

  89. User generated content is a good predictor albeit sparse
    Click Count is a strong baseline but does not help T.Rex+
    Table 5.2: MRR, precision and coverage.
    Algorithm MRR P@1 P@5 P@10 Coverage
    RECENCY 0.020 0.002 0.018 0.036 1.000
    CLICKCOUNT 0.059 0.024 0.086 0.135 1.000
    SOCIAL 0.017 0.002 0.018 0.036 0.606
    CONTENT 0.107 0.029 0.171 0.286 0.158
    POPULARITY 0.008 0.003 0.005 0.012 1.000
    T.REX 0.107 0.073 0.130 0.168 1.000
    T.REX+ 0.109 0.062 0.146 0.189 1.000
    ENCY: it ranks news articles by time of publication (most recent
    CKCOUNT: it ranks news articles by click count (highest count fi
    !"#$%&"'()*+'#,%$-.%/*"'(0(+$%#$1%2+3"*#4"5
    Mean  Reciprocal  Rank,  Precision  and  Coverage
    Predicting Clicked News

    View Slide