Big Data and the Web: Algorithms for Data Intensive Scalable Computing

Big Data and the Web: Algorithms for Data Intensive Scalable Computing

Presentation of my Ph.D. defense at IMT

Transcript

  1. Big Data and the Web: Algorithms for Data Intensive Scalable

    Computing Gianmarco De Francisci Morales IMT Institute for Advanced Studies, Lucca ISTI-CNR, Pisa Supervisors: Claudio Lucchese Ranieri Baraglia
  2. Big Data... “Data whose size forces us to look beyond

    the tried- and-true methods that are prevalent at the time” (Jacobs 2009) “When the size of the data itself becomes part of the problem and traditional techniques for working with data run out of steam” (Loukides 2010) 3V: Volume, Variety, Velocity (Gartner 2011)
  3. ...and the Web Largest publicly accessible data source in the

    world Economical, socio-political and scientific importance Center of our digital lives, digital footprint Data is large, noisy, diverse, fast 3 main models for data: Bags, Graphs, Streams
  4. Big Data Mining (Data Mining) Data mining is the process

    of inspecting data in order to extract useful information (Data Exhaust) The quality of the information extracted benefits from the availability of extensive datasets (Data Deluge) The size of these datasets calls for parallel solutions: Data Intensive Scalable Computing
  5. DISC Data Intensive Scalable Computing systems Parallel, scalable, cost effective,

    fault tolerant Non general purpose, data-parallel, restricted computing interface for the sake of performance 2 main computational models: MapReduce, Streaming
  6. MapReduce DFS Input 1 Input 2 Input 3 MAP MAP

    MAP REDUCE REDUCE DFS Output 1 Output 2 Shuffle Merge & Group Partition & Sort Map : [￿k1, v1 ￿] → [￿k2, v2 ￿] Reduce : {k2 : [v2]} → [￿k3, v3 ￿]
  7. Streaming (Actor Model) PE : [s1, ￿k1, v1 ￿] →

    [s2, ￿k2, v2 ￿] Live Streams Stream 1 Stream 2 Stream 3 PE PE PE PE PE External Persister Output 1 Output 2 Event routing
  8. Research Goal

  9. Research Goal Design algorithms for Web mining that efficiently harness

    the power of Data Intensive Scalable Computing
  10. Contributions Algorithm Structure Data Complexity MR-Iterative MR-Optimized S4-Streaming & MR

    Bags Streams & Graphs Graphs Social Content Matching Similarity Self-Join Personalized Online News Recommendation
  11. Similarity Self-Join Discover all those pairs of objects whose similarity

    is above a threshold 2 new MapReduce algorithms: SSJ-2 and SSJ-2R Exact solution with efficient pruning Test on a large Web corpus from TREC 4.5x faster than state-of-the-art R. Baraglia, G. De Francisci Morales, C. Lucchese “Document Similarity Self-Join with MapReduce” IEEE International Conference on Data Mining 2010 R. Baraglia, G. De Francisci Morales, C. Lucchese “Scaling out All Pairs Similarity Search with MapReduce” ACM Workshop on Large Scale Distributed Systems for IR 2010
  12. Motivation

  13. SSJ-2R

  14. shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d

    1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing SSJ-2R
  15. shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d

    1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache SSJ-2R
  16. shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d

    1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> SSJ-2R
  17. Experiments 4 workers with 16 cores, 8 GB memory, 2

    TB disks WT10G samples Metric: running time Table II SAMPLES FROM THE TREC WT10G COLLECTION D17K D30K D63K # documents 17,024 30,683 63,126 # terms 183,467 297,227 580,915 # all pairs 289,816,576 941,446,489 3,984,891,876 # similar pairs 94,220 138,816 189,969 B al co ar gr do
  18. Results 0 10000 20000 30000 40000 50000 60000 15000 20000

    25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of vectors ELSA VERN SSJ-2 SSJ-2R 0 10000 20000 30000 40000 50000 60000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of vectors ELSA VERN SSJ-2 SSJ-2R
  19. Results 1 10 100 1000 100 1000 Number of lists

    Inverted list length max=6600 ELSA 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 ELSA 1 10 100 1000 Number of lists max=1729 SSJ-2R
  20. Results 0 200 400 600 800 1000 1200 1400 1600

    1800 2000 0 10 20 30 40 50 Time (seconds) Mapper ID ELSA VERN SSJ-2R without bucketing SSJ-2R with bucketing 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper ID ELSA VERN SSJ-2R without bucketing SSJ-2R with bucketing
  21. Social Content Matching Select a subset of the edges of

    a weighed graphs, maximizing the total weight of the solution, while obeying capacity constraints of the nodes StackMR: ⅙ approx., poly-logarithmic, (1+∊) violations GreedyMR: ½ approx., linear worst case, no violations Validation on 2 large datasets coming from real world systems: flickr and Yahoo! Answers SSJ-2R to build the weighted bipartite graphs G. De Francisci Morales, A. Gionis, M. Sozio “Social Content Matching in MapReduce” International Conference on Very Large Data Bases 2011
  22. Motivation

  23. Motivation

  24. Motivation

  25. Motivation

  26. Problem: graph b-matching Given a set of items T, consumers

    C, bipartite graph, weights w(ti, cj), capacity constraints b(ti) and b(cj) Find a matching M={(t, c)} such that - |M(ti)| ≤ b(ti) - |M(cj)| ≤ b(cj) - w(M) is maximized Items Consumers
  27. Graph processing in MR Map Reduce

  28. Experiments 3 datasets: Quality = b-matching value Efficiency = number

    of MR iterations Evaluation of capacity violations for StackMR Evaluation of convergence speed for GreedyMR Dataset |T| |C| |E| flickr-small 2 817 526 550 667 flickr-large 373 373 32 707 1 995 123 827 yahoo-answers 4 852 689 1 149 714 18 847 281 236
  29. None
  30. None
  31. Personalized Online News Recommendation Deliver personalized news recommendations based on

    a model built from the Twitter profile of users Learn personalized ranking function from 3 signals: Social, Content, Popularity Deep personalization via entity extraction Test on 1 month of Y! News + Twitter + Toolbar logs Predict user click in top-10 positions 20% of the times G. De Francisci Morales, A. Gionis, C. Lucchese “From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendations” ACM International Conference on Web Search and Data Mining 2012
  32. Motivation

  33. Timeliness Personalization Number of mentions of “Osama Bin Laden” -0.2

    0 0.2 0.4 0.6 0.8 1 1.2 M ay-01 h20 M ay-02 h00 M ay-02 h04 M ay-02 h08 M ay-02 h12 M ay-02 h16 M ay-02 h20 M ay-03 h00 M ay-03 h04 M ay-03 h08 news twitter clicks Why Twitter?
  34. FEATURED FROM YOUR TWITTER ACCOUNT !"#$%&"'()*+&, Recommended from Twitter!

  35. Designed to be streaming and lightweight Recommendation model is updated

    in real-time Tweets User Tweets Followee Tweets Followee Tweets Followee Tweets Twitter articles news T.Rex User Model ! " # Personalized ranked list of news articles System Overview Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
  36. Automatic evaluation, aim for precision Frame as a click prediction

    problem, at time τ: Given a user model and a stream of published news Predict which news the user clicks on Clicks from Y! Toolbar and news from Y! News 1 month of English Tweets + crawled follower network Experiments
  37. Evaluation Metrics where is the rank of the clicked news

    article at the i-th event and Q is the set of tests where is the relevance of the document at position j in the i-th ranking MRR = 1 |Q| Q ￿ i=1 1 r(ni ∗ ) r(ni ∗ ) ni ∗ DCG[j] = ￿ G[j] if j = 1; DCG[j − 1] + G[j] log2j if j > 1, ni j G[j] G[j] = ￿Jaccard(ni ∗ , ni j ) × 5￿ ÷ 5
  38. Entity overlap between clicked and suggested news article 0 2

    4 6 8 10 12 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Average DCG Rank T.Rex+ T.Rex Popularity Content Social Recency Click count Results
  39. Conclusions Tackle the big data problem on the Web by

    designing large scale Web mining algorithms for DISC systems Address classical problems like similarity, matching and recommendation in the context of Web mining with large, heterogeneous datasets Provide novel, efficient and scalable solutions for the MapReduce and streaming programming models
  40. Conclusions Similarity of bags of Web pages: SSJ-2 and SSJ-2R

    4.5x faster than state-of-the-art Importance of careful design of MR algorithms Matching of Web 2.0 content on graphs: StackMR and GreedyMR iterative MR algorithms with provable approximation guarantees First solution to b-matching problem in MR Scalable computation pattern for graph mining in MR Personalized recommendation of news from streams: T.Rex predicts user interest from real-time social Web Parallelizable online stream + graph mining
  41. Thanks

  42. Similarity Self-Join

  43. SSJ-2 Example

  44. shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d

    1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing SSJ-2 Example
  45. shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d

    1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing SSJ-2 Example shuffle <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> <(d1 ,d3 ), [2,1]> reduce <(d1 ,d3 ), 5> HDFS d3 "A B B C" d1 "A A B C" Similarity map map map shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing
  46. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0
  47. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0 • Indexing & Prefix filtering
  48. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0 • Indexing & Prefix filtering • Need to retrieve pruned part
  49. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0 • Indexing & Prefix filtering • Need to retrieve pruned part • Actually, retrieve the whole documents
  50. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0 • Indexing & Prefix filtering • Need to retrieve pruned part • Actually, retrieve the whole documents • 2 remote (DFS) I/O per pair
  51. SSJ-2R ￿di ￿; ￿(di, dj), WA ij ￿; ￿(di, dj),

    WB ij ￿; ￿(di, dk), WA ik ￿; . . . ￿ ￿￿ ￿ group by key di ￿dj ￿; ￿(dj, dk), WA jk ￿; ￿(dj, dk), WB jk ￿; ￿(dj, dl), WA jl ￿; . . . ￿ ￿￿ ￿ group by key dj Remainder file = pruned part of the input Pre-load remainder file in memory, no further disk I/O Shuffle the input together with the partial similarity scores Pruned Indexed Pruned Indexed d i d j bi bj |L| 0
  52. (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

    ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)
  53. (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

    ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)
  54. (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

    ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Remainder file preloaded in memory Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)
  55. Results 0 10000 20000 30000 40000 50000 60000 15000 20000

    25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of GRFXPHQWV Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R
  56. (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

    ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] Partitioning K=2 • Split in K slices • Each reducer needs to load only 1/K of the remainder file • Need to replicate the input K times
  57. Map phase 0 15000 20000 25000 30000 35000 40000 45000

    50000 55000 60000 65000 Number of !"#$%&'() 0 5 15 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 2 4 6 8 10 12 14 16 18 Time (seconds) (c) 0 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'() 0 5 15 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 2 4 6 8 10 12 14 16 18 Time (seconds) (c)
  58. Map phase 1 10 100 1000 100 1000 Number of

    lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R
  59. Map phase 0 200 400 600 800 1000 1200 1400

    1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper Map times (threshold=0.9) Elsayed et al. SSJ-2R without bucketing SSJ-2R with bucketing 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper Map times (threshold=0.9) Elsayed et al. SSJ-2R without bucketing SSJ-2R with bucketing
  60. Reduce phase 000 65000 0 5 15000 20000 25000 30000

    35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'(s (b) 000 65000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'(s Avg. )&!$#& Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R (d)
  61. Social Content Matching

  62. System overview The application operates in consecutive phases (each phase

    in the range from hours to days) Before the beginning of the ith phase, the application makes a tentative allocation of items to users Capacity constraints User: an estimate of the number of logins during the ith phase Items: proportional to a quality assessment or constant B = ￿ c∈C b(c) = ￿ t∈T b(t)
  63. Graph building Edge weight is the cosine similarity between some

    vector representations of the item and the consumer w (ti, cj) = v(ti ) · v(cj) Prune the candidate edges O(|T||C|) by discarding low weight edges (we want to maximize the total weight) Similarity join between T and C in MapReduce
  64. StackMR Primal-dual formulation of the problem (Integer Linear Programming) Compute

    a maximal ⌈∊b⌉-matching in parallel Push it in the stack, update dual variables and remove covered edges When there are no more edges, pop the whole stack and include edges in the solution layer by layer For efficiency, allows (1+∊) violations on capacity constraints
  65. StackMR Example

  66. StackMR Example

  67. StackMR Example

  68. StackMR Example

  69. StackMR Example

  70. StackMR Example

  71. StackMR Example

  72. StackMR Example

  73. StackMR Example

  74. StackMR Example

  75. StackMR Example

  76. StackMR Example

  77. StackMR Example

  78. GreedyMR Adaptation in MR of a classical greedy algorithm (sort

    the edges by weight, include the current edge if it maintains the constraints and update the capacities) At each round, each node proposes its top weighting b(v) edges to its neighbors The intersection between the proposal of each node and the ones of its neighbors is included in the solution Capacities are updated in parallel Yields a feasible sub-optimal solution at each round
  79. StackGreedyMR Hybrid approach Same structure as StackMR Uses a greedy

    heuristic in one of the randomized phases, when choosing the edges to propose We tried also with a proportional heuristic, but the results were always worse than with the greedy one Mixed results overall
  80. Algorithms summary Approximation guarantee MR rounds Capacity violations StackMR GreedyMR

    ⅙ poly- logarithmic 1+∊ ½ linear no
  81. Vector representation Bag-of-words model flickr users: set of tags used

    in all photos flickr items (photos): set of tags Y! Answers users: set of words used in all answers Y! Answers items (questions): set of words Y! Answers: stopword removal, stemming, tf-idf
  82. Conclusions 2 algorithms with different trade offs between result quality

    and efficiency StackMR scales to very large datasets, has provable poly-logarithmic complexity and is faster in practice, capacity violations are negligible GreedyMR yields higher quality results, has ½ approximation and can be stopped at any time
  83. Personalized Online News Recommendation

  84. 90% of the clicks happen within 2 days from publication

    0 5 10 15 20 25 30 35 40 45 1 10 100 1000 10000 Minutes News-click delay !"#$%&'()'(**"&&%!*%+ ,%-+.*/0*1'2%/34'20+5&0$"50(! News Get Old Soon
  85. Builds a user model from Twitter Signals from user generated

    content, social neighbors and popularity across Twitter and news Deep personalization based on entities (overcomes vocabulary mismatch, easier to model relevance) Learn a personalized news ranking function Pick up candidates from a pool of related or popular fresh news, rank them and present top-k to the user T.Rex Twitter-based news recommendation system
  86. Ranking function is user and time dependent Social model +

    Content model + Popularity model Social model weights the content model of neighbors by a truncated PageRank on the Twitter network Content model measures relatedness of user’s tweet stream and news article represented as bag-of-entities Popularity model tracks entity popularity by the number of mentions in Twitter and news (exponential forgetting) Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n) Recommendation Model
  87. Entities News Tweets Content Model Γ !"#"$"%"&!"#$%$!!'()*+!&'!(#$!)*+($+(! %$,$-.+)$!*/!+$"'!, *" /*%!0'$%!-

    ( 1 Social Model Σ! ."#"/0"%"$"%"&!"#$%$!.'()*+!&'!(#$! '*)&.,!%$,$-.+)$!*/!+$"'!, *" /*%!0'$%!- ( 1 Popularity Model Π 1"#"2"%"3!"#$%$"1'*+!&'!(#$! 2*20,.%&(3!*/!+$"'!.%(&),$!, * 1 $'()*+"#"4!&4!- (! &'!(#$! .0(#*%!*/!5"$$(!5 * U T !"#!.0(#*%'#&2!6.(%&7 /'()*+"#"4!&4!- (! &'! &+($%$'($8!&+! (#$!)*+($+(! 2%*80)$8!93!- * U U ""#"'*)&.,!6.(%&7 o a user-dependent relevance criteria. We also aim at e recency into our model, so that our recommendations ently published news articles. ed to model the factors that affect the relevance of news We first model the social-network aspect. In our case, ent is induced by the twitter following relationship. We social network adjacency matrix, were S(i, j) is equal e number of users followed by user ui if ui follows uj , We also adopt a functional ranking (Baeza-Yates et al., the interests of a user among its neighbors recursively. aximum hop distance d, we define the social influence llows. al influence S∗). Given a set of users U = {u0, u1, . . .}, al network where each user may express an interest to the y another user, we define the social influence model S∗ as the here S∗(i, j) measures the interest of user ui to the content j and it is computed as S∗ = ￿ i=d ￿ i=1 σiSi ￿ , normalized adjacency matrix of the social network, d is the nce up to which users may influence their neighbors, and σ '*)&.,! &+($%$'( /0'()*+!:!,$-$,!*/!&+($%$'(!*/!- ( ! (*!(#$!)*+($+(!2%*80)$8!93!- * 1 Z = 6,5(57"89:;6"*+(*!"#&)#! T!.+8"N!.%$!6.22$81 ;$!0'$!;&<&2$8&.!2.=$'!.'! *0%!$+(&(3!'2.)$1 >28.($8!93! (%.)<&+=! 6$+(&*+'!&+! +$"'!.+8! 5"&(($%!"&(#! $72*+$+(&.,! 8$).31 Z 2'(+"#!2*20,.%&(3!*/!$+(&(3!< ( #"#!2*20,.%&(3!-$)(*% &'()*+"#" %$,.($8+$''!*/! 5"$$(!5 (! (*!+$"'!, * T N $"#"("$$(?(*?+$"' !6.(%&7 $%&%'%(%) ='()*+"#! %$,.($8+$''! */!5"$$(!5 (" (*! $+(&(3!< * T Z '"#!5"$$(!6.(%&7 3'()*+"#!%$,.($8+$''!*/!! $+(&(3!< (" (*!+$"'!, * Z N )"#!+$"'!6.(%&7 Recommendation Model
  88. Learning to rank approach with SVM Each time the user

    clicks on a news, we learn a set of preferences (clicked_news > non_clicked_news): Prune the number of constraints for scalability: only news published in the last 2 days only take the top-k news for each ranking component T.Rex+ includes additional features: click count, age. τ ≤ c(ni) < c(nj) then Rτ (u, ni) > Rτ (u, nj) Learning the Weights
  89. User generated content is a good predictor albeit sparse Click

    Count is a strong baseline but does not help T.Rex+ Table 5.2: MRR, precision and coverage. Algorithm MRR P@1 P@5 P@10 Coverage RECENCY 0.020 0.002 0.018 0.036 1.000 CLICKCOUNT 0.059 0.024 0.086 0.135 1.000 SOCIAL 0.017 0.002 0.018 0.036 0.606 CONTENT 0.107 0.029 0.171 0.286 0.158 POPULARITY 0.008 0.003 0.005 0.012 1.000 T.REX 0.107 0.073 0.130 0.168 1.000 T.REX+ 0.109 0.062 0.146 0.189 1.000 ENCY: it ranks news articles by time of publication (most recent CKCOUNT: it ranks news articles by click count (highest count fi !"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5 Mean  Reciprocal  Rank,  Precision  and  Coverage Predicting Clicked News