Big Data and the Web: Algorithms for Data Intensive Scalable Computing

Big Data and the Web: Algorithms for Data Intensive Scalable
Computing Gianmarco De Francisci Morales IMT Institute for Advanced Studies, Lucca ISTI-CNR, Pisa Supervisors: Claudio Lucchese Ranieri Baraglia

Big Data... “Data whose size forces us to look beyond
the tried- and-true methods that are prevalent at the time” (Jacobs 2009) “When the size of the data itself becomes part of the problem and traditional techniques for working with data run out of steam” (Loukides 2010) 3V: Volume, Variety, Velocity (Gartner 2011)

...and the Web Largest publicly accessible data source in the
world Economical, socio-political and scientiﬁc importance Center of our digital lives, digital footprint Data is large, noisy, diverse, fast 3 main models for data: Bags, Graphs, Streams

Big Data Mining (Data Mining) Data mining is the process
of inspecting data in order to extract useful information (Data Exhaust) The quality of the information extracted beneﬁts from the availability of extensive datasets (Data Deluge) The size of these datasets calls for parallel solutions: Data Intensive Scalable Computing

DISC Data Intensive Scalable Computing systems Parallel, scalable, cost effective,
fault tolerant Non general purpose, data-parallel, restricted computing interface for the sake of performance 2 main computational models: MapReduce, Streaming

MapReduce DFS Input 1 Input 2 Input 3 MAP MAP
MAP REDUCE REDUCE DFS Output 1 Output 2 Shufﬂe Merge & Group Partition & Sort Map : [k1, v1 ] → [k2, v2 ] Reduce : {k2 : [v2]} → [k3, v3 ]

Streaming (Actor Model) PE : [s1, k1, v1 ] →
[s2, k2, v2 ] Live Streams Stream 1 Stream 2 Stream 3 PE PE PE PE PE External Persister Output 1 Output 2 Event routing

Research Goal

Research Goal Design algorithms for Web mining that efﬁciently harness
the power of Data Intensive Scalable Computing

Contributions Algorithm Structure Data Complexity MR-Iterative MR-Optimized S4-Streaming & MR
Bags Streams & Graphs Graphs Social Content Matching Similarity Self-Join Personalized Online News Recommendation

Similarity Self-Join Discover all those pairs of objects whose similarity
is above a threshold 2 new MapReduce algorithms: SSJ-2 and SSJ-2R Exact solution with efﬁcient pruning Test on a large Web corpus from TREC 4.5x faster than state-of-the-art R. Baraglia, G. De Francisci Morales, C. Lucchese “Document Similarity Self-Join with MapReduce” IEEE International Conference on Data Mining 2010 R. Baraglia, G. De Francisci Morales, C. Lucchese “Scaling out All Pairs Similarity Search with MapReduce” ACM Workshop on Large Scale Distributed Systems for IR 2010

Motivation

SSJ-2R

shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d
1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing SSJ-2R

shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d
1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache SSJ-2R

shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d
1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> SSJ-2R

Experiments 4 workers with 16 cores, 8 GB memory, 2
TB disks WT10G samples Metric: running time Table II SAMPLES FROM THE TREC WT10G COLLECTION D17K D30K D63K # documents 17,024 30,683 63,126 # terms 183,467 297,227 580,915 # all pairs 289,816,576 941,446,489 3,984,891,876 # similar pairs 94,220 138,816 189,969 B al co ar gr do

Results 0 10000 20000 30000 40000 50000 60000 15000 20000
25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of vectors ELSA VERN SSJ-2 SSJ-2R 0 10000 20000 30000 40000 50000 60000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of vectors ELSA VERN SSJ-2 SSJ-2R

Results 1 10 100 1000 100 1000 Number of lists
Inverted list length max=6600 ELSA 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 ELSA 1 10 100 1000 Number of lists max=1729 SSJ-2R

Results 0 200 400 600 800 1000 1200 1400 1600
1800 2000 0 10 20 30 40 50 Time (seconds) Mapper ID ELSA VERN SSJ-2R without bucketing SSJ-2R with bucketing 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper ID ELSA VERN SSJ-2R without bucketing SSJ-2R with bucketing

Social Content Matching Select a subset of the edges of
a weighed graphs, maximizing the total weight of the solution, while obeying capacity constraints of the nodes StackMR: ⅙ approx., poly-logarithmic, (1+∊) violations GreedyMR: ½ approx., linear worst case, no violations Validation on 2 large datasets coming from real world systems: flickr and Yahoo! Answers SSJ-2R to build the weighted bipartite graphs G. De Francisci Morales, A. Gionis, M. Sozio “Social Content Matching in MapReduce” International Conference on Very Large Data Bases 2011

Motivation

Problem: graph b-matching Given a set of items T, consumers
C, bipartite graph, weights w(ti, cj), capacity constraints b(ti) and b(cj) Find a matching M={(t, c)} such that - |M(ti)| ≤ b(ti) - |M(cj)| ≤ b(cj) - w(M) is maximized Items Consumers

Graph processing in MR Map Reduce

Experiments 3 datasets: Quality = b-matching value Efﬁciency = number
of MR iterations Evaluation of capacity violations for StackMR Evaluation of convergence speed for GreedyMR Dataset |T| |C| |E| flickr-small 2 817 526 550 667 flickr-large 373 373 32 707 1 995 123 827 yahoo-answers 4 852 689 1 149 714 18 847 281 236

Personalized Online News Recommendation Deliver personalized news recommendations based on
a model built from the Twitter proﬁle of users Learn personalized ranking function from 3 signals: Social, Content, Popularity Deep personalization via entity extraction Test on 1 month of Y! News + Twitter + Toolbar logs Predict user click in top-10 positions 20% of the times G. De Francisci Morales, A. Gionis, C. Lucchese “From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendations” ACM International Conference on Web Search and Data Mining 2012

Motivation

Timeliness Personalization Number of mentions of “Osama Bin Laden” -0.2
0 0.2 0.4 0.6 0.8 1 1.2 M ay-01 h20 M ay-02 h00 M ay-02 h04 M ay-02 h08 M ay-02 h12 M ay-02 h16 M ay-02 h20 M ay-03 h00 M ay-03 h04 M ay-03 h08 news twitter clicks Why Twitter?

FEATURED FROM YOUR TWITTER ACCOUNT !"#$%&"'()*+&, Recommended from Twitter!

Designed to be streaming and lightweight Recommendation model is updated
in real-time Tweets User Tweets Followee Tweets Followee Tweets Followee Tweets Twitter articles news T.Rex User Model ! " # Personalized ranked list of news articles System Overview Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)

Automatic evaluation, aim for precision Frame as a click prediction
problem, at time τ: Given a user model and a stream of published news Predict which news the user clicks on Clicks from Y! Toolbar and news from Y! News 1 month of English Tweets + crawled follower network Experiments

Evaluation Metrics where is the rank of the clicked news
article at the i-th event and Q is the set of tests where is the relevance of the document at position j in the i-th ranking MRR = 1 |Q| Q i=1 1 r(ni ∗ ) r(ni ∗ ) ni ∗ DCG[j] = G[j] if j = 1; DCG[j − 1] + G[j] log2j if j > 1, ni j G[j] G[j] = Jaccard(ni ∗ , ni j ) × 5 ÷ 5

Entity overlap between clicked and suggested news article 0 2
4 6 8 10 12 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Average DCG Rank T.Rex+ T.Rex Popularity Content Social Recency Click count Results

Conclusions Tackle the big data problem on the Web by
designing large scale Web mining algorithms for DISC systems Address classical problems like similarity, matching and recommendation in the context of Web mining with large, heterogeneous datasets Provide novel, efﬁcient and scalable solutions for the MapReduce and streaming programming models

Conclusions Similarity of bags of Web pages: SSJ-2 and SSJ-2R
4.5x faster than state-of-the-art Importance of careful design of MR algorithms Matching of Web 2.0 content on graphs: StackMR and GreedyMR iterative MR algorithms with provable approximation guarantees First solution to b-matching problem in MR Scalable computation pattern for graph mining in MR Personalized recommendation of news from streams: T.Rex predicts user interest from real-time social Web Parallelizable online stream + graph mining

Thanks

Similarity Self-Join

SSJ-2 Example

shufﬂe <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d
1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing SSJ-2 Example

shuffle <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d
1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> <B, (d 1 ,1)> <C, (d 1 ,1)> <D, (d 2 ,2)> <B, (d 3 ,2)> <C, (d 3 ,1)> map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" <B, [(d 1 ,1), (d 3 ,2)]> <C, [(d 1 ,1), (d 3 ,1)]> <D, [(d 2 ,2)]> reduce reduce reduce Indexing SSJ-2 Example shuffle <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> <(d1 ,d3 ), [2,1]> reduce <(d1 ,d3 ), 5> HDFS d3 "A B B C" d1 "A A B C" Similarity map map map shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing

SSJ-2 Pruned Indexed Pruned Indexed d i d j bi
bj |L| 0

bj |L| 0 • Indexing & Preﬁx ﬁltering

bj |L| 0 • Indexing & Preﬁx ﬁltering • Need to retrieve pruned part

bj |L| 0 • Indexing & Preﬁx ﬁltering • Need to retrieve pruned part • Actually, retrieve the whole documents

bj |L| 0 • Indexing & Preﬁx ﬁltering • Need to retrieve pruned part • Actually, retrieve the whole documents • 2 remote (DFS) I/O per pair

SSJ-2R di ; (di, dj), WA ij ; (di, dj),
WB ij ; (di, dk), WA ik ; . . . group by key di dj ; (dj, dk), WA jk ; (dj, dk), WB jk ; (dj, dl), WA jl ; . . . group by key dj Remainder file = pruned part of the input Pre-load remainder file in memory, no further disk I/O Shuffle the input together with the partial similarity scores Pruned Indexed Pruned Indexed d i d j bi bj |L| 0

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2
,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Reduce Input Sort pairs on both IDs, group on ﬁrst (Secondary Sort) Only 1 reducer reads d0 Remainder ﬁle contains only the useful portion of the other documents (about 10%)

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2
,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2
,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Remainder file preloaded in memory Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)

Results 0 10000 20000 30000 40000 50000 60000 15000 20000
25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of GRFXPHQWV Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2
,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] Partitioning K=2 • Split in K slices • Each reducer needs to load only 1/K of the remainder ﬁle • Need to replicate the input K times

Map phase 0 15000 20000 25000 30000 35000 40000 45000
50000 55000 60000 65000 Number of !"#$%&'() 0 5 15 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 2 4 6 8 10 12 14 16 18 Time (seconds) (c) 0 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'() 0 5 15 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 2 4 6 8 10 12 14 16 18 Time (seconds) (c)

Map phase 1 10 100 1000 100 1000 Number of
lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R

Map phase 0 200 400 600 800 1000 1200 1400
1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper Map times (threshold=0.9) Elsayed et al. SSJ-2R without bucketing SSJ-2R with bucketing 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper Map times (threshold=0.9) Elsayed et al. SSJ-2R without bucketing SSJ-2R with bucketing

Reduce phase 000 65000 0 5 15000 20000 25000 30000
35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'(s (b) 000 65000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'(s Avg. )&!$#& Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R (d)

Social Content Matching

System overview The application operates in consecutive phases (each phase
in the range from hours to days) Before the beginning of the ith phase, the application makes a tentative allocation of items to users Capacity constraints User: an estimate of the number of logins during the ith phase Items: proportional to a quality assessment or constant B = c∈C b(c) = t∈T b(t)

Graph building Edge weight is the cosine similarity between some
vector representations of the item and the consumer w (ti, cj) = v(ti ) · v(cj) Prune the candidate edges O(|T||C|) by discarding low weight edges (we want to maximize the total weight) Similarity join between T and C in MapReduce

StackMR Primal-dual formulation of the problem (Integer Linear Programming) Compute
a maximal ⌈∊b⌉-matching in parallel Push it in the stack, update dual variables and remove covered edges When there are no more edges, pop the whole stack and include edges in the solution layer by layer For efﬁciency, allows (1+∊) violations on capacity constraints

StackMR Example

GreedyMR Adaptation in MR of a classical greedy algorithm (sort
the edges by weight, include the current edge if it maintains the constraints and update the capacities) At each round, each node proposes its top weighting b(v) edges to its neighbors The intersection between the proposal of each node and the ones of its neighbors is included in the solution Capacities are updated in parallel Yields a feasible sub-optimal solution at each round

StackGreedyMR Hybrid approach Same structure as StackMR Uses a greedy
heuristic in one of the randomized phases, when choosing the edges to propose We tried also with a proportional heuristic, but the results were always worse than with the greedy one Mixed results overall

Algorithms summary Approximation guarantee MR rounds Capacity violations StackMR GreedyMR
⅙ poly- logarithmic 1+∊ ½ linear no

Vector representation Bag-of-words model ﬂickr users: set of tags used
in all photos ﬂickr items (photos): set of tags Y! Answers users: set of words used in all answers Y! Answers items (questions): set of words Y! Answers: stopword removal, stemming, tf-idf

Conclusions 2 algorithms with different trade offs between result quality
and efﬁciency StackMR scales to very large datasets, has provable poly-logarithmic complexity and is faster in practice, capacity violations are negligible GreedyMR yields higher quality results, has ½ approximation and can be stopped at any time

Personalized Online News Recommendation

90% of the clicks happen within 2 days from publication
0 5 10 15 20 25 30 35 40 45 1 10 100 1000 10000 Minutes News-click delay !"#$%&'()'(**"&&%!*%+ ,%-+.*/0*1'2%/34'20+5&0$"50(! News Get Old Soon

Builds a user model from Twitter Signals from user generated
content, social neighbors and popularity across Twitter and news Deep personalization based on entities (overcomes vocabulary mismatch, easier to model relevance) Learn a personalized news ranking function Pick up candidates from a pool of related or popular fresh news, rank them and present top-k to the user T.Rex Twitter-based news recommendation system

Ranking function is user and time dependent Social model +
Content model + Popularity model Social model weights the content model of neighbors by a truncated PageRank on the Twitter network Content model measures relatedness of user’s tweet stream and news article represented as bag-of-entities Popularity model tracks entity popularity by the number of mentions in Twitter and news (exponential forgetting) Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n) Recommendation Model

Entities News Tweets Content Model Γ !"#"$"%"&!"#$%$!!'()*+!&'!(#$!)*+($+(! %$,$-.+)$!*/!+$"'!, *" /*%!0'$%!-
( 1 Social Model Σ! ."#"/0"%"$"%"&!"#$%$!.'()*+!&'!(#$! '*)&.,!%$,$-.+)$!*/!+$"'!, *" /*%!0'$%!- ( 1 Popularity Model Π 1"#"2"%"3!"#$%$"1'*+!&'!(#$! 2*20,.%&(3!*/!+$"'!.%(&),$!, * 1 $'()*+"#"4!&4!- (! &'!(#$! .0(#*%!*/!5"$$(!5 * U T !"#!.0(#*%'#&2!6.(%&7 /'()*+"#"4!&4!- (! &'! &+($%$'($8!&+! (#$!)*+($+(! 2%*80)$8!93!- * U U ""#"'*)&.,!6.(%&7 o a user-dependent relevance criteria. We also aim at e recency into our model, so that our recommendations ently published news articles. ed to model the factors that affect the relevance of news We first model the social-network aspect. In our case, ent is induced by the twitter following relationship. We social network adjacency matrix, were S(i, j) is equal e number of users followed by user ui if ui follows uj , We also adopt a functional ranking (Baeza-Yates et al., the interests of a user among its neighbors recursively. aximum hop distance d, we define the social influence llows. al influence S∗). Given a set of users U = {u0, u1, . . .}, al network where each user may express an interest to the y another user, we define the social influence model S∗ as the here S∗(i, j) measures the interest of user ui to the content j and it is computed as S∗ = i=d i=1 σiSi , normalized adjacency matrix of the social network, d is the nce up to which users may influence their neighbors, and σ '*)&.,! &+($%$'( /0'()*+!:!,$-$,!*/!&+($%$'(!*/!- ( ! (*!(#$!)*+($+(!2%*80)$8!93!- * 1 Z = 6,5(57"89:;6"*+(*!"#&)#! T!.+8"N!.%$!6.22$81 ;$!0'$!;&<&2$8&.!2.=$'!.'! *0%!$+(&(3!'2.)$1 >28.($8!93! (%.)<&+=! 6$+(&*+'!&+! +$"'!.+8! 5"&(($%!"&(#! $72*+$+(&.,! 8$).31 Z 2'(+"#!2*20,.%&(3!*/!$+(&(3!< ( #"#!2*20,.%&(3!-$)(*% &'()*+"#" %$,.($8+$''!*/! 5"$$(!5 (! (*!+$"'!, * T N $"#"("$$(?(*?+$"' !6.(%&7 $%&%'%(%) ='()*+"#! %$,.($8+$''! */!5"$$(!5 (" (*! $+(&(3!< * T Z '"#!5"$$(!6.(%&7 3'()*+"#!%$,.($8+$''!*/!! $+(&(3!< (" (*!+$"'!, * Z N )"#!+$"'!6.(%&7 Recommendation Model

Learning to rank approach with SVM Each time the user
clicks on a news, we learn a set of preferences (clicked_news > non_clicked_news): Prune the number of constraints for scalability: only news published in the last 2 days only take the top-k news for each ranking component T.Rex+ includes additional features: click count, age. τ ≤ c(ni) < c(nj) then Rτ (u, ni) > Rτ (u, nj) Learning the Weights

User generated content is a good predictor albeit sparse Click
Count is a strong baseline but does not help T.Rex+ Table 5.2: MRR, precision and coverage. Algorithm MRR P@1 P@5 P@10 Coverage RECENCY 0.020 0.002 0.018 0.036 1.000 CLICKCOUNT 0.059 0.024 0.086 0.135 1.000 SOCIAL 0.017 0.002 0.018 0.036 0.606 CONTENT 0.107 0.029 0.171 0.286 0.158 POPULARITY 0.008 0.003 0.005 0.012 1.000 T.REX 0.107 0.073 0.130 0.168 1.000 T.REX+ 0.109 0.062 0.146 0.189 1.000 ENCY: it ranks news articles by time of publication (most recent CKCOUNT: it ranks news articles by click count (highest count ﬁ !"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5 Mean Reciprocal Rank, Precision and Coverage Predicting Clicked News

Big Data and the Web: Algorithms for Data Inten...

Big Data and the Web: Algorithms for Data Intensive Scalable Computing

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Featured

Transcript