Slide 1

Slide 1 text

Big Data and the Web: Algorithms for Data Intensive Scalable Computing Gianmarco De Francisci Morales IMT Institute for Advanced Studies, Lucca ISTI-CNR, Pisa Supervisors: Claudio Lucchese Ranieri Baraglia

Slide 2

Slide 2 text

Big Data... “Data whose size forces us to look beyond the tried- and-true methods that are prevalent at the time” (Jacobs 2009) “When the size of the data itself becomes part of the problem and traditional techniques for working with data run out of steam” (Loukides 2010) 3V: Volume, Variety, Velocity (Gartner 2011)

Slide 3

Slide 3 text

...and the Web Largest publicly accessible data source in the world Economical, socio-political and scientific importance Center of our digital lives, digital footprint Data is large, noisy, diverse, fast 3 main models for data: Bags, Graphs, Streams

Slide 4

Slide 4 text

Big Data Mining (Data Mining) Data mining is the process of inspecting data in order to extract useful information (Data Exhaust) The quality of the information extracted benefits from the availability of extensive datasets (Data Deluge) The size of these datasets calls for parallel solutions: Data Intensive Scalable Computing

Slide 5

Slide 5 text

DISC Data Intensive Scalable Computing systems Parallel, scalable, cost effective, fault tolerant Non general purpose, data-parallel, restricted computing interface for the sake of performance 2 main computational models: MapReduce, Streaming

Slide 6

Slide 6 text

MapReduce DFS Input 1 Input 2 Input 3 MAP MAP MAP REDUCE REDUCE DFS Output 1 Output 2 Shuffle Merge & Group Partition & Sort Map : [￿k1, v1 ￿] → [￿k2, v2 ￿] Reduce : {k2 : [v2]} → [￿k3, v3 ￿]

Slide 7

Slide 7 text

Streaming (Actor Model) PE : [s1, ￿k1, v1 ￿] → [s2, ￿k2, v2 ￿] Live Streams Stream 1 Stream 2 Stream 3 PE PE PE PE PE External Persister Output 1 Output 2 Event routing

Slide 8

Slide 8 text

Research Goal

Slide 9

Slide 9 text

Research Goal Design algorithms for Web mining that efficiently harness the power of Data Intensive Scalable Computing

Slide 10

Slide 10 text

Contributions Algorithm Structure Data Complexity MR-Iterative MR-Optimized S4-Streaming & MR Bags Streams & Graphs Graphs Social Content Matching Similarity Self-Join Personalized Online News Recommendation

Slide 11

Slide 11 text

Similarity Self-Join Discover all those pairs of objects whose similarity is above a threshold 2 new MapReduce algorithms: SSJ-2 and SSJ-2R Exact solution with efficient pruning Test on a large Web corpus from TREC 4.5x faster than state-of-the-art R. Baraglia, G. De Francisci Morales, C. Lucchese “Document Similarity Self-Join with MapReduce” IEEE International Conference on Data Mining 2010 R. Baraglia, G. De Francisci Morales, C. Lucchese “Scaling out All Pairs Similarity Search with MapReduce” ACM Workshop on Large Scale Distributed Systems for IR 2010

Slide 12

Slide 12 text

Motivation

Slide 13

Slide 13 text

SSJ-2R

Slide 14

Slide 14 text

shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing SSJ-2R

Slide 15

Slide 15 text

shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache SSJ-2R

Slide 16

Slide 16 text

shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing shuffle <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> <(d 1 ,!),"A A B C"> <(d 1 ,d 3 ), 2> <(d 1 ,d 3 ), 1> reduce <(d 1 ,d 3 ), 5> Similarity map map map map Remainder File d 1 "A A" d 3 "A" d 2 "B" Distributed Cache <(d 1 ,!), "A A B C"> <(d 3 ,!), "A B B C"> SSJ-2R

Slide 17

Slide 17 text

Experiments 4 workers with 16 cores, 8 GB memory, 2 TB disks WT10G samples Metric: running time Table II SAMPLES FROM THE TREC WT10G COLLECTION D17K D30K D63K # documents 17,024 30,683 63,126 # terms 183,467 297,227 580,915 # all pairs 289,816,576 941,446,489 3,984,891,876 # similar pairs 94,220 138,816 189,969 B al co ar gr do

Slide 18

Slide 18 text

Results 0 10000 20000 30000 40000 50000 60000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of vectors ELSA VERN SSJ-2 SSJ-2R 0 10000 20000 30000 40000 50000 60000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of vectors ELSA VERN SSJ-2 SSJ-2R

Slide 19

Slide 19 text

Results 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 ELSA 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 ELSA 1 10 100 1000 Number of lists max=1729 SSJ-2R

Slide 20

Slide 20 text

Results 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper ID ELSA VERN SSJ-2R without bucketing SSJ-2R with bucketing 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper ID ELSA VERN SSJ-2R without bucketing SSJ-2R with bucketing

Slide 21

Slide 21 text

Social Content Matching Select a subset of the edges of a weighed graphs, maximizing the total weight of the solution, while obeying capacity constraints of the nodes StackMR: ⅙ approx., poly-logarithmic, (1+∊) violations GreedyMR: ½ approx., linear worst case, no violations Validation on 2 large datasets coming from real world systems: flickr and Yahoo! Answers SSJ-2R to build the weighted bipartite graphs G. De Francisci Morales, A. Gionis, M. Sozio “Social Content Matching in MapReduce” International Conference on Very Large Data Bases 2011

Slide 22

Slide 22 text

Motivation

Slide 23

Slide 23 text

Motivation

Slide 24

Slide 24 text

Motivation

Slide 25

Slide 25 text

Motivation

Slide 26

Slide 26 text

Problem: graph b-matching Given a set of items T, consumers C, bipartite graph, weights w(ti, cj), capacity constraints b(ti) and b(cj) Find a matching M={(t, c)} such that - |M(ti)| ≤ b(ti) - |M(cj)| ≤ b(cj) - w(M) is maximized Items Consumers

Slide 27

Slide 27 text

Graph processing in MR Map Reduce

Slide 28

Slide 28 text

Experiments 3 datasets: Quality = b-matching value Efficiency = number of MR iterations Evaluation of capacity violations for StackMR Evaluation of convergence speed for GreedyMR Dataset |T| |C| |E| flickr-small 2 817 526 550 667 flickr-large 373 373 32 707 1 995 123 827 yahoo-answers 4 852 689 1 149 714 18 847 281 236

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Personalized Online News Recommendation Deliver personalized news recommendations based on a model built from the Twitter profile of users Learn personalized ranking function from 3 signals: Social, Content, Popularity Deep personalization via entity extraction Test on 1 month of Y! News + Twitter + Toolbar logs Predict user click in top-10 positions 20% of the times G. De Francisci Morales, A. Gionis, C. Lucchese “From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendations” ACM International Conference on Web Search and Data Mining 2012

Slide 32

Slide 32 text

Motivation

Slide 33

Slide 33 text

Timeliness Personalization Number of mentions of “Osama Bin Laden” -0.2 0 0.2 0.4 0.6 0.8 1 1.2 M ay-01 h20 M ay-02 h00 M ay-02 h04 M ay-02 h08 M ay-02 h12 M ay-02 h16 M ay-02 h20 M ay-03 h00 M ay-03 h04 M ay-03 h08 news twitter clicks Why Twitter?

Slide 34

Slide 34 text

FEATURED FROM YOUR TWITTER ACCOUNT !"#$%&"'()*+&, Recommended from Twitter!

Slide 35

Slide 35 text

Designed to be streaming and lightweight Recommendation model is updated in real-time Tweets User Tweets Followee Tweets Followee Tweets Followee Tweets Twitter articles news T.Rex User Model ! " # Personalized ranked list of news articles System Overview Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)

Slide 36

Slide 36 text

Automatic evaluation, aim for precision Frame as a click prediction problem, at time τ: Given a user model and a stream of published news Predict which news the user clicks on Clicks from Y! Toolbar and news from Y! News 1 month of English Tweets + crawled follower network Experiments

Slide 37

Slide 37 text

Evaluation Metrics where is the rank of the clicked news article at the i-th event and Q is the set of tests where is the relevance of the document at position j in the i-th ranking MRR = 1 |Q| Q ￿ i=1 1 r(ni ∗ ) r(ni ∗ ) ni ∗ DCG[j] = ￿ G[j] if j = 1; DCG[j − 1] + G[j] log2j if j > 1, ni j G[j] G[j] = ￿Jaccard(ni ∗ , ni j ) × 5￿ ÷ 5

Slide 38

Slide 38 text

Entity overlap between clicked and suggested news article 0 2 4 6 8 10 12 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Average DCG Rank T.Rex+ T.Rex Popularity Content Social Recency Click count Results

Slide 39

Slide 39 text

Conclusions Tackle the big data problem on the Web by designing large scale Web mining algorithms for DISC systems Address classical problems like similarity, matching and recommendation in the context of Web mining with large, heterogeneous datasets Provide novel, efficient and scalable solutions for the MapReduce and streaming programming models

Slide 40

Slide 40 text

Conclusions Similarity of bags of Web pages: SSJ-2 and SSJ-2R 4.5x faster than state-of-the-art Importance of careful design of MR algorithms Matching of Web 2.0 content on graphs: StackMR and GreedyMR iterative MR algorithms with provable approximation guarantees First solution to b-matching problem in MR Scalable computation pattern for graph mining in MR Personalized recommendation of news from streams: T.Rex predicts user interest from real-time social Web Parallelizable online stream + graph mining

Slide 41

Slide 41 text

Thanks

Slide 42

Slide 42 text

Similarity Self-Join

Slide 43

Slide 43 text

SSJ-2 Example

Slide 44

Slide 44 text

shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing SSJ-2 Example

Slide 45

Slide 45 text

shuffle map d 1 "A A B C" map d 2 "B D D" map d 3 "A B B C" reduce reduce reduce Indexing SSJ-2 Example shuffle <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> <(d1 ,d3 ), [2,1]> reduce <(d1 ,d3 ), 5> HDFS d3 "A B B C" d1 "A A B C" Similarity map map map shuffle map d1 "A A B C" map d2 "B D D" map d3 "A B B C" reduce reduce reduce Indexing

Slide 46

Slide 46 text

SSJ-2 Pruned Indexed Pruned Indexed d i d j bi bj |L| 0

Slide 47

Slide 47 text

SSJ-2 Pruned Indexed Pruned Indexed d i d j bi bj |L| 0 • Indexing & Prefix filtering

Slide 48

Slide 48 text

SSJ-2 Pruned Indexed Pruned Indexed d i d j bi bj |L| 0 • Indexing & Prefix filtering • Need to retrieve pruned part

Slide 49

Slide 49 text

SSJ-2 Pruned Indexed Pruned Indexed d i d j bi bj |L| 0 • Indexing & Prefix filtering • Need to retrieve pruned part • Actually, retrieve the whole documents

Slide 50

Slide 50 text

SSJ-2 Pruned Indexed Pruned Indexed d i d j bi bj |L| 0 • Indexing & Prefix filtering • Need to retrieve pruned part • Actually, retrieve the whole documents • 2 remote (DFS) I/O per pair

Slide 51

Slide 51 text

SSJ-2R ￿di ￿; ￿(di, dj), WA ij ￿; ￿(di, dj), WB ij ￿; ￿(di, dk), WA ik ￿; . . . ￿ ￿￿ ￿ group by key di ￿dj ￿; ￿(dj, dk), WA jk ￿; ￿(dj, dk), WB jk ￿; ￿(dj, dl), WA jl ￿; . . . ￿ ￿￿ ￿ group by key dj Remainder file = pruned part of the input Pre-load remainder file in memory, no further disk I/O Shuffle the input together with the partial similarity scores Pruned Indexed Pruned Indexed d i d j bi bj |L| 0

Slide 52

Slide 52 text

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2 ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)

Slide 53

Slide 53 text

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2 ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)

Slide 54

Slide 54 text

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2 ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Remainder file preloaded in memory Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)

Slide 55

Slide 55 text

Results 0 10000 20000 30000 40000 50000 60000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of GRFXPHQWV Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R

Slide 56

Slide 56 text

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2 ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] Partitioning K=2 • Split in K slices • Each reducer needs to load only 1/K of the remainder file • Need to replicate the input K times

Slide 57

Slide 57 text

Map phase 0 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'() 0 5 15 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 2 4 6 8 10 12 14 16 18 Time (seconds) (c) 0 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'() 0 5 15 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 2 4 6 8 10 12 14 16 18 Time (seconds) (c)

Slide 58

Slide 58 text

Map phase 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R

Slide 59

Slide 59 text

Map phase 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper Map times (threshold=0.9) Elsayed et al. SSJ-2R without bucketing SSJ-2R with bucketing 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 Time (seconds) Mapper Map times (threshold=0.9) Elsayed et al. SSJ-2R without bucketing SSJ-2R with bucketing

Slide 60

Slide 60 text

Reduce phase 000 65000 0 5 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'(s (b) 000 65000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'(s Avg. )&!$#& Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R (d)

Slide 61

Slide 61 text

Social Content Matching

Slide 62

Slide 62 text

System overview The application operates in consecutive phases (each phase in the range from hours to days) Before the beginning of the ith phase, the application makes a tentative allocation of items to users Capacity constraints User: an estimate of the number of logins during the ith phase Items: proportional to a quality assessment or constant B = ￿ c∈C b(c) = ￿ t∈T b(t)

Slide 63

Slide 63 text

Graph building Edge weight is the cosine similarity between some vector representations of the item and the consumer w (ti, cj) = v(ti ) · v(cj) Prune the candidate edges O(|T||C|) by discarding low weight edges (we want to maximize the total weight) Similarity join between T and C in MapReduce

Slide 64

Slide 64 text

StackMR Primal-dual formulation of the problem (Integer Linear Programming) Compute a maximal ⌈∊b⌉-matching in parallel Push it in the stack, update dual variables and remove covered edges When there are no more edges, pop the whole stack and include edges in the solution layer by layer For efficiency, allows (1+∊) violations on capacity constraints

Slide 65

Slide 65 text

StackMR Example

Slide 66

Slide 66 text

StackMR Example

Slide 67

Slide 67 text

StackMR Example

Slide 68

Slide 68 text

StackMR Example

Slide 69

Slide 69 text

StackMR Example

Slide 70

Slide 70 text

StackMR Example

Slide 71

Slide 71 text

StackMR Example

Slide 72

Slide 72 text

StackMR Example

Slide 73

Slide 73 text

StackMR Example

Slide 74

Slide 74 text

StackMR Example

Slide 75

Slide 75 text

StackMR Example

Slide 76

Slide 76 text

StackMR Example

Slide 77

Slide 77 text

StackMR Example

Slide 78

Slide 78 text

GreedyMR Adaptation in MR of a classical greedy algorithm (sort the edges by weight, include the current edge if it maintains the constraints and update the capacities) At each round, each node proposes its top weighting b(v) edges to its neighbors The intersection between the proposal of each node and the ones of its neighbors is included in the solution Capacities are updated in parallel Yields a feasible sub-optimal solution at each round

Slide 79

Slide 79 text

StackGreedyMR Hybrid approach Same structure as StackMR Uses a greedy heuristic in one of the randomized phases, when choosing the edges to propose We tried also with a proportional heuristic, but the results were always worse than with the greedy one Mixed results overall

Slide 80

Slide 80 text

Algorithms summary Approximation guarantee MR rounds Capacity violations StackMR GreedyMR ⅙ poly- logarithmic 1+∊ ½ linear no

Slide 81

Slide 81 text

Vector representation Bag-of-words model flickr users: set of tags used in all photos flickr items (photos): set of tags Y! Answers users: set of words used in all answers Y! Answers items (questions): set of words Y! Answers: stopword removal, stemming, tf-idf

Slide 82

Slide 82 text

Conclusions 2 algorithms with different trade offs between result quality and efficiency StackMR scales to very large datasets, has provable poly-logarithmic complexity and is faster in practice, capacity violations are negligible GreedyMR yields higher quality results, has ½ approximation and can be stopped at any time

Slide 83

Slide 83 text

Personalized Online News Recommendation

Slide 84

Slide 84 text

90% of the clicks happen within 2 days from publication 0 5 10 15 20 25 30 35 40 45 1 10 100 1000 10000 Minutes News-click delay !"#$%&'()'(**"&&%!*%+ ,%-+.*/0*1'2%/34'20+5&0$"50(! News Get Old Soon

Slide 85

Slide 85 text

Builds a user model from Twitter Signals from user generated content, social neighbors and popularity across Twitter and news Deep personalization based on entities (overcomes vocabulary mismatch, easier to model relevance) Learn a personalized news ranking function Pick up candidates from a pool of related or popular fresh news, rank them and present top-k to the user T.Rex Twitter-based news recommendation system

Slide 86

Slide 86 text

Ranking function is user and time dependent Social model + Content model + Popularity model Social model weights the content model of neighbors by a truncated PageRank on the Twitter network Content model measures relatedness of user’s tweet stream and news article represented as bag-of-entities Popularity model tracks entity popularity by the number of mentions in Twitter and news (exponential forgetting) Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n) Recommendation Model

Slide 87

Slide 87 text

Entities News Tweets Content Model Γ !"#"$"%"&!"#$%$!!'()*+!&'!(#$!)*+($+(! %$,$-.+)$!*/!+$"'!, *" /*%!0'$%!- ( 1 Social Model Σ! ."#"/0"%"$"%"&!"#$%$!.'()*+!&'!(#$! '*)&.,!%$,$-.+)$!*/!+$"'!, *" /*%!0'$%!- ( 1 Popularity Model Π 1"#"2"%"3!"#$%$"1'*+!&'!(#$! 2*20,.%&(3!*/!+$"'!.%(&),$!, * 1 $'()*+"#"4!&4!- (! &'!(#$! .0(#*%!*/!5"$$(!5 * U T !"#!.0(#*%'#&2!6.(%&7 /'()*+"#"4!&4!- (! &'! &+($%$'($8!&+! (#$!)*+($+(! 2%*80)$8!93!- * U U ""#"'*)&.,!6.(%&7 o a user-dependent relevance criteria. We also aim at e recency into our model, so that our recommendations ently published news articles. ed to model the factors that affect the relevance of news We first model the social-network aspect. In our case, ent is induced by the twitter following relationship. We social network adjacency matrix, were S(i, j) is equal e number of users followed by user ui if ui follows uj , We also adopt a functional ranking (Baeza-Yates et al., the interests of a user among its neighbors recursively. aximum hop distance d, we define the social influence llows. al influence S∗). Given a set of users U = {u0, u1, . . .}, al network where each user may express an interest to the y another user, we define the social influence model S∗ as the here S∗(i, j) measures the interest of user ui to the content j and it is computed as S∗ = ￿ i=d ￿ i=1 σiSi ￿ , normalized adjacency matrix of the social network, d is the nce up to which users may influence their neighbors, and σ '*)&.,! &+($%$'( /0'()*+!:!,$-$,!*/!&+($%$'(!*/!- ( ! (*!(#$!)*+($+(!2%*80)$8!93!- * 1 Z = 6,5(57"89:;6"*+(*!"#&)#! T!.+8"N!.%$!6.22$81 ;$!0'$!;&<&2$8&.!2.=$'!.'! *0%!$+(&(3!'2.)$1 >28.($8!93! (%.)<&+=! 6$+(&*+'!&+! +$"'!.+8! 5"&(($%!"&(#! $72*+$+(&.,! 8$).31 Z 2'(+"#!2*20,.%&(3!*/!$+(&(3!< ( #"#!2*20,.%&(3!-$)(*% &'()*+"#" %$,.($8+$''!*/! 5"$$(!5 (! (*!+$"'!, * T N $"#"("$$(?(*?+$"' !6.(%&7 $%&%'%(%) ='()*+"#! %$,.($8+$''! */!5"$$(!5 (" (*! $+(&(3!< * T Z '"#!5"$$(!6.(%&7 3'()*+"#!%$,.($8+$''!*/!! $+(&(3!< (" (*!+$"'!, * Z N )"#!+$"'!6.(%&7 Recommendation Model

Slide 88

Slide 88 text

Learning to rank approach with SVM Each time the user clicks on a news, we learn a set of preferences (clicked_news > non_clicked_news): Prune the number of constraints for scalability: only news published in the last 2 days only take the top-k news for each ranking component T.Rex+ includes additional features: click count, age. τ ≤ c(ni) < c(nj) then Rτ (u, ni) > Rτ (u, nj) Learning the Weights

Slide 89

Slide 89 text

User generated content is a good predictor albeit sparse Click Count is a strong baseline but does not help T.Rex+ Table 5.2: MRR, precision and coverage. Algorithm MRR P@1 P@5 P@10 Coverage RECENCY 0.020 0.002 0.018 0.036 1.000 CLICKCOUNT 0.059 0.024 0.086 0.135 1.000 SOCIAL 0.017 0.002 0.018 0.036 0.606 CONTENT 0.107 0.029 0.171 0.286 0.158 POPULARITY 0.008 0.003 0.005 0.012 1.000 T.REX 0.107 0.073 0.130 0.168 1.000 T.REX+ 0.109 0.062 0.146 0.189 1.000 ENCY: it ranks news articles by time of publication (most recent CKCOUNT: it ranks news articles by click count (highest count fi !"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5 Mean  Reciprocal  Rank,  Precision  and  Coverage Predicting Clicked News