Similarity Self-Join with MapReduce

Document Similarity Self-Join with MapReduce G. De Francisci Morales, C.
Lucchese, R. Baraglia ISTI-CNR Pisa && IMT Lucca, Italy

Similarity Self-Join

Similarity Self-Join Discover all those pairs of objects whose similarity
is above a certain threshold

is above a certain threshold Also known as “All Pairs” problem

is above a certain threshold Also known as “All Pairs” problem Useful for near duplicate detection, recommender systems, spam detection, etc...

Overview 2 new algorithms: SSJ-2 and SSJ-2R Exact solution to
the document SSJ problem Parallel execution using MapReduce Draw from state-of-the-art serial algorithms SSJ-2R is 4.5x faster than best known algorithms

Assumptions Vector space model, bag of words Unit normalized vectors
Symmetric similarity function cos(di , dj ) = 0≤t<|L| di [t] · dj [t] di dj ≥ σ

MapReduce DFS Input 1 Input 2 Input 3 MAP MAP
MAP REDUCE REDUCE DFS Output 1 Output 2 Shufﬂe Merge & Group Partition & Sort Map : [ k1 , v1 ] → [ k2 , v2 ] Reduce : {k2 : [v2 ]} → [ k3 , v3 ]

Filtering approach Generate “signatures” for documents Group candidates by signature
Only documents that share a signature may be part of the solution Compute similarities in each group

“Full Filtering” d! (A,(d! ,2)) (B,(d1 ,1)) (C,(d1 ,1)) (B,(d2
,1)) (D,(d2 ,2)) (A,(d3 ,1)) (B,(d3 ,2)) (E,(d3 ,1)) (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) d" d# ((d1 ,d3 ),2) ((d1 ,d2 ),1) ((d1 ,d3 ),2) ((d2 ,d3 ),2) ((d1 ,d2 ),[1]) ((d1 ,d3 ),[2, 2]) ((d2 ,d3 ),[2]) ((d1 ,d2 ),1) ((d1 ,d3 ),4) ((d2 ,d3 ),2) “A A B C” “B D D” “A B B E” map map map reduce reduce reduce map map map shuffle map map shuffle Indexing Pairwise Similarity reduce reduce reduce reduce reduce (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d = tft,d ) is chosen for illustration. ual term contributions to the ﬁnal inner product. The MapReduce runtime sorts the tuples and then the reducer sums all the individual score contributions for a pair to generate the ﬁnal similarity score. 4 Experimental Evaluation In our experiments, we used Hadoop ver- sion 0.16.0,3 an open-source Java implementation R2 = 0.997 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Computation Time (minutes) Signature = terms Build inverted index Compute similarity

,1)) (D,(d2 ,2)) (A,(d3 ,1)) (B,(d3 ,2)) (E,(d3 ,1)) (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) d" d# ((d1 ,d3 ),2) ((d1 ,d2 ),1) ((d1 ,d3 ),2) ((d2 ,d3 ),2) ((d1 ,d2 ),[1]) ((d1 ,d3 ),[2, 2]) ((d2 ,d3 ),[2]) ((d1 ,d2 ),1) ((d1 ,d3 ),4) ((d2 ,d3 ),2) “A A B C” “B D D” “A B B E” map map map reduce reduce reduce map map map shuffle map map shuffle Indexing Pairwise Similarity reduce reduce reduce reduce reduce (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d = tft,d ) is chosen for illustration. ual term contributions to the final inner product. The MapReduce runtime sorts the tuples and then the reducer sums all the individual score contributions for a pair to generate the final similarity score. 4 Experimental Evaluation In our experiments, we used Hadoop ver- sion 0.16.0,3 an open-source Java implementation R2 = 0.997 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Computation Time (minutes) Signature = terms Build inverted index Compute similarity Zipfian distribution of terms

,1)) (D,(d2 ,2)) (A,(d3 ,1)) (B,(d3 ,2)) (E,(d3 ,1)) (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) d" d# ((d1 ,d3 ),2) ((d1 ,d2 ),1) ((d1 ,d3 ),2) ((d2 ,d3 ),2) ((d1 ,d2 ),[1]) ((d1 ,d3 ),[2, 2]) ((d2 ,d3 ),[2]) ((d1 ,d2 ),1) ((d1 ,d3 ),4) ((d2 ,d3 ),2) “A A B C” “B D D” “A B B E” map map map reduce reduce reduce map map map shuffle map map shuffle Indexing Pairwise Similarity reduce reduce reduce reduce reduce (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d = tft,d ) is chosen for illustration. ual term contributions to the final inner product. The MapReduce runtime sorts the tuples and then the reducer sums all the individual score contributions for a pair to generate the final similarity score. 4 Experimental Evaluation In our experiments, we used Hadoop ver- sion 0.16.0,3 an open-source Java implementation R2 = 0.997 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Computation Time (minutes) Signature = terms Build inverted index Compute similarity Zipfian distribution of terms Computes low score similarities

Preﬁx Filtering Signatures = subset of terms Global ordering of
terms by decreasing frequency Upper bound on similarity with the rest of the input ˆ d[t] = max d∈D d[t] S(d) = {b(d) ≤ t < |L| | d[t] = 0} b(d) : 0≤t<b(d) d[t] · ˆ d[t] < σ

SSJ-2 Pruned Indexed Pruned Indexed d i d j bi
bj |L| 0

bj |L| 0 • Indexing & Preﬁx ﬁltering

bj |L| 0 • Indexing & Preﬁx ﬁltering • Need to retrieve pruned part

bj |L| 0 • Indexing & Preﬁx ﬁltering • Need to retrieve pruned part • Actually, retrieve the whole documents

bj |L| 0 • Indexing & Preﬁx ﬁltering • Need to retrieve pruned part • Actually, retrieve the whole documents • 2 remote (DFS) I/O per pair

SSJ-2R di ; (di , dj ), WA ij ;
(di , dj ), WB ij ; (di , dk ), WA ik ; . . . group by key di dj ; (dj , dk ), WA jk ; (dj , dk ), WB jk ; (dj , dl ), WA jl ; . . . group by key dj Remainder file = pruned part of the input Pre-load remainder file in memory, no further disk I/O Shuffle the input together with the partial similarity scores Pruned Indexed Pruned Indexed d i d j bi bj |L| 0

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2
,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Reduce Input Sort pairs on both IDs, group on ﬁrst (Secondary Sort) Only 1 reducer reads d0 Remainder ﬁle contains only the useful portion of the other documents (about 10%)

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2
,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)

(d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2
,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Remainder file preloaded in memory Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)

shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3
,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing SSJ-2R Example

shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3
,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache SSJ-2R Example

shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3
,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing shuffle <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> <(d1 ,!),"A A B C"> <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> reduce <(d1 ,d3 ), 5> Similarity map map map map Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache <(d1 ,!), "A A B C"> <(d3 ,!), "A B B C"> shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing shuffle <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> <(d1 ,!),"A A B C"> <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> reduce <(d1 ,d3 ), 5> Similarity map map map map Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache <(d1 ,!), "A A B C"> <(d3 ,!), "A B B C"> SSJ-2R Example

Running time 0 10000 20000 30000 40000 50000 60000 15000
20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R

Map phase 0 15000 20000 25000 30000 35000 40000 45000
50000 55000 60000 65000 Number of !"#$%&'() 0 5 150 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 20 40 60 80 100 120 140 160 180 Time (seconds) (c) 0 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'() 0 5 150 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 20 40 60 80 100 120 140 160 180 Time (seconds) (c)

Map phase 1 10 100 1000 100 1000 Number of
lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R

Reduce phase 000 65000 0 5 15000 20000 25000 30000
35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'(s (b) 000 65000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'(s Avg. )&!$#& Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R (d)

Conclusions Effective distributed index pruning on MapReduce Leverage different communication
patterns Up to 4.5x faster than state-of-the-art Scalable, conﬁgurable memory footprint

Thanks

Similarity Self-Join with MapReduce

Similarity Self-Join with MapReduce

Gianmarco De Francisci Morales

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Featured

Transcript

Document Similarity Self-Join with MapReduce G. De Francisci Morales, C.