720

# Similarity Self-Join with MapReduce

Presentation of my article at ICDM'10

## Gianmarco De Francisci Morales

December 17, 2010

## Transcript

1. ### Document Similarity Self-Join with MapReduce G. De Francisci Morales, C.

Lucchese, R. Baraglia ISTI-CNR Pisa && IMT Lucca, Italy

4. ### Similarity Self-Join Discover all those pairs of objects whose similarity

is above a certain threshold
5. ### Similarity Self-Join Discover all those pairs of objects whose similarity

is above a certain threshold Also known as “All Pairs” problem
6. ### Similarity Self-Join Discover all those pairs of objects whose similarity

is above a certain threshold Also known as “All Pairs” problem Useful for near duplicate detection, recommender systems, spam detection, etc...
7. ### Overview 2 new algorithms: SSJ-2 and SSJ-2R Exact solution to

the document SSJ problem Parallel execution using MapReduce Draw from state-of-the-art serial algorithms SSJ-2R is 4.5x faster than best known algorithms
8. ### Assumptions Vector space model, bag of words Unit normalized vectors

Symmetric similarity function cos(di , dj ) = 0≤t<|L| di [t] · dj [t] di dj ≥ σ
9. ### MapReduce DFS Input 1 Input 2 Input 3 MAP MAP

MAP REDUCE REDUCE DFS Output 1 Output 2 Shufﬂe Merge & Group Partition & Sort Map : [ k1 , v1 ] → [ k2 , v2 ] Reduce : {k2 : [v2 ]} → [ k3 , v3 ]
10. ### Filtering approach Generate “signatures” for documents Group candidates by signature

Only documents that share a signature may be part of the solution Compute similarities in each group
11. ### “Full Filtering” d! (A,(d! ,2)) (B,(d1 ,1)) (C,(d1 ,1)) (B,(d2

,1)) (D,(d2 ,2)) (A,(d3 ,1)) (B,(d3 ,2)) (E,(d3 ,1)) (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) d" d# ((d1 ,d3 ),2) ((d1 ,d2 ),1) ((d1 ,d3 ),2) ((d2 ,d3 ),2) ((d1 ,d2 ),[1]) ((d1 ,d3 ),[2, 2]) ((d2 ,d3 ),[2]) ((d1 ,d2 ),1) ((d1 ,d3 ),4) ((d2 ,d3 ),2) “A A B C” “B D D” “A B B E” map map map reduce reduce reduce map map map shuffle map map shuffle Indexing Pairwise Similarity reduce reduce reduce reduce reduce (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d = tft,d ) is chosen for illustration. ual term contributions to the ﬁnal inner product. The MapReduce runtime sorts the tuples and then the re- ducer sums all the individual score contributions for a pair to generate the ﬁnal similarity score. 4 Experimental Evaluation In our experiments, we used Hadoop ver- sion 0.16.0,3 an open-source Java implementation R2 = 0.997 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Computation Time (minutes) Signature = terms Build inverted index Compute similarity
12. ### “Full Filtering” d! (A,(d! ,2)) (B,(d1 ,1)) (C,(d1 ,1)) (B,(d2

,1)) (D,(d2 ,2)) (A,(d3 ,1)) (B,(d3 ,2)) (E,(d3 ,1)) (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) d" d# ((d1 ,d3 ),2) ((d1 ,d2 ),1) ((d1 ,d3 ),2) ((d2 ,d3 ),2) ((d1 ,d2 ),[1]) ((d1 ,d3 ),[2, 2]) ((d2 ,d3 ),[2]) ((d1 ,d2 ),1) ((d1 ,d3 ),4) ((d2 ,d3 ),2) “A A B C” “B D D” “A B B E” map map map reduce reduce reduce map map map shuffle map map shuffle Indexing Pairwise Similarity reduce reduce reduce reduce reduce (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d = tft,d ) is chosen for illustration. ual term contributions to the ﬁnal inner product. The MapReduce runtime sorts the tuples and then the re- ducer sums all the individual score contributions for a pair to generate the ﬁnal similarity score. 4 Experimental Evaluation In our experiments, we used Hadoop ver- sion 0.16.0,3 an open-source Java implementation R2 = 0.997 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Computation Time (minutes) Signature = terms Build inverted index Compute similarity Zipﬁan distribution of terms
13. ### “Full Filtering” d! (A,(d! ,2)) (B,(d1 ,1)) (C,(d1 ,1)) (B,(d2

,1)) (D,(d2 ,2)) (A,(d3 ,1)) (B,(d3 ,2)) (E,(d3 ,1)) (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) d" d# ((d1 ,d3 ),2) ((d1 ,d2 ),1) ((d1 ,d3 ),2) ((d2 ,d3 ),2) ((d1 ,d2 ),[1]) ((d1 ,d3 ),[2, 2]) ((d2 ,d3 ),[2]) ((d1 ,d2 ),1) ((d1 ,d3 ),4) ((d2 ,d3 ),2) “A A B C” “B D D” “A B B E” map map map reduce reduce reduce map map map shuffle map map shuffle Indexing Pairwise Similarity reduce reduce reduce reduce reduce (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d = tft,d ) is chosen for illustration. ual term contributions to the ﬁnal inner product. The MapReduce runtime sorts the tuples and then the re- ducer sums all the individual score contributions for a pair to generate the ﬁnal similarity score. 4 Experimental Evaluation In our experiments, we used Hadoop ver- sion 0.16.0,3 an open-source Java implementation R2 = 0.997 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Computation Time (minutes) Signature = terms Build inverted index Compute similarity Zipﬁan distribution of terms Computes low score similarities
14. ### Preﬁx Filtering Signatures = subset of terms Global ordering of

terms by decreasing frequency Upper bound on similarity with the rest of the input ˆ d[t] = max d∈D d[t] S(d) = {b(d) ≤ t < |L| | d[t] = 0} b(d) : 0≤t<b(d) d[t] · ˆ d[t] < σ

bj |L| 0
16. ### SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

bj |L| 0 • Indexing & Preﬁx ﬁltering
17. ### SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

bj |L| 0 • Indexing & Preﬁx ﬁltering • Need to retrieve pruned part
18. ### SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

bj |L| 0 • Indexing & Preﬁx ﬁltering • Need to retrieve pruned part • Actually, retrieve the whole documents
19. ### SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

bj |L| 0 • Indexing & Preﬁx ﬁltering • Need to retrieve pruned part • Actually, retrieve the whole documents • 2 remote (DFS) I/O per pair
20. ### SSJ-2R di ; (di , dj ), WA ij ;

(di , dj ), WB ij ; (di , dk ), WA ik ; . . . group by key di dj ; (dj , dk ), WA jk ; (dj , dk ), WB jk ; (dj , dl ), WA jl ; . . . group by key dj Remainder ﬁle = pruned part of the input Pre-load remainder ﬁle in memory, no further disk I/O Shufﬂe the input together with the partial similarity scores Pruned Indexed Pruned Indexed d i d j bi bj |L| 0
21. ### (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Reduce Input Sort pairs on both IDs, group on ﬁrst (Secondary Sort) Only 1 reducer reads d0 Remainder ﬁle contains only the useful portion of the other documents (about 10%)
22. ### (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shufﬂed via MR Reduce Input Sort pairs on both IDs, group on ﬁrst (Secondary Sort) Only 1 reducer reads d0 Remainder ﬁle contains only the useful portion of the other documents (about 10%)
23. ### (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shufﬂed via MR Remainder ﬁle preloaded in memory Reduce Input Sort pairs on both IDs, group on ﬁrst (Secondary Sort) Only 1 reducer reads d0 Remainder ﬁle contains only the useful portion of the other documents (about 10%)
24. ### shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3

,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing SSJ-2R Example
25. ### shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3

,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache SSJ-2R Example
26. ### shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3

,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing shufﬂe <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> <(d1 ,!),"A A B C"> <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> reduce <(d1 ,d3 ), 5> Similarity map map map map Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache <(d1 ,!), "A A B C"> <(d3 ,!), "A B B C"> shufﬂe <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing shufﬂe <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> <(d1 ,!),"A A B C"> <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> reduce <(d1 ,d3 ), 5> Similarity map map map map Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache <(d1 ,!), "A A B C"> <(d3 ,!), "A B B C"> SSJ-2R Example
27. ### Running time 0 10000 20000 30000 40000 50000 60000 15000

20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#\$%&'() Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R
28. ### Map phase 0 15000 20000 25000 30000 35000 40000 45000

50000 55000 60000 65000 Number of !"#\$%&'() 0 5 150 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#\$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 20 40 60 80 100 120 140 160 180 Time (seconds) (c) 0 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Number of !"#\$%&'() 0 5 150 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#\$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 20 40 60 80 100 120 140 160 180 Time (seconds) (c)
29. ### Map phase 1 10 100 1000 100 1000 Number of

lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R
30. ### Reduce phase 000 65000 0 5 15000 20000 25000 30000

35000 40000 45000 50000 55000 60000 65000 Number of !"#\$%&'(s (b) 000 65000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#\$%&'(s Avg. )&!\$#& Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R (d)
31. ### Conclusions Effective distributed index pruning on MapReduce Leverage different communication

patterns Up to 4.5x faster than state-of-the-art Scalable, conﬁgurable memory footprint