Similarity Self-Join with MapReduce

Similarity Self-Join with MapReduce

Presentation of my article at ICDM'10

4715c0947b4e0ca3bec820d8051aa45a?s=128

Gianmarco De Francisci Morales

December 17, 2010
Tweet

Transcript

  1. Document Similarity Self-Join with MapReduce G. De Francisci Morales, C.

    Lucchese, R. Baraglia ISTI-CNR Pisa && IMT Lucca, Italy
  2. Similarity Self-Join

  3. Similarity Self-Join

  4. Similarity Self-Join Discover all those pairs of objects whose similarity

    is above a certain threshold
  5. Similarity Self-Join Discover all those pairs of objects whose similarity

    is above a certain threshold Also known as “All Pairs” problem
  6. Similarity Self-Join Discover all those pairs of objects whose similarity

    is above a certain threshold Also known as “All Pairs” problem Useful for near duplicate detection, recommender systems, spam detection, etc...
  7. Overview 2 new algorithms: SSJ-2 and SSJ-2R Exact solution to

    the document SSJ problem Parallel execution using MapReduce Draw from state-of-the-art serial algorithms SSJ-2R is 4.5x faster than best known algorithms
  8. Assumptions Vector space model, bag of words Unit normalized vectors

    Symmetric similarity function cos(di , dj ) = 0≤t<|L| di [t] · dj [t] di dj ≥ σ
  9. MapReduce DFS Input 1 Input 2 Input 3 MAP MAP

    MAP REDUCE REDUCE DFS Output 1 Output 2 Shuffle Merge & Group Partition & Sort Map : [ k1 , v1 ] → [ k2 , v2 ] Reduce : {k2 : [v2 ]} → [ k3 , v3 ]
  10. Filtering approach Generate “signatures” for documents Group candidates by signature

    Only documents that share a signature may be part of the solution Compute similarities in each group
  11. “Full Filtering” d! (A,(d! ,2)) (B,(d1 ,1)) (C,(d1 ,1)) (B,(d2

    ,1)) (D,(d2 ,2)) (A,(d3 ,1)) (B,(d3 ,2)) (E,(d3 ,1)) (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) d" d# ((d1 ,d3 ),2) ((d1 ,d2 ),1) ((d1 ,d3 ),2) ((d2 ,d3 ),2) ((d1 ,d2 ),[1]) ((d1 ,d3 ),[2, 2]) ((d2 ,d3 ),[2]) ((d1 ,d2 ),1) ((d1 ,d3 ),4) ((d2 ,d3 ),2) “A A B C” “B D D” “A B B E” map map map reduce reduce reduce map map map shuffle map map shuffle Indexing Pairwise Similarity reduce reduce reduce reduce reduce (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d = tft,d ) is chosen for illustration. ual term contributions to the final inner product. The MapReduce runtime sorts the tuples and then the re- ducer sums all the individual score contributions for a pair to generate the final similarity score. 4 Experimental Evaluation In our experiments, we used Hadoop ver- sion 0.16.0,3 an open-source Java implementation R2 = 0.997 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Computation Time (minutes) Signature = terms Build inverted index Compute similarity
  12. “Full Filtering” d! (A,(d! ,2)) (B,(d1 ,1)) (C,(d1 ,1)) (B,(d2

    ,1)) (D,(d2 ,2)) (A,(d3 ,1)) (B,(d3 ,2)) (E,(d3 ,1)) (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) d" d# ((d1 ,d3 ),2) ((d1 ,d2 ),1) ((d1 ,d3 ),2) ((d2 ,d3 ),2) ((d1 ,d2 ),[1]) ((d1 ,d3 ),[2, 2]) ((d2 ,d3 ),[2]) ((d1 ,d2 ),1) ((d1 ,d3 ),4) ((d2 ,d3 ),2) “A A B C” “B D D” “A B B E” map map map reduce reduce reduce map map map shuffle map map shuffle Indexing Pairwise Similarity reduce reduce reduce reduce reduce (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d = tft,d ) is chosen for illustration. ual term contributions to the final inner product. The MapReduce runtime sorts the tuples and then the re- ducer sums all the individual score contributions for a pair to generate the final similarity score. 4 Experimental Evaluation In our experiments, we used Hadoop ver- sion 0.16.0,3 an open-source Java implementation R2 = 0.997 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Computation Time (minutes) Signature = terms Build inverted index Compute similarity Zipfian distribution of terms
  13. “Full Filtering” d! (A,(d! ,2)) (B,(d1 ,1)) (C,(d1 ,1)) (B,(d2

    ,1)) (D,(d2 ,2)) (A,(d3 ,1)) (B,(d3 ,2)) (E,(d3 ,1)) (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) d" d# ((d1 ,d3 ),2) ((d1 ,d2 ),1) ((d1 ,d3 ),2) ((d2 ,d3 ),2) ((d1 ,d2 ),[1]) ((d1 ,d3 ),[2, 2]) ((d2 ,d3 ),[2]) ((d1 ,d2 ),1) ((d1 ,d3 ),4) ((d2 ,d3 ),2) “A A B C” “B D D” “A B B E” map map map reduce reduce reduce map map map shuffle map map shuffle Indexing Pairwise Similarity reduce reduce reduce reduce reduce (A,[(d1 ,2), (d3 ,1)]) (B,[(d1 ,1), (d2 ,1), (d3 ,2)]) (C,[(d1 ,1)]) (D,[(d2 ,2)]) (E,[(d3 ,1)]) Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d = tft,d ) is chosen for illustration. ual term contributions to the final inner product. The MapReduce runtime sorts the tuples and then the re- ducer sums all the individual score contributions for a pair to generate the final similarity score. 4 Experimental Evaluation In our experiments, we used Hadoop ver- sion 0.16.0,3 an open-source Java implementation R2 = 0.997 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Computation Time (minutes) Signature = terms Build inverted index Compute similarity Zipfian distribution of terms Computes low score similarities
  14. Prefix Filtering Signatures = subset of terms Global ordering of

    terms by decreasing frequency Upper bound on similarity with the rest of the input ˆ d[t] = max d∈D d[t] S(d) = {b(d) ≤ t < |L| | d[t] = 0} b(d) : 0≤t<b(d) d[t] · ˆ d[t] < σ
  15. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0
  16. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0 • Indexing & Prefix filtering
  17. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0 • Indexing & Prefix filtering • Need to retrieve pruned part
  18. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0 • Indexing & Prefix filtering • Need to retrieve pruned part • Actually, retrieve the whole documents
  19. SSJ-2 Pruned Indexed Pruned Indexed d i d j bi

    bj |L| 0 • Indexing & Prefix filtering • Need to retrieve pruned part • Actually, retrieve the whole documents • 2 remote (DFS) I/O per pair
  20. SSJ-2R di ; (di , dj ), WA ij ;

    (di , dj ), WB ij ; (di , dk ), WA ik ; . . . group by key di dj ; (dj , dk ), WA jk ; (dj , dk ), WB jk ; (dj , dl ), WA jl ; . . . group by key dj Remainder file = pruned part of the input Pre-load remainder file in memory, no further disk I/O Shuffle the input together with the partial similarity scores Pruned Indexed Pruned Indexed d i d j bi bj |L| 0
  21. (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

    ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)
  22. (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

    ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)
  23. (d0 ,d1 ),[w1 ,w2 ,w3 ...] (d0 ,d3 ),[w1 ,w2

    ,w3 ...] (d0 ,d4 ),[w1 ,w2 ,w3 ...] (d0 ,d2 ),[w1 ,w2 ,w3 ...] (d0 ,!),[t1 ,t2 ,t3 ...] SSJ-2R Reducer Whole document shuffled via MR Remainder file preloaded in memory Reduce Input Sort pairs on both IDs, group on first (Secondary Sort) Only 1 reducer reads d0 Remainder file contains only the useful portion of the other documents (about 10%)
  24. shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3

    ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing SSJ-2R Example
  25. shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3

    ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache SSJ-2R Example
  26. shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3

    ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing shuffle <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> <(d1 ,!),"A A B C"> <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> reduce <(d1 ,d3 ), 5> Similarity map map map map Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache <(d1 ,!), "A A B C"> <(d3 ,!), "A B B C"> shuffle <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> <B, (d1 ,1)> <C, (d1 ,1)> <D, (d2 ,2)> <B, (d3 ,2)> <C, (d3 ,1)> map d1 "A A B C" map d2 "B D D" map d3 "A B B C" <B, [(d1 ,1), (d3 ,2)]> <C, [(d1 ,1), (d3 ,1)]> <D, [(d2 ,2)]> reduce reduce reduce Indexing shuffle <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> <(d1 ,!),"A A B C"> <(d1 ,d3 ), 2> <(d1 ,d3 ), 1> reduce <(d1 ,d3 ), 5> Similarity map map map map Remainder File d1 "A A" d3 "A" d2 "B" Distributed Cache <(d1 ,!), "A A B C"> <(d3 ,!), "A B B C"> SSJ-2R Example
  27. Running time 0 10000 20000 30000 40000 50000 60000 15000

    20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R
  28. Map phase 0 15000 20000 25000 30000 35000 40000 45000

    50000 55000 60000 65000 Number of !"#$%&'() 0 5 150 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 20 40 60 80 100 120 140 160 180 Time (seconds) (c) 0 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'() 0 5 150 (a) 0 1000 2000 3000 4000 5000 6000 7000 8000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'() Avg. Map Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R 20 40 60 80 100 120 140 160 180 Time (seconds) (c)
  29. Map phase 1 10 100 1000 100 1000 Number of

    lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R
  30. Reduce phase 000 65000 0 5 15000 20000 25000 30000

    35000 40000 45000 50000 55000 60000 65000 Number of !"#$%&'(s (b) 000 65000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 Time (seconds) Number of !"#$%&'(s Avg. )&!$#& Running time Elsayed et al. SSJ-2 Vernica et al. SSJ-2R (d)
  31. Conclusions Effective distributed index pruning on MapReduce Leverage different communication

    patterns Up to 4.5x faster than state-of-the-art Scalable, configurable memory footprint
  32. Thanks