660

# Similarity Self-Join with MapReduce

Presentation of my article at ICDM'10

## Gianmarco De Francisci Morales

December 17, 2010

## Transcript

1. Document Similarity
Self-Join with MapReduce
G. De Francisci Morales, C. Lucchese, R. Baraglia
ISTI-CNR Pisa && IMT Lucca, Italy

2. Similarity Self-Join

3. Similarity Self-Join

4. Similarity Self-Join
Discover all those pairs of objects
whose similarity is above a certain
threshold

5. Similarity Self-Join
Discover all those pairs of objects
whose similarity is above a certain
threshold
Also known as “All Pairs” problem

6. Similarity Self-Join
Discover all those pairs of objects
whose similarity is above a certain
threshold
Also known as “All Pairs” problem
Useful for near duplicate detection,
recommender systems, spam
detection, etc...

7. Overview
2 new algorithms: SSJ-2 and SSJ-2R
Exact solution to the document SSJ problem
Parallel execution using MapReduce
Draw from state-of-the-art serial algorithms
SSJ-2R is 4.5x faster than best known algorithms

8. Assumptions
Vector space model, bag of words
Unit normalized vectors
Symmetric similarity function
cos(di
, dj
) = 0≤t<|L|
di
[t] · dj
[t]
di
dj
≥ σ

9. MapReduce
DFS
Input 1
Input 2
Input 3
MAP
MAP
MAP
REDUCE
REDUCE
DFS
Output 1
Output 2
Shufﬂe
Merge &
Group
Partition &
Sort
Map : [ k1
, v1
] → [ k2
, v2
]
Reduce : {k2
: [v2
]} → [ k3
, v3
]

10. Filtering approach
Generate “signatures” for documents
Group candidates by signature
Only documents that share a signature may be part
of the solution
Compute similarities in each group

11. “Full Filtering”
d!
(A,(d!
,2))
(B,(d1
,1))
(C,(d1
,1))
(B,(d2
,1))
(D,(d2
,2))
(A,(d3
,1))
(B,(d3
,2))
(E,(d3
,1))
(A,[(d1
,2),
(d3
,1)])
(B,[(d1
,1),
(d2
,1),
(d3
,2)])
(C,[(d1
,1)])
(D,[(d2
,2)])
(E,[(d3
,1)])
d"
d#
((d1
,d3
),2)
((d1
,d2
),1)
((d1
,d3
),2)
((d2
,d3
),2)
((d1
,d2
),[1])
((d1
,d3
),[2,
2])
((d2
,d3
),[2])
((d1
,d2
),1)
((d1
,d3
),4)
((d2
,d3
),2)
“A A B C”
“B D D”
“A B B E”
map
map
map
reduce
reduce
reduce
map
map
map
shuffle
map
map
shuffle
Indexing Pairwise Similarity
reduce
reduce
reduce
reduce
reduce
(A,[(d1
,2),
(d3
,1)])
(B,[(d1
,1),
(d2
,1),
(d3
,2)])
(C,[(d1
,1)])
(D,[(d2
,2)])
(E,[(d3
,1)])
Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d
=
tft,d
) is chosen for illustration.
ual term contributions to the ﬁnal inner product. The
MapReduce runtime sorts the tuples and then the re-
ducer sums all the individual score contributions for
a pair to generate the ﬁnal similarity score.
4 Experimental Evaluation
In our experiments, we used Hadoop ver-
sion 0.16.0,3 an open-source Java implementation
R2 = 0.997
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80 90 100
Computation Time (minutes)
Signature = terms
Build inverted index
Compute similarity

12. “Full Filtering”
d!
(A,(d!
,2))
(B,(d1
,1))
(C,(d1
,1))
(B,(d2
,1))
(D,(d2
,2))
(A,(d3
,1))
(B,(d3
,2))
(E,(d3
,1))
(A,[(d1
,2),
(d3
,1)])
(B,[(d1
,1),
(d2
,1),
(d3
,2)])
(C,[(d1
,1)])
(D,[(d2
,2)])
(E,[(d3
,1)])
d"
d#
((d1
,d3
),2)
((d1
,d2
),1)
((d1
,d3
),2)
((d2
,d3
),2)
((d1
,d2
),[1])
((d1
,d3
),[2,
2])
((d2
,d3
),[2])
((d1
,d2
),1)
((d1
,d3
),4)
((d2
,d3
),2)
“A A B C”
“B D D”
“A B B E”
map
map
map
reduce
reduce
reduce
map
map
map
shuffle
map
map
shuffle
Indexing Pairwise Similarity
reduce
reduce
reduce
reduce
reduce
(A,[(d1
,2),
(d3
,1)])
(B,[(d1
,1),
(d2
,1),
(d3
,2)])
(C,[(d1
,1)])
(D,[(d2
,2)])
(E,[(d3
,1)])
Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d
=
tft,d
) is chosen for illustration.
ual term contributions to the ﬁnal inner product. The
MapReduce runtime sorts the tuples and then the re-
ducer sums all the individual score contributions for
a pair to generate the ﬁnal similarity score.
4 Experimental Evaluation
In our experiments, we used Hadoop ver-
sion 0.16.0,3 an open-source Java implementation
R2 = 0.997
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80 90 100
Computation Time (minutes)
Signature = terms
Build inverted index
Compute similarity
Zipﬁan distribution of terms

13. “Full Filtering”
d!
(A,(d!
,2))
(B,(d1
,1))
(C,(d1
,1))
(B,(d2
,1))
(D,(d2
,2))
(A,(d3
,1))
(B,(d3
,2))
(E,(d3
,1))
(A,[(d1
,2),
(d3
,1)])
(B,[(d1
,1),
(d2
,1),
(d3
,2)])
(C,[(d1
,1)])
(D,[(d2
,2)])
(E,[(d3
,1)])
d"
d#
((d1
,d3
),2)
((d1
,d2
),1)
((d1
,d3
),2)
((d2
,d3
),2)
((d1
,d2
),[1])
((d1
,d3
),[2,
2])
((d2
,d3
),[2])
((d1
,d2
),1)
((d1
,d3
),4)
((d2
,d3
),2)
“A A B C”
“B D D”
“A B B E”
map
map
map
reduce
reduce
reduce
map
map
map
shuffle
map
map
shuffle
Indexing Pairwise Similarity
reduce
reduce
reduce
reduce
reduce
(A,[(d1
,2),
(d3
,1)])
(B,[(d1
,1),
(d2
,1),
(d3
,2)])
(C,[(d1
,1)])
(D,[(d2
,2)])
(E,[(d3
,1)])
Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d
=
tft,d
) is chosen for illustration.
ual term contributions to the ﬁnal inner product. The
MapReduce runtime sorts the tuples and then the re-
ducer sums all the individual score contributions for
a pair to generate the ﬁnal similarity score.
4 Experimental Evaluation
In our experiments, we used Hadoop ver-
sion 0.16.0,3 an open-source Java implementation
R2 = 0.997
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80 90 100
Computation Time (minutes)
Signature = terms
Build inverted index
Compute similarity
Zipﬁan distribution of terms
Computes low score
similarities

14. Preﬁx Filtering
Signatures = subset of terms
Global ordering of terms by decreasing frequency
Upper bound on similarity with the rest of the input
ˆ
d[t] = max
d∈D
d[t]
S(d) = {b(d) ≤ t < |L| | d[t] = 0}
b(d) :
0≤td[t] · ˆ
d[t] < σ

15. SSJ-2
Pruned Indexed
Pruned Indexed
d
i
d
j
bi
bj
|L|
0

16. SSJ-2
Pruned Indexed
Pruned Indexed
d
i
d
j
bi
bj
|L|
0
• Indexing &
Preﬁx ﬁltering

17. SSJ-2
Pruned Indexed
Pruned Indexed
d
i
d
j
bi
bj
|L|
0
• Indexing &
Preﬁx ﬁltering
• Need to retrieve
pruned part

18. SSJ-2
Pruned Indexed
Pruned Indexed
d
i
d
j
bi
bj
|L|
0
• Indexing &
Preﬁx ﬁltering
• Need to retrieve
pruned part
• Actually, retrieve
the whole
documents

19. SSJ-2
Pruned Indexed
Pruned Indexed
d
i
d
j
bi
bj
|L|
0
• Indexing &
Preﬁx ﬁltering
• Need to retrieve
pruned part
• Actually, retrieve
the whole
documents
• 2 remote (DFS)
I/O per pair

20. SSJ-2R
di ; (di
, dj
), WA
ij
; (di
, dj
), WB
ij
; (di
, dk
), WA
ik
; . . .
group by key di
dj ; (dj
, dk
), WA
jk
; (dj
, dk
), WB
jk
; (dj
, dl
), WA
jl
; . . .
group by key dj
Remainder ﬁle = pruned part of
the input
memory, no further disk I/O
Shufﬂe the input together with
the partial similarity scores
Pruned Indexed
Pruned Indexed
d
i
d
j
bi
bj
|L|
0

21. (d0
,d1
),[w1
,w2
,w3
...]
(d0
,d3
),[w1
,w2
,w3
...]
(d0
,d4
),[w1
,w2
,w3
...]
(d0
,d2
),[w1
,w2
,w3
...]
(d0
,!),[t1
,t2
,t3
...]
SSJ-2R Reducer
Reduce Input
Sort pairs on both IDs,
group on ﬁrst
(Secondary Sort)
Remainder ﬁle contains
only the useful portion
of the other documents

22. (d0
,d1
),[w1
,w2
,w3
...]
(d0
,d3
),[w1
,w2
,w3
...]
(d0
,d4
),[w1
,w2
,w3
...]
(d0
,d2
),[w1
,w2
,w3
...]
(d0
,!),[t1
,t2
,t3
...]
SSJ-2R Reducer
Whole document
shufﬂed via MR
Reduce Input
Sort pairs on both IDs,
group on ﬁrst
(Secondary Sort)
Remainder ﬁle contains
only the useful portion
of the other documents

23. (d0
,d1
),[w1
,w2
,w3
...]
(d0
,d3
),[w1
,w2
,w3
...]
(d0
,d4
),[w1
,w2
,w3
...]
(d0
,d2
),[w1
,w2
,w3
...]
(d0
,!),[t1
,t2
,t3
...]
SSJ-2R Reducer
Whole document
shufﬂed via MR
Remainder ﬁle
Reduce Input
Sort pairs on both IDs,
group on ﬁrst
(Secondary Sort)
Remainder ﬁle contains
only the useful portion
of the other documents

24. shufﬂe
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
,1)>
,1)>
,2)>
,2)>
,1)>
map
d1
"A A B C"
map
d2
"B D D"
map
d3
"A B B C"
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
reduce
reduce
reduce
Indexing
shufﬂe
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
,1)>
,1)>
,2)>
,2)>
,1)>
map
d1
"A A B C"
map
d2
"B D D"
map
d3
"A B B C"
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
reduce
reduce
reduce
Indexing
SSJ-2R Example

25. shufﬂe
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
,1)>
,1)>
,2)>
,2)>
,1)>
map
d1
"A A B C"
map
d2
"B D D"
map
d3
"A B B C"
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
reduce
reduce
reduce
Indexing
Remainder
File
d1
"A A"
d3
"A"
d2
"B"
Distributed Cache
shufﬂe
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
,1)>
,1)>
,2)>
,2)>
,1)>
map
d1
"A A B C"
map
d2
"B D D"
map
d3
"A B B C"
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
reduce
reduce
reduce
Indexing
Remainder
File
d1
"A A"
d3
"A"
d2
"B"
Distributed Cache
SSJ-2R Example

26. shufﬂe
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
,1)>
,1)>
,2)>
,2)>
,1)>
map
d1
"A A B C"
map
d2
"B D D"
map
d3
"A B B C"
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
reduce
reduce
reduce
Indexing
shufﬂe
<(d1
,d3
), 2>
<(d1
,d3
), 1>
<(d1
,!),"A A B C">
<(d1
,d3
), 2>
<(d1
,d3
), 1>
reduce <(d1
,d3
), 5>
Similarity
map
map
map
map
Remainder
File
d1
"A A"
d3
"A"
d2
"B"
Distributed Cache
<(d1
,!),
"A A B C">
<(d3
,!),
"A B B C">
shufﬂe
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
,1)>
,1)>
,2)>
,2)>
,1)>
map
d1
"A A B C"
map
d2
"B D D"
map
d3
"A B B C"
,1),
(d3
,2)]>
,1),
(d3
,1)]>
,2)]>
reduce
reduce
reduce
Indexing
shufﬂe
<(d1
,d3
), 2>
<(d1
,d3
), 1>
<(d1
,!),"A A B C">
<(d1
,d3
), 2>
<(d1
,d3
), 1>
reduce <(d1
,d3
), 5>
Similarity
map
map
map
map
Remainder
File
d1
"A A"
d3
"A"
d2
"B"
Distributed Cache
<(d1
,!),
"A A B C">
<(d3
,!),
"A B B C">
SSJ-2R Example

27. Running time
0
10000
20000
30000
40000
50000
60000
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
Time (seconds)
Number of !"#\$%&'()
Running time
Elsayed et al.
SSJ-2
Vernica et al.
SSJ-2R

28. Map phase
0
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
Number of !"#\$%&'()
0
5
150
(a)
0
1000
2000
3000
4000
5000
6000
7000
8000
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
Time (seconds)
Number of !"#\$%&'()
Avg. Map Running time
Elsayed et al.
SSJ-2
Vernica et al.
SSJ-2R
20
40
60
80
100
120
140
160
180
Time (seconds)
(c)
0
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
Number of !"#\$%&'()
0
5
150
(a)
0
1000
2000
3000
4000
5000
6000
7000
8000
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
Time (seconds)
Number of !"#\$%&'()
Avg. Map Running time
Elsayed et al.
SSJ-2
Vernica et al.
SSJ-2R
20
40
60
80
100
120
140
160
180
Time (seconds)
(c)

29. Map phase
1
10
100
1000
100 1000
Number of lists
Inverted list length
max=6600
Elsayed et al.
1
10
100
1000
Number of lists
max=1729
SSJ-2R
1
10
100
1000
100 1000
Number of lists
Inverted list length
max=6600
Elsayed et al.
1
10
100
1000
Number of lists
max=1729
SSJ-2R

30. Reduce phase
000 65000
0
5
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
Number of !"#\$%&'(s
(b)
000 65000
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
Time (seconds)
Number of !"#\$%&'(s
Avg. )&!\$#& Running time
Elsayed et al.
SSJ-2
Vernica et al.
SSJ-2R
(d)

31. Conclusions
Effective distributed index pruning on MapReduce
Leverage different communication patterns
Up to 4.5x faster than state-of-the-art
Scalable, conﬁgurable memory footprint

32. Thanks