“Full Filtering”
d!
(A,(d!
,2))
(B,(d1
,1))
(C,(d1
,1))
(B,(d2
,1))
(D,(d2
,2))
(A,(d3
,1))
(B,(d3
,2))
(E,(d3
,1))
(A,[(d1
,2),
(d3
,1)])
(B,[(d1
,1),
(d2
,1),
(d3
,2)])
(C,[(d1
,1)])
(D,[(d2
,2)])
(E,[(d3
,1)])
d"
d#
((d1
,d3
),2)
((d1
,d2
),1)
((d1
,d3
),2)
((d2
,d3
),2)
((d1
,d2
),[1])
((d1
,d3
),[2,
2])
((d2
,d3
),[2])
((d1
,d2
),1)
((d1
,d3
),4)
((d2
,d3
),2)
“A A B C”
“B D D”
“A B B E”
map
map
map
reduce
reduce
reduce
map
map
map
shuffle
map
map
shuffle
Indexing Pairwise Similarity
reduce
reduce
reduce
reduce
reduce
(A,[(d1
,2),
(d3
,1)])
(B,[(d1
,1),
(d2
,1),
(d3
,2)])
(C,[(d1
,1)])
(D,[(d2
,2)])
(E,[(d3
,1)])
Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d
=
tft,d
) is chosen for illustration.
ual term contributions to the final inner product. The
MapReduce runtime sorts the tuples and then the re-
ducer sums all the individual score contributions for
a pair to generate the final similarity score.
4 Experimental Evaluation
In our experiments, we used Hadoop ver-
sion 0.16.0,3 an open-source Java implementation
R2 = 0.997
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60 70 80 90 100
Computation Time (minutes)
Signature = terms
Build inverted index
Compute similarity