the document SSJ problem Parallel execution using MapReduce Draw from state-of-the-art serial algorithms SSJ-2R is 4.5x faster than best known algorithms
terms by decreasing frequency Upper bound on similarity with the rest of the input ˆ d[t] = max d∈D d[t] S(d) = {b(d) ≤ t < |L| | d[t] = 0} b(d) : 0≤t<b(d) d[t] · ˆ d[t] < σ
(di , dj ), WB ij ; (di , dk ), WA ik ; . . . group by key di dj ; (dj , dk ), WA jk ; (dj , dk ), WB jk ; (dj , dl ), WA jl ; . . . group by key dj Remainder file = pruned part of the input Pre-load remainder file in memory, no further disk I/O Shuffle the input together with the partial similarity scores Pruned Indexed Pruned Indexed d i d j bi bj |L| 0
lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R 1 10 100 1000 100 1000 Number of lists Inverted list length max=6600 Elsayed et al. 1 10 100 1000 Number of lists max=1729 SSJ-2R