Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Similarity Self-Join with MapReduce

Similarity Self-Join with MapReduce

Presentation of my article at ICDM'10

Gianmarco De Francisci Morales

December 17, 2010
Tweet

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. Document Similarity
    Self-Join with MapReduce
    G. De Francisci Morales, C. Lucchese, R. Baraglia
    ISTI-CNR Pisa && IMT Lucca, Italy

    View Slide

  2. Similarity Self-Join

    View Slide

  3. Similarity Self-Join

    View Slide

  4. Similarity Self-Join
    Discover all those pairs of objects
    whose similarity is above a certain
    threshold

    View Slide

  5. Similarity Self-Join
    Discover all those pairs of objects
    whose similarity is above a certain
    threshold
    Also known as “All Pairs” problem

    View Slide

  6. Similarity Self-Join
    Discover all those pairs of objects
    whose similarity is above a certain
    threshold
    Also known as “All Pairs” problem
    Useful for near duplicate detection,
    recommender systems, spam
    detection, etc...

    View Slide

  7. Overview
    2 new algorithms: SSJ-2 and SSJ-2R
    Exact solution to the document SSJ problem
    Parallel execution using MapReduce
    Draw from state-of-the-art serial algorithms
    SSJ-2R is 4.5x faster than best known algorithms

    View Slide

  8. Assumptions
    Vector space model, bag of words
    Unit normalized vectors
    Symmetric similarity function
    cos(di
    , dj
    ) = 0≤tdi
    [t] · dj
    [t]
    di
    dj
    ≥ σ

    View Slide

  9. MapReduce
    DFS
    Input 1
    Input 2
    Input 3
    MAP
    MAP
    MAP
    REDUCE
    REDUCE
    DFS
    Output 1
    Output 2
    Shuffle
    Merge &
    Group
    Partition &
    Sort
    Map : [ k1
    , v1
    ] → [ k2
    , v2
    ]
    Reduce : {k2
    : [v2
    ]} → [ k3
    , v3
    ]

    View Slide

  10. Filtering approach
    Generate “signatures” for documents
    Group candidates by signature
    Only documents that share a signature may be part
    of the solution
    Compute similarities in each group

    View Slide

  11. “Full Filtering”
    d!
    (A,(d!
    ,2))
    (B,(d1
    ,1))
    (C,(d1
    ,1))
    (B,(d2
    ,1))
    (D,(d2
    ,2))
    (A,(d3
    ,1))
    (B,(d3
    ,2))
    (E,(d3
    ,1))
    (A,[(d1
    ,2),
    (d3
    ,1)])
    (B,[(d1
    ,1),
    (d2
    ,1),
    (d3
    ,2)])
    (C,[(d1
    ,1)])
    (D,[(d2
    ,2)])
    (E,[(d3
    ,1)])
    d"
    d#
    ((d1
    ,d3
    ),2)
    ((d1
    ,d2
    ),1)
    ((d1
    ,d3
    ),2)
    ((d2
    ,d3
    ),2)
    ((d1
    ,d2
    ),[1])
    ((d1
    ,d3
    ),[2,
    2])
    ((d2
    ,d3
    ),[2])
    ((d1
    ,d2
    ),1)
    ((d1
    ,d3
    ),4)
    ((d2
    ,d3
    ),2)
    “A A B C”
    “B D D”
    “A B B E”
    map
    map
    map
    reduce
    reduce
    reduce
    map
    map
    map
    shuffle
    map
    map
    shuffle
    Indexing Pairwise Similarity
    reduce
    reduce
    reduce
    reduce
    reduce
    (A,[(d1
    ,2),
    (d3
    ,1)])
    (B,[(d1
    ,1),
    (d2
    ,1),
    (d3
    ,2)])
    (C,[(d1
    ,1)])
    (D,[(d2
    ,2)])
    (E,[(d3
    ,1)])
    Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d
    =
    tft,d
    ) is chosen for illustration.
    ual term contributions to the final inner product. The
    MapReduce runtime sorts the tuples and then the re-
    ducer sums all the individual score contributions for
    a pair to generate the final similarity score.
    4 Experimental Evaluation
    In our experiments, we used Hadoop ver-
    sion 0.16.0,3 an open-source Java implementation
    R2 = 0.997
    0
    20
    40
    60
    80
    100
    120
    140
    0 10 20 30 40 50 60 70 80 90 100
    Computation Time (minutes)
    Signature = terms
    Build inverted index
    Compute similarity

    View Slide

  12. “Full Filtering”
    d!
    (A,(d!
    ,2))
    (B,(d1
    ,1))
    (C,(d1
    ,1))
    (B,(d2
    ,1))
    (D,(d2
    ,2))
    (A,(d3
    ,1))
    (B,(d3
    ,2))
    (E,(d3
    ,1))
    (A,[(d1
    ,2),
    (d3
    ,1)])
    (B,[(d1
    ,1),
    (d2
    ,1),
    (d3
    ,2)])
    (C,[(d1
    ,1)])
    (D,[(d2
    ,2)])
    (E,[(d3
    ,1)])
    d"
    d#
    ((d1
    ,d3
    ),2)
    ((d1
    ,d2
    ),1)
    ((d1
    ,d3
    ),2)
    ((d2
    ,d3
    ),2)
    ((d1
    ,d2
    ),[1])
    ((d1
    ,d3
    ),[2,
    2])
    ((d2
    ,d3
    ),[2])
    ((d1
    ,d2
    ),1)
    ((d1
    ,d3
    ),4)
    ((d2
    ,d3
    ),2)
    “A A B C”
    “B D D”
    “A B B E”
    map
    map
    map
    reduce
    reduce
    reduce
    map
    map
    map
    shuffle
    map
    map
    shuffle
    Indexing Pairwise Similarity
    reduce
    reduce
    reduce
    reduce
    reduce
    (A,[(d1
    ,2),
    (d3
    ,1)])
    (B,[(d1
    ,1),
    (d2
    ,1),
    (d3
    ,2)])
    (C,[(d1
    ,1)])
    (D,[(d2
    ,2)])
    (E,[(d3
    ,1)])
    Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d
    =
    tft,d
    ) is chosen for illustration.
    ual term contributions to the final inner product. The
    MapReduce runtime sorts the tuples and then the re-
    ducer sums all the individual score contributions for
    a pair to generate the final similarity score.
    4 Experimental Evaluation
    In our experiments, we used Hadoop ver-
    sion 0.16.0,3 an open-source Java implementation
    R2 = 0.997
    0
    20
    40
    60
    80
    100
    120
    140
    0 10 20 30 40 50 60 70 80 90 100
    Computation Time (minutes)
    Signature = terms
    Build inverted index
    Compute similarity
    Zipfian distribution of terms

    View Slide

  13. “Full Filtering”
    d!
    (A,(d!
    ,2))
    (B,(d1
    ,1))
    (C,(d1
    ,1))
    (B,(d2
    ,1))
    (D,(d2
    ,2))
    (A,(d3
    ,1))
    (B,(d3
    ,2))
    (E,(d3
    ,1))
    (A,[(d1
    ,2),
    (d3
    ,1)])
    (B,[(d1
    ,1),
    (d2
    ,1),
    (d3
    ,2)])
    (C,[(d1
    ,1)])
    (D,[(d2
    ,2)])
    (E,[(d3
    ,1)])
    d"
    d#
    ((d1
    ,d3
    ),2)
    ((d1
    ,d2
    ),1)
    ((d1
    ,d3
    ),2)
    ((d2
    ,d3
    ),2)
    ((d1
    ,d2
    ),[1])
    ((d1
    ,d3
    ),[2,
    2])
    ((d2
    ,d3
    ),[2])
    ((d1
    ,d2
    ),1)
    ((d1
    ,d3
    ),4)
    ((d2
    ,d3
    ),2)
    “A A B C”
    “B D D”
    “A B B E”
    map
    map
    map
    reduce
    reduce
    reduce
    map
    map
    map
    shuffle
    map
    map
    shuffle
    Indexing Pairwise Similarity
    reduce
    reduce
    reduce
    reduce
    reduce
    (A,[(d1
    ,2),
    (d3
    ,1)])
    (B,[(d1
    ,1),
    (d2
    ,1),
    (d3
    ,2)])
    (C,[(d1
    ,1)])
    (D,[(d2
    ,2)])
    (E,[(d3
    ,1)])
    Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d
    =
    tft,d
    ) is chosen for illustration.
    ual term contributions to the final inner product. The
    MapReduce runtime sorts the tuples and then the re-
    ducer sums all the individual score contributions for
    a pair to generate the final similarity score.
    4 Experimental Evaluation
    In our experiments, we used Hadoop ver-
    sion 0.16.0,3 an open-source Java implementation
    R2 = 0.997
    0
    20
    40
    60
    80
    100
    120
    140
    0 10 20 30 40 50 60 70 80 90 100
    Computation Time (minutes)
    Signature = terms
    Build inverted index
    Compute similarity
    Zipfian distribution of terms
    Computes low score
    similarities

    View Slide

  14. Prefix Filtering
    Signatures = subset of terms
    Global ordering of terms by decreasing frequency
    Upper bound on similarity with the rest of the input
    ˆ
    d[t] = max
    d∈D
    d[t]
    S(d) = {b(d) ≤ t < |L| | d[t] = 0}
    b(d) :
    0≤td[t] · ˆ
    d[t] < σ

    View Slide

  15. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0

    View Slide

  16. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0
    • Indexing &
    Prefix filtering

    View Slide

  17. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0
    • Indexing &
    Prefix filtering
    • Need to retrieve
    pruned part

    View Slide

  18. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0
    • Indexing &
    Prefix filtering
    • Need to retrieve
    pruned part
    • Actually, retrieve
    the whole
    documents

    View Slide

  19. SSJ-2
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0
    • Indexing &
    Prefix filtering
    • Need to retrieve
    pruned part
    • Actually, retrieve
    the whole
    documents
    • 2 remote (DFS)
    I/O per pair

    View Slide

  20. SSJ-2R
    di ; (di
    , dj
    ), WA
    ij
    ; (di
    , dj
    ), WB
    ij
    ; (di
    , dk
    ), WA
    ik
    ; . . .
    group by key di
    dj ; (dj
    , dk
    ), WA
    jk
    ; (dj
    , dk
    ), WB
    jk
    ; (dj
    , dl
    ), WA
    jl
    ; . . .
    group by key dj
    Remainder file = pruned part of
    the input
    Pre-load remainder file in
    memory, no further disk I/O
    Shuffle the input together with
    the partial similarity scores
    Pruned Indexed
    Pruned Indexed
    d
    i
    d
    j
    bi
    bj
    |L|
    0

    View Slide

  21. (d0
    ,d1
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d3
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d4
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d2
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,!),[t1
    ,t2
    ,t3
    ...]
    SSJ-2R Reducer
    Reduce Input
    Sort pairs on both IDs,
    group on first
    (Secondary Sort)
    Only 1 reducer reads d0
    Remainder file contains
    only the useful portion
    of the other documents
    (about 10%)

    View Slide

  22. (d0
    ,d1
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d3
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d4
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d2
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,!),[t1
    ,t2
    ,t3
    ...]
    SSJ-2R Reducer
    Whole document
    shuffled via MR
    Reduce Input
    Sort pairs on both IDs,
    group on first
    (Secondary Sort)
    Only 1 reducer reads d0
    Remainder file contains
    only the useful portion
    of the other documents
    (about 10%)

    View Slide

  23. (d0
    ,d1
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d3
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d4
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,d2
    ),[w1
    ,w2
    ,w3
    ...]
    (d0
    ,!),[t1
    ,t2
    ,t3
    ...]
    SSJ-2R Reducer
    Whole document
    shuffled via MR
    Remainder file
    preloaded in memory
    Reduce Input
    Sort pairs on both IDs,
    group on first
    (Secondary Sort)
    Only 1 reducer reads d0
    Remainder file contains
    only the useful portion
    of the other documents
    (about 10%)

    View Slide

  24. shuffle
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    ,1)>
    ,1)>
    ,2)>
    ,2)>
    ,1)>
    map
    d1
    "A A B C"
    map
    d2
    "B D D"
    map
    d3
    "A B B C"
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    shuffle
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    ,1)>
    ,1)>
    ,2)>
    ,2)>
    ,1)>
    map
    d1
    "A A B C"
    map
    d2
    "B D D"
    map
    d3
    "A B B C"
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    SSJ-2R Example

    View Slide

  25. shuffle
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    ,1)>
    ,1)>
    ,2)>
    ,2)>
    ,1)>
    map
    d1
    "A A B C"
    map
    d2
    "B D D"
    map
    d3
    "A B B C"
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    Remainder
    File
    d1
    "A A"
    d3
    "A"
    d2
    "B"
    Distributed Cache
    shuffle
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    ,1)>
    ,1)>
    ,2)>
    ,2)>
    ,1)>
    map
    d1
    "A A B C"
    map
    d2
    "B D D"
    map
    d3
    "A B B C"
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    Remainder
    File
    d1
    "A A"
    d3
    "A"
    d2
    "B"
    Distributed Cache
    SSJ-2R Example

    View Slide

  26. shuffle
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    ,1)>
    ,1)>
    ,2)>
    ,2)>
    ,1)>
    map
    d1
    "A A B C"
    map
    d2
    "B D D"
    map
    d3
    "A B B C"
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    shuffle
    ,d3
    ), 2>
    ,d3
    ), 1>
    ,!),"A A B C">
    ,d3
    ), 2>
    ,d3
    ), 1>
    reduce ,d3
    ), 5>
    Similarity
    map
    map
    map
    map
    Remainder
    File
    d1
    "A A"
    d3
    "A"
    d2
    "B"
    Distributed Cache
    ,!),
    "A A B C">
    ,!),
    "A B B C">
    shuffle
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    ,1)>
    ,1)>
    ,2)>
    ,2)>
    ,1)>
    map
    d1
    "A A B C"
    map
    d2
    "B D D"
    map
    d3
    "A B B C"
    ,1),
    (d3
    ,2)]>
    ,1),
    (d3
    ,1)]>
    ,2)]>
    reduce
    reduce
    reduce
    Indexing
    shuffle
    ,d3
    ), 2>
    ,d3
    ), 1>
    ,!),"A A B C">
    ,d3
    ), 2>
    ,d3
    ), 1>
    reduce ,d3
    ), 5>
    Similarity
    map
    map
    map
    map
    Remainder
    File
    d1
    "A A"
    d3
    "A"
    d2
    "B"
    Distributed Cache
    ,!),
    "A A B C">
    ,!),
    "A B B C">
    SSJ-2R Example

    View Slide

  27. Running time
    0
    10000
    20000
    30000
    40000
    50000
    60000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of !"#$%&'()
    Running time
    Elsayed et al.
    SSJ-2
    Vernica et al.
    SSJ-2R

    View Slide

  28. Map phase
    0
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Number of !"#$%&'()
    0
    5
    150
    (a)
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of !"#$%&'()
    Avg. Map Running time
    Elsayed et al.
    SSJ-2
    Vernica et al.
    SSJ-2R
    20
    40
    60
    80
    100
    120
    140
    160
    180
    Time (seconds)
    (c)
    0
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Number of !"#$%&'()
    0
    5
    150
    (a)
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of !"#$%&'()
    Avg. Map Running time
    Elsayed et al.
    SSJ-2
    Vernica et al.
    SSJ-2R
    20
    40
    60
    80
    100
    120
    140
    160
    180
    Time (seconds)
    (c)

    View Slide

  29. Map phase
    1
    10
    100
    1000
    100 1000
    Number of lists
    Inverted list length
    max=6600
    Elsayed et al.
    1
    10
    100
    1000
    Number of lists
    max=1729
    SSJ-2R
    1
    10
    100
    1000
    100 1000
    Number of lists
    Inverted list length
    max=6600
    Elsayed et al.
    1
    10
    100
    1000
    Number of lists
    max=1729
    SSJ-2R

    View Slide

  30. Reduce phase
    000 65000
    0
    5
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Number of !"#$%&'(s
    (b)
    000 65000
    0
    2000
    4000
    6000
    8000
    10000
    12000
    14000
    16000
    18000
    15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000
    Time (seconds)
    Number of !"#$%&'(s
    Avg. )&!$#& Running time
    Elsayed et al.
    SSJ-2
    Vernica et al.
    SSJ-2R
    (d)

    View Slide

  31. Conclusions
    Effective distributed index pruning on MapReduce
    Leverage different communication patterns
    Up to 4.5x faster than state-of-the-art
    Scalable, configurable memory footprint

    View Slide

  32. Thanks

    View Slide