Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ICDM2020

 ICDM2020

The presentation slide of "Dynamic Similarity Search on Integer Sketches" in ICDM20

Shunsuke Kanda

November 18, 2020
Tweet

More Decks by Shunsuke Kanda

Other Decks in Research

Transcript

  1. Dynamic Similarity Search on Integer Sketches
    Shunsuke Kanda and Yasuo Tabei
    RIKEN Center for Advanced Intelligence Project, Japan
    20th IEEE International Conference on Data Mining (ICDM)
    November 17–20, 2020, Sorrento, Italy (virtual)

    View Slide

  2. Similarity-preserving Hashing
    • Core technique for fast similarity searches
    ▹ Randomly map vectors in a metric space into sketches in the Hamming space
    Hashing
    Hamming space
    Metric space
    (e.g. Cosine or Jaccard)
    High dimension :(
    (~103 to ~106)
    0.2
    0.7
    0.1
    0.5
    0.2
    0.3
    0.3
    0.8

    0.1 Low dimension :)
    (32 or 64)
    0
    1
    1

    0
    Many similarity search problems can be solved as Hamming distance problem!!
    (discrete strings)

    View Slide

  3. Issues on Modern Similarity Search
    • Generality
    ▹ Traditional hashing algorithms produce binary sketches
    ▹ Modern hashing algorithms produce integer sketches
    – Such as b-bit minhash [Li+, WWW10], 0-bit CWS [Li, KDD15], and GCWS [Li, KDD17]
    ▹ But, most search methods are designed for binary sketches
    • Dynamics
    ▹ Modern real-world datasets are dynamic (i.e., updated over time)
    – Such as Web pages and time series data
    ▹ But, most search methods are limited to static datasets or inefficient for dynamic datasets
    Our challenge
    Develop an efficient dynamic search method for both binary and integer sketches
    e.g., 001101001001
    e.g., 236301499231
    dataset
    x
    insert

    View Slide

  4. Problem Statement
    • Sketch x of length m is an m-dimensional vector of non-negative integers
    • We have a dataset X = {x1
    , x2
    , …, xn
    }, which is a dynamic set of n sketches
    • Given sketch y and Hamming radius r as a query, we want to quickly find similar
    sketches such that {xi
    : H(xi
    , y) ≤ r}
    ▹ H(∙, ∙) is the Hamming distance (i.e., # of errors in each dimension)
    x1 111020
    x2 001020
    x3 032021
    x4 113021
    Dataset X
    n
    Generality
    Dynamics
    H(x1, y) = 1
    H(x2, y) = 3
    H(x3, y) = 3
    H(x4, y) = 1
    ≤ r
    ≤ r
    similar
    similar
    y = 111021
    r = 1
    Query

    View Slide

  5. State-of-the-art Similarity Search Methods
    • Most methods use hash tables, but they are inefficient for dynamic datasets
    • Recently, Eghbali et al. [IEEE TPAMI19] addressed this issue by using a search tree,
    but it is not applicable to integer sketches
    • We propose new methods DyFTs for dynamic datasets of integer sketches, which
    leverage a trie data structure

    View Slide

  6. Trie and Similarity Search
    • Trie is a labeled tree built by merging
    common prefixes of sketches
    • The downgoing path from the root to a
    leaf represents the associated sketch
    x1
    x2
    x3
    x7
    x5
    x4
    x8
    x6
    0
    1
    3
    0 3
    1
    1 3
    1
    0
    2
    0
    0 2
    1
    2
    0
    0
    2
    1
    1 3
    0
    2
    0
    0
    2
    1
    1
    0
    2
    0
    0 3
    1
    1
    0
    1
    1
    0
    x3
    = 032021

    View Slide

  7. Trie and Similarity Search
    • Trie is a labeled tree built by merging
    common prefixes of sketches
    • The downgoing path from the root to a
    leaf represents the associated sketch
    x1
    x2
    x3
    x7
    x5
    x4
    x8
    x6
    0
    1
    3
    0 3
    1
    1 3
    1
    0
    2
    0
    0 2
    1
    2
    0
    0
    2
    1
    1 3
    0
    2
    0
    0
    2
    1
    1
    0
    2
    0
    0 3
    1
    1
    0
    1
    1
    0
    • Similarity search is performed by
    traversing nodes while counting #errors
    to the query sketch
    • If #errors exceeds the radius, we stop
    traversing down to the all descendants
    • The time complexity is O(mr+2)
    Search for y = 111020 with r =1
    x1 and x7 are similar
    not depending on dataset size n

    View Slide

  8. Dynamic Filter Trie (DyFT)
    • Trie-based similarity search for binary and integer sketches
    ▹ Store only some of trie nodes around the root for memory efficiency
    ▹ Exploit the trie search algorithm for filtering out dissimilar sketches
    x1
    x2
    x3
    x7
    x5
    x4
    x8
    x6
    x1 111020
    x2 001020
    x3 032021
    x4 113021
    x5 333110
    x6 330110
    x7 311020
    x8 030120
    Database X
    Verification
    H(x1
    , y) = 0 ≤ r
    H(x4
    , y) = 2 > r
    H(x7
    , y) = 1 ≤ r
    similar
    similar
    dissimilar
    0
    1
    3
    0 3 1 3
    Proposed
    Search for y = 111020 with r =1
    Candidate solutions

    View Slide

  9. Update Procedure
    • Visit the deepest reachable leaf node using new sketch
    • Append to the posting list of leaf node
    • If the length of (or ) exceeds threshold , split and create new leaf nodes
    v xi
    xi
    Lv
    v
    Lv
    |Lv
    | τ Lv
    v
    x3
    x8


    Insert x9 = 030110
    0
    3
    x9
    Append
    v


    0
    3
    x3
    x8
    x9
    0 2
    Split
    (if )
    |Lv
    | > τ
    |Lv
    |
    Proposed
    Lv

    View Slide

  10. What is a Reasonable Splitting Threshold ?
    τ
    • A reasonable value of can be determined depending on the configuration of the
    dataset and given parameters
    • But, it is impossible to search such a reasonable value for dynamic datasets
    τ
    If is large
    τ
    Large verification time
    If is small
    τ
    Large traversal time
    Proposed
    The best values
    are reversed!
    One order of
    magnitude!

    View Slide

  11. Optimal Treshold τ*
    • First, construct a search cost model assuming that sketches are uniformly
    distributed in the Hamming space
    • Then, determine an optimal threshold minimizing the search cost
    τ*
    (if )
    |Lv
    | ≤ τ*
    keep?
    or
    split? (if )
    |Lv
    | > τ*
    The search cost for node v is defined as
    Reach
    Probability
    Computational
    Cost
    Lv
    offers the case that can
    maintain the smaller cost
    τ*
    v
    Proposed

    View Slide

  12. Reach Probability for Node at Level
    v ℓ
    • Consider the probability of reaching node within errors using a random sketch
    from a uniform distribution
    v r
    x ∈ {0,1,…, σ − 1}ℓ
    level ℓ v
    Given random sketch and radius
    x r
    What is the probability of
    reaching node within errors ?
    v r
    Proposed
    P(ℓ) =
    N(ℓ)
    σℓ
    # of all possible sketches of length is
    ℓ σℓ
    N(ℓ) =
    r

    k=0
    (

    k)(σ − 1)k
    # of all possible sketches reachable to a node at level within errors is
    ℓ r

    View Slide

  13. Search Cost of Inner Node at Level
    v ℓ
    • If is an inner node, we try to descend to the children of node
    v v
    v
    Case 1
    with less than errors
    r
    Check all the children in time
    O(σ)
    Case 2
    v
    with errors
    r
    Directly lookup the child in time
    O(1)
    Cin
    (v) = P(ℓ) ×
    The search cost of inner node :
    v
    • The number of all possible sketches reachable to with errors is
    v r N2
    (ℓ) = (

    r)(σ − 1)k
    Proposed
    Case 1 Case 2
    {(1 −
    N2
    (ℓ)
    N(ℓ) ) × σ +
    N2
    (ℓ)
    N(ℓ)
    × 1
    }

    View Slide

  14. Search Cost of Leaf Node at Level
    v ℓ
    • If is a leaf node, we verify all sketches associated with
    v Lv
    Hamming distance can be computed by
    performing sets of bitwise-XOR and
    -popcount operations [Zhang+, SSDBM13]
    ⌈log2
    σ⌉
    x1
    v
    x4
    x6
    x7
    Ham(x1, y)
    Ham(x4, y)
    Ham(x6, y)
    Ham(x7, y)
    Given a sketch y
    Lv
    The search cost of leaf node :
    v
    Cleaf
    (v) = P(ℓ) ×
    Proposed
    Verification time
    {|Lv
    | × ⌈log2
    σ⌉}

    View Slide

  15. Optimal Threshold τ*
    • Given leaf at level , we compare the search costs in the two cases:
    v ℓ
    If not splitting leaf v

    v
    |Lv
    |
    Cleaf
    (v)
    then, the search cost is
    v

    u1
    u2
    uk
    ︙ ︙ ︙
    If splitting leaf v
    Cin
    (v) + ∑ Cleaf
    (ui
    )
    then, the new search cost is
    |Lv
    | >
    P(ℓ)
    P(ℓ) − P(ℓ + 1)
    ×
    (1 − N2
    (ℓ)
    N(ℓ)
    ) × σ + N2
    (ℓ)
    N(ℓ)
    ⌈log2
    σ⌉
    • We can derive the condition to maintain the smaller cost
    =: τ*
    Proposed
    Precomputable :)

    View Slide

  16. Summary of DyFT
    • Trie-based similarity search method for integer sketches
    ▹ Store only some of trie nodes around the root for
    memory efficiency
    ▹ Exploit the trie search algorithm for filtering out
    dissimilar sketches
    ▹ Grow the data structure while maintaining fast
    searches using optimal threshold τ*
    x1
    x2
    x3
    x7
    x5
    x4
    x8
    x6
    Search for y = 111020 with r =1
    Candidate solutions
    0
    1
    3
    0 3 1 3
    • Other techniques (not presented in this slide)
    ▹ Switching trie search and linear search based on the cost model
    ▹ Weighting factor for practical computational costs
    ▹ Efficient node implementation MART (modified adaptive radix tree)
    Proposed

    View Slide

  17. Experimental Setup
    • Dataset
    ▹ 216 million compound-protein pairs
    – Each pair is represented as a 3.6 million dimensional binary fingerprint
    ▹ We converted the fingerprints into binary and integer sketches using Li’s minhash
    algorithm for Jaccard similarity [Li+, WWW10]
    ▹ We constructed an index by inserting sketches in random order
    • Queryset
    ▹ We randomly sampled 1000 sketches from the dataset
    • Code
    ▹ We implemented all data structures using C++17
    ▹ Source code is available at https://github.com/kampersanda/dyft
    Aspirin
    Caffeic Acid

    View Slide

  18. Analysis for Optimal Threshold τ*
    Binary Sketch
    Integer Sketch
    Search time (ms/query)
    Optimal threshold is the
    fastest in most cases
    τ*
    The search times with fixed
    thresholds are
    reversed according to the
    dataset size
    τ = 1,10,100
    n

    View Slide

  19. Comparison with State-of-the-Arts
    Search time (ms/query)
    1600x
    faster
    Update time (sec) Memory usage (GB)
    13x
    smaller
    Nearly equal
    Binary Sketch
    Integer Sketch
    Always faster
    Always smaller
    Practically fast

    View Slide