Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AIP Open Seminar #6

Shunsuke Kanda
February 04, 2021

AIP Open Seminar #6

Presentation slide on AIP Open Seminar #6

Shunsuke Kanda

February 04, 2021
Tweet

More Decks by Shunsuke Kanda

Other Decks in Research

Transcript

  1. Dynamic Similarity Search on Integer Sketches
    Shunsuke Kanda and Yasuo Tabei
    Succinct Information Processing Unit
    (Presented at ICDM20)

    View Slide

  2. Contents
    1. Background & Contribution
    2. Preliminary: Trie-based Similarity Search
    3. New method: Dynamic Filter Trie (DyFT)
    i. Node reduction technique
    ii. Node implementation technique
    4. Experiments

    View Slide

  3. Contents
    1. Background & Contribution
    2. Preliminary: Trie-based Similarity Search
    3. New method: Dynamic Filter Trie (DyFT)
    i. Node reduction technique
    ii. Node implementation technique
    4. Experiments

    View Slide

  4. Similarity-preserving Hashing
    • Core technique for fast similarity searches
    ▹ Randomly map vectors in a metric space into sketches in the Hamming space
    Hashing
    Hamming space
    Metric space
    (e.g. Cosine or Jaccard)
    High dimension :(
    (~103 to ~106)
    0.2
    0.7
    0.1
    0.5
    0.2
    0.3
    0.3
    0.8

    0.1 Low dimension :)
    (32 or 64)
    0
    1
    1

    0
    Many similarity search problems can be solved as Hamming distance problem!!
    (discrete strings)

    View Slide

  5. Modern Issues on Similarity Search
    • Generality
    ▹ Traditional hashing algorithms produce binary sketches
    ▹ Modern hashing algorithms produce integer sketches
    – Such as b-bit minhash [Li+, WWW10], 0-bit CWS [Li, KDD15], and GCWS [Li, KDD17]
    ▹ But, most search methods are designed for binary sketches
    e.g., 001101001001
    e.g., 236301499231
    • Dynamics
    ▹ Modern real-world datasets are dynamic (i.e., updated over time)
    – Such as Web pages and time series data
    ▹ But, most search methods are limited to static datasets or inefficient for dynamic datasets
    dataset
    x
    insert
    Our challenge
    Develop an efficient dynamic search method for both binary and integer sketches

    View Slide

  6. Problem Statement
    • Sketch x of length m is an m-dimensional vector of non-negative integers
    • We have a dataset X = {x1
    , x2
    , …, xn
    }, which is a dynamic set of n sketches
    • Given sketch y and Hamming radius r as a query, we want to quickly find similar
    sketches such that {xi
    : H(xi
    , y) ≤ r}
    ▹ H(∙, ∙) is the Hamming distance (i.e., # of errors in each dimension)
    x1 111020
    x2 001020
    x3 032021
    x4 113021
    Dataset X
    n
    Generality
    Dynamics
    H(x1, y) = 1
    H(x2, y) = 3
    H(x3, y) = 3
    H(x4, y) = 1
    ≤ r
    ≤ r
    similar
    similar
    y = 111021
    r = 1
    Query

    View Slide

  7. State-of-the-art Similarity Search Methods
    • Most methods use hash tables, but they are inefficient for dynamic datasets
    • Recently, Eghbali et al. [IEEE TPAMI19] addressed this issue by using a search tree,
    but it is not applicable to integer sketches
    • We propose new methods DyFTs for dynamic datasets of integer sketches, which
    leverage a trie data structure

    View Slide

  8. Contents
    1. Background & Contribution
    2. Preliminary: Trie-based Similarity Search
    3. New method: Dynamic Filter Trie (DyFT)
    i. Node reduction technique
    ii. Node implementation technique
    4. Experiments

    View Slide

  9. Trie-based Similarity Search
    • Trie is a labeled tree built by merging
    common prefixes of sketches
    • The downgoing path from the root to a
    leaf represents the associated sketch
    x1
    x2
    x3
    x7
    x5
    x4
    x8
    x6
    0
    1
    3
    0 3
    1
    1 3
    1
    0
    2
    0
    0 2
    1
    2
    0
    0
    2
    1
    1 3
    0
    2
    0
    0
    2
    1
    1
    0
    2
    0
    0 3
    1
    1
    0
    1
    1
    0
    x3
    = 032021

    View Slide

  10. Trie-based Similarity Search
    • Trie is a labeled tree built by merging
    common prefixes of sketches
    • The downgoing path from the root to a
    leaf represents the associated sketch
    x1
    x2
    x3
    x7
    x5
    x4
    x8
    x6
    0
    1
    3
    0 3
    1
    1 3
    1
    0
    2
    0
    0 2
    1
    2
    0
    0
    2
    1
    1 3
    0
    2
    0
    0
    2
    1
    1
    0
    2
    0
    0 3
    1
    1
    0
    1
    1
    0
    • Similarity search is performed by
    traversing nodes while counting #errors
    to the query sketch
    • If #errors exceeds the radius, we stop
    traversing down to the all descendants
    • The time complexity is O(mr+2)
    Search for y = 111020 with r =1
    x1 and x7 are similar
    not depending on dataset size n

    View Slide

  11. Two Issues on Trie Implementation
    • Scalability
    ▹ Trie for large database maintains many pointers and consumes huge memory
    ▹ Reducing redundant nodes is an often-used solution
    ▹ But, there is no reduction technique for similarity searches
    • Generality
    ▹ Sketches consist of integers from {0,1,…,σ–1}
    ▹ σ is a given parameter depending on hashing techniques
    – σ ≤ 4 is recommended in MinHash
    – σ ≥ 16 is recommended in CWS
    ▹ But, existing trie implementations have been designed for byte sketches, i.e., σ = 256
    Our DyFT is a new similarity search method to solve the issues

    View Slide

  12. Contents
    1. Background & Contribution
    2. Preliminary: Trie-based Similarity Search
    3. New method: Dynamic Filter Trie (DyFT)
    i. Node reduction technique
    ii. Node implementation technique
    4. Experiments
    For Scalability Issue

    View Slide

  13. Dynamic Filter Trie (DyFT)
    • Trie-based similarity search for binary and integer sketches
    ▹ Store only some of trie nodes around the root for memory efficiency
    ▹ Exploit the trie search algorithm for filtering out dissimilar sketches
    x1
    x2
    x3
    x7
    x5
    x4
    x8
    x6
    x1 111020
    x2 001020
    x3 032021
    x4 113021
    x5 333110
    x6 330110
    x7 311020
    x8 030120
    Database X
    Verification
    H(x1
    , y) = 0 ≤ r
    H(x4
    , y) = 2 > r
    H(x7
    , y) = 1 ≤ r
    similar
    similar
    dissimilar
    0
    1
    3
    0 3 1 3
    Search for y = 111020 with r =1
    Candidate solutions

    View Slide

  14. Update Procedure
    • Visit the deepest reachable leaf node using new sketch
    • Append to the posting list of leaf node
    • If the length of (or ) exceeds threshold , split and create new leaf nodes
    v xi
    xi
    Lv
    v
    Lv
    |Lv
    | τ Lv
    v
    x3
    x8


    Insert x9 = 030110
    0
    3
    x9
    Append
    v


    0
    3
    x3
    x8
    x9
    0 2
    Split
    (if )
    |Lv
    | > τ
    |Lv
    |
    Lv

    View Slide

  15. What is a Reasonable Splitting Threshold ?
    τ
    • A reasonable value of can be determined depending on the configuration of the
    dataset and given parameters of hashing techniques
    • But, it is impossible to search such a reasonable value for dynamic datasets
    τ
    If is large
    τ
    Large verification time
    If is small
    τ
    Large traversal time
    The best values
    are reversed :(
    One order of
    magnitude!
    Fast

    View Slide

  16. Optimal Treshold τ*
    • First, construct a search cost model
    • Then, determine an optimal threshold minimizing the search cost
    τ*
    (if )
    |Lv
    | ≤ τ*
    keep?
    or
    split? (if )
    |Lv
    | > τ*
    Lv
    offers the case that can
    maintain the smaller cost
    τ*
    v
    Can always achieve
    the fastest search
    Fast

    View Slide

  17. Definition of Search Cost SC(v)
    • The search cost for node is defined by
    ▹ is the Reach Probability defined for a random sketch from a uniform distribution
    ▹ is the Computational Cost defined for inner and leaf nodes separately
    v SC(v) = RP(v) × CC(v)
    RP(v)
    CC(v)
    v
    Inner node
    v
    Lv
    Leaf node
    CCin
    (v)
    Check children CCleaf
    (v)
    Verify sketches
    SCin
    (v) = SCleaf
    (v) =
    RP(v) RP(v)
    Given a random sketch
    Given a random sketch

    View Slide

  18. Optimal Threshold τ*
    • Given leaf , compare the search costs in the two cases:
    v
    If keeping leaf v

    v
    |Lv
    |
    SCleaf
    (v)
    then, the search cost is
    v

    u1
    u2
    uk
    ︙ ︙ ︙
    If splitting leaf v
    SCin
    (v) + ∑ SCleaf
    (ui
    )
    then, the new search cost is
    Precomputable :)
    |Lv
    | > τ*(r, ℓ, σ)
    • Can derive the condition if the right case
    can maintain a smaller search cost
    DyFT can grow while maintaining fast similarity searches with few node pointers

    View Slide

  19. Contents
    1. Background & Contribution
    2. Preliminary: Trie-based Similarity Search
    3. New method: Dynamic Filter Trie (DyFT)
    i. Node reduction technique
    ii. Node implementation technique
    4. Experiments
    For Generality Issue

    View Slide

  20. How to implement DyFT efficiently?
    • Good point :)
    ▹ There are many trie implementations
    • Bad point :(
    ▹ They are designed for byte strings
    ▹ But, sketches consist of general integers
    • Our approach
    ▹ Reconstruct integer sketches into byte ones
    ▹ Represent them using an adaptive radix tree
    (space-efficient trie implementation)
    x = 2 3 6 3 0 1 2 11 2…
    x’ = 0xF2 0xAE 0x53…

    View Slide

  21. Adaptive Radix Tree [Leis+, ICDE13]
    • Adaptively select a space-efficient node implementation depending on #children
    • The data structure is modified for node traversal in similarity search
    For a node with few children,
    use a list-based data structure
    For a node with moderate children,
    use a hybrid data structure of
    a list and an array
    For a node with many children,
    use an array-based data structure

    View Slide

  22. Summary of Our Method
    • What is a issue on modern similarity search?
    ▹ There is no efficient dynamic data structure for integer sketches
    • What are issues on trie-based similarity search?
    ▹ Scalability: There is no node reduction technique for similarity search
    ▹ Generality: There is no node implementation technique for integer sketches
    We developed DyFT based on a trie data structure
    We constructed a search cost model, defined an optimal threshold, and
    reduced DyFT nodes while maintaining fast similarity searches
    We reconstructed integer sketches into byte sketches to
    leverage an existing trie implementation technique

    View Slide

  23. Contents
    1. Background & Contribution
    2. Preliminary: Trie-based Similarity Search
    3. New method: Dynamic Filter Trie (DyFT)
    i. Node reduction technique
    ii. Node implementation technique
    4. Experiments

    View Slide

  24. Experimental Setup
    • Dataset
    ▹ 216 million compound-protein pairs
    – Each pair is represented as a 3.6 million dimensional binary fingerprint
    ▹ We converted the fingerprints into binary and integer sketches using Li’s minhash
    algorithm for Jaccard similarity [Li+, WWW10]
    ▹ We constructed an index by inserting sketches in random order
    • Queryset
    ▹ We randomly sampled 1000 sketches from the dataset
    • Code
    ▹ We implemented all data structures using C++17
    ▹ Source code is available at https://github.com/kampersanda/dyft
    Aspirin
    Caffeic Acid

    View Slide

  25. Analysis for Optimal Threshold τ*
    Binary Sketch
    Integer Sketch
    Search time (ms/query)
    Optimal threshold is the
    fastest in most cases
    τ*
    The search times with fixed
    thresholds are
    reversed according to the
    dataset size
    τ = 1,10,100
    n
    Fast

    View Slide

  26. Comparison with State-of-the-Arts (Binary Sketches)
    1600x
    faster
    13x
    smaller
    Competitive
    Search time (ms/query) Update time (sec) Memory usage (GB)
    • four orders of magnitude faster on the search time
    • competitive on the update time
    • one order of magnitude smaller on the memory
    DyFT was

    View Slide

  27. Comparison with State-of-the-Arts (Integer Sketches)
    • always faster on the search time
    • competitive on the update time
    • always smaller on the memory
    DyFT was
    Search time (ms/query) Update time (sec) Memory usage (GB)
    Always faster
    Always smaller
    Competitive

    View Slide

  28. Summary of Our Method
    • What is a issue on modern similarity search?
    ▹ There is no efficient dynamic data structure for integer sketches
    • What are issues on trie-based similarity search?
    ▹ Scalability: There is no node reduction technique for similarity search
    ▹ Generality: There is no node implementation technique for integer sketches
    We developed DyFT based on a trie data structure
    We reconstructed integer sketches into byte sketches to
    leverage an existing trie implementation technique
    We constructed a search cost model, defined an optimal threshold, and
    reduced DyFT nodes while maintaining fast similarity searches

    View Slide