68

# ICDM2020

The presentation slide of "Dynamic Similarity Search on Integer Sketches" in ICDM20 ## Shunsuke Kanda

November 18, 2020

## Transcript

1. Dynamic Similarity Search on Integer Sketches
Shunsuke Kanda and Yasuo Tabei
RIKEN Center for Advanced Intelligence Project, Japan
20th IEEE International Conference on Data Mining (ICDM)
November 17–20, 2020, Sorrento, Italy (virtual)

2. Similarity-preserving Hashing
• Core technique for fast similarity searches
▹ Randomly map vectors in a metric space into sketches in the Hamming space
Hashing
Hamming space
Metric space
(e.g. Cosine or Jaccard)
High dimension :(
(~103 to ~106)
0.2
0.7
0.1
0.5
0.2
0.3
0.3
0.8

0.1 Low dimension :)
(32 or 64)
0
1
1

0
Many similarity search problems can be solved as Hamming distance problem!!
(discrete strings)

3. Issues on Modern Similarity Search
• Generality
▹ Traditional hashing algorithms produce binary sketches
▹ Modern hashing algorithms produce integer sketches
– Such as b-bit minhash [Li+, WWW10], 0-bit CWS [Li, KDD15], and GCWS [Li, KDD17]
▹ But, most search methods are designed for binary sketches
• Dynamics
▹ Modern real-world datasets are dynamic (i.e., updated over time)
– Such as Web pages and time series data
▹ But, most search methods are limited to static datasets or inefﬁcient for dynamic datasets
Our challenge
Develop an efﬁcient dynamic search method for both binary and integer sketches
e.g., 001101001001
e.g., 236301499231
dataset
x
insert

4. Problem Statement
• Sketch x of length m is an m-dimensional vector of non-negative integers
• We have a dataset X = {x1
, x2
, …, xn
}, which is a dynamic set of n sketches
• Given sketch y and Hamming radius r as a query, we want to quickly ﬁnd similar
sketches such that {xi
: H(xi
, y) ≤ r}
▹ H(∙, ∙) is the Hamming distance (i.e., # of errors in each dimension)
x1 111020
x2 001020
x3 032021
x4 113021
Dataset X
n
Generality
Dynamics
H(x1, y) = 1
H(x2, y) = 3
H(x3, y) = 3
H(x4, y) = 1
≤ r
≤ r
similar
similar
y = 111021
r = 1
Query

5. State-of-the-art Similarity Search Methods
• Most methods use hash tables, but they are inefﬁcient for dynamic datasets
• Recently, Eghbali et al. [IEEE TPAMI19] addressed this issue by using a search tree,
but it is not applicable to integer sketches
• We propose new methods DyFTs for dynamic datasets of integer sketches, which
leverage a trie data structure

6. Trie and Similarity Search
• Trie is a labeled tree built by merging
common preﬁxes of sketches
• The downgoing path from the root to a
leaf represents the associated sketch
x1
x2
x3
x7
x5
x4
x8
x6
0
1
3
0 3
1
1 3
1
0
2
0
0 2
1
2
0
0
2
1
1 3
0
2
0
0
2
1
1
0
2
0
0 3
1
1
0
1
1
0
x3
= 032021

7. Trie and Similarity Search
• Trie is a labeled tree built by merging
common preﬁxes of sketches
• The downgoing path from the root to a
leaf represents the associated sketch
x1
x2
x3
x7
x5
x4
x8
x6
0
1
3
0 3
1
1 3
1
0
2
0
0 2
1
2
0
0
2
1
1 3
0
2
0
0
2
1
1
0
2
0
0 3
1
1
0
1
1
0
• Similarity search is performed by
traversing nodes while counting #errors
to the query sketch
• If #errors exceeds the radius, we stop
traversing down to the all descendants
• The time complexity is O(mr+2)
Search for y = 111020 with r =1
x1 and x7 are similar
not depending on dataset size n

8. Dynamic Filter Trie (DyFT)
• Trie-based similarity search for binary and integer sketches
▹ Store only some of trie nodes around the root for memory efﬁciency
▹ Exploit the trie search algorithm for ﬁltering out dissimilar sketches
x1
x2
x3
x7
x5
x4
x8
x6
x1 111020
x2 001020
x3 032021
x4 113021
x5 333110
x6 330110
x7 311020
x8 030120
Database X
Veriﬁcation
H(x1
, y) = 0 ≤ r
H(x4
, y) = 2 > r
H(x7
, y) = 1 ≤ r
similar
similar
dissimilar
0
1
3
0 3 1 3
Proposed
Search for y = 111020 with r =1
Candidate solutions

9. Update Procedure
• Visit the deepest reachable leaf node using new sketch
• Append to the posting list of leaf node
• If the length of (or ) exceeds threshold , split and create new leaf nodes
v xi
xi
Lv
v
Lv
|Lv
| τ Lv
v
x3
x8

Insert x9 = 030110
0
3
x9
Append
v

0
3
x3
x8
x9
0 2
Split
(if )
|Lv
| > τ
|Lv
|
Proposed
Lv

10. What is a Reasonable Splitting Threshold ?
τ
• A reasonable value of can be determined depending on the conﬁguration of the
dataset and given parameters
• But, it is impossible to search such a reasonable value for dynamic datasets
τ
If is large
τ
Large veriﬁcation time
If is small
τ
Large traversal time
Proposed
The best values
are reversed!
One order of
magnitude!

11. Optimal Treshold τ*
• First, construct a search cost model assuming that sketches are uniformly
distributed in the Hamming space
• Then, determine an optimal threshold minimizing the search cost
τ*
(if )
|Lv
| ≤ τ*
keep?
or
split? (if )
|Lv
| > τ*
The search cost for node v is deﬁned as
Reach
Probability
Computational
Cost
Lv
offers the case that can
maintain the smaller cost
τ*
v
Proposed

12. Reach Probability for Node at Level
v ℓ
• Consider the probability of reaching node within errors using a random sketch
from a uniform distribution
v r
x ∈ {0,1,…, σ − 1}ℓ
level ℓ v
x r
What is the probability of
reaching node within errors ?
v r
Proposed
P(ℓ) =
N(ℓ)
σℓ
# of all possible sketches of length is
ℓ σℓ
N(ℓ) =
r

k=0
(

k)(σ − 1)k
# of all possible sketches reachable to a node at level within errors is
ℓ r

13. Search Cost of Inner Node at Level
v ℓ
• If is an inner node, we try to descend to the children of node
v v
v
Case 1
with less than errors
r
Check all the children in time
O(σ)
Case 2
v
with errors
r
Directly lookup the child in time
O(1)
Cin
(v) = P(ℓ) ×
The search cost of inner node :
v
• The number of all possible sketches reachable to with errors is
v r N2
(ℓ) = (

r)(σ − 1)k
Proposed
Case 1 Case 2
{(1 −
N2
(ℓ)
N(ℓ) ) × σ +
N2
(ℓ)
N(ℓ)
× 1
}

14. Search Cost of Leaf Node at Level
v ℓ
• If is a leaf node, we verify all sketches associated with
v Lv
Hamming distance can be computed by
performing sets of bitwise-XOR and
-popcount operations [Zhang+, SSDBM13]
⌈log2
σ⌉
x1
v
x4
x6
x7
Ham(x1, y)
Ham(x4, y)
Ham(x6, y)
Ham(x7, y)
Given a sketch y
Lv
The search cost of leaf node :
v
Cleaf
(v) = P(ℓ) ×
Proposed
Veriﬁcation time
{|Lv
| × ⌈log2
σ⌉}

15. Optimal Threshold τ*
• Given leaf at level , we compare the search costs in the two cases:
v ℓ
If not splitting leaf v

v
|Lv
|
Cleaf
(v)
then, the search cost is
v

u1
u2
uk
︙ ︙ ︙
If splitting leaf v
Cin
(v) + ∑ Cleaf
(ui
)
then, the new search cost is
|Lv
| >
P(ℓ)
P(ℓ) − P(ℓ + 1)
×
(1 − N2
(ℓ)
N(ℓ)
) × σ + N2
(ℓ)
N(ℓ)
⌈log2
σ⌉
• We can derive the condition to maintain the smaller cost
=: τ*
Proposed
Precomputable :)

16. Summary of DyFT
• Trie-based similarity search method for integer sketches
▹ Store only some of trie nodes around the root for
memory efﬁciency
▹ Exploit the trie search algorithm for ﬁltering out
dissimilar sketches
▹ Grow the data structure while maintaining fast
searches using optimal threshold τ*
x1
x2
x3
x7
x5
x4
x8
x6
Search for y = 111020 with r =1
Candidate solutions
0
1
3
0 3 1 3
• Other techniques (not presented in this slide)
▹ Switching trie search and linear search based on the cost model
▹ Weighting factor for practical computational costs
Proposed

17. Experimental Setup
• Dataset
▹ 216 million compound-protein pairs
– Each pair is represented as a 3.6 million dimensional binary ﬁngerprint
▹ We converted the ﬁngerprints into binary and integer sketches using Li’s minhash
algorithm for Jaccard similarity [Li+, WWW10]
▹ We constructed an index by inserting sketches in random order
• Queryset
▹ We randomly sampled 1000 sketches from the dataset
• Code
▹ We implemented all data structures using C++17
▹ Source code is available at https://github.com/kampersanda/dyft
Aspirin
Caffeic Acid

18. Analysis for Optimal Threshold τ*
Binary Sketch
Integer Sketch
Search time (ms/query)
Optimal threshold is the
fastest in most cases
τ*
The search times with ﬁxed
thresholds are
reversed according to the
dataset size
τ = 1,10,100
n

19. Comparison with State-of-the-Arts
Search time (ms/query)
1600x
faster
Update time (sec) Memory usage (GB)
13x
smaller
Nearly equal
Binary Sketch
Integer Sketch
Always faster
Always smaller
Practically fast