Dynamic Similarity Search on Integer Sketches Shunsuke Kanda and Yasuo Tabei RIKEN Center for Advanced Intelligence Project, Japan 20th IEEE International Conference on Data Mining (ICDM) November 17–20, 2020, Sorrento, Italy (virtual)

Similarity-preserving Hashing • Core technique for fast similarity searches ▹ Randomly map vectors in a metric space into sketches in the Hamming space Hashing Hamming space Metric space (e.g. Cosine or Jaccard) High dimension :( (~103 to ~106) 0.2 0.7 0.1 0.5 0.2 0.3 0.3 0.8 ⋮ 0.1 Low dimension :) (32 or 64) 0 1 1 ⋮ 0 Many similarity search problems can be solved as Hamming distance problem!! (discrete strings)

Issues on Modern Similarity Search • Generality ▹ Traditional hashing algorithms produce binary sketches ▹ Modern hashing algorithms produce integer sketches – Such as b-bit minhash [Li+, WWW10], 0-bit CWS [Li, KDD15], and GCWS [Li, KDD17] ▹ But, most search methods are designed for binary sketches • Dynamics ▹ Modern real-world datasets are dynamic (i.e., updated over time) – Such as Web pages and time series data ▹ But, most search methods are limited to static datasets or inefﬁcient for dynamic datasets Our challenge Develop an efﬁcient dynamic search method for both binary and integer sketches e.g., 001101001001 e.g., 236301499231 dataset x insert

Problem Statement • Sketch x of length m is an m-dimensional vector of non-negative integers • We have a dataset X = {x1 , x2 , …, xn }, which is a dynamic set of n sketches • Given sketch y and Hamming radius r as a query, we want to quickly ﬁnd similar sketches such that {xi : H(xi , y) ≤ r} ▹ H(∙, ∙) is the Hamming distance (i.e., # of errors in each dimension) x1 111020 x2 001020 x3 032021 x4 113021 Dataset X n Generality Dynamics H(x1, y) = 1 H(x2, y) = 3 H(x3, y) = 3 H(x4, y) = 1 ≤ r ≤ r similar similar y = 111021 r = 1 Query

State-of-the-art Similarity Search Methods • Most methods use hash tables, but they are inefﬁcient for dynamic datasets • Recently, Eghbali et al. [IEEE TPAMI19] addressed this issue by using a search tree, but it is not applicable to integer sketches • We propose new methods DyFTs for dynamic datasets of integer sketches, which leverage a trie data structure

Trie and Similarity Search • Trie is a labeled tree built by merging common preﬁxes of sketches • The downgoing path from the root to a leaf represents the associated sketch x1 x2 x3 x7 x5 x4 x8 x6 0 1 3 0 3 1 1 3 1 0 2 0 0 2 1 2 0 0 2 1 1 3 0 2 0 0 2 1 1 0 2 0 0 3 1 1 0 1 1 0 • Similarity search is performed by traversing nodes while counting #errors to the query sketch • If #errors exceeds the radius, we stop traversing down to the all descendants • The time complexity is O(mr+2) Search for y = 111020 with r =1 x1 and x7 are similar not depending on dataset size n

Dynamic Filter Trie (DyFT) • Trie-based similarity search for binary and integer sketches ▹ Store only some of trie nodes around the root for memory efﬁciency ▹ Exploit the trie search algorithm for ﬁltering out dissimilar sketches x1 x2 x3 x7 x5 x4 x8 x6 x1 111020 x2 001020 x3 032021 x4 113021 x5 333110 x6 330110 x7 311020 x8 030120 Database X Veriﬁcation H(x1 , y) = 0 ≤ r H(x4 , y) = 2 > r H(x7 , y) = 1 ≤ r similar similar dissimilar 0 1 3 0 3 1 3 Proposed Search for y = 111020 with r =1 Candidate solutions

Update Procedure • Visit the deepest reachable leaf node using new sketch • Append to the posting list of leaf node • If the length of (or ) exceeds threshold , split and create new leaf nodes v xi xi Lv v Lv |Lv | τ Lv v x3 x8 ︙ ︙ Insert x9 = 030110 0 3 x9 Append v ︙ ︙ 0 3 x3 x8 x9 0 2 Split (if ) |Lv | > τ |Lv | Proposed Lv

What is a Reasonable Splitting Threshold ? τ • A reasonable value of can be determined depending on the conﬁguration of the dataset and given parameters • But, it is impossible to search such a reasonable value for dynamic datasets τ If is large τ Large veriﬁcation time If is small τ Large traversal time Proposed The best values are reversed! One order of magnitude!

Optimal Treshold τ* • First, construct a search cost model assuming that sketches are uniformly distributed in the Hamming space • Then, determine an optimal threshold minimizing the search cost τ* (if ) |Lv | ≤ τ* keep? or split? (if ) |Lv | > τ* The search cost for node v is deﬁned as Reach Probability Computational Cost Lv offers the case that can maintain the smaller cost τ* v Proposed

Reach Probability for Node at Level v ℓ • Consider the probability of reaching node within errors using a random sketch from a uniform distribution v r x ∈ {0,1,…, σ − 1}ℓ level ℓ v Given random sketch and radius x r What is the probability of reaching node within errors ? v r Proposed P(ℓ) = N(ℓ) σℓ # of all possible sketches of length is ℓ σℓ N(ℓ) = r ∑ k=0 ( ℓ k)(σ − 1)k # of all possible sketches reachable to a node at level within errors is ℓ r

Search Cost of Inner Node at Level v ℓ • If is an inner node, we try to descend to the children of node v v v Case 1 with less than errors r Check all the children in time O(σ) Case 2 v with errors r Directly lookup the child in time O(1) Cin (v) = P(ℓ) × The search cost of inner node : v • The number of all possible sketches reachable to with errors is v r N2 (ℓ) = ( ℓ r)(σ − 1)k Proposed Case 1 Case 2 {(1 − N2 (ℓ) N(ℓ) ) × σ + N2 (ℓ) N(ℓ) × 1 }

Search Cost of Leaf Node at Level v ℓ • If is a leaf node, we verify all sketches associated with v Lv Hamming distance can be computed by performing sets of bitwise-XOR and -popcount operations [Zhang+, SSDBM13] ⌈log2 σ⌉ x1 v x4 x6 x7 Ham(x1, y) Ham(x4, y) Ham(x6, y) Ham(x7, y) Given a sketch y Lv The search cost of leaf node : v Cleaf (v) = P(ℓ) × Proposed Veriﬁcation time {|Lv | × ⌈log2 σ⌉}

Optimal Threshold τ* • Given leaf at level , we compare the search costs in the two cases: v ℓ If not splitting leaf v ︙ v |Lv | Cleaf (v) then, the search cost is v ︙ u1 u2 uk ︙ ︙ ︙ If splitting leaf v Cin (v) + ∑ Cleaf (ui ) then, the new search cost is |Lv | > P(ℓ) P(ℓ) − P(ℓ + 1) × (1 − N2 (ℓ) N(ℓ) ) × σ + N2 (ℓ) N(ℓ) ⌈log2 σ⌉ • We can derive the condition to maintain the smaller cost =: τ* Proposed Precomputable :)

Summary of DyFT • Trie-based similarity search method for integer sketches ▹ Store only some of trie nodes around the root for memory efﬁciency ▹ Exploit the trie search algorithm for ﬁltering out dissimilar sketches ▹ Grow the data structure while maintaining fast searches using optimal threshold τ* x1 x2 x3 x7 x5 x4 x8 x6 Search for y = 111020 with r =1 Candidate solutions 0 1 3 0 3 1 3 • Other techniques (not presented in this slide) ▹ Switching trie search and linear search based on the cost model ▹ Weighting factor for practical computational costs ▹ Efﬁcient node implementation MART (modiﬁed adaptive radix tree) Proposed

Experimental Setup • Dataset ▹ 216 million compound-protein pairs – Each pair is represented as a 3.6 million dimensional binary ﬁngerprint ▹ We converted the ﬁngerprints into binary and integer sketches using Li’s minhash algorithm for Jaccard similarity [Li+, WWW10] ▹ We constructed an index by inserting sketches in random order • Queryset ▹ We randomly sampled 1000 sketches from the dataset • Code ▹ We implemented all data structures using C++17 ▹ Source code is available at https://github.com/kampersanda/dyft Aspirin Caffeic Acid

Analysis for Optimal Threshold τ* Binary Sketch Integer Sketch Search time (ms/query) Optimal threshold is the fastest in most cases τ* The search times with ﬁxed thresholds are reversed according to the dataset size τ = 1,10,100 n