Tabei RIKEN Center for Advanced Intelligence Project, Japan 20th IEEE International Conference on Data Mining (ICDM) November 17–20, 2020, Sorrento, Italy (virtual)

Randomly map vectors in a metric space into sketches in the Hamming space Hashing Hamming space Metric space (e.g. Cosine or Jaccard) High dimension :( (~103 to ~106) 0.2 0.7 0.1 0.5 0.2 0.3 0.3 0.8 ⋮ 0.1 Low dimension :) (32 or 64) 0 1 1 ⋮ 0 Many similarity search problems can be solved as Hamming distance problem!! (discrete strings)

algorithms produce binary sketches ▹ Modern hashing algorithms produce integer sketches – Such as b-bit minhash [Li+, WWW10], 0-bit CWS [Li, KDD15], and GCWS [Li, KDD17] ▹ But, most search methods are designed for binary sketches • Dynamics ▹ Modern real-world datasets are dynamic (i.e., updated over time) – Such as Web pages and time series data ▹ But, most search methods are limited to static datasets or inefﬁcient for dynamic datasets Our challenge Develop an efﬁcient dynamic search method for both binary and integer sketches e.g., 001101001001 e.g., 236301499231 dataset x insert

m-dimensional vector of non-negative integers • We have a dataset X = {x1 , x2 , …, xn }, which is a dynamic set of n sketches • Given sketch y and Hamming radius r as a query, we want to quickly ﬁnd similar sketches such that {xi : H(xi , y) ≤ r} ▹ H(∙, ∙) is the Hamming distance (i.e., # of errors in each dimension) x1 111020 x2 001020 x3 032021 x4 113021 Dataset X n Generality Dynamics H(x1, y) = 1 H(x2, y) = 3 H(x3, y) = 3 H(x4, y) = 1 ≤ r ≤ r similar similar y = 111021 r = 1 Query

but they are inefﬁcient for dynamic datasets • Recently, Eghbali et al. [IEEE TPAMI19] addressed this issue by using a search tree, but it is not applicable to integer sketches • We propose new methods DyFTs for dynamic datasets of integer sketches, which leverage a trie data structure

built by merging common preﬁxes of sketches • The downgoing path from the root to a leaf represents the associated sketch x1 x2 x3 x7 x5 x4 x8 x6 0 1 3 0 3 1 1 3 1 0 2 0 0 2 1 2 0 0 2 1 1 3 0 2 0 0 2 1 1 0 2 0 0 3 1 1 0 1 1 0 • Similarity search is performed by traversing nodes while counting #errors to the query sketch • If #errors exceeds the radius, we stop traversing down to the all descendants • The time complexity is O(mr+2) Search for y = 111020 with r =1 x1 and x7 are similar not depending on dataset size n

and integer sketches ▹ Store only some of trie nodes around the root for memory efﬁciency ▹ Exploit the trie search algorithm for ﬁltering out dissimilar sketches x1 x2 x3 x7 x5 x4 x8 x6 x1 111020 x2 001020 x3 032021 x4 113021 x5 333110 x6 330110 x7 311020 x8 030120 Database X Veriﬁcation H(x1 , y) = 0 ≤ r H(x4 , y) = 2 > r H(x7 , y) = 1 ≤ r similar similar dissimilar 0 1 3 0 3 1 3 Proposed Search for y = 111020 with r =1 Candidate solutions

new sketch • Append to the posting list of leaf node • If the length of (or ) exceeds threshold , split and create new leaf nodes v xi xi Lv v Lv |Lv | τ Lv v x3 x8 ︙ ︙ Insert x9 = 030110 0 3 x9 Append v ︙ ︙ 0 3 x3 x8 x9 0 2 Split (if ) |Lv | > τ |Lv | Proposed Lv

reasonable value of can be determined depending on the conﬁguration of the dataset and given parameters • But, it is impossible to search such a reasonable value for dynamic datasets τ If is large τ Large veriﬁcation time If is small τ Large traversal time Proposed The best values are reversed! One order of magnitude!

assuming that sketches are uniformly distributed in the Hamming space • Then, determine an optimal threshold minimizing the search cost τ* (if ) |Lv | ≤ τ* keep? or split? (if ) |Lv | > τ* The search cost for node v is deﬁned as Reach Probability Computational Cost Lv offers the case that can maintain the smaller cost τ* v Proposed

the probability of reaching node within errors using a random sketch from a uniform distribution v r x ∈ {0,1,…, σ − 1}ℓ level ℓ v Given random sketch and radius x r What is the probability of reaching node within errors ? v r Proposed P(ℓ) = N(ℓ) σℓ # of all possible sketches of length is ℓ σℓ N(ℓ) = r ∑ k=0 ( ℓ k)(σ − 1)k # of all possible sketches reachable to a node at level within errors is ℓ r

If is an inner node, we try to descend to the children of node v v v Case 1 with less than errors r Check all the children in time O(σ) Case 2 v with errors r Directly lookup the child in time O(1) Cin (v) = P(ℓ) × The search cost of inner node : v • The number of all possible sketches reachable to with errors is v r N2 (ℓ) = ( ℓ r)(σ − 1)k Proposed Case 1 Case 2 {(1 − N2 (ℓ) N(ℓ) ) × σ + N2 (ℓ) N(ℓ) × 1 }

If is a leaf node, we verify all sketches associated with v Lv Hamming distance can be computed by performing sets of bitwise-XOR and -popcount operations [Zhang+, SSDBM13] ⌈log2 σ⌉ x1 v x4 x6 x7 Ham(x1, y) Ham(x4, y) Ham(x6, y) Ham(x7, y) Given a sketch y Lv The search cost of leaf node : v Cleaf (v) = P(ℓ) × Proposed Veriﬁcation time {|Lv | × ⌈log2 σ⌉}

compare the search costs in the two cases: v ℓ If not splitting leaf v ︙ v |Lv | Cleaf (v) then, the search cost is v ︙ u1 u2 uk ︙ ︙ ︙ If splitting leaf v Cin (v) + ∑ Cleaf (ui ) then, the new search cost is |Lv | > P(ℓ) P(ℓ) − P(ℓ + 1) × (1 − N2 (ℓ) N(ℓ) ) × σ + N2 (ℓ) N(ℓ) ⌈log2 σ⌉ • We can derive the condition to maintain the smaller cost =: τ* Proposed Precomputable :)

sketches ▹ Store only some of trie nodes around the root for memory efﬁciency ▹ Exploit the trie search algorithm for ﬁltering out dissimilar sketches ▹ Grow the data structure while maintaining fast searches using optimal threshold τ* x1 x2 x3 x7 x5 x4 x8 x6 Search for y = 111020 with r =1 Candidate solutions 0 1 3 0 3 1 3 • Other techniques (not presented in this slide) ▹ Switching trie search and linear search based on the cost model ▹ Weighting factor for practical computational costs ▹ Efﬁcient node implementation MART (modiﬁed adaptive radix tree) Proposed

Each pair is represented as a 3.6 million dimensional binary ﬁngerprint ▹ We converted the ﬁngerprints into binary and integer sketches using Li’s minhash algorithm for Jaccard similarity [Li+, WWW10] ▹ We constructed an index by inserting sketches in random order • Queryset ▹ We randomly sampled 1000 sketches from the dataset • Code ▹ We implemented all data structures using C++17 ▹ Source code is available at https://github.com/kampersanda/dyft Aspirin Caffeic Acid

time (ms/query) Optimal threshold is the fastest in most cases τ* The search times with ﬁxed thresholds are reversed according to the dataset size τ = 1,10,100 n