AIP Open Seminar #6 - Speaker Deck

Slide 1

Slide 1 text

Dynamic Similarity Search on Integer Sketches Shunsuke Kanda and Yasuo Tabei Succinct Information Processing Unit (Presented at ICDM20)

Slide 2

Slide 2 text

Contents 1. Background & Contribution 2. Preliminary: Trie-based Similarity Search 3. New method: Dynamic Filter Trie (DyFT) i. Node reduction technique ii. Node implementation technique 4. Experiments

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Similarity-preserving Hashing • Core technique for fast similarity searches ▹ Randomly map vectors in a metric space into sketches in the Hamming space Hashing Hamming space Metric space (e.g. Cosine or Jaccard) High dimension :( (~103 to ~106) 0.2 0.7 0.1 0.5 0.2 0.3 0.3 0.8 ⋮ 0.1 Low dimension :) (32 or 64) 0 1 1 ⋮ 0 Many similarity search problems can be solved as Hamming distance problem!! (discrete strings)

Slide 5

Slide 5 text

Modern Issues on Similarity Search • Generality ▹ Traditional hashing algorithms produce binary sketches ▹ Modern hashing algorithms produce integer sketches – Such as b-bit minhash [Li+, WWW10], 0-bit CWS [Li, KDD15], and GCWS [Li, KDD17] ▹ But, most search methods are designed for binary sketches e.g., 001101001001 e.g., 236301499231 • Dynamics ▹ Modern real-world datasets are dynamic (i.e., updated over time) – Such as Web pages and time series data ▹ But, most search methods are limited to static datasets or inefﬁcient for dynamic datasets dataset x insert Our challenge Develop an efﬁcient dynamic search method for both binary and integer sketches

Slide 6

Slide 6 text

Problem Statement • Sketch x of length m is an m-dimensional vector of non-negative integers • We have a dataset X = {x1 , x2 , …, xn }, which is a dynamic set of n sketches • Given sketch y and Hamming radius r as a query, we want to quickly ﬁnd similar sketches such that {xi : H(xi , y) ≤ r} ▹ H(∙, ∙) is the Hamming distance (i.e., # of errors in each dimension) x1 111020 x2 001020 x3 032021 x4 113021 Dataset X n Generality Dynamics H(x1, y) = 1 H(x2, y) = 3 H(x3, y) = 3 H(x4, y) = 1 ≤ r ≤ r similar similar y = 111021 r = 1 Query

Slide 7

Slide 7 text

State-of-the-art Similarity Search Methods • Most methods use hash tables, but they are inefﬁcient for dynamic datasets • Recently, Eghbali et al. [IEEE TPAMI19] addressed this issue by using a search tree, but it is not applicable to integer sketches • We propose new methods DyFTs for dynamic datasets of integer sketches, which leverage a trie data structure

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Trie-based Similarity Search • Trie is a labeled tree built by merging common preﬁxes of sketches • The downgoing path from the root to a leaf represents the associated sketch x1 x2 x3 x7 x5 x4 x8 x6 0 1 3 0 3 1 1 3 1 0 2 0 0 2 1 2 0 0 2 1 1 3 0 2 0 0 2 1 1 0 2 0 0 3 1 1 0 1 1 0 • Similarity search is performed by traversing nodes while counting #errors to the query sketch • If #errors exceeds the radius, we stop traversing down to the all descendants • The time complexity is O(mr+2) Search for y = 111020 with r =1 x1 and x7 are similar not depending on dataset size n

Slide 11

Slide 11 text

Two Issues on Trie Implementation • Scalability ▹ Trie for large database maintains many pointers and consumes huge memory ▹ Reducing redundant nodes is an often-used solution ▹ But, there is no reduction technique for similarity searches • Generality ▹ Sketches consist of integers from {0,1,…,σ–1} ▹ σ is a given parameter depending on hashing techniques – σ ≤ 4 is recommended in MinHash – σ ≥ 16 is recommended in CWS ▹ But, existing trie implementations have been designed for byte sketches, i.e., σ = 256 Our DyFT is a new similarity search method to solve the issues

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Dynamic Filter Trie (DyFT) • Trie-based similarity search for binary and integer sketches ▹ Store only some of trie nodes around the root for memory efficiency ▹ Exploit the trie search algorithm for filtering out dissimilar sketches x1 x2 x3 x7 x5 x4 x8 x6 x1 111020 x2 001020 x3 032021 x4 113021 x5 333110 x6 330110 x7 311020 x8 030120 Database X Verification H(x1 , y) = 0 ≤ r H(x4 , y) = 2 > r H(x7 , y) = 1 ≤ r similar similar dissimilar 0 1 3 0 3 1 3 Search for y = 111020 with r =1 Candidate solutions

Slide 14

Slide 14 text

Update Procedure • Visit the deepest reachable leaf node using new sketch • Append to the posting list of leaf node • If the length of (or ) exceeds threshold , split and create new leaf nodes v xi xi Lv v Lv |Lv | τ Lv v x3 x8 ︙ ︙ Insert x9 = 030110 0 3 x9 Append v ︙ ︙ 0 3 x3 x8 x9 0 2 Split (if ) |Lv | > τ |Lv | Lv

Slide 15

Slide 15 text

What is a Reasonable Splitting Threshold ? τ • A reasonable value of can be determined depending on the conﬁguration of the dataset and given parameters of hashing techniques • But, it is impossible to search such a reasonable value for dynamic datasets τ If is large τ Large veriﬁcation time If is small τ Large traversal time The best values are reversed :( One order of magnitude! Fast

Slide 16

Slide 16 text

Optimal Treshold τ* • First, construct a search cost model • Then, determine an optimal threshold minimizing the search cost τ* (if ) |Lv | ≤ τ* keep? or split? (if ) |Lv | > τ* Lv offers the case that can maintain the smaller cost τ* v Can always achieve the fastest search Fast

Slide 17

Slide 17 text

Definition of Search Cost SC(v) • The search cost for node is defined by ▹ is the Reach Probability defined for a random sketch from a uniform distribution ▹ is the Computational Cost defined for inner and leaf nodes separately v SC(v) = RP(v) × CC(v) RP(v) CC(v) v Inner node v Lv Leaf node CCin (v) Check children CCleaf (v) Verify sketches SCin (v) = SCleaf (v) = RP(v) RP(v) Given a random sketch Given a random sketch

Slide 18

Slide 18 text

Optimal Threshold τ* • Given leaf , compare the search costs in the two cases: v If keeping leaf v ︙ v |Lv | SCleaf (v) then, the search cost is v ︙ u1 u2 uk ︙ ︙ ︙ If splitting leaf v SCin (v) + ∑ SCleaf (ui ) then, the new search cost is Precomputable :) |Lv | > τ*(r, ℓ, σ) • Can derive the condition if the right case can maintain a smaller search cost DyFT can grow while maintaining fast similarity searches with few node pointers

Slide 19

Slide 19 text

Slide 20

Slide 20 text

How to implement DyFT efﬁciently? • Good point :) ▹ There are many trie implementations • Bad point :( ▹ They are designed for byte strings ▹ But, sketches consist of general integers • Our approach ▹ Reconstruct integer sketches into byte ones ▹ Represent them using an adaptive radix tree (space-efﬁcient trie implementation) x = 2 3 6 3 0 1 2 11 2… x’ = 0xF2 0xAE 0x53…

Slide 21

Slide 21 text

Adaptive Radix Tree [Leis+, ICDE13] • Adaptively select a space-efﬁcient node implementation depending on #children • The data structure is modiﬁed for node traversal in similarity search For a node with few children, use a list-based data structure For a node with moderate children, use a hybrid data structure of a list and an array For a node with many children, use an array-based data structure

Slide 22

Slide 22 text

Summary of Our Method • What is a issue on modern similarity search? ▹ There is no efﬁcient dynamic data structure for integer sketches • What are issues on trie-based similarity search? ▹ Scalability: There is no node reduction technique for similarity search ▹ Generality: There is no node implementation technique for integer sketches We developed DyFT based on a trie data structure We constructed a search cost model, deﬁned an optimal threshold, and reduced DyFT nodes while maintaining fast similarity searches We reconstructed integer sketches into byte sketches to leverage an existing trie implementation technique

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Experimental Setup • Dataset ▹ 216 million compound-protein pairs – Each pair is represented as a 3.6 million dimensional binary ﬁngerprint ▹ We converted the ﬁngerprints into binary and integer sketches using Li’s minhash algorithm for Jaccard similarity [Li+, WWW10] ▹ We constructed an index by inserting sketches in random order • Queryset ▹ We randomly sampled 1000 sketches from the dataset • Code ▹ We implemented all data structures using C++17 ▹ Source code is available at https://github.com/kampersanda/dyft Aspirin Caffeic Acid

Slide 25

Slide 25 text

Analysis for Optimal Threshold τ* Binary Sketch Integer Sketch Search time (ms/query) Optimal threshold is the fastest in most cases τ* The search times with ﬁxed thresholds are reversed according to the dataset size τ = 1,10,100 n Fast

Slide 26

Slide 26 text

Comparison with State-of-the-Arts (Binary Sketches) 1600x faster 13x smaller Competitive Search time (ms/query) Update time (sec) Memory usage (GB) • four orders of magnitude faster on the search time • competitive on the update time • one order of magnitude smaller on the memory DyFT was

Slide 27

Slide 27 text

Comparison with State-of-the-Arts (Integer Sketches) • always faster on the search time • competitive on the update time • always smaller on the memory DyFT was Search time (ms/query) Update time (sec) Memory usage (GB) Always faster Always smaller Competitive

Slide 28

Slide 28 text

Summary of Our Method • What is a issue on modern similarity search? ▹ There is no efﬁcient dynamic data structure for integer sketches • What are issues on trie-based similarity search? ▹ Scalability: There is no node reduction technique for similarity search ▹ Generality: There is no node implementation technique for integer sketches We developed DyFT based on a trie data structure We reconstructed integer sketches into byte sketches to leverage an existing trie implementation technique We constructed a search cost model, deﬁned an optimal threshold, and reduced DyFT nodes while maintaining fast similarity searches