Background 3 } Finding similar objects of high-dimensional vectorial data is a fundamental task in recent data analysis Original space (high dimension) q xi Hamming space (low dimension) q’ x’i similarity-preserving hashing } e.g., near duplication detection in Web pages [Manku et al., WWW07] } Web pages are converted into 64-bit vectors using SimHash } Jaccard similarity between Web pages is approximated by Hamming distance between 64-bit vectors costly lightweight

Problem definition } We have n binary strings x1 , …, xn of length m each } Given a query string q and threshold t, the goal is to report all string ids i such that Ham(q, xi ) ≤ t } where Ham(·, ·) is the Hamming distance between two strings 4 Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Given q = 0000 0111 and t = 2 Ham(0000 0111, 0000 0000) = 3 > t Ham(0000 0111, 0000 0111) = 0 ≤ t Ham(0000 0111, 0000 1111) = 1 ≤ t Ham(0000 0111, 1001 1111) = 3 > t 2 3

How to solve? 5 } Brute-force linear scan: O(n) time } Ham(x, y) can be computed by popcnt(x xor y) in O(1) when a string is within a machine word } Modern solutions use inverted-index-based approaches Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 O(n) Ham(0000 0111, 1001 1111) = 3 0000 0111 xor 1001 1111 = 1001 1000 popcnt(1001 1000) = 3

Inverted-index-based solution 6 } Approach } Build an inverted index from the strings } Generate a set of strings whose Hamming distance for query q is no more than t, Q = {q’ ∈ {0,1}m: Ham(q, q’) ≤ t}, called signatures } Find a set of string ids whose key is in Q by retrieving the index } Problem } |Q| becomes too large for long strings and large thresholds Index id 0000 0000 1 0000 0111 2 0000 1111 3 1001 1111 4 |Q| = t X k=0 ✓ m k ◆ Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Given q = 0000 0111 and t = 1, Generate Q = { 0000 0111, 1000 0111, 0100 0111, 0010 0111, 0001 0111, 0000 1111, 0000 0011, 0000 0101, 0000 0110 } Results = {2, 3}

Multi-index approach 7 } Aim } To leverage the index approach also for large parameters } Preprocessing } Partition string xi into b disjoint blocks } Build inverted indexes for each block 1st 2nd x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 1st id 0000 1, 2, 3 1001 4 2nd id 0000 1 0111 2 1111 3, 4 When b = 2,

Multi-index approach 9 } Query processing: filter-and-verification strategy } Partition q into b disjoint blocks q1 , q2 , …, qb } (Filter phase): Obtain candidates by retrieving each index with Qj = {q’ ∈ {0,1}m/b: Ham(qj , q’) ≤ ⌊t/b⌋} for each block j } ⌊t/b⌋ is based on the pigeonhole principle } (Verification phase): Verify those candidates with the original strings x1 , …, xn by computing the Hamming distance Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Candidates = {1, 2, 3} Ham(q, x1 ) = 3 > t Ham(q, x2 ) = 0 ≤ t Ham(q, x3 ) = 1 ≤ t verification! Results = {2, 3} Does the threshold ⌊t/b⌋ never allow false negatives? Question Given q = 0000 0111 and t = 1

Pigeonhole principle () 12 } If n items are contained in m boxes, then at least one box has no more than ⌊n/m⌋ items } Many existing solutions based on the principle } Google (WWW2007), HEngine (ICDE2013), HmSearch (SSDBM2013), MIH (CVPR2012), PartAlloc (VLDB2015), multi- index* (SIGIR2016) and so on… q = 0000 1111 0000 1111 x = 0011 0011 0011 1011 4 boxes and 7 items Ham(q4 , x4 ) ≤ ⌊7/4⌋ = 1

Qin’s claim in ICDE18 13 } Is the threshold assignment based on the (basic) pigeonhole principle tight? } The tightness means that } (correctness) false negative never occurs with the threshold assignment, and } (minimality) there does not exist another threshold assignment whose values are smaller (simplified) } Smaller thresholds can reduce filter and verification costs } Unfortunately, the existing assignment is not tight } The general pigeonhole principle offers such tight assignment

Flexible pigeonhole principle 15 } Lemma (real) } Given strings q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are real numbers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0.8 0.8 0.8 1.6 ∑ tj = t = 4 When t = 4,

Results 22 } Method } GPH: Multi-index based on the general principle } MIH: Multi-index based on the basic principle } Dataset } SIFT: a billion binary strings of length 128

Next stage 23 } Utilizing data skewness } Varying block lengths depending on the data skewness } Qin et al., “GPH: Similarity Search in Hamming Space,” ICDE, 2018 } Utilizing adjacent thresholds } To shorten the verification time with stronger constraints } Qin and Xiao, “Pigeonring: A Principle for Faster Thresholded Similarity Search,” VLDB, 2018