110

# StringBeginners#1

General Pigeonhole Principle

#### Shunsuke Kanda

November 02, 2018

## Transcript

2018
3. ### Background 3 } Finding similar objects of high-dimensional vectorial data

is a fundamental task in recent data analysis Original space (high dimension) q xi Hamming space (low dimension) q’ x’i similarity-preserving hashing } e.g., near duplication detection in Web pages [Manku et al., WWW07] } Web pages are converted into 64-bit vectors using SimHash } Jaccard similarity between Web pages is approximated by Hamming distance between 64-bit vectors costly lightweight
4. ### Problem definition } We have n binary strings x1 ,

…, xn of length m each } Given a query string q and threshold t, the goal is to report all string ids i such that Ham(q, xi ) ≤ t } where Ham(·, ·) is the Hamming distance between two strings 4 Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Given q = 0000 0111 and t = 2 Ham(0000 0111, 0000 0000) = 3 > t Ham(0000 0111, 0000 0111) = 0 ≤ t Ham(0000 0111, 0000 1111) = 1 ≤ t Ham(0000 0111, 1001 1111) = 3 > t 2 3
5. ### How to solve? 5 } Brute-force linear scan: O(n) time

} Ham(x, y) can be computed by popcnt(x xor y) in O(1) when a string is within a machine word } Modern solutions use inverted-index-based approaches Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 O(n) Ham(0000 0111, 1001 1111) = 3 0000 0111 xor 1001 1111 = 1001 1000 popcnt(1001 1000) = 3
6. ### Inverted-index-based solution 6 } Approach } Build an inverted index

from the strings } Generate a set of strings whose Hamming distance for query q is no more than t, Q = {q’ ∈ {0,1}m: Ham(q, q’) ≤ t}, called signatures } Find a set of string ids whose key is in Q by retrieving the index } Problem } |Q| becomes too large for long strings and large thresholds Index id 0000 0000 1 0000 0111 2 0000 1111 3 1001 1111 4 |Q| = t X k=0 ✓ m k ◆ Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Given q = 0000 0111 and t = 1, Generate Q = { 0000 0111, 1000 0111, 0100 0111, 0010 0111, 0001 0111, 0000 1111, 0000 0011, 0000 0101, 0000 0110 } Results = {2, 3}
7. ### Multi-index approach 7 } Aim } To leverage the index

approach also for large parameters } Preprocessing } Partition string xi into b disjoint blocks } Build inverted indexes for each block 1st 2nd x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 1st id 0000 1, 2, 3 1001 4 2nd id 0000 1 0111 2 1111 3, 4 When b = 2,
8. ### Multi-index approach 8 } Query processing: filter-and-verification strategy } Partition

q into b disjoint blocks q1 , q2 , …, qb } (Filter phase): Obtain candidates by retrieving each index with Qj = {q’ ∈ {0,1}m/b: Ham(qj , q’) ≤ ⌊t/b⌋} for each block j } ⌊t/b⌋ is based on the pigeonhole principle 1st id 0000 1, 2, 3 1001 4 2nd id 0000 1 0111 2 1111 3, 4 1st block: q1 = 0000, ⌊t/b⌋ = ⌊1/2⌋ = 0 Q1 = {0000} 2nd block: q2 = 0111, ⌊t/b⌋ = 0 Q2 = {0111} Given q = 0000 0111 and t = 1 Candidates = {1, 2, 3}
9. ### Multi-index approach 9 } Query processing: filter-and-verification strategy } Partition

q into b disjoint blocks q1 , q2 , …, qb } (Filter phase): Obtain candidates by retrieving each index with Qj = {q’ ∈ {0,1}m/b: Ham(qj , q’) ≤ ⌊t/b⌋} for each block j } ⌊t/b⌋ is based on the pigeonhole principle } (Verification phase): Verify those candidates with the original strings x1 , …, xn by computing the Hamming distance Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Candidates = {1, 2, 3} Ham(q, x1 ) = 3 > t Ham(q, x2 ) = 0 ≤ t Ham(q, x3 ) = 1 ≤ t verification! Results = {2, 3} Does the threshold ⌊t/b⌋ never allow false negatives? Question Given q = 0000 0111 and t = 1
10. ### Pigeonhole principle () 10 } If n items are contained

in m boxes, then at least one box has no more than ⌊n/m⌋ items 4 boxes 7 items
11. ### Pigeonhole principle () 11 } If n items are contained

in m boxes, then at least one box has no more than ⌊n/m⌋ items 4 boxes 7 items ⌊n/m⌋ = ⌊7/4⌋ = 1
12. ### Pigeonhole principle () 12 } If n items are contained

in m boxes, then at least one box has no more than ⌊n/m⌋ items } Many existing solutions based on the principle } Google (WWW2007), HEngine (ICDE2013), HmSearch (SSDBM2013), MIH (CVPR2012), PartAlloc (VLDB2015), multi- index* (SIGIR2016) and so on… q = 0000 1111 0000 1111 x = 0011 0011 0011 1011 4 boxes and 7 items Ham(q4 , x4 ) ≤ ⌊7/4⌋ = 1
13. ### Qin’s claim in ICDE18 13 } Is the threshold assignment

based on the (basic) pigeonhole principle tight? } The tightness means that } (correctness) false negative never occurs with the threshold assignment, and } (minimality) there does not exist another threshold assignment whose values are smaller (simplified) } Smaller thresholds can reduce filter and verification costs } Unfortunately, the existing assignment is not tight } The general pigeonhole principle offers such tight assignment
14. ### Flexible pigeonhole principle 14 } Lemma (integer) } Given strings

q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ tj } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 1 1 1 1 When t = 4, ∑ tj = t = 4
15. ### Flexible pigeonhole principle 15 } Lemma (real) } Given strings

q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are real numbers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0.8 0.8 0.8 1.6 ∑ tj = t = 4 When t = 4,
16. ### Flexible pigeonhole principle 16 } Lemma (real) } Given strings

q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are real numbers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0.8 0.8 0.8 1.6 ⌊tj ⌋ 0 0 0 1 Integer reduction ⌊t/b⌋ 1 1 1 1 When t = 4,
17. ### General pigeonhole principle 17 } Theorem } Given strings q

= q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb ∑ tj = t–b+1
18. ### General pigeonhole principle 18 } Theorem } Given strings q

= q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb +1 +1 +1 +0 t’j t1 +1 t2 +1 … tb-1 +1 tb ∑ tj = t–b+1 ∑ t’j = t (Lemma integer)
19. ### General pigeonhole principle 19 } Theorem } Given strings q

= q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb +1 +1 +1 +0 t’j t1 +1 t2 +1 … tb-1 +1 tb –ε –ε –ε +(b–1)ε t’’j t1 +1–ε t2 +1–ε … tb-1 +1–ε tb +(b–1)ε ∑ tj = t–b+1 ∑ t’j = t (Lemma integer) ∑ t’’j = t (Lemma real) ε: small positive real number
20. ### General pigeonhole principle 20 } Theorem } Given strings q

= q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof ⌊t’’j ⌋ t1 t2 … tb-1 tb ∑ tj = t–b+1 ∑ t’j = t (Lemma integer) ∑ t’’j = t (Lemma real) Integer reduction 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb +1 +1 +1 +0 t’j t1 +1 t2 +1 … tb-1 +1 tb –ε –ε –ε +(b–1)ε t’’j t1 +1–ε t2 +1–ε … tb-1 +1–ε tb +(b–1)ε ε: small positive real number
21. ### General pigeonhole principle 21 } Theorem } Given strings q

= q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0 0 0 1 When t = 4 and b = 4 ∑ tj = t–b+1=1 ⌊t/b⌋ 1 1 1 1
22. ### Results 22 } Method } GPH: Multi-index based on the

general principle } MIH: Multi-index based on the basic principle } Dataset } SIFT: a billion binary strings of length 128
23. ### Next stage 23 } Utilizing data skewness } Varying block

lengths depending on the data skewness } Qin et al., “GPH: Similarity Search in Hamming Space,” ICDE, 2018 } Utilizing adjacent thresholds } To shorten the verification time with stronger constraints } Qin and Xiao, “Pigeonring: A Principle for Faster Thresholded Similarity Search,” VLDB, 2018