Slide 1

Slide 1 text

General Pigeonhole Principle Shunsuke Kanda (proper string beginner) 1 StringBeginners#1

Slide 2

Slide 2 text

Literature 2 } IEEE International Conference on Data Engineering (ICDE), 2018

Slide 3

Slide 3 text

Background 3 } Finding similar objects of high-dimensional vectorial data is a fundamental task in recent data analysis Original space (high dimension) q xi Hamming space (low dimension) q’ x’i similarity-preserving hashing } e.g., near duplication detection in Web pages [Manku et al., WWW07] } Web pages are converted into 64-bit vectors using SimHash } Jaccard similarity between Web pages is approximated by Hamming distance between 64-bit vectors costly lightweight

Slide 4

Slide 4 text

Problem definition } We have n binary strings x1 , …, xn of length m each } Given a query string q and threshold t, the goal is to report all string ids i such that Ham(q, xi ) ≤ t } where Ham(·, ·) is the Hamming distance between two strings 4 Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Given q = 0000 0111 and t = 2 Ham(0000 0111, 0000 0000) = 3 > t Ham(0000 0111, 0000 0111) = 0 ≤ t Ham(0000 0111, 0000 1111) = 1 ≤ t Ham(0000 0111, 1001 1111) = 3 > t 2 3

Slide 5

Slide 5 text

How to solve? 5 } Brute-force linear scan: O(n) time } Ham(x, y) can be computed by popcnt(x xor y) in O(1) when a string is within a machine word } Modern solutions use inverted-index-based approaches Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 O(n) Ham(0000 0111, 1001 1111) = 3 0000 0111 xor 1001 1111 = 1001 1000 popcnt(1001 1000) = 3

Slide 6

Slide 6 text

Inverted-index-based solution 6 } Approach } Build an inverted index from the strings } Generate a set of strings whose Hamming distance for query q is no more than t, Q = {q’ ∈ {0,1}m: Ham(q, q’) ≤ t}, called signatures } Find a set of string ids whose key is in Q by retrieving the index } Problem } |Q| becomes too large for long strings and large thresholds Index id 0000 0000 1 0000 0111 2 0000 1111 3 1001 1111 4 |Q| = t X k=0 ✓ m k ◆ Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Given q = 0000 0111 and t = 1, Generate Q = { 0000 0111, 1000 0111, 0100 0111, 0010 0111, 0001 0111, 0000 1111, 0000 0011, 0000 0101, 0000 0110 } Results = {2, 3}

Slide 7

Slide 7 text

Multi-index approach 7 } Aim } To leverage the index approach also for large parameters } Preprocessing } Partition string xi into b disjoint blocks } Build inverted indexes for each block 1st 2nd x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 1st id 0000 1, 2, 3 1001 4 2nd id 0000 1 0111 2 1111 3, 4 When b = 2,

Slide 8

Slide 8 text

Multi-index approach 8 } Query processing: filter-and-verification strategy } Partition q into b disjoint blocks q1 , q2 , …, qb } (Filter phase): Obtain candidates by retrieving each index with Qj = {q’ ∈ {0,1}m/b: Ham(qj , q’) ≤ ⌊t/b⌋} for each block j } ⌊t/b⌋ is based on the pigeonhole principle 1st id 0000 1, 2, 3 1001 4 2nd id 0000 1 0111 2 1111 3, 4 1st block: q1 = 0000, ⌊t/b⌋ = ⌊1/2⌋ = 0 Q1 = {0000} 2nd block: q2 = 0111, ⌊t/b⌋ = 0 Q2 = {0111} Given q = 0000 0111 and t = 1 Candidates = {1, 2, 3}

Slide 9

Slide 9 text

Multi-index approach 9 } Query processing: filter-and-verification strategy } Partition q into b disjoint blocks q1 , q2 , …, qb } (Filter phase): Obtain candidates by retrieving each index with Qj = {q’ ∈ {0,1}m/b: Ham(qj , q’) ≤ ⌊t/b⌋} for each block j } ⌊t/b⌋ is based on the pigeonhole principle } (Verification phase): Verify those candidates with the original strings x1 , …, xn by computing the Hamming distance Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Candidates = {1, 2, 3} Ham(q, x1 ) = 3 > t Ham(q, x2 ) = 0 ≤ t Ham(q, x3 ) = 1 ≤ t verification! Results = {2, 3} Does the threshold ⌊t/b⌋ never allow false negatives? Question Given q = 0000 0111 and t = 1

Slide 10

Slide 10 text

Pigeonhole principle () 10 } If n items are contained in m boxes, then at least one box has no more than ⌊n/m⌋ items 4 boxes 7 items

Slide 11

Slide 11 text

Pigeonhole principle () 11 } If n items are contained in m boxes, then at least one box has no more than ⌊n/m⌋ items 4 boxes 7 items ⌊n/m⌋ = ⌊7/4⌋ = 1

Slide 12

Slide 12 text

Pigeonhole principle () 12 } If n items are contained in m boxes, then at least one box has no more than ⌊n/m⌋ items } Many existing solutions based on the principle } Google (WWW2007), HEngine (ICDE2013), HmSearch (SSDBM2013), MIH (CVPR2012), PartAlloc (VLDB2015), multi- index* (SIGIR2016) and so on… q = 0000 1111 0000 1111 x = 0011 0011 0011 1011 4 boxes and 7 items Ham(q4 , x4 ) ≤ ⌊7/4⌋ = 1

Slide 13

Slide 13 text

Qin’s claim in ICDE18 13 } Is the threshold assignment based on the (basic) pigeonhole principle tight? } The tightness means that } (correctness) false negative never occurs with the threshold assignment, and } (minimality) there does not exist another threshold assignment whose values are smaller (simplified) } Smaller thresholds can reduce filter and verification costs } Unfortunately, the existing assignment is not tight } The general pigeonhole principle offers such tight assignment

Slide 14

Slide 14 text

Flexible pigeonhole principle 14 } Lemma (integer) } Given strings q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ tj } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 1 1 1 1 When t = 4, ∑ tj = t = 4

Slide 15

Slide 15 text

Flexible pigeonhole principle 15 } Lemma (real) } Given strings q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are real numbers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0.8 0.8 0.8 1.6 ∑ tj = t = 4 When t = 4,

Slide 16

Slide 16 text

Flexible pigeonhole principle 16 } Lemma (real) } Given strings q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are real numbers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0.8 0.8 0.8 1.6 ⌊tj ⌋ 0 0 0 1 Integer reduction ⌊t/b⌋ 1 1 1 1 When t = 4,

Slide 17

Slide 17 text

General pigeonhole principle 17 } Theorem } Given strings q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb ∑ tj = t–b+1

Slide 18

Slide 18 text

General pigeonhole principle 18 } Theorem } Given strings q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb +1 +1 +1 +0 t’j t1 +1 t2 +1 … tb-1 +1 tb ∑ tj = t–b+1 ∑ t’j = t (Lemma integer)

Slide 19

Slide 19 text

General pigeonhole principle 19 } Theorem } Given strings q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb +1 +1 +1 +0 t’j t1 +1 t2 +1 … tb-1 +1 tb –ε –ε –ε +(b–1)ε t’’j t1 +1–ε t2 +1–ε … tb-1 +1–ε tb +(b–1)ε ∑ tj = t–b+1 ∑ t’j = t (Lemma integer) ∑ t’’j = t (Lemma real) ε: small positive real number

Slide 20

Slide 20 text

General pigeonhole principle 20 } Theorem } Given strings q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof ⌊t’’j ⌋ t1 t2 … tb-1 tb ∑ tj = t–b+1 ∑ t’j = t (Lemma integer) ∑ t’’j = t (Lemma real) Integer reduction 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb +1 +1 +1 +0 t’j t1 +1 t2 +1 … tb-1 +1 tb –ε –ε –ε +(b–1)ε t’’j t1 +1–ε t2 +1–ε … tb-1 +1–ε tb +(b–1)ε ε: small positive real number

Slide 21

Slide 21 text

General pigeonhole principle 21 } Theorem } Given strings q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0 0 0 1 When t = 4 and b = 4 ∑ tj = t–b+1=1 ⌊t/b⌋ 1 1 1 1

Slide 22

Slide 22 text

Results 22 } Method } GPH: Multi-index based on the general principle } MIH: Multi-index based on the basic principle } Dataset } SIFT: a billion binary strings of length 128

Slide 23

Slide 23 text

Next stage 23 } Utilizing data skewness } Varying block lengths depending on the data skewness } Qin et al., “GPH: Similarity Search in Hamming Space,” ICDE, 2018 } Utilizing adjacent thresholds } To shorten the verification time with stronger constraints } Qin and Xiao, “Pigeonring: A Principle for Faster Thresholded Similarity Search,” VLDB, 2018