StringBeginners#1

7336da77de517e04e2438553e4f8071d?s=47 Shunsuke Kanda
November 02, 2018

 StringBeginners#1

General Pigeonhole Principle

7336da77de517e04e2438553e4f8071d?s=128

Shunsuke Kanda

November 02, 2018
Tweet

Transcript

  1. General Pigeonhole Principle Shunsuke Kanda (proper string beginner) 1 StringBeginners#1

  2. Literature 2 } IEEE International Conference on Data Engineering (ICDE),

    2018
  3. Background 3 } Finding similar objects of high-dimensional vectorial data

    is a fundamental task in recent data analysis Original space (high dimension) q xi Hamming space (low dimension) q’ x’i similarity-preserving hashing } e.g., near duplication detection in Web pages [Manku et al., WWW07] } Web pages are converted into 64-bit vectors using SimHash } Jaccard similarity between Web pages is approximated by Hamming distance between 64-bit vectors costly lightweight
  4. Problem definition } We have n binary strings x1 ,

    …, xn of length m each } Given a query string q and threshold t, the goal is to report all string ids i such that Ham(q, xi ) ≤ t } where Ham(·, ·) is the Hamming distance between two strings 4 Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Given q = 0000 0111 and t = 2 Ham(0000 0111, 0000 0000) = 3 > t Ham(0000 0111, 0000 0111) = 0 ≤ t Ham(0000 0111, 0000 1111) = 1 ≤ t Ham(0000 0111, 1001 1111) = 3 > t 2 3
  5. How to solve? 5 } Brute-force linear scan: O(n) time

    } Ham(x, y) can be computed by popcnt(x xor y) in O(1) when a string is within a machine word } Modern solutions use inverted-index-based approaches Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 O(n) Ham(0000 0111, 1001 1111) = 3 0000 0111 xor 1001 1111 = 1001 1000 popcnt(1001 1000) = 3
  6. Inverted-index-based solution 6 } Approach } Build an inverted index

    from the strings } Generate a set of strings whose Hamming distance for query q is no more than t, Q = {q’ ∈ {0,1}m: Ham(q, q’) ≤ t}, called signatures } Find a set of string ids whose key is in Q by retrieving the index } Problem } |Q| becomes too large for long strings and large thresholds Index id 0000 0000 1 0000 0111 2 0000 1111 3 1001 1111 4 |Q| = t X k=0 ✓ m k ◆ Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Given q = 0000 0111 and t = 1, Generate Q = { 0000 0111, 1000 0111, 0100 0111, 0010 0111, 0001 0111, 0000 1111, 0000 0011, 0000 0101, 0000 0110 } Results = {2, 3}
  7. Multi-index approach 7 } Aim } To leverage the index

    approach also for large parameters } Preprocessing } Partition string xi into b disjoint blocks } Build inverted indexes for each block 1st 2nd x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 1st id 0000 1, 2, 3 1001 4 2nd id 0000 1 0111 2 1111 3, 4 When b = 2,
  8. Multi-index approach 8 } Query processing: filter-and-verification strategy } Partition

    q into b disjoint blocks q1 , q2 , …, qb } (Filter phase): Obtain candidates by retrieving each index with Qj = {q’ ∈ {0,1}m/b: Ham(qj , q’) ≤ ⌊t/b⌋} for each block j } ⌊t/b⌋ is based on the pigeonhole principle 1st id 0000 1, 2, 3 1001 4 2nd id 0000 1 0111 2 1111 3, 4 1st block: q1 = 0000, ⌊t/b⌋ = ⌊1/2⌋ = 0 Q1 = {0000} 2nd block: q2 = 0111, ⌊t/b⌋ = 0 Q2 = {0111} Given q = 0000 0111 and t = 1 Candidates = {1, 2, 3}
  9. Multi-index approach 9 } Query processing: filter-and-verification strategy } Partition

    q into b disjoint blocks q1 , q2 , …, qb } (Filter phase): Obtain candidates by retrieving each index with Qj = {q’ ∈ {0,1}m/b: Ham(qj , q’) ≤ ⌊t/b⌋} for each block j } ⌊t/b⌋ is based on the pigeonhole principle } (Verification phase): Verify those candidates with the original strings x1 , …, xn by computing the Hamming distance Dataset x1 0000 0000 x2 0000 0111 x3 0000 1111 x4 1001 1111 Candidates = {1, 2, 3} Ham(q, x1 ) = 3 > t Ham(q, x2 ) = 0 ≤ t Ham(q, x3 ) = 1 ≤ t verification! Results = {2, 3} Does the threshold ⌊t/b⌋ never allow false negatives? Question Given q = 0000 0111 and t = 1
  10. Pigeonhole principle () 10 } If n items are contained

    in m boxes, then at least one box has no more than ⌊n/m⌋ items 4 boxes 7 items
  11. Pigeonhole principle () 11 } If n items are contained

    in m boxes, then at least one box has no more than ⌊n/m⌋ items 4 boxes 7 items ⌊n/m⌋ = ⌊7/4⌋ = 1
  12. Pigeonhole principle () 12 } If n items are contained

    in m boxes, then at least one box has no more than ⌊n/m⌋ items } Many existing solutions based on the principle } Google (WWW2007), HEngine (ICDE2013), HmSearch (SSDBM2013), MIH (CVPR2012), PartAlloc (VLDB2015), multi- index* (SIGIR2016) and so on… q = 0000 1111 0000 1111 x = 0011 0011 0011 1011 4 boxes and 7 items Ham(q4 , x4 ) ≤ ⌊7/4⌋ = 1
  13. Qin’s claim in ICDE18 13 } Is the threshold assignment

    based on the (basic) pigeonhole principle tight? } The tightness means that } (correctness) false negative never occurs with the threshold assignment, and } (minimality) there does not exist another threshold assignment whose values are smaller (simplified) } Smaller thresholds can reduce filter and verification costs } Unfortunately, the existing assignment is not tight } The general pigeonhole principle offers such tight assignment
  14. Flexible pigeonhole principle 14 } Lemma (integer) } Given strings

    q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ tj } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 1 1 1 1 When t = 4, ∑ tj = t = 4
  15. Flexible pigeonhole principle 15 } Lemma (real) } Given strings

    q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are real numbers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0.8 0.8 0.8 1.6 ∑ tj = t = 4 When t = 4,
  16. Flexible pigeonhole principle 16 } Lemma (real) } Given strings

    q = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are real numbers and ∑ tj = t } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Proof } Assume there is no block j such that Ham(qj , xj ) ≤ ⌊tj ⌋ } Ham(q, x) = ∑ Ham(qj , xj ) > ∑ tj = t contradicts Ham(q, x) ≤ t 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0.8 0.8 0.8 1.6 ⌊tj ⌋ 0 0 0 1 Integer reduction ⌊t/b⌋ 1 1 1 1 When t = 4,
  17. General pigeonhole principle 17 } Theorem } Given strings q

    = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb ∑ tj = t–b+1
  18. General pigeonhole principle 18 } Theorem } Given strings q

    = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb +1 +1 +1 +0 t’j t1 +1 t2 +1 … tb-1 +1 tb ∑ tj = t–b+1 ∑ t’j = t (Lemma integer)
  19. General pigeonhole principle 19 } Theorem } Given strings q

    = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb +1 +1 +1 +0 t’j t1 +1 t2 +1 … tb-1 +1 tb –ε –ε –ε +(b–1)ε t’’j t1 +1–ε t2 +1–ε … tb-1 +1–ε tb +(b–1)ε ∑ tj = t–b+1 ∑ t’j = t (Lemma integer) ∑ t’’j = t (Lemma real) ε: small positive real number
  20. General pigeonhole principle 20 } Theorem } Given strings q

    = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj } Proof ⌊t’’j ⌋ t1 t2 … tb-1 tb ∑ tj = t–b+1 ∑ t’j = t (Lemma integer) ∑ t’’j = t (Lemma real) Integer reduction 1st 2nd … b–1th bth tj t1 t2 … tb-1 tb +1 +1 +1 +0 t’j t1 +1 t2 +1 … tb-1 +1 tb –ε –ε –ε +(b–1)ε t’’j t1 +1–ε t2 +1–ε … tb-1 +1–ε tb +(b–1)ε ε: small positive real number
  21. General pigeonhole principle 21 } Theorem } Given strings q

    = q1 , q2 , …, qb and x = x1 , x2 , …, xb , and threshold t } Consider thresholds t1 , t2 , …, tb such that tj are integers and ∑ tj = t–b+1 } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj , xj ) ≤ tj 1st 2nd 3rd 4th qj 0000 1111 0000 1111 xj 0001 0011 0011 1111 tj 0 0 0 1 When t = 4 and b = 4 ∑ tj = t–b+1=1 ⌊t/b⌋ 1 1 1 1
  22. Results 22 } Method } GPH: Multi-index based on the

    general principle } MIH: Multi-index based on the basic principle } Dataset } SIFT: a billion binary strings of length 128
  23. Next stage 23 } Utilizing data skewness } Varying block

    lengths depending on the data skewness } Qin et al., “GPH: Similarity Search in Hamming Space,” ICDE, 2018 } Utilizing adjacent thresholds } To shorten the verification time with stronger constraints } Qin and Xiao, “Pigeonring: A Principle for Faster Thresholded Similarity Search,” VLDB, 2018