140

# StringBeginners#1

General Pigeonhole Principle ## Shunsuke Kanda

November 02, 2018

## Transcript

1. General Pigeonhole Principle
Shunsuke Kanda (proper string beginner)
1
StringBeginners#1

2. Literature
2
} IEEE International Conference on Data Engineering
(ICDE), 2018

3. Background
3
} Finding similar objects of high-dimensional vectorial data is a
fundamental task in recent data analysis
Original space (high dimension)
q
xi
Hamming space (low dimension)
q’
x’i
similarity-preserving
hashing
} e.g., near duplication detection in Web pages [Manku et al., WWW07]
} Web pages are converted into 64-bit vectors using SimHash
} Jaccard similarity between Web pages is approximated by Hamming
distance between 64-bit vectors
costly lightweight

4. Problem definition
} We have n binary strings x1
, …, xn
of length m each
} Given a query string q and threshold t, the goal is to
report all string ids i such that Ham(q, xi
) ≤ t
} where Ham(·, ·) is the Hamming distance between two strings
4
Dataset
x1
0000 0000
x2
0000 0111
x3
0000 1111
x4
1001 1111
Given q = 0000 0111 and t = 2 Ham(0000 0111,
0000 0000) = 3 > t
Ham(0000 0111,
0000 0111) = 0 ≤ t
Ham(0000 0111,
0000 1111) = 1 ≤ t
Ham(0000 0111,
1001 1111) = 3 > t
2
3

5. How to solve?
5
} Brute-force linear scan: O(n) time
} Ham(x, y) can be computed by popcnt(x xor y) in O(1) when a
string is within a machine word
} Modern solutions use inverted-index-based approaches
Dataset
x1
0000 0000
x2
0000 0111
x3
0000 1111
x4
1001 1111
O(n)
Ham(0000 0111,
1001 1111) = 3
0000 0111
xor
1001 1111
=
1001 1000 popcnt(1001 1000) = 3

6. Inverted-index-based solution
6
} Approach
} Build an inverted index from the strings
} Generate a set of strings whose Hamming distance for query q is no
more than t, Q = {q’ ∈ {0,1}m: Ham(q, q’) ≤ t}, called signatures
} Find a set of string ids whose key is in Q by retrieving the index
} Problem
} |Q| becomes too large for long strings and large thresholds
Index id
0000 0000 1
0000 0111 2
0000 1111 3
1001 1111 4
|Q| =
t
X
k=0

m
k

Dataset
x1
0000 0000
x2
0000 0111
x3
0000 1111
x4
1001 1111
Given q = 0000 0111 and t = 1,
Generate Q = { 0000 0111,
1000 0111,
0100 0111,
0010 0111,
0001 0111,
0000 1111,
0000 0011,
0000 0101,
0000 0110 }
Results = {2, 3}

7. Multi-index approach
7
} Aim
} To leverage the index approach also for large parameters
} Preprocessing
} Partition string xi
into b disjoint blocks
} Build inverted indexes for each block
1st 2nd
x1
0000 0000
x2
0000 0111
x3
0000 1111
x4
1001 1111
1st id
0000 1, 2, 3
1001 4
2nd id
0000 1
0111 2
1111 3, 4
When b = 2,

8. Multi-index approach
8
} Query processing: filter-and-verification strategy
} Partition q into b disjoint blocks q1
, q2
, …, qb
} (Filter phase): Obtain candidates by retrieving each index with Qj
=
{q’ ∈ {0,1}m/b: Ham(qj
, q’) ≤ ⌊t/b⌋} for each block j
} ⌊t/b⌋ is based on the pigeonhole principle
1st id
0000 1, 2, 3
1001 4
2nd id
0000 1
0111 2
1111 3, 4
1st block: q1
= 0000, ⌊t/b⌋ = ⌊1/2⌋ = 0
Q1
= {0000}
2nd block: q2
= 0111, ⌊t/b⌋ = 0
Q2
= {0111}
Given q = 0000 0111 and t = 1
Candidates = {1, 2, 3}

9. Multi-index approach
9
} Query processing: filter-and-verification strategy
} Partition q into b disjoint blocks q1
, q2
, …, qb
} (Filter phase): Obtain candidates by retrieving each index with Qj
=
{q’ ∈ {0,1}m/b: Ham(qj
, q’) ≤ ⌊t/b⌋} for each block j
} ⌊t/b⌋ is based on the pigeonhole principle
} (Verification phase): Verify those candidates with the original strings
x1
, …, xn
by computing the Hamming distance
Dataset
x1
0000 0000
x2
0000 0111
x3
0000 1111
x4
1001 1111
Candidates = {1, 2, 3}
Ham(q, x1
) = 3 > t
Ham(q, x2
) = 0 ≤ t
Ham(q, x3
) = 1 ≤ t
verification!
Results = {2, 3}
Does the threshold ⌊t/b⌋
never allow false negatives?
Question
Given q = 0000 0111 and t = 1

10. Pigeonhole principle ()
10
} If n items are contained in m boxes, then at least one box
has no more than ⌊n/m⌋ items
4 boxes
7 items

11. Pigeonhole principle ()
11
} If n items are contained in m boxes, then at least one box
has no more than ⌊n/m⌋ items
4 boxes
7 items
⌊n/m⌋ = ⌊7/4⌋ = 1

12. Pigeonhole principle ()
12
} If n items are contained in m boxes, then at least one box
has no more than ⌊n/m⌋ items
} Many existing solutions based on the principle
} Google (WWW2007), HEngine (ICDE2013), HmSearch
(SSDBM2013), MIH (CVPR2012), PartAlloc (VLDB2015), multi-
index* (SIGIR2016) and so on…
q = 0000 1111 0000 1111
x = 0011 0011 0011 1011
4 boxes and 7 items
Ham(q4
, x4
) ≤ ⌊7/4⌋ = 1

13. Qin’s claim in ICDE18
13
} Is the threshold assignment based on the (basic) pigeonhole
principle tight?
} The tightness means that
} (correctness) false negative never occurs with the threshold
assignment, and
} (minimality) there does not exist another threshold assignment
whose values are smaller (simplified)
} Smaller thresholds can reduce filter and verification costs
} Unfortunately, the existing assignment is not tight
} The general pigeonhole principle offers such tight assignment

14. Flexible pigeonhole principle
14
} Lemma (integer)
} Given strings q = q1
, q2
, …, qb
and x = x1
, x2
, …, xb
, and threshold t
} Consider thresholds t1
, t2
, …, tb
such that tj
are integers and ∑ tj
= t
} If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
, xj
) ≤ tj
} Proof
} Assume there is no block j such that Ham(qj
, xj
) ≤ tj
} Ham(q, x) = ∑ Ham(qj
, xj
) > ∑ tj
= t contradicts Ham(q, x) ≤ t
1st 2nd 3rd 4th
qj
0000 1111 0000 1111
xj
0001 0011 0011 1111
tj
1 1 1 1
When t = 4,
∑ tj
= t = 4

15. Flexible pigeonhole principle
15
} Lemma (real)
} Given strings q = q1
, q2
, …, qb
and x = x1
, x2
, …, xb
, and threshold t
} Consider thresholds t1
, t2
, …, tb
such that tj
are real numbers and ∑ tj
= t
} If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
, xj
) ≤ ⌊tj

} Proof
} Assume there is no block j such that Ham(qj
, xj
) ≤ ⌊tj

} Ham(q, x) = ∑ Ham(qj
, xj
) > ∑ tj
= t contradicts Ham(q, x) ≤ t
1st 2nd 3rd 4th
qj
0000 1111 0000 1111
xj
0001 0011 0011 1111
tj
0.8 0.8 0.8 1.6 ∑ tj
= t = 4
When t = 4,

16. Flexible pigeonhole principle
16
} Lemma (real)
} Given strings q = q1
, q2
, …, qb
and x = x1
, x2
, …, xb
, and threshold t
} Consider thresholds t1
, t2
, …, tb
such that tj
are real numbers and ∑ tj
= t
} If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
, xj
) ≤ ⌊tj

} Proof
} Assume there is no block j such that Ham(qj
, xj
) ≤ ⌊tj

} Ham(q, x) = ∑ Ham(qj
, xj
) > ∑ tj
= t contradicts Ham(q, x) ≤ t
1st 2nd 3rd 4th
qj
0000 1111 0000 1111
xj
0001 0011 0011 1111
tj
0.8 0.8 0.8 1.6
⌊tj
⌋ 0 0 0 1
Integer reduction
⌊t/b⌋ 1 1 1 1
When t = 4,

17. General pigeonhole principle
17
} Theorem
} Given strings q = q1
, q2
, …, qb
and x = x1
, x2
, …, xb
, and threshold t
} Consider thresholds t1
, t2
, …, tb
such that tj
are integers and ∑ tj
= t–b+1
} If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
, xj
) ≤ tj
} Proof
1st 2nd … b–1th bth
tj
t1
t2
… tb-1
tb
∑ tj
= t–b+1

18. General pigeonhole principle
18
} Theorem
} Given strings q = q1
, q2
, …, qb
and x = x1
, x2
, …, xb
, and threshold t
} Consider thresholds t1
, t2
, …, tb
such that tj
are integers and ∑ tj
= t–b+1
} If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
, xj
) ≤ tj
} Proof
1st 2nd … b–1th bth
tj
t1
t2
… tb-1
tb
+1 +1 +1 +0
t’j
t1
+1 t2
+1 … tb-1
+1 tb
∑ tj
= t–b+1
∑ t’j
= t (Lemma integer)

19. General pigeonhole principle
19
} Theorem
} Given strings q = q1
, q2
, …, qb
and x = x1
, x2
, …, xb
, and threshold t
} Consider thresholds t1
, t2
, …, tb
such that tj
are integers and ∑ tj
= t–b+1
} If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
, xj
) ≤ tj
} Proof
1st 2nd … b–1th bth
tj
t1
t2
… tb-1
tb
+1 +1 +1 +0
t’j
t1
+1 t2
+1 … tb-1
+1 tb
–ε –ε –ε +(b–1)ε
t’’j
t1
+1–ε t2
+1–ε … tb-1
+1–ε tb
+(b–1)ε
∑ tj
= t–b+1
∑ t’j
= t (Lemma integer)
∑ t’’j
= t (Lemma real)
ε: small positive real number

20. General pigeonhole principle
20
} Theorem
} Given strings q = q1
, q2
, …, qb
and x = x1
, x2
, …, xb
, and threshold t
} Consider thresholds t1
, t2
, …, tb
such that tj
are integers and ∑ tj
= t–b+1
} If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
, xj
) ≤ tj
} Proof
⌊t’’j
⌋ t1
t2
… tb-1
tb
∑ tj
= t–b+1
∑ t’j
= t (Lemma integer)
∑ t’’j
= t (Lemma real)
Integer reduction
1st 2nd … b–1th bth
tj
t1
t2
… tb-1
tb
+1 +1 +1 +0
t’j
t1
+1 t2
+1 … tb-1
+1 tb
–ε –ε –ε +(b–1)ε
t’’j
t1
+1–ε t2
+1–ε … tb-1
+1–ε tb
+(b–1)ε
ε: small positive real number

21. General pigeonhole principle
21
} Theorem
} Given strings q = q1
, q2
, …, qb
and x = x1
, x2
, …, xb
, and threshold t
} Consider thresholds t1
, t2
, …, tb
such that tj
are integers and ∑ tj
= t–b+1
} If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
, xj
) ≤ tj
1st 2nd 3rd 4th
qj
0000 1111 0000 1111
xj
0001 0011 0011 1111
tj
0 0 0 1
When t = 4 and b = 4
∑ tj
= t–b+1=1
⌊t/b⌋ 1 1 1 1

22. Results
22
} Method
} GPH: Multi-index based on the general principle
} MIH: Multi-index based on the basic principle
} Dataset
} SIFT: a billion binary strings of length 128

23. Next stage
23
} Utilizing data skewness
} Varying block lengths depending on the data skewness
} Qin et al., “GPH: Similarity Search in Hamming Space,” ICDE,
2018