Upgrade to Pro — share decks privately, control downloads, hide ads and more …

StringBeginners#1

Shunsuke Kanda
November 02, 2018

 StringBeginners#1

General Pigeonhole Principle

Shunsuke Kanda

November 02, 2018
Tweet

More Decks by Shunsuke Kanda

Other Decks in Research

Transcript

  1. General Pigeonhole Principle
    Shunsuke Kanda (proper string beginner)
    1
    StringBeginners#1

    View Slide

  2. Literature
    2
    } IEEE International Conference on Data Engineering
    (ICDE), 2018

    View Slide

  3. Background
    3
    } Finding similar objects of high-dimensional vectorial data is a
    fundamental task in recent data analysis
    Original space (high dimension)
    q
    xi
    Hamming space (low dimension)
    q’
    x’i
    similarity-preserving
    hashing
    } e.g., near duplication detection in Web pages [Manku et al., WWW07]
    } Web pages are converted into 64-bit vectors using SimHash
    } Jaccard similarity between Web pages is approximated by Hamming
    distance between 64-bit vectors
    costly lightweight

    View Slide

  4. Problem definition
    } We have n binary strings x1
    , …, xn
    of length m each
    } Given a query string q and threshold t, the goal is to
    report all string ids i such that Ham(q, xi
    ) ≤ t
    } where Ham(·, ·) is the Hamming distance between two strings
    4
    Dataset
    x1
    0000 0000
    x2
    0000 0111
    x3
    0000 1111
    x4
    1001 1111
    Given q = 0000 0111 and t = 2 Ham(0000 0111,
    0000 0000) = 3 > t
    Ham(0000 0111,
    0000 0111) = 0 ≤ t
    Ham(0000 0111,
    0000 1111) = 1 ≤ t
    Ham(0000 0111,
    1001 1111) = 3 > t
    2
    3

    View Slide

  5. How to solve?
    5
    } Brute-force linear scan: O(n) time
    } Ham(x, y) can be computed by popcnt(x xor y) in O(1) when a
    string is within a machine word
    } Modern solutions use inverted-index-based approaches
    Dataset
    x1
    0000 0000
    x2
    0000 0111
    x3
    0000 1111
    x4
    1001 1111
    O(n)
    Ham(0000 0111,
    1001 1111) = 3
    0000 0111
    xor
    1001 1111
    =
    1001 1000 popcnt(1001 1000) = 3

    View Slide

  6. Inverted-index-based solution
    6
    } Approach
    } Build an inverted index from the strings
    } Generate a set of strings whose Hamming distance for query q is no
    more than t, Q = {q’ ∈ {0,1}m: Ham(q, q’) ≤ t}, called signatures
    } Find a set of string ids whose key is in Q by retrieving the index
    } Problem
    } |Q| becomes too large for long strings and large thresholds
    Index id
    0000 0000 1
    0000 0111 2
    0000 1111 3
    1001 1111 4
    |Q| =
    t
    X
    k=0

    m
    k

    Dataset
    x1
    0000 0000
    x2
    0000 0111
    x3
    0000 1111
    x4
    1001 1111
    Given q = 0000 0111 and t = 1,
    Generate Q = { 0000 0111,
    1000 0111,
    0100 0111,
    0010 0111,
    0001 0111,
    0000 1111,
    0000 0011,
    0000 0101,
    0000 0110 }
    Results = {2, 3}

    View Slide

  7. Multi-index approach
    7
    } Aim
    } To leverage the index approach also for large parameters
    } Preprocessing
    } Partition string xi
    into b disjoint blocks
    } Build inverted indexes for each block
    1st 2nd
    x1
    0000 0000
    x2
    0000 0111
    x3
    0000 1111
    x4
    1001 1111
    1st id
    0000 1, 2, 3
    1001 4
    2nd id
    0000 1
    0111 2
    1111 3, 4
    When b = 2,

    View Slide

  8. Multi-index approach
    8
    } Query processing: filter-and-verification strategy
    } Partition q into b disjoint blocks q1
    , q2
    , …, qb
    } (Filter phase): Obtain candidates by retrieving each index with Qj
    =
    {q’ ∈ {0,1}m/b: Ham(qj
    , q’) ≤ ⌊t/b⌋} for each block j
    } ⌊t/b⌋ is based on the pigeonhole principle
    1st id
    0000 1, 2, 3
    1001 4
    2nd id
    0000 1
    0111 2
    1111 3, 4
    1st block: q1
    = 0000, ⌊t/b⌋ = ⌊1/2⌋ = 0
    Q1
    = {0000}
    2nd block: q2
    = 0111, ⌊t/b⌋ = 0
    Q2
    = {0111}
    Given q = 0000 0111 and t = 1
    Candidates = {1, 2, 3}

    View Slide

  9. Multi-index approach
    9
    } Query processing: filter-and-verification strategy
    } Partition q into b disjoint blocks q1
    , q2
    , …, qb
    } (Filter phase): Obtain candidates by retrieving each index with Qj
    =
    {q’ ∈ {0,1}m/b: Ham(qj
    , q’) ≤ ⌊t/b⌋} for each block j
    } ⌊t/b⌋ is based on the pigeonhole principle
    } (Verification phase): Verify those candidates with the original strings
    x1
    , …, xn
    by computing the Hamming distance
    Dataset
    x1
    0000 0000
    x2
    0000 0111
    x3
    0000 1111
    x4
    1001 1111
    Candidates = {1, 2, 3}
    Ham(q, x1
    ) = 3 > t
    Ham(q, x2
    ) = 0 ≤ t
    Ham(q, x3
    ) = 1 ≤ t
    verification!
    Results = {2, 3}
    Does the threshold ⌊t/b⌋
    never allow false negatives?
    Question
    Given q = 0000 0111 and t = 1

    View Slide

  10. Pigeonhole principle ()
    10
    } If n items are contained in m boxes, then at least one box
    has no more than ⌊n/m⌋ items
    4 boxes
    7 items

    View Slide

  11. Pigeonhole principle ()
    11
    } If n items are contained in m boxes, then at least one box
    has no more than ⌊n/m⌋ items
    4 boxes
    7 items
    ⌊n/m⌋ = ⌊7/4⌋ = 1

    View Slide

  12. Pigeonhole principle ()
    12
    } If n items are contained in m boxes, then at least one box
    has no more than ⌊n/m⌋ items
    } Many existing solutions based on the principle
    } Google (WWW2007), HEngine (ICDE2013), HmSearch
    (SSDBM2013), MIH (CVPR2012), PartAlloc (VLDB2015), multi-
    index* (SIGIR2016) and so on…
    q = 0000 1111 0000 1111
    x = 0011 0011 0011 1011
    4 boxes and 7 items
    Ham(q4
    , x4
    ) ≤ ⌊7/4⌋ = 1

    View Slide

  13. Qin’s claim in ICDE18
    13
    } Is the threshold assignment based on the (basic) pigeonhole
    principle tight?
    } The tightness means that
    } (correctness) false negative never occurs with the threshold
    assignment, and
    } (minimality) there does not exist another threshold assignment
    whose values are smaller (simplified)
    } Smaller thresholds can reduce filter and verification costs
    } Unfortunately, the existing assignment is not tight
    } The general pigeonhole principle offers such tight assignment

    View Slide

  14. Flexible pigeonhole principle
    14
    } Lemma (integer)
    } Given strings q = q1
    , q2
    , …, qb
    and x = x1
    , x2
    , …, xb
    , and threshold t
    } Consider thresholds t1
    , t2
    , …, tb
    such that tj
    are integers and ∑ tj
    = t
    } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
    , xj
    ) ≤ tj
    } Proof
    } Assume there is no block j such that Ham(qj
    , xj
    ) ≤ tj
    } Ham(q, x) = ∑ Ham(qj
    , xj
    ) > ∑ tj
    = t contradicts Ham(q, x) ≤ t
    1st 2nd 3rd 4th
    qj
    0000 1111 0000 1111
    xj
    0001 0011 0011 1111
    tj
    1 1 1 1
    When t = 4,
    ∑ tj
    = t = 4

    View Slide

  15. Flexible pigeonhole principle
    15
    } Lemma (real)
    } Given strings q = q1
    , q2
    , …, qb
    and x = x1
    , x2
    , …, xb
    , and threshold t
    } Consider thresholds t1
    , t2
    , …, tb
    such that tj
    are real numbers and ∑ tj
    = t
    } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
    , xj
    ) ≤ ⌊tj

    } Proof
    } Assume there is no block j such that Ham(qj
    , xj
    ) ≤ ⌊tj

    } Ham(q, x) = ∑ Ham(qj
    , xj
    ) > ∑ tj
    = t contradicts Ham(q, x) ≤ t
    1st 2nd 3rd 4th
    qj
    0000 1111 0000 1111
    xj
    0001 0011 0011 1111
    tj
    0.8 0.8 0.8 1.6 ∑ tj
    = t = 4
    When t = 4,

    View Slide

  16. Flexible pigeonhole principle
    16
    } Lemma (real)
    } Given strings q = q1
    , q2
    , …, qb
    and x = x1
    , x2
    , …, xb
    , and threshold t
    } Consider thresholds t1
    , t2
    , …, tb
    such that tj
    are real numbers and ∑ tj
    = t
    } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
    , xj
    ) ≤ ⌊tj

    } Proof
    } Assume there is no block j such that Ham(qj
    , xj
    ) ≤ ⌊tj

    } Ham(q, x) = ∑ Ham(qj
    , xj
    ) > ∑ tj
    = t contradicts Ham(q, x) ≤ t
    1st 2nd 3rd 4th
    qj
    0000 1111 0000 1111
    xj
    0001 0011 0011 1111
    tj
    0.8 0.8 0.8 1.6
    ⌊tj
    ⌋ 0 0 0 1
    Integer reduction
    ⌊t/b⌋ 1 1 1 1
    When t = 4,

    View Slide

  17. General pigeonhole principle
    17
    } Theorem
    } Given strings q = q1
    , q2
    , …, qb
    and x = x1
    , x2
    , …, xb
    , and threshold t
    } Consider thresholds t1
    , t2
    , …, tb
    such that tj
    are integers and ∑ tj
    = t–b+1
    } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
    , xj
    ) ≤ tj
    } Proof
    1st 2nd … b–1th bth
    tj
    t1
    t2
    … tb-1
    tb
    ∑ tj
    = t–b+1

    View Slide

  18. General pigeonhole principle
    18
    } Theorem
    } Given strings q = q1
    , q2
    , …, qb
    and x = x1
    , x2
    , …, xb
    , and threshold t
    } Consider thresholds t1
    , t2
    , …, tb
    such that tj
    are integers and ∑ tj
    = t–b+1
    } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
    , xj
    ) ≤ tj
    } Proof
    1st 2nd … b–1th bth
    tj
    t1
    t2
    … tb-1
    tb
    +1 +1 +1 +0
    t’j
    t1
    +1 t2
    +1 … tb-1
    +1 tb
    ∑ tj
    = t–b+1
    ∑ t’j
    = t (Lemma integer)

    View Slide

  19. General pigeonhole principle
    19
    } Theorem
    } Given strings q = q1
    , q2
    , …, qb
    and x = x1
    , x2
    , …, xb
    , and threshold t
    } Consider thresholds t1
    , t2
    , …, tb
    such that tj
    are integers and ∑ tj
    = t–b+1
    } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
    , xj
    ) ≤ tj
    } Proof
    1st 2nd … b–1th bth
    tj
    t1
    t2
    … tb-1
    tb
    +1 +1 +1 +0
    t’j
    t1
    +1 t2
    +1 … tb-1
    +1 tb
    –ε –ε –ε +(b–1)ε
    t’’j
    t1
    +1–ε t2
    +1–ε … tb-1
    +1–ε tb
    +(b–1)ε
    ∑ tj
    = t–b+1
    ∑ t’j
    = t (Lemma integer)
    ∑ t’’j
    = t (Lemma real)
    ε: small positive real number

    View Slide

  20. General pigeonhole principle
    20
    } Theorem
    } Given strings q = q1
    , q2
    , …, qb
    and x = x1
    , x2
    , …, xb
    , and threshold t
    } Consider thresholds t1
    , t2
    , …, tb
    such that tj
    are integers and ∑ tj
    = t–b+1
    } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
    , xj
    ) ≤ tj
    } Proof
    ⌊t’’j
    ⌋ t1
    t2
    … tb-1
    tb
    ∑ tj
    = t–b+1
    ∑ t’j
    = t (Lemma integer)
    ∑ t’’j
    = t (Lemma real)
    Integer reduction
    1st 2nd … b–1th bth
    tj
    t1
    t2
    … tb-1
    tb
    +1 +1 +1 +0
    t’j
    t1
    +1 t2
    +1 … tb-1
    +1 tb
    –ε –ε –ε +(b–1)ε
    t’’j
    t1
    +1–ε t2
    +1–ε … tb-1
    +1–ε tb
    +(b–1)ε
    ε: small positive real number

    View Slide

  21. General pigeonhole principle
    21
    } Theorem
    } Given strings q = q1
    , q2
    , …, qb
    and x = x1
    , x2
    , …, xb
    , and threshold t
    } Consider thresholds t1
    , t2
    , …, tb
    such that tj
    are integers and ∑ tj
    = t–b+1
    } If Ham(q, x) ≤ t, there exists at least one block j such that Ham(qj
    , xj
    ) ≤ tj
    1st 2nd 3rd 4th
    qj
    0000 1111 0000 1111
    xj
    0001 0011 0011 1111
    tj
    0 0 0 1
    When t = 4 and b = 4
    ∑ tj
    = t–b+1=1
    ⌊t/b⌋ 1 1 1 1

    View Slide

  22. Results
    22
    } Method
    } GPH: Multi-index based on the general principle
    } MIH: Multi-index based on the basic principle
    } Dataset
    } SIFT: a billion binary strings of length 128

    View Slide

  23. Next stage
    23
    } Utilizing data skewness
    } Varying block lengths depending on the data skewness
    } Qin et al., “GPH: Similarity Search in Hamming Space,” ICDE,
    2018
    } Utilizing adjacent thresholds
    } To shorten the verification time with stronger constraints
    } Qin and Xiao, “Pigeonring: A Principle for Faster Thresholded
    Similarity Search,” VLDB, 2018

    View Slide