Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SIGSPATIAL20

 SIGSPATIAL20

The presentation slide of "Succinct Trit-array Trie for Scalable Trajectory Similarity Search" in SIGSPATIAL20

Shunsuke Kanda

November 05, 2020
Tweet

More Decks by Shunsuke Kanda

Other Decks in Research

Transcript

  1. Succinct Trit-array Trie for
    Scalable Trajectory Similarity Search
    Shunsuke Kanda1 Koh Takeuchi2,1 Keisuke Fujii3,1 Yasuo Tabei1
    1RIKEN AIP 2Kyoto Univ. 3Nagoya Univ.
    28th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
    November 3–6, 2020, Seattle, Washington, USA (virtual)

    View Slide

  2. Background & Contribution
    • Background
    ▹ Massive datasets of spatial trajectories are ubiquitous in research and industry
    ▹ Similarity search of a huge collection of trajectories is indispensable for turning these
    datasets into knowledge
    • Our contribution
    ▹ Develop an efficient trajectory similarity search method
    – Powerful measure: Fréchet distance
    – Fast search: Locality sensitive hashing (LSH) + Trie search algorithm
    – Scalability: Compressed trie implementation using succinct data structures
    ▹ Experiments using real-world huge datasets
    – Demonstrate our method performs superiorly compared to state-of-the-art ones

    View Slide

  3. (Discrete) Fréchet Distance
    • Often explained by the metaphor using an owner and his dog with a leash
    ▹ Both walk on their trajectories with their speeds, but cannot go backward
    ▹ The Fréchet distance is the leash length necessary at least
    • The computation time is O(traj-length2) by dynamic programing
    max = Fréchet(owner, dog)
    The computational demand makes difficult
    to design an efficient exact solution :(
    LSH enables us to quickly solve such
    difficult search problems :)

    View Slide

  4. Approximate Approach
    1. Map trajectories on the Fréchet space into integer vectors (i.e., sketches) on the
    Hamming space
    2. Retrieve candidate solutions for sketches using Hamming distance
    3. Remove false positives from the candidate solutions
    P3
    P1
    Q
    P2
    P4
    Trajectories
    10230122
    10220132
    22030132
    12031123
    10230332
    S1
    S2
    S3
    S4
    T
    Sketches
    Ham(S1
    ,T) = 2
    Ham(S2
    ,T) = 2
    Ham(S3
    ,T) = 4
    Ham(S4
    ,T) = 6
    Hamming distance
    Similar
    Dissimilar
    LSH
    Main problem

    View Slide

  5. Approximate Trajectory Similarity Search Problem
    • Input
    ▹ Database of n sketches S1
    , S2
    , …, Sn
    ▹ Query sketch T
    ▹ Hamming distance threshold K
    • Output
    ▹ All sketches Si
    such that the Hamming distance to T is within K
    – i.e., { Si
    : Ham(Si
    , T) ≤ K }
    • Issues :(
    ▹ Most existing methods are designed for binary sketches and inefficient for integer ones
    ▹ Existing methods for integer sketches are memory-inefficient
    We develop a novel similarity search method called tSTAT

    View Slide

  6. tSTAT
    (trajectory-indexing Succinct Trit-Array Trie)
    H
    0
    2
    1
    0
    0
    0
    1
    2
    3
    G
    1
    1
    V
    1
    4
    1
    1
    0
    2
    0
    1
    2
    3
    H
    1
    1
    G
    2
    3
    V
    2
    H
    2
    0
    0
    1
    0
    0
    0
    1
    0
    0
    1
    2
    3
    4
    5
    6
    7
    H
    3
    0
    0
    0
    2
    0
    0
    2
    2
    0
    1
    2
    3
    4
    5
    6
    7
    1
    0
    1
    1
    G
    4
    5
    6
    2
    1
    V
    4
    0
    1
    2
    3
    3
    1
    1
    0
    0 1
    0
    1
    1 1 2
    2
    2
    2
    2
    3
    1 2
    1
    0
    1,4
    2
    3,6
    3
    1 1
    1
    1 2
    2
    1
    0
    0
    1
    3
    2 3
    2
    2
    3
    5,6
    2
    3
    1
    0
    0
    1 1
    1
    1 2
    2
    1
    0
    1
    3
    2 3
    2
    2
    3
    4
    5,6
    2
    3
    1
    0
    0
    1
    1 2
    2
    1
    0
    1
    3
    4
    5,6
    3
    1,2
    A2 -> 3
    0 1 2 3 4
    level:
    0 1 2
    level:
    Step1: Partition each sketch into B blocks based on the multi-index approach
    Step1 Step2 Step3
    Enables us to divide the Hamming distance problem with large threshold K into
    B sub-problems with small threshold ⌊K/B⌋
    Proposed Method
    Although false positives can arise, they can be safely removed in the final verification
    Block1 Block2
    Threshold K = 3 becomes
    sub-thresholds ⌊3/2⌋=1

    View Slide

  7. tSTAT
    (trajectory-indexing Succinct Trit-Array Trie)
    H
    0
    2
    1
    0
    0
    0
    1
    2
    3
    G
    1
    1
    V
    1
    4
    1
    1
    0
    2
    0
    1
    2
    3
    H
    1
    1
    G
    2
    3
    V
    2
    H
    2
    0
    0
    1
    0
    0
    0
    1
    0
    0
    1
    2
    3
    4
    5
    6
    7
    H
    3
    0
    0
    0
    2
    0
    0
    2
    2
    0
    1
    2
    3
    4
    5
    6
    7
    1
    0
    1
    1
    G
    4
    5
    6
    2
    1
    V
    4
    0
    1
    2
    3
    3
    1
    1
    0
    0 1
    0
    1
    1 1 2
    2
    2
    2
    2
    3
    1 2
    1
    0
    1,4
    2
    3,6
    3
    1 1
    1
    1 2
    2
    1
    0
    0
    1
    3
    2 3
    2
    2
    3
    5,6
    2
    3
    1
    0
    0
    1 1
    1
    1 2
    2
    1
    0
    1
    3
    2 3
    2
    2
    3
    4
    5,6
    2
    3
    1
    0
    0
    1
    1 2
    2
    1
    0
    1
    3
    4
    5,6
    3
    1,2
    A2 -> 3
    0 1 2 3 4
    level:
    0 1 2
    level:
    Step2: Index each block using a trie where redundant nodes are eliminated
    Step1 Step2 Step3
    The Hamming distance problem can be solved by traversing trie nodes
    while counting #errors for the query
    Proposed Method
    Similar
    Block1 Block2
    The search takes O(B(L/B)K/B+2) time, where B is #blocks and L is sketch length
    Stop traversing down when #errors >⌊3/2⌋=1
    Threshold K = 3 becomes
    sub-thresholds ⌊3/2⌋=1
    0 0 2 3
    Query:

    View Slide

  8. tSTAT
    (trajectory-indexing Succinct Trit-Array Trie)
    H
    0
    2
    1
    0
    0
    0
    1
    2
    3
    G
    1
    1
    V
    1
    4
    1
    1
    0
    2
    0
    1
    2
    3
    H
    1
    1
    G
    2
    3
    V
    2
    H
    2
    0
    0
    1
    0
    0
    0
    1
    0
    0
    1
    2
    3
    4
    5
    6
    7
    H
    3
    0
    0
    0
    2
    0
    0
    2
    2
    0
    1
    2
    3
    4
    5
    6
    7
    1
    0
    1
    1
    G
    4
    5
    6
    2
    1
    V
    4
    0
    1
    2
    3
    3
    1
    1
    0
    0 1
    0
    1
    1 1 2
    2
    2
    2
    2
    3
    1 2
    1
    0
    1,4
    2
    3,6
    3
    1 1
    1
    1 2
    2
    1
    0
    0
    1
    3
    2 3
    2
    2
    3
    5,6
    2
    3
    1
    0
    0
    1 1
    1
    1 2
    2
    1
    0
    1
    3
    2 3
    2
    2
    3
    4
    5,6
    2
    3
    1
    0
    0
    1
    1 2
    2
    1
    0
    1
    3
    4
    5,6
    3
    1,2
    A2 -> 3
    0 1 2 3 4
    level:
    0 1 2
    level:
    Step3: Implement the trie index using novel data structure STAT in compressed space
    Step1 Step2 Step3
    Proposed Method
    Leverage succinct data structures
    (compressed ones supporting various data operations)
    Block1 Block2

    View Slide

  9. Compressed Data Structure: STAT
    (Succinct Trit-Array Trie)
    Trie nodes are represented using
    direct addressable tables H
    Tree navigation can be performed in
    O(1) time by Rank/Select queries over H
    Proposed Method
    H can be implemented by succinct trit array in
    bits of compressed space
    σNin
    log2
    3 + o(Nin
    )
    Close to the theoretically lower-bound space
    Rank
    Rank
    Select
    We developed an efficient implementation of
    the succinct trit array supporting Rank/Select
    (σ: #kinds of integers, Nin
    : #inner nodes)

    View Slide

  10. tSTAT
    (trajectory-indexing Succinct Trit-Array Trie)
    H
    0
    2
    1
    0
    0
    0
    1
    2
    3
    G
    1
    1
    V
    1
    4
    1
    1
    0
    2
    0
    1
    2
    3
    H
    1
    1
    G
    2
    3
    V
    2
    H
    2
    0
    0
    1
    0
    0
    0
    1
    0
    0
    1
    2
    3
    4
    5
    6
    7
    H
    3
    0
    0
    0
    2
    0
    0
    2
    2
    0
    1
    2
    3
    4
    5
    6
    7
    1
    0
    1
    1
    G
    4
    5
    6
    2
    1
    V
    4
    0
    1
    2
    3
    3
    1
    1
    0
    0 1
    0
    1
    1 1 2
    2
    2
    2
    2
    3
    1 2
    1
    0
    1,4
    2
    3,6
    3
    1 1
    1
    1 2
    2
    1
    0
    0
    1
    3
    2 3
    2
    2
    3
    5,6
    2
    3
    1
    0
    0
    1 1
    1
    1 2
    2
    1
    0
    1
    3
    2 3
    2
    2
    3
    4
    5,6
    2
    3
    1
    0
    0
    1
    1 2
    2
    1
    0
    1
    3
    4
    5,6
    3
    1,2
    A2 -> 3
    0 1 2 3 4
    level:
    0 1 2
    level:
    Step1 Step2 Step3
    Proposed Method
    Block1 Block2
    Step1: Partition each sketch into B blocks based on the multi-index approach
    Step2: Index each block using a trie where redundant nodes are eliminated
    Step3: Implement the trie index using novel data structure STAT in compressed space

    View Slide

  11. Experiments
    • Dataset: 3.3 million NBA player trajectories of 636 games in the 2015/16 seasons
    • Queryset: 1000 trajectories randomly extracted from the dataset
    • Competitors
    ▹ LS: Strawman baseline with linear search (without any auxiliary data structure)
    ▹ HmSearch: State-of-the-art of similarity search for integer sketches [SSDBM13]
    ▹ FRESH: State-of-the-art of approximate trajectory similarity search [WADS19]
    17x smaller than FRESH
    10x smaller than HmSearch
    Memory usage (GiB)
    Fréchet radii R to find 1, 10, and 100 solutions on average per query

    View Slide

  12. Experiments
    • Dataset: 3.3 million NBA player trajectories of 636 games in the 2015/16 seasons
    • Queryset: 1000 trajectories randomly extracted from the dataset
    • Competitors
    ▹ LS: Strawman baseline with linear search (without any auxiliary data structure)
    ▹ HmSearch: State-of-the-art of similarity search for integer sketches [SSDBM13]
    ▹ FRESH: State-of-the-art of approximate trajectory similarity search [WADS19]
    Average Search Time (ms/query)
    Fréchet radii R to find 1, 10, and 100 solutions on average per query
    34x faster than FRESH
    12x faster than HmSearch

    View Slide

  13. Example of querying similar movements using tSTAT
    • Conclusion
    ▹ Proposed a novel similarity search method tSTAT
    ▹ Showed the efficiency through experiments using real-world datasets
    Date: 12/06/2015
    Match: SAC vs OKC
    PlayerName: Rajon Rondo (No 9)
    Q4 – 07:09.74
    Q4 – 07:00.29
    Query
    Date: 10/31/2015
    Match: NOP vs GSW
    PlayerName: Toney Douglas (No 16)
    Distance: 0.363737
    Result 1
    Q3 – 00:36.15
    Q3 – 00:31.75
    Date: 12/09/2015
    Match: SAS vs TOR
    PlayerName: Tim Duncan (No 21)
    Distance: 0.423995
    Result 2
    Q1 – 09:48.32
    Q1 – 09:43.59
    Date: 01/12/2016
    Match: PHX vs IND
    PlayerName: P. J. Tucker (No 17)
    Distance: 0.395999
    Result 3
    Q4 – 06:20.51
    Q4 – 06:17.35
    Database of
    3.3 million trajs
    • For a short movement of Rajon Rondo in Kings vs. Thunder on Dec. 6, 2015

    View Slide