Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SPIRE 2017

Shunsuke Kanda
September 26, 2017

SPIRE 2017

Shunsuke Kanda

September 26, 2017
Tweet

More Decks by Shunsuke Kanda

Other Decks in Research

Transcript

  1. PRACTICAL IMPLEMENTATION OF
    SPACE-EFFICIENT DYNAMIC
    KEYWORD DICTIONARIES
    Shunsuke Kanda, Kazuhiro Morita and Masao Fuketa
    26th–29th Sep. 2017, Palermo, ITALY
    24th International Symposium on
    String Processing and Information Retrieval
    Tokushima Univ.
    JAPAN

    View Slide

  2. Keyword Dictionaries
    2
    ¨ Associative array with string keys
    ¤ Such as std::map in C++
    ¨ Recent target data are massive [Martínez-Prieto+ 16]
    ¤ NLP, IR, Web Graph, RDF, Bioinformatics, etc.
    String
    Processing
    And
    Information
    Retrieval
    1
    2
    3
    4
    5
    Information → 4
    Keyword Value

    View Slide

  3. Keyword Dictionaries (cont.)
    3
    ¨ Research objective
    ¤ Engineering practical implementation of space-efficient
    dynamic keyword dictionaries
    libCSD
    path_decomposed_tries
    Marisa-trie
    Xcdat
    Judy
    Hat-Trie
    libart
    Cedar
    Static Dynamic
    LARGE
    SMALL
    Dictionary size

    View Slide

  4. Data structures related to our work
    Trie & Path Decomposition
    4

    View Slide

  5. Trie
    5
    ¨ Edge-labeled tree for storing a set of strings
    ¤ Search time: O(k) time where k denotes the query length
    Keyword Value
    ideal$ 1
    ideology$ 2
    tec$ 3
    technology$ 4
    tie$ 5
    trie$ 6
    t
    ide
    rie$
    ec
    hnology$
    ology$
    ie$
    6
    4
    1 2 5
    3
    al$
    $
    Search technology$
    ※ Strictly speaking, this tree is Patricia Trie

    View Slide

  6. Path Decomposition [Ferragina+ 08]
    6
    ¨ Procedure of transforming a trie to improve the cache
    efficiency by lowering the height
    ¤ Application: Static compressed dictionaries [Grossi+ 14]
    technology$
    i r $
    i
    t
    ide
    rie$
    ec
    hnology$
    ology$
    ie$
    al$
    $
    Path-Decomposed Trie (PDT)

    View Slide

  7. New implementation of space-efficient dynamic
    keyword dictionaries
    DynPDT: Dynamic Path-Decomposed Trie
    7

    View Slide

  8. Incremental Path Decomposition
    8
    ¨ Incrementally constructing a PDT
    technology$
    v1
    EX1 Add key technology$ to an empty dictionary

    View Slide

  9. Incremental Path Decomposition (cont.)
    9
    ¨ Incrementally constructing a PDT
    technology$
    v1
    (5, i)
    v2
    EX2 Add key technics$ 012345
    technics$
    cs$

    View Slide

  10. Incremental Path Decomposition (cont.)
    10
    ¨ Incrementally constructing a PDT
    technology$
    cs$
    (5, i)
    v1
    v2
    (0, q)
    v3
    EX3 Add key technique$ 012345
    technique$
    Dynamic Path-Decomposed Trie (DynPDT)
    0
    que$
    ue$

    View Slide

  11. Implementation Approaches
    11
    ¨ Tree representation (with edge labels)
    ¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14]
    ¨ Node label management
    ¤ Separately storing the labels from the tree structure
    ¤ Pointer management is important
    technology$
    cs$
    ue$
    (5, i)
    (0, q)
    v1
    v2
    v3
    technology$
    technics$
    technique$

    View Slide

  12. Implementation Approaches
    12
    ¨ Tree representation (with edge labels)
    ¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14]
    ¨ Node label management
    ¤ Separately storing the labels from the tree structure
    ¤ Pointer management is important
    technology$
    cs$
    ue$
    (5, i)
    (0, q)
    v1
    v2
    v3
    technology$
    technics$
    technique$

    View Slide

  13. Plain Label Management (PLAIN)
    13
    ¨ Introducing pointers to each label
    ¤ Using array P where P[v] has the pointer of node v
    ¤ GOOD: Accessing a label in constant time
    ¤ BAD: Using 64 bits for each slot (too large)
    v4
    v1
    v2
    v6
    v3
    P
    technology$ cs$ ue$
    lly$ cal$

    View Slide

  14. Compact Label Management (BITMAP)
    14
    ¨ Reducing the pointer overhead
    ¤ Grouping the node labels into ℓ labels over the IDs
    ¤ Concatenating the labels for each group
    n #pointers is divided by ℓ
    ¤ Using bit array B such that B[v] = 1 if node v has a label
    v4
    v1
    v2
    v6
    v3
    B 1 0 1 1 0 1 0 1
    P
    3lly10technology2cs 3cal2ue
    Group 1 Gourp 2
    In ℓ = 4

    View Slide

  15. Compact Label Management (BITMAP)
    15
    ¨ Access procedure
    ¤ Calculating the target label position in the group
    n Constant time using Popcnt
    ¤ Scanning the concatenated label string until the target label
    n O(ℓ) time using Skipping technique (or constant time)
    v4
    v1
    v2
    v6
    v3
    B 1 0 1 1 0 1 0 1
    P’
    3lly10technology2cs 3cal2ue
    In ℓ = 4
    Popcnt(B[v4
    ..v1
    ]) = 2
    Label of node v1
    Skipping

    View Slide

  16. Space & Time
    Experiments
    16

    View Slide

  17. Settings
    17
    ¨ Machine
    ¤ Intel Xeon E5540 @2.53 GHz CPU,32GB RAM
    ¤ Ubuntu Server 16.40 LTS
    ¨ Datasets
    ¤ Wiki: Titles from English Wikipedia
    n Size: 227MiB, Keys: 11.5M, Ave. length: 20.7 bytes
    ¤ WebBase: URLs from WebBase crawler
    n Size: 6.6GiB, Keys: 118.1M, Ave. length: 60.2 bytes
    ¤ LUBM: URIs from LUBM benchmark
    n Size: 3.1GiB, Keys: 52.6M, Ave. length: 63.7 bytes

    View Slide

  18. Settings (cont.)
    18
    ¨ Data structures
    ¤ DynPDT: PLAIN and BITMAP (ℓ = 16)
    ¤ m-Bonsai: As a naïve trie (not as a dictionary)
    ¤ Judy: Trie dictionary developed by HP-Lab.
    ¤ HAT-trie: Hybrid dictionary of trie and array hashing
    ¤ Cedar: Minimal-prefix double-array trie dictionary
    ¨ Details
    ¤ Language: C++
    ¤ Associated value type: int (4 bytes)
    ¤ Keyword order: random

    View Slide

  19. Results for Space Usage
    19
    46.6 47.5
    45.0
    18.8
    21.0
    13.8
    23.6
    29.3
    11.4
    50.5
    53.5
    33.9
    40.2
    68.9
    64.7
    41.1
    29.7
    0
    10
    20
    30
    40
    50
    60
    70
    80
    Wiki WebBase LUBM
    Bytes per keyword
    DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
    N/A
    3.3x
    4.7x

    View Slide

  20. Results for Insertion Time
    20
    1.14
    2.37
    1.65
    1.57
    2.93
    1.99
    2.22
    7.69
    4.80
    1.06
    2.94
    1.53
    1.13
    1.75
    2.58
    1.07
    2.50
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    Wiki WebBase LUBM
    Micro sec. per keyword
    DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
    N/A
    2.6x

    View Slide

  21. Results for Search Time
    21
    1.13
    2.20
    1.12
    1.61
    2.74
    1.43
    2.06
    8.30
    3.08
    0.88
    2.42
    0.79
    0.35
    0.80
    0.51
    0.69 0.69
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    Wiki WebBase LUBM
    Micro sec. per keyword
    DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar
    N/A
    4.6x

    View Slide

  22. Summary
    22
    ¨ Proposing a new dictionary structure, or DynPDT
    ¤ GOOD: Space efficiency, BAD: Time performance
    ¤ The traversal speed of m-Bonsai is a bottleneck
    n Xorshift random number generator can solve this problem
    ¨ Future work
    ¤ To improve m-Bonsai or engineer an alternative trie
    ¤ To support more complex operations (possible in principle)
    n Invertible mapping between keywords and unique IDs
    n Prefix-based operations
    ¤ To develop and publish a useful dictionary library

    View Slide

  23. 23
    Thank you for your attention!
    My English skills are limited
    If you have any questions,
    please speak slowly and clearly :)
    My experimental implementation is available at
    https://github.com/kampersanda/dynpdt

    View Slide