SPIRE 2017

7336da77de517e04e2438553e4f8071d?s=47 Shunsuke Kanda
September 26, 2017

SPIRE 2017

7336da77de517e04e2438553e4f8071d?s=128

Shunsuke Kanda

September 26, 2017
Tweet

Transcript

  1. 1.

    PRACTICAL IMPLEMENTATION OF SPACE-EFFICIENT DYNAMIC KEYWORD DICTIONARIES Shunsuke Kanda, Kazuhiro

    Morita and Masao Fuketa 26th–29th Sep. 2017, Palermo, ITALY 24th International Symposium on String Processing and Information Retrieval Tokushima Univ. JAPAN
  2. 2.

    Keyword Dictionaries 2 ¨ Associative array with string keys ¤

    Such as std::map<std::string, type> in C++ ¨ Recent target data are massive [Martínez-Prieto+ 16] ¤ NLP, IR, Web Graph, RDF, Bioinformatics, etc. String Processing And Information Retrieval 1 2 3 4 5 Information → 4 Keyword Value
  3. 3.

    Keyword Dictionaries (cont.) 3 ¨ Research objective ¤ Engineering practical

    implementation of space-efficient dynamic keyword dictionaries libCSD path_decomposed_tries Marisa-trie Xcdat Judy Hat-Trie libart Cedar Static Dynamic LARGE SMALL Dictionary size
  4. 5.

    Trie 5 ¨ Edge-labeled tree for storing a set of

    strings ¤ Search time: O(k) time where k denotes the query length Keyword Value ideal$ 1 ideology$ 2 tec$ 3 technology$ 4 tie$ 5 trie$ 6 t ide rie$ ec hnology$ ology$ ie$ 6 4 1 2 5 3 al$ $ Search technology$ ※ Strictly speaking, this tree is Patricia Trie
  5. 6.

    Path Decomposition [Ferragina+ 08] 6 ¨ Procedure of transforming a

    trie to improve the cache efficiency by lowering the height ¤ Application: Static compressed dictionaries [Grossi+ 14] technology$ i r $ i t ide rie$ ec hnology$ ology$ ie$ al$ $ Path-Decomposed Trie (PDT)
  6. 8.
  7. 9.

    Incremental Path Decomposition (cont.) 9 ¨ Incrementally constructing a PDT

    technology$ v1 (5, i) v2 EX2 Add key technics$ 012345 technics$ cs$
  8. 10.

    Incremental Path Decomposition (cont.) 10 ¨ Incrementally constructing a PDT

    technology$ cs$ (5, i) v1 v2 (0, q) v3 EX3 Add key technique$ 012345 technique$ Dynamic Path-Decomposed Trie (DynPDT) 0 que$ ue$
  9. 11.

    Implementation Approaches 11 ¨ Tree representation (with edge labels) ¤

    Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$
  10. 12.

    Implementation Approaches 12 ¨ Tree representation (with edge labels) ¤

    Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$
  11. 13.

    Plain Label Management (PLAIN) 13 ¨ Introducing pointers to each

    label ¤ Using array P where P[v] has the pointer of node v ¤ GOOD: Accessing a label in constant time ¤ BAD: Using 64 bits for each slot (too large) v4 v1 v2 v6 v3 P technology$ cs$ ue$ lly$ cal$
  12. 14.

    Compact Label Management (BITMAP) 14 ¨ Reducing the pointer overhead

    ¤ Grouping the node labels into ℓ labels over the IDs ¤ Concatenating the labels for each group n #pointers is divided by ℓ ¤ Using bit array B such that B[v] = 1 if node v has a label v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P 3lly10technology2cs 3cal2ue Group 1 Gourp 2 In ℓ = 4
  13. 15.

    Compact Label Management (BITMAP) 15 ¨ Access procedure ¤ Calculating

    the target label position in the group n Constant time using Popcnt ¤ Scanning the concatenated label string until the target label n O(ℓ) time using Skipping technique (or constant time) v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P’ 3lly10technology2cs 3cal2ue In ℓ = 4 Popcnt(B[v4 ..v1 ]) = 2 Label of node v1 Skipping
  14. 17.

    Settings 17 ¨ Machine ¤ Intel Xeon E5540 @2.53 GHz

    CPU,32GB RAM ¤ Ubuntu Server 16.40 LTS ¨ Datasets ¤ Wiki: Titles from English Wikipedia n Size: 227MiB, Keys: 11.5M, Ave. length: 20.7 bytes ¤ WebBase: URLs from WebBase crawler n Size: 6.6GiB, Keys: 118.1M, Ave. length: 60.2 bytes ¤ LUBM: URIs from LUBM benchmark n Size: 3.1GiB, Keys: 52.6M, Ave. length: 63.7 bytes
  15. 18.

    Settings (cont.) 18 ¨ Data structures ¤ DynPDT: PLAIN and

    BITMAP (ℓ = 16) ¤ m-Bonsai: As a naïve trie (not as a dictionary) ¤ Judy: Trie dictionary developed by HP-Lab. ¤ HAT-trie: Hybrid dictionary of trie and array hashing ¤ Cedar: Minimal-prefix double-array trie dictionary ¨ Details ¤ Language: C++ ¤ Associated value type: int (4 bytes) ¤ Keyword order: random
  16. 19.

    Results for Space Usage 19 46.6 47.5 45.0 18.8 21.0

    13.8 23.6 29.3 11.4 50.5 53.5 33.9 40.2 68.9 64.7 41.1 29.7 0 10 20 30 40 50 60 70 80 Wiki WebBase LUBM Bytes per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 3.3x 4.7x
  17. 20.

    Results for Insertion Time 20 1.14 2.37 1.65 1.57 2.93

    1.99 2.22 7.69 4.80 1.06 2.94 1.53 1.13 1.75 2.58 1.07 2.50 0 1 2 3 4 5 6 7 8 9 Wiki WebBase LUBM Micro sec. per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 2.6x
  18. 21.

    Results for Search Time 21 1.13 2.20 1.12 1.61 2.74

    1.43 2.06 8.30 3.08 0.88 2.42 0.79 0.35 0.80 0.51 0.69 0.69 0 1 2 3 4 5 6 7 8 9 Wiki WebBase LUBM Micro sec. per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 4.6x
  19. 22.

    Summary 22 ¨ Proposing a new dictionary structure, or DynPDT

    ¤ GOOD: Space efficiency, BAD: Time performance ¤ The traversal speed of m-Bonsai is a bottleneck n Xorshift random number generator can solve this problem ¨ Future work ¤ To improve m-Bonsai or engineer an alternative trie ¤ To support more complex operations (possible in principle) n Invertible mapping between keywords and unique IDs n Prefix-based operations ¤ To develop and publish a useful dictionary library
  20. 23.

    23 Thank you for your attention! My English skills are

    limited If you have any questions, please speak slowly and clearly :) My experimental implementation is available at https://github.com/kampersanda/dynpdt