Morita and Masao Fuketa 26th–29th Sep. 2017, Palermo, ITALY 24th International Symposium on String Processing and Information Retrieval Tokushima Univ. JAPAN
Such as std::map<std::string, type> in C++ ¨ Recent target data are massive [Martínez-Prieto+ 16] ¤ NLP, IR, Web Graph, RDF, Bioinformatics, etc. String Processing And Information Retrieval 1 2 3 4 5 Information → 4 Keyword Value
implementation of space-efficient dynamic keyword dictionaries libCSD path_decomposed_tries Marisa-trie Xcdat Judy Hat-Trie libart Cedar Static Dynamic LARGE SMALL Dictionary size
trie to improve the cache efficiency by lowering the height ¤ Application: Static compressed dictionaries [Grossi+ 14] technology$ i r $ i t ide rie$ ec hnology$ ology$ ie$ al$ $ Path-Decomposed Trie (PDT)
Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$
Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$
label ¤ Using array P where P[v] has the pointer of node v ¤ GOOD: Accessing a label in constant time ¤ BAD: Using 64 bits for each slot (too large) v4 v1 v2 v6 v3 P technology$ cs$ ue$ lly$ cal$
¤ Grouping the node labels into ℓ labels over the IDs ¤ Concatenating the labels for each group n #pointers is divided by ℓ ¤ Using bit array B such that B[v] = 1 if node v has a label v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P 3lly10technology2cs 3cal2ue Group 1 Gourp 2 In ℓ = 4
the target label position in the group n Constant time using Popcnt ¤ Scanning the concatenated label string until the target label n O(ℓ) time using Skipping technique (or constant time) v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P’ 3lly10technology2cs 3cal2ue In ℓ = 4 Popcnt(B[v4 ..v1 ]) = 2 Label of node v1 Skipping
BITMAP (ℓ = 16) ¤ m-Bonsai: As a naïve trie (not as a dictionary) ¤ Judy: Trie dictionary developed by HP-Lab. ¤ HAT-trie: Hybrid dictionary of trie and array hashing ¤ Cedar: Minimal-prefix double-array trie dictionary ¨ Details ¤ Language: C++ ¤ Associated value type: int (4 bytes) ¤ Keyword order: random
¤ GOOD: Space efficiency, BAD: Time performance ¤ The traversal speed of m-Bonsai is a bottleneck n Xorshift random number generator can solve this problem ¨ Future work ¤ To improve m-Bonsai or engineer an alternative trie ¤ To support more complex operations (possible in principle) n Invertible mapping between keywords and unique IDs n Prefix-based operations ¤ To develop and publish a useful dictionary library
limited If you have any questions, please speak slowly and clearly :) My experimental implementation is available at https://github.com/kampersanda/dynpdt