SPIRE 2017

PRACTICAL IMPLEMENTATION OF SPACE-EFFICIENT DYNAMIC KEYWORD DICTIONARIES Shunsuke Kanda, Kazuhiro
Morita and Masao Fuketa 26th–29th Sep. 2017, Palermo, ITALY 24th International Symposium on String Processing and Information Retrieval Tokushima Univ. JAPAN

Keyword Dictionaries 2 ¨ Associative array with string keys ¤
Such as std::map<std::string, type> in C++ ¨ Recent target data are massive [Martínez-Prieto+ 16] ¤ NLP, IR, Web Graph, RDF, Bioinformatics, etc. String Processing And Information Retrieval 1 2 3 4 5 Information → 4 Keyword Value

Keyword Dictionaries (cont.) 3 ¨ Research objective ¤ Engineering practical
implementation of space-efficient dynamic keyword dictionaries libCSD path_decomposed_tries Marisa-trie Xcdat Judy Hat-Trie libart Cedar Static Dynamic LARGE SMALL Dictionary size

Data structures related to our work Trie & Path Decomposition
4

Trie 5 ¨ Edge-labeled tree for storing a set of
strings ¤ Search time: O(k) time where k denotes the query length Keyword Value ideal$ 1 ideology$ 2 tec$ 3 technology$ 4 tie$ 5 trie$ 6 t ide rie$ ec hnology$ ology$ ie$ 6 4 1 2 5 3 al$ $ Search technology$ ※ Strictly speaking, this tree is Patricia Trie

Path Decomposition [Ferragina+ 08] 6 ¨ Procedure of transforming a
trie to improve the cache efficiency by lowering the height ¤ Application: Static compressed dictionaries [Grossi+ 14] technology$ i r $ i t ide rie$ ec hnology$ ology$ ie$ al$ $ Path-Decomposed Trie (PDT)

New implementation of space-efficient dynamic keyword dictionaries DynPDT: Dynamic Path-Decomposed
Trie 7

Incremental Path Decomposition 8 ¨ Incrementally constructing a PDT technology$
v1 EX1 Add key technology$ to an empty dictionary

Incremental Path Decomposition (cont.) 9 ¨ Incrementally constructing a PDT
technology$ v1 (5, i) v2 EX2 Add key technics$ 012345 technics$ cs$

Incremental Path Decomposition (cont.) 10 ¨ Incrementally constructing a PDT
technology$ cs$ (5, i) v1 v2 (0, q) v3 EX3 Add key technique$ 012345 technique$ Dynamic Path-Decomposed Trie (DynPDT) 0 que$ ue$

Implementation Approaches 11 ¨ Tree representation (with edge labels) ¤
Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$

Implementation Approaches 12 ¨ Tree representation (with edge labels) ¤
Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$

Plain Label Management (PLAIN) 13 ¨ Introducing pointers to each
label ¤ Using array P where P[v] has the pointer of node v ¤ GOOD: Accessing a label in constant time ¤ BAD: Using 64 bits for each slot (too large) v4 v1 v2 v6 v3 P technology$ cs$ ue$ lly$ cal$

Compact Label Management (BITMAP) 14 ¨ Reducing the pointer overhead
¤ Grouping the node labels into ℓ labels over the IDs ¤ Concatenating the labels for each group n #pointers is divided by ℓ ¤ Using bit array B such that B[v] = 1 if node v has a label v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P 3lly10technology2cs 3cal2ue Group 1 Gourp 2 In ℓ = 4

Compact Label Management (BITMAP) 15 ¨ Access procedure ¤ Calculating
the target label position in the group n Constant time using Popcnt ¤ Scanning the concatenated label string until the target label n O(ℓ) time using Skipping technique (or constant time) v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P’ 3lly10technology2cs 3cal2ue In ℓ = 4 Popcnt(B[v4 ..v1 ]) = 2 Label of node v1 Skipping

Space & Time Experiments 16

Settings 17 ¨ Machine ¤ Intel Xeon E5540 @2.53 GHz
CPU，32GB RAM ¤ Ubuntu Server 16.40 LTS ¨ Datasets ¤ Wiki: Titles from English Wikipedia n Size: 227MiB, Keys: 11.5M, Ave. length: 20.7 bytes ¤ WebBase: URLs from WebBase crawler n Size: 6.6GiB, Keys: 118.1M, Ave. length: 60.2 bytes ¤ LUBM: URIs from LUBM benchmark n Size: 3.1GiB, Keys: 52.6M, Ave. length: 63.7 bytes

Settings (cont.) 18 ¨ Data structures ¤ DynPDT: PLAIN and
BITMAP (ℓ = 16) ¤ m-Bonsai: As a naïve trie (not as a dictionary) ¤ Judy: Trie dictionary developed by HP-Lab. ¤ HAT-trie: Hybrid dictionary of trie and array hashing ¤ Cedar: Minimal-prefix double-array trie dictionary ¨ Details ¤ Language: C++ ¤ Associated value type: int (4 bytes) ¤ Keyword order: random

Results for Space Usage 19 46.6 47.5 45.0 18.8 21.0
13.8 23.6 29.3 11.4 50.5 53.5 33.9 40.2 68.9 64.7 41.1 29.7 0 10 20 30 40 50 60 70 80 Wiki WebBase LUBM Bytes per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 3.3x 4.7x

Results for Insertion Time 20 1.14 2.37 1.65 1.57 2.93
1.99 2.22 7.69 4.80 1.06 2.94 1.53 1.13 1.75 2.58 1.07 2.50 0 1 2 3 4 5 6 7 8 9 Wiki WebBase LUBM Micro sec. per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 2.6x

Results for Search Time 21 1.13 2.20 1.12 1.61 2.74
1.43 2.06 8.30 3.08 0.88 2.42 0.79 0.35 0.80 0.51 0.69 0.69 0 1 2 3 4 5 6 7 8 9 Wiki WebBase LUBM Micro sec. per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 4.6x

Summary 22 ¨ Proposing a new dictionary structure, or DynPDT
¤ GOOD: Space efficiency, BAD: Time performance ¤ The traversal speed of m-Bonsai is a bottleneck n Xorshift random number generator can solve this problem ¨ Future work ¤ To improve m-Bonsai or engineer an alternative trie ¤ To support more complex operations (possible in principle) n Invertible mapping between keywords and unique IDs n Prefix-based operations ¤ To develop and publish a useful dictionary library

23 Thank you for your attention! My English skills are
limited If you have any questions, please speak slowly and clearly :) My experimental implementation is available at https://github.com/kampersanda/dynpdt

SPIRE 2017

SPIRE 2017

Shunsuke Kanda

More Decks by Shunsuke Kanda

Other Decks in Research

Featured

Transcript

PRACTICAL IMPLEMENTATION OF SPACE-EFFICIENT DYNAMIC KEYWORD DICTIONARIES Shunsuke Kanda, Kazuhiro

Keyword Dictionaries 2 ¨ Associative array with string keys ¤

Keyword Dictionaries (cont.) 3 ¨ Research objective ¤ Engineering practical

Data structures related to our work Trie & Path Decomposition

Trie 5 ¨ Edge-labeled tree for storing a set of

Path Decomposition [Ferragina+ 08] 6 ¨ Procedure of transforming a

New implementation of space-efficient dynamic keyword dictionaries DynPDT: Dynamic Path-Decomposed

Incremental Path Decomposition 8 ¨ Incrementally constructing a PDT technology$

Incremental Path Decomposition (cont.) 9 ¨ Incrementally constructing a PDT

Incremental Path Decomposition (cont.) 10 ¨ Incrementally constructing a PDT

Implementation Approaches 11 ¨ Tree representation (with edge labels) ¤

Implementation Approaches 12 ¨ Tree representation (with edge labels) ¤

Plain Label Management (PLAIN) 13 ¨ Introducing pointers to each

Compact Label Management (BITMAP) 14 ¨ Reducing the pointer overhead

Compact Label Management (BITMAP) 15 ¨ Access procedure ¤ Calculating

Space & Time Experiments 16

Settings 17 ¨ Machine ¤ Intel Xeon E5540 @2.53 GHz

Settings (cont.) 18 ¨ Data structures ¤ DynPDT: PLAIN and

Results for Space Usage 19 46.6 47.5 45.0 18.8 21.0

Results for Insertion Time 20 1.14 2.37 1.65 1.57 2.93

Results for Search Time 21 1.13 2.20 1.12 1.61 2.74

Summary 22 ¨ Proposing a new dictionary structure, or DynPDT

23 Thank you for your attention! My English skills are