Slide 1

Slide 1 text

PRACTICAL IMPLEMENTATION OF SPACE-EFFICIENT DYNAMIC KEYWORD DICTIONARIES Shunsuke Kanda, Kazuhiro Morita and Masao Fuketa 26th–29th Sep. 2017, Palermo, ITALY 24th International Symposium on String Processing and Information Retrieval Tokushima Univ. JAPAN

Slide 2

Slide 2 text

Keyword Dictionaries 2 ¨ Associative array with string keys ¤ Such as std::map in C++ ¨ Recent target data are massive [Martínez-Prieto+ 16] ¤ NLP, IR, Web Graph, RDF, Bioinformatics, etc. String Processing And Information Retrieval 1 2 3 4 5 Information → 4 Keyword Value

Slide 3

Slide 3 text

Keyword Dictionaries (cont.) 3 ¨ Research objective ¤ Engineering practical implementation of space-efficient dynamic keyword dictionaries libCSD path_decomposed_tries Marisa-trie Xcdat Judy Hat-Trie libart Cedar Static Dynamic LARGE SMALL Dictionary size

Slide 4

Slide 4 text

Data structures related to our work Trie & Path Decomposition 4

Slide 5

Slide 5 text

Trie 5 ¨ Edge-labeled tree for storing a set of strings ¤ Search time: O(k) time where k denotes the query length Keyword Value ideal$ 1 ideology$ 2 tec$ 3 technology$ 4 tie$ 5 trie$ 6 t ide rie$ ec hnology$ ology$ ie$ 6 4 1 2 5 3 al$ $ Search technology$ ※ Strictly speaking, this tree is Patricia Trie

Slide 6

Slide 6 text

Path Decomposition [Ferragina+ 08] 6 ¨ Procedure of transforming a trie to improve the cache efficiency by lowering the height ¤ Application: Static compressed dictionaries [Grossi+ 14] technology$ i r $ i t ide rie$ ec hnology$ ology$ ie$ al$ $ Path-Decomposed Trie (PDT)

Slide 7

Slide 7 text

New implementation of space-efficient dynamic keyword dictionaries DynPDT: Dynamic Path-Decomposed Trie 7

Slide 8

Slide 8 text

Incremental Path Decomposition 8 ¨ Incrementally constructing a PDT technology$ v1 EX1 Add key technology$ to an empty dictionary

Slide 9

Slide 9 text

Incremental Path Decomposition (cont.) 9 ¨ Incrementally constructing a PDT technology$ v1 (5, i) v2 EX2 Add key technics$ 012345 technics$ cs$

Slide 10

Slide 10 text

Incremental Path Decomposition (cont.) 10 ¨ Incrementally constructing a PDT technology$ cs$ (5, i) v1 v2 (0, q) v3 EX3 Add key technique$ 012345 technique$ Dynamic Path-Decomposed Trie (DynPDT) 0 que$ ue$

Slide 11

Slide 11 text

Implementation Approaches 11 ¨ Tree representation (with edge labels) ¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$

Slide 12

Slide 12 text

Implementation Approaches 12 ¨ Tree representation (with edge labels) ¤ Using a compact dynamic trie, or m-Bonsai [Poyias+ 14] ¨ Node label management ¤ Separately storing the labels from the tree structure ¤ Pointer management is important technology$ cs$ ue$ (5, i) (0, q) v1 v2 v3 technology$ technics$ technique$

Slide 13

Slide 13 text

Plain Label Management (PLAIN) 13 ¨ Introducing pointers to each label ¤ Using array P where P[v] has the pointer of node v ¤ GOOD: Accessing a label in constant time ¤ BAD: Using 64 bits for each slot (too large) v4 v1 v2 v6 v3 P technology$ cs$ ue$ lly$ cal$

Slide 14

Slide 14 text

Compact Label Management (BITMAP) 14 ¨ Reducing the pointer overhead ¤ Grouping the node labels into ℓ labels over the IDs ¤ Concatenating the labels for each group n #pointers is divided by ℓ ¤ Using bit array B such that B[v] = 1 if node v has a label v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P 3lly10technology2cs 3cal2ue Group 1 Gourp 2 In ℓ = 4

Slide 15

Slide 15 text

Compact Label Management (BITMAP) 15 ¨ Access procedure ¤ Calculating the target label position in the group n Constant time using Popcnt ¤ Scanning the concatenated label string until the target label n O(ℓ) time using Skipping technique (or constant time) v4 v1 v2 v6 v3 B 1 0 1 1 0 1 0 1 P’ 3lly10technology2cs 3cal2ue In ℓ = 4 Popcnt(B[v4 ..v1 ]) = 2 Label of node v1 Skipping

Slide 16

Slide 16 text

Space & Time Experiments 16

Slide 17

Slide 17 text

Settings 17 ¨ Machine ¤ Intel Xeon E5540 @2.53 GHz CPU,32GB RAM ¤ Ubuntu Server 16.40 LTS ¨ Datasets ¤ Wiki: Titles from English Wikipedia n Size: 227MiB, Keys: 11.5M, Ave. length: 20.7 bytes ¤ WebBase: URLs from WebBase crawler n Size: 6.6GiB, Keys: 118.1M, Ave. length: 60.2 bytes ¤ LUBM: URIs from LUBM benchmark n Size: 3.1GiB, Keys: 52.6M, Ave. length: 63.7 bytes

Slide 18

Slide 18 text

Settings (cont.) 18 ¨ Data structures ¤ DynPDT: PLAIN and BITMAP (ℓ = 16) ¤ m-Bonsai: As a naïve trie (not as a dictionary) ¤ Judy: Trie dictionary developed by HP-Lab. ¤ HAT-trie: Hybrid dictionary of trie and array hashing ¤ Cedar: Minimal-prefix double-array trie dictionary ¨ Details ¤ Language: C++ ¤ Associated value type: int (4 bytes) ¤ Keyword order: random

Slide 19

Slide 19 text

Results for Space Usage 19 46.6 47.5 45.0 18.8 21.0 13.8 23.6 29.3 11.4 50.5 53.5 33.9 40.2 68.9 64.7 41.1 29.7 0 10 20 30 40 50 60 70 80 Wiki WebBase LUBM Bytes per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 3.3x 4.7x

Slide 20

Slide 20 text

Results for Insertion Time 20 1.14 2.37 1.65 1.57 2.93 1.99 2.22 7.69 4.80 1.06 2.94 1.53 1.13 1.75 2.58 1.07 2.50 0 1 2 3 4 5 6 7 8 9 Wiki WebBase LUBM Micro sec. per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 2.6x

Slide 21

Slide 21 text

Results for Search Time 21 1.13 2.20 1.12 1.61 2.74 1.43 2.06 8.30 3.08 0.88 2.42 0.79 0.35 0.80 0.51 0.69 0.69 0 1 2 3 4 5 6 7 8 9 Wiki WebBase LUBM Micro sec. per keyword DynPDT+PLAIN DynPDT+BITMAP m-Bonsai Judy HAT-trie Cedar N/A 4.6x

Slide 22

Slide 22 text

Summary 22 ¨ Proposing a new dictionary structure, or DynPDT ¤ GOOD: Space efficiency, BAD: Time performance ¤ The traversal speed of m-Bonsai is a bottleneck n Xorshift random number generator can solve this problem ¨ Future work ¤ To improve m-Bonsai or engineer an alternative trie ¤ To support more complex operations (possible in principle) n Invertible mapping between keywords and unique IDs n Prefix-based operations ¤ To develop and publish a useful dictionary library

Slide 23

Slide 23 text

23 Thank you for your attention! My English skills are limited If you have any questions, please speak slowly and clearly :) My experimental implementation is available at https://github.com/kampersanda/dynpdt