Ph.D. Thesis

Space- and Time-Efficient String Dictionaries Shunsuke Kanda (D2) @ Fuketa
Lab. 1 J

Back Ground } Management of massive data is a fundamental
problem } Most of such data are handled as strings } Such as documents, Web pages, genomics data } Data structures and algorithms for space-efficient string processing have been developed by many researchers 2

String Dictionaries 3 } Data structure for storing a set
of strings } Such as std::map<std::string, type> in C++ } Classical applications } Manage document vocabulary in NLP and IR tasks 0 20000 40000 60000 80000 100000 120000 140000 160000 0 2x106 4x106 6x106 8x106 1x107 Vocabulary Document length O(nβ) for 0.4 < β < 0.6 Terabyte-size document Megabyte-size dictionary Heaps’ Law

String Dictionaries (cont) 4 } Recent applications [Martínez-Prieto+ 16] }
Natural language applications handling Web collections and N- grams, such as Web search engines and IMEs } RDF stores in Semantic Web managing URIs, blank nodes, and literal values } Web graphs and crawlers storing massive URLs } Alignment software in Bioinformatics indexing large k-mers } NoSQL handling massive key-value stores } GISs storing large geographic data within a limited-resource navigation system The size will easily exceed gigabytes!

Our Contributions 5 } We developed some novel data structures
to solve existing problems of string dictionaries } Static compressed double-array tries supporting fast lookup [Kanda+, KAIS 16] [Kanda+, KAIS 17] } Lightweight dictionary compression using dictionary encoding, instead of using powerful text compression [Kanda+, Innovate- Data 17] [Kanda+, DBSJ 18] } Dynamic path-decomposed tries in compact space [Kanda+, SPIRE 17] } Practical rearrangement methods of dynamic double-array tries [Kanda+, SPE 18] Due to the time limitation

Static Compressed Double-Array Tries Supporting Fast Lookup The contents are
based on the paper in the journal of Knowledge and Information Systems (KAIS), 2017 6

Static String Dictionaries (SSD) 7 } Provide a bijection between
a set of strings X and integer IDs in range [1,|X|] } Lookup(x) returns the ID if x ∈ X } Access(i) restores the string from the given ID i idea ism teach tech 1 2 3 4 provides Lookup(teach) = 3 Aceess(3) = teach PrefixLookup(te) = {teach, tech}

State-of-the-art Compressed SSDs 8 } Based on closed hashing }
HashDAC-rp [Martínez-Prieto+ 16] } Based on Front Coding } HTFC-rp [Martínez-Prieto+ 16] } Based on suffix arrays } FM-index [Ferragina+ 07] } Based on tries } MARISA [Yata 11] } XBW [Ferragina+ 09] } Cent-rp [Grossi+ 16] } LZ string dictionaries [Arz+ 14] } XCDA [Kanda+ 17] Space Time Functionality New Note: Depending largely on datasets

Tries [Fredkin 60, Knuth 88] 9 } Edge-labeled tree for
storing a set of strings } Lookup/Access are supported in O(k) optimal time } k is the length of the target string t i r e c s i ie h ea m e ch d a Lookup(tech) = 4 Access(4) = tech idea ism teach tech tie trie PrefixLookup(te) = {teach, tech}

Double Arrays [Aoe 89] } Trie representation technique using two
arrays } BASE stores offsets to children } CHECK stores back pointers to parents } Provide the fastest Lookup/Access in tries 10 c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s +c1 +c2 Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s Parent(t) = CHECK[t] Note: XOR is well used instead of PLUS in practice

Shortcoming of Double Arrays } BASE and CHECK are not
compact in Ω(n lg n) bits } Information-theoretical lower bound is n lg σ + O(n) bits } Existing compressed forms have critical problems } CDA [Yata+ 07] is still large and sacrifices Access } DALF [Kanda+ 16] is smaller than CDA but sacrifices Access 11 Ω(n) Trie with n nodes from alphabet of size σ c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s Note: Some empty elements can be included

Novel Compressed Data Structure } XOR-compressed double arrays (XCDA) }
Represent each BASE/CHECK element in 8 bits empirically through a simple approach } The space efficiency is nearly equal to that of DALF } Support both Lookup and Access 12 BASE CHECK BASE’ CHECK’ Double Array 32/64 bits ~8 bits XCDA

Basic Idea } Property of double arrays } Node addresses
can be freely arranged as long as the following expressions are satisfied for all nodes 13 c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s +c1 +c2 Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s Parent(t) = CHECK[t]

Basic Idea (cont) } Property of double arrays } Node
addresses can be freely arranged as long as the following expressions are satisfied for all nodes 14 Note: BASE is also the same 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 CHECK[s] XOR s Address: s XOR Most of values are within one byte 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 CHECK[s] Address: s We can obtain Modified version [Kanda+ 17] of DACs [Brisaboa+ 11]

Experimental Settings 15 } Machine } Quad-Core Intel Xeon 2
x 2.4 GHz CPU } 16 GB RAM } Language } C++ } Apple LLVM version 7.0.2 (clang-700) } Optimization -O3 } Datasets } Geographic names on the asciiname column from GeoNames dump } URLs of a 2005 crawl by the UbiCrawler on the .uk domain

Experimental Settings (cont) 16 } Based on closed hashing }
HashDAC-rp [Martínez-Prieto+ 16] } Based on Front Coding } HTFC-rp [Martínez-Prieto+ 16] } Based on suffix arrays } FM-index [Ferragina+ 07] } Based on tries } MARISA [Yata 11] } XBW [Ferragina+ 09] } Cent-rp [Grossi+ 16] } LZ string dictionaries [Arz+ 14] } XCDA [Kanda+ 17] Space Time Functionality Original double array, DA [Aoe 89] Previous compressed double array, DALF [Kanda+ 16]

Experimental Results 17 } Geographic names on the asciiname column
from GeoNames } Raw size: 106.1 MiB } # of strings: 6,784,722 } Ave. length: 15.6 bytes Constr (sec) Cmpr (%) Lookup (μs/str) Access (μs/str) XCDA 5.7 base 52.8 base 0.93 base 1.29 base DA 4.9 (0.9x) 95.8 (1.8x) 0.61 (0.7x) 0.95 (0.7x) DALF 9.5 (1.7x) 52.8 (1.0x) 0.80 (0.9x) – – Cent-rp 33.8 (5.9x) 31.5 (0.6x) 2.10 (2.3x) 2.17 (1.7x) HTFC-rp 125.1 (21.9x) 34.4 (0.7x) 3.50 (3.8x) 1.79 (1.4x) HashDAC-rp 298.9 (52.4x) 48.0 (0.9x) 1.28 (1.4x) 0.92 (0.7x) To measure Lookup/Access, extracted a million strings at random

Experimental Results (cont) 18 } URLs of a 2005 crawl
by the UbiCrawler on the .uk domain } Raw size: 2,855.5 MiB } # of strings: 39,459,925 } Ave. length: 72.4 bytes Constr (sec) Cmpr (%) Lookup (μs/str) Access (μs/str) XCDA 71.7 base 25.2 base 2.70 base 3.54 base DA 65.9 (0.9x) 43.8 (1.7x) 1.95 (0.7x) 2.93 (0.8x) DALF 110.1 (1.5x) 24.1 (1.0x) 6.00 (2.2x) – – Cent-rp 472.7 (6.6x) 17.5 (0.7x) 4.02 (1.5x) 4.47 (1.3x) HTFC-rp 12598.4 (175.7x) 18.3 (0.7x) 7.96 (2.9x) 4.41 (1.2x) HashDAC-rp Could not be constructed in practical time To measure Lookup/Access, extracted a million strings at random

Dynamic Path-Decomposed Tries The contents are based on the paper
in the proceedings of 24th International Symposium on String Processing and Information Retrieval (SPIRE), 2017 19

Dynamic String Dictionaries (DSD) } Providing mapping from a set
of strings X to values of any type } Search(x) returns the associated value if x ∈ X } Insert(x, v) adds the string x associated with value v to the dictionary } Delete(x) erases the string x from the dictionary 20 idea ism teach tech 12 51 87 provides Search(teach) = 12

State-of-the-art DSDs 21 } Based on hashing } Google sparse
hash [Google 07] } Array hash [Askitis+ 05] } Based on tries } Judy [Baskins 02] } HAT-trie [Askitis+ 07] } ART [Leis+ 13] } Cedar [Yoshinaga+ 14] } DynPDT [Kanda+ 17] } Recent works } Provided theoretical results for dynamic tries, not in practice [Jansson+ 15] [Arroyuelo+ 16] DynPDT is much smaller than all the others Main results New

Novel Data Structure of DSDs 22 } Dynamic Path-Decomposed Tries
(DynPDT) } Path decomposition: Procedure to build cache-friendly tries } Application: Compressed SSDs (Cent-rp) [Grossi+ 14] } Apply it to implement DSDs with a different approach i r a i t ide rie e chnology$ ology ie$ al a technology$ Path-Decomposed Trie (PDT) Path decomposition

Basic Idea 23 } Incrementally construct a PDT 1. Add
key technology$ to an empty dictionary technology$ v1

Basic Idea (cont) 24 } Incrementally construct a PDT 2.
Add key technics$ technology$ v1 (5, i) v2 012345 technics$ cs$

Basic Idea (cont) 25 } Incrementally construct a PDT 3.
Add key technique$ technology$ cs$ (5, i) v1 (0, q) v3 012345 technique$ Dynamic Path-Decomposed Trie (DynPDT) 0 que$ ue$ v2

How to Implement? 26 } Representation of trie topology }
Use a compact dynamic trie, or m-Bonsai [Poyias+ 16] } Trie can be represented in (n/α)(lg σ + O(1)) bits } Where n is #nodes, σ is alphabet size and α is 0 < α < 1 } Close to the informational optimal n(lg σ + O(1)) bits } Management of node labels } Plain implementation takes O(n lg n) bits for pointers } Our compact one takes O(n) bits, although access time is O(lg n) technology$ cs$ ue$ lly$ cal$

Experimental Settings 27 } Machine } Intel Xeon E5540 2.53
GHz CPU } 32 GB RAM (L2 cache: 1 MB, L3 cache: 8 MB) } Language } C++ } g++ (version 5.4.0) } Optimization -O9 } Datasets } Geographic names on the asciiname column from GeoNames dump } URLs of a 2005 crawl by the UbiCrawler on the .uk domain

Experimental Settings (cont) 28 } Based on hashing } Google
sparse hash [Google 07] } Array hash [Askitis+ 05] } Based on tries } Judy [Baskins 02] } HAT-trie [Askitis+ 07] } ART [Leis+ 13] } Cedar [Yoshinaga+ 14] } DynPDT [Kanda+ 17]

Experimental Results 29 } Geographic names on the asciiname column
from GeoNames } Raw size: 106.1 MiB } # of strings: 6,784,722 } Ave. length: 15.6 bytes Space (bytes/str) Insert (μs/str) Search (μs/str) DynPDT 16.8 base 1.37 base 1.38 base Sparsehash 62.3 (3.71x) 4.31 (3.14x) 0.34 (0.24x) Judy 47.6 (2.83x) 0.93 (0.68x) 0.70 (0.50x) HAT -trie 35.4 (2.11x) 0.96 (0.70x) 0.31 (0.23x) ART 87.1 (5.18x) 1.07 (0.78x) 0.81 (0.59x) Cedar 30.5 (1.82x) 1.05 (0.76x) 0.42 (0.30x) To measure Insert/Search, extracted a million strings at random

Experimental Results (cont) 30 } URLs of a 2005 crawl
by the UbiCrawler on the .uk domain } Raw size: 2,855.5 MiB } # of strings: 39,459,925 } Ave. length: 72.4 bytes Space (bytes/str) Insert (μs/str) Search (μs/str) DynPDT 28.2 base 2.29 base 2.47 base Sparsehash 131.0 (4.65x) 9.13 (3.99x) 0.67 (0.27x) Judy 60.3 (2.14x) 2.15 (0.94x) 2.02 (0.82x) HAT -trie 82.3 (2.92x) 1.63 (0.71x) 0.61 (0.25x) ART 140.9 (5.00x) 2.20 (0.96x) 1.84 (0.75x) Cedar 58.4 (2.07x) 2.56 (1.12x) 2.51 (1.02x) To measure Insert/Search, extracted a million strings at random

Concluding Remarks 31 } Our contributions (in the presentation) }
XCDA: Novel compressed double-array structure for static string dictionaries supporting fast lookup } DynPDT: Novel space-efficient data structure for dynamic string dictionaries } Open problems } We are not aware of any fast implementation supporting substring-based operations } FM-index and XBW are very slow compared to other dictionaries } We are not aware of any succinct dynamic string dictionaries

Ph.D. Thesis

Ph.D. Thesis

Shunsuke Kanda

More Decks by Shunsuke Kanda

Other Decks in Research

Featured

Transcript

Space- and Time-Efficient String Dictionaries Shunsuke Kanda (D2) @ Fuketa

Back Ground } Management of massive data is a fundamental

String Dictionaries 3 } Data structure for storing a set

String Dictionaries (cont) 4 } Recent applications [Martínez-Prieto+ 16] }

Our Contributions 5 } We developed some novel data structures

Static Compressed Double-Array Tries Supporting Fast Lookup The contents are

Static String Dictionaries (SSD) 7 } Provide a bijection between

State-of-the-art Compressed SSDs 8 } Based on closed hashing }

Tries [Fredkin 60, Knuth 88] 9 } Edge-labeled tree for

Double Arrays [Aoe 89] } Trie representation technique using two

Shortcoming of Double Arrays } BASE and CHECK are not

Novel Compressed Data Structure } XOR-compressed double arrays (XCDA) }

Basic Idea } Property of double arrays } Node addresses

Basic Idea (cont) } Property of double arrays } Node

Experimental Settings 15 } Machine } Quad-Core Intel Xeon 2

Experimental Settings (cont) 16 } Based on closed hashing }

Experimental Results 17 } Geographic names on the asciiname column

Experimental Results (cont) 18 } URLs of a 2005 crawl

Dynamic Path-Decomposed Tries The contents are based on the paper

Dynamic String Dictionaries (DSD) } Providing mapping from a set

State-of-the-art DSDs 21 } Based on hashing } Google sparse

Novel Data Structure of DSDs 22 } Dynamic Path-Decomposed Tries

Basic Idea 23 } Incrementally construct a PDT 1. Add

Basic Idea (cont) 24 } Incrementally construct a PDT 2.

Basic Idea (cont) 25 } Incrementally construct a PDT 3.

How to Implement? 26 } Representation of trie topology }

Experimental Settings 27 } Machine } Intel Xeon E5540 2.53

Experimental Settings (cont) 28 } Based on hashing } Google

Experimental Results 29 } Geographic names on the asciiname column

Experimental Results (cont) 30 } URLs of a 2005 crawl

Concluding Remarks 31 } Our contributions (in the presentation) }