Ph.D. Thesis - Speaker Deck

Slide 1

Slide 1 text

Space- and Time-Efficient String Dictionaries Shunsuke Kanda (D2) @ Fuketa Lab. 1 J

Slide 2

Slide 2 text

Back Ground } Management of massive data is a fundamental problem } Most of such data are handled as strings } Such as documents, Web pages, genomics data } Data structures and algorithms for space-efficient string processing have been developed by many researchers 2

Slide 3

Slide 3 text

String Dictionaries 3 } Data structure for storing a set of strings } Such as std::map in C++ } Classical applications } Manage document vocabulary in NLP and IR tasks 0 20000 40000 60000 80000 100000 120000 140000 160000 0 2x106 4x106 6x106 8x106 1x107 Vocabulary Document length O(nβ) for 0.4 < β < 0.6 Terabyte-size document Megabyte-size dictionary Heaps’ Law

Slide 4

Slide 4 text

String Dictionaries (cont) 4 } Recent applications [Martínez-Prieto+ 16] } Natural language applications handling Web collections and N- grams, such as Web search engines and IMEs } RDF stores in Semantic Web managing URIs, blank nodes, and literal values } Web graphs and crawlers storing massive URLs } Alignment software in Bioinformatics indexing large k-mers } NoSQL handling massive key-value stores } GISs storing large geographic data within a limited-resource navigation system The size will easily exceed gigabytes!

Slide 5

Slide 5 text

Our Contributions 5 } We developed some novel data structures to solve existing problems of string dictionaries } Static compressed double-array tries supporting fast lookup [Kanda+, KAIS 16] [Kanda+, KAIS 17] } Lightweight dictionary compression using dictionary encoding, instead of using powerful text compression [Kanda+, Innovate- Data 17] [Kanda+, DBSJ 18] } Dynamic path-decomposed tries in compact space [Kanda+, SPIRE 17] } Practical rearrangement methods of dynamic double-array tries [Kanda+, SPE 18] Due to the time limitation

Slide 6

Slide 6 text

Static Compressed Double-Array Tries Supporting Fast Lookup The contents are based on the paper in the journal of Knowledge and Information Systems (KAIS), 2017 6

Slide 7

Slide 7 text

Static String Dictionaries (SSD) 7 } Provide a bijection between a set of strings X and integer IDs in range [1,|X|] } Lookup(x) returns the ID if x ∈ X } Access(i) restores the string from the given ID i idea ism teach tech 1 2 3 4 provides Lookup(teach) = 3 Aceess(3) = teach PrefixLookup(te) = {teach, tech}

Slide 8

Slide 8 text

State-of-the-art Compressed SSDs 8 } Based on closed hashing } HashDAC-rp [Martínez-Prieto+ 16] } Based on Front Coding } HTFC-rp [Martínez-Prieto+ 16] } Based on suffix arrays } FM-index [Ferragina+ 07] } Based on tries } MARISA [Yata 11] } XBW [Ferragina+ 09] } Cent-rp [Grossi+ 16] } LZ string dictionaries [Arz+ 14] } XCDA [Kanda+ 17] Space Time Functionality New Note: Depending largely on datasets

Slide 9

Slide 9 text

Tries [Fredkin 60, Knuth 88] 9 } Edge-labeled tree for storing a set of strings } Lookup/Access are supported in O(k) optimal time } k is the length of the target string t i r e c s i ie h ea m e ch d a Lookup(tech) = 4 Access(4) = tech idea ism teach tech tie trie PrefixLookup(te) = {teach, tech}

Slide 10

Slide 10 text

Double Arrays [Aoe 89] } Trie representation technique using two arrays } BASE stores offsets to children } CHECK stores back pointers to parents } Provide the fastest Lookup/Access in tries 10 c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s +c1 +c2 Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s Parent(t) = CHECK[t] Note: XOR is well used instead of PLUS in practice

Slide 11

Slide 11 text

Shortcoming of Double Arrays } BASE and CHECK are not compact in Ω(n lg n) bits } Information-theoretical lower bound is n lg σ + O(n) bits } Existing compressed forms have critical problems } CDA [Yata+ 07] is still large and sacrifices Access } DALF [Kanda+ 16] is smaller than CDA but sacrifices Access 11 Ω(n) Trie with n nodes from alphabet of size σ c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s Note: Some empty elements can be included

Slide 12

Slide 12 text

Novel Compressed Data Structure } XOR-compressed double arrays (XCDA) } Represent each BASE/CHECK element in 8 bits empirically through a simple approach } The space efficiency is nearly equal to that of DALF } Support both Lookup and Access 12 BASE CHECK BASE’ CHECK’ Double Array 32/64 bits ~8 bits XCDA

Slide 13

Slide 13 text

Basic Idea } Property of double arrays } Node addresses can be freely arranged as long as the following expressions are satisfied for all nodes 13 c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s +c1 +c2 Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s Parent(t) = CHECK[t]

Slide 14

Slide 14 text

Basic Idea (cont) } Property of double arrays } Node addresses can be freely arranged as long as the following expressions are satisfied for all nodes 14 Note: BASE is also the same 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 CHECK[s] XOR s Address: s XOR Most of values are within one byte 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 CHECK[s] Address: s We can obtain Modified version [Kanda+ 17] of DACs [Brisaboa+ 11]

Slide 15

Slide 15 text

Experimental Settings 15 } Machine } Quad-Core Intel Xeon 2 x 2.4 GHz CPU } 16 GB RAM } Language } C++ } Apple LLVM version 7.0.2 (clang-700) } Optimization -O3 } Datasets } Geographic names on the asciiname column from GeoNames dump } URLs of a 2005 crawl by the UbiCrawler on the .uk domain

Slide 16

Slide 16 text

Experimental Settings (cont) 16 } Based on closed hashing } HashDAC-rp [Martínez-Prieto+ 16] } Based on Front Coding } HTFC-rp [Martínez-Prieto+ 16] } Based on suffix arrays } FM-index [Ferragina+ 07] } Based on tries } MARISA [Yata 11] } XBW [Ferragina+ 09] } Cent-rp [Grossi+ 16] } LZ string dictionaries [Arz+ 14] } XCDA [Kanda+ 17] Space Time Functionality Original double array, DA [Aoe 89] Previous compressed double array, DALF [Kanda+ 16]

Slide 17

Slide 17 text

Experimental Results 17 } Geographic names on the asciiname column from GeoNames } Raw size: 106.1 MiB } # of strings: 6,784,722 } Ave. length: 15.6 bytes Constr (sec) Cmpr (%) Lookup (μs/str) Access (μs/str) XCDA 5.7 base 52.8 base 0.93 base 1.29 base DA 4.9 (0.9x) 95.8 (1.8x) 0.61 (0.7x) 0.95 (0.7x) DALF 9.5 (1.7x) 52.8 (1.0x) 0.80 (0.9x) – – Cent-rp 33.8 (5.9x) 31.5 (0.6x) 2.10 (2.3x) 2.17 (1.7x) HTFC-rp 125.1 (21.9x) 34.4 (0.7x) 3.50 (3.8x) 1.79 (1.4x) HashDAC-rp 298.9 (52.4x) 48.0 (0.9x) 1.28 (1.4x) 0.92 (0.7x) To measure Lookup/Access, extracted a million strings at random

Slide 18

Slide 18 text

Experimental Results (cont) 18 } URLs of a 2005 crawl by the UbiCrawler on the .uk domain } Raw size: 2,855.5 MiB } # of strings: 39,459,925 } Ave. length: 72.4 bytes Constr (sec) Cmpr (%) Lookup (μs/str) Access (μs/str) XCDA 71.7 base 25.2 base 2.70 base 3.54 base DA 65.9 (0.9x) 43.8 (1.7x) 1.95 (0.7x) 2.93 (0.8x) DALF 110.1 (1.5x) 24.1 (1.0x) 6.00 (2.2x) – – Cent-rp 472.7 (6.6x) 17.5 (0.7x) 4.02 (1.5x) 4.47 (1.3x) HTFC-rp 12598.4 (175.7x) 18.3 (0.7x) 7.96 (2.9x) 4.41 (1.2x) HashDAC-rp Could not be constructed in practical time To measure Lookup/Access, extracted a million strings at random

Slide 19

Slide 19 text

Dynamic Path-Decomposed Tries The contents are based on the paper in the proceedings of 24th International Symposium on String Processing and Information Retrieval (SPIRE), 2017 19

Slide 20

Slide 20 text

Dynamic String Dictionaries (DSD) } Providing mapping from a set of strings X to values of any type } Search(x) returns the associated value if x ∈ X } Insert(x, v) adds the string x associated with value v to the dictionary } Delete(x) erases the string x from the dictionary 20 idea ism teach tech 12 51 87 provides Search(teach) = 12

Slide 21

Slide 21 text

State-of-the-art DSDs 21 } Based on hashing } Google sparse hash [Google 07] } Array hash [Askitis+ 05] } Based on tries } Judy [Baskins 02] } HAT-trie [Askitis+ 07] } ART [Leis+ 13] } Cedar [Yoshinaga+ 14] } DynPDT [Kanda+ 17] } Recent works } Provided theoretical results for dynamic tries, not in practice [Jansson+ 15] [Arroyuelo+ 16] DynPDT is much smaller than all the others Main results New

Slide 22

Slide 22 text

Novel Data Structure of DSDs 22 } Dynamic Path-Decomposed Tries (DynPDT) } Path decomposition: Procedure to build cache-friendly tries } Application: Compressed SSDs (Cent-rp) [Grossi+ 14] } Apply it to implement DSDs with a different approach i r a i t ide rie e chnology$ ology ie$ al a technology$ Path-Decomposed Trie (PDT) Path decomposition

Slide 23

Slide 23 text

Basic Idea 23 } Incrementally construct a PDT 1. Add key technology$ to an empty dictionary technology$ v1

Slide 24

Slide 24 text

Basic Idea (cont) 24 } Incrementally construct a PDT 2. Add key technics$ technology$ v1 (5, i) v2 012345 technics$ cs$

Slide 25

Slide 25 text

Basic Idea (cont) 25 } Incrementally construct a PDT 3. Add key technique$ technology$ cs$ (5, i) v1 (0, q) v3 012345 technique$ Dynamic Path-Decomposed Trie (DynPDT) 0 que$ ue$ v2

Slide 26

Slide 26 text

How to Implement? 26 } Representation of trie topology } Use a compact dynamic trie, or m-Bonsai [Poyias+ 16] } Trie can be represented in (n/α)(lg σ + O(1)) bits } Where n is #nodes, σ is alphabet size and α is 0 < α < 1 } Close to the informational optimal n(lg σ + O(1)) bits } Management of node labels } Plain implementation takes O(n lg n) bits for pointers } Our compact one takes O(n) bits, although access time is O(lg n) technology$ cs$ ue$ lly$ cal$

Slide 27

Slide 27 text

Experimental Settings 27 } Machine } Intel Xeon E5540 2.53 GHz CPU } 32 GB RAM (L2 cache: 1 MB, L3 cache: 8 MB) } Language } C++ } g++ (version 5.4.0) } Optimization -O9 } Datasets } Geographic names on the asciiname column from GeoNames dump } URLs of a 2005 crawl by the UbiCrawler on the .uk domain

Slide 28

Slide 28 text

Experimental Settings (cont) 28 } Based on hashing } Google sparse hash [Google 07] } Array hash [Askitis+ 05] } Based on tries } Judy [Baskins 02] } HAT-trie [Askitis+ 07] } ART [Leis+ 13] } Cedar [Yoshinaga+ 14] } DynPDT [Kanda+ 17]

Slide 29

Slide 29 text

Experimental Results 29 } Geographic names on the asciiname column from GeoNames } Raw size: 106.1 MiB } # of strings: 6,784,722 } Ave. length: 15.6 bytes Space (bytes/str) Insert (μs/str) Search (μs/str) DynPDT 16.8 base 1.37 base 1.38 base Sparsehash 62.3 (3.71x) 4.31 (3.14x) 0.34 (0.24x) Judy 47.6 (2.83x) 0.93 (0.68x) 0.70 (0.50x) HAT -trie 35.4 (2.11x) 0.96 (0.70x) 0.31 (0.23x) ART 87.1 (5.18x) 1.07 (0.78x) 0.81 (0.59x) Cedar 30.5 (1.82x) 1.05 (0.76x) 0.42 (0.30x) To measure Insert/Search, extracted a million strings at random

Slide 30

Slide 30 text

Experimental Results (cont) 30 } URLs of a 2005 crawl by the UbiCrawler on the .uk domain } Raw size: 2,855.5 MiB } # of strings: 39,459,925 } Ave. length: 72.4 bytes Space (bytes/str) Insert (μs/str) Search (μs/str) DynPDT 28.2 base 2.29 base 2.47 base Sparsehash 131.0 (4.65x) 9.13 (3.99x) 0.67 (0.27x) Judy 60.3 (2.14x) 2.15 (0.94x) 2.02 (0.82x) HAT -trie 82.3 (2.92x) 1.63 (0.71x) 0.61 (0.25x) ART 140.9 (5.00x) 2.20 (0.96x) 1.84 (0.75x) Cedar 58.4 (2.07x) 2.56 (1.12x) 2.51 (1.02x) To measure Insert/Search, extracted a million strings at random

Slide 31

Slide 31 text

Concluding Remarks 31 } Our contributions (in the presentation) } XCDA: Novel compressed double-array structure for static string dictionaries supporting fast lookup } DynPDT: Novel space-efficient data structure for dynamic string dictionaries } Open problems } We are not aware of any fast implementation supporting substring-based operations } FM-index and XBW are very slow compared to other dictionaries } We are not aware of any succinct dynamic string dictionaries