problem } Most of such data are handled as strings } Such as documents, Web pages, genomics data } Data structures and algorithms for space-efficient string processing have been developed by many researchers 2
Natural language applications handling Web collections and N- grams, such as Web search engines and IMEs } RDF stores in Semantic Web managing URIs, blank nodes, and literal values } Web graphs and crawlers storing massive URLs } Alignment software in Bioinformatics indexing large k-mers } NoSQL handling massive key-value stores } GISs storing large geographic data within a limited-resource navigation system The size will easily exceed gigabytes!
to solve existing problems of string dictionaries } Static compressed double-array tries supporting fast lookup [Kanda+, KAIS 16] [Kanda+, KAIS 17] } Lightweight dictionary compression using dictionary encoding, instead of using powerful text compression [Kanda+, Innovate- Data 17] [Kanda+, DBSJ 18] } Dynamic path-decomposed tries in compact space [Kanda+, SPIRE 17] } Practical rearrangement methods of dynamic double-array tries [Kanda+, SPE 18] Due to the time limitation
a set of strings X and integer IDs in range [1,|X|] } Lookup(x) returns the ID if x ∈ X } Access(i) restores the string from the given ID i idea ism teach tech 1 2 3 4 provides Lookup(teach) = 3 Aceess(3) = teach PrefixLookup(te) = {teach, tech}
storing a set of strings } Lookup/Access are supported in O(k) optimal time } k is the length of the target string t i r e c s i ie h ea m e ch d a Lookup(tech) = 4 Access(4) = tech idea ism teach tech tie trie PrefixLookup(te) = {teach, tech}
arrays } BASE stores offsets to children } CHECK stores back pointers to parents } Provide the fastest Lookup/Access in tries 10 c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s +c1 +c2 Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s Parent(t) = CHECK[t] Note: XOR is well used instead of PLUS in practice
compact in Ω(n lg n) bits } Information-theoretical lower bound is n lg σ + O(n) bits } Existing compressed forms have critical problems } CDA [Yata+ 07] is still large and sacrifices Access } DALF [Kanda+ 16] is smaller than CDA but sacrifices Access 11 Ω(n) Trie with n nodes from alphabet of size σ c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s Note: Some empty elements can be included
Represent each BASE/CHECK element in 8 bits empirically through a simple approach } The space efficiency is nearly equal to that of DALF } Support both Lookup and Access 12 BASE CHECK BASE’ CHECK’ Double Array 32/64 bits ~8 bits XCDA
can be freely arranged as long as the following expressions are satisfied for all nodes 13 c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s +c1 +c2 Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s Parent(t) = CHECK[t]
addresses can be freely arranged as long as the following expressions are satisfied for all nodes 14 Note: BASE is also the same 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 CHECK[s] XOR s Address: s XOR Most of values are within one byte 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 CHECK[s] Address: s We can obtain Modified version [Kanda+ 17] of DACs [Brisaboa+ 11]
x 2.4 GHz CPU } 16 GB RAM } Language } C++ } Apple LLVM version 7.0.2 (clang-700) } Optimization -O3 } Datasets } Geographic names on the asciiname column from GeoNames dump } URLs of a 2005 crawl by the UbiCrawler on the .uk domain
by the UbiCrawler on the .uk domain } Raw size: 2,855.5 MiB } # of strings: 39,459,925 } Ave. length: 72.4 bytes Constr (sec) Cmpr (%) Lookup (μs/str) Access (μs/str) XCDA 71.7 base 25.2 base 2.70 base 3.54 base DA 65.9 (0.9x) 43.8 (1.7x) 1.95 (0.7x) 2.93 (0.8x) DALF 110.1 (1.5x) 24.1 (1.0x) 6.00 (2.2x) – – Cent-rp 472.7 (6.6x) 17.5 (0.7x) 4.02 (1.5x) 4.47 (1.3x) HTFC-rp 12598.4 (175.7x) 18.3 (0.7x) 7.96 (2.9x) 4.41 (1.2x) HashDAC-rp Could not be constructed in practical time To measure Lookup/Access, extracted a million strings at random
of strings X to values of any type } Search(x) returns the associated value if x ∈ X } Insert(x, v) adds the string x associated with value v to the dictionary } Delete(x) erases the string x from the dictionary 20 idea ism teach tech 12 51 87 provides Search(teach) = 12
hash [Google 07] } Array hash [Askitis+ 05] } Based on tries } Judy [Baskins 02] } HAT-trie [Askitis+ 07] } ART [Leis+ 13] } Cedar [Yoshinaga+ 14] } DynPDT [Kanda+ 17] } Recent works } Provided theoretical results for dynamic tries, not in practice [Jansson+ 15] [Arroyuelo+ 16] DynPDT is much smaller than all the others Main results New
(DynPDT) } Path decomposition: Procedure to build cache-friendly tries } Application: Compressed SSDs (Cent-rp) [Grossi+ 14] } Apply it to implement DSDs with a different approach i r a i t ide rie e chnology$ ology ie$ al a technology$ Path-Decomposed Trie (PDT) Path decomposition
Use a compact dynamic trie, or m-Bonsai [Poyias+ 16] } Trie can be represented in (n/α)(lg σ + O(1)) bits } Where n is #nodes, σ is alphabet size and α is 0 < α < 1 } Close to the informational optimal n(lg σ + O(1)) bits } Management of node labels } Plain implementation takes O(n lg n) bits for pointers } Our compact one takes O(n) bits, although access time is O(lg n) technology$ cs$ ue$ lly$ cal$
GHz CPU } 32 GB RAM (L2 cache: 1 MB, L3 cache: 8 MB) } Language } C++ } g++ (version 5.4.0) } Optimization -O9 } Datasets } Geographic names on the asciiname column from GeoNames dump } URLs of a 2005 crawl by the UbiCrawler on the .uk domain
from GeoNames } Raw size: 106.1 MiB } # of strings: 6,784,722 } Ave. length: 15.6 bytes Space (bytes/str) Insert (μs/str) Search (μs/str) DynPDT 16.8 base 1.37 base 1.38 base Sparsehash 62.3 (3.71x) 4.31 (3.14x) 0.34 (0.24x) Judy 47.6 (2.83x) 0.93 (0.68x) 0.70 (0.50x) HAT -trie 35.4 (2.11x) 0.96 (0.70x) 0.31 (0.23x) ART 87.1 (5.18x) 1.07 (0.78x) 0.81 (0.59x) Cedar 30.5 (1.82x) 1.05 (0.76x) 0.42 (0.30x) To measure Insert/Search, extracted a million strings at random
by the UbiCrawler on the .uk domain } Raw size: 2,855.5 MiB } # of strings: 39,459,925 } Ave. length: 72.4 bytes Space (bytes/str) Insert (μs/str) Search (μs/str) DynPDT 28.2 base 2.29 base 2.47 base Sparsehash 131.0 (4.65x) 9.13 (3.99x) 0.67 (0.27x) Judy 60.3 (2.14x) 2.15 (0.94x) 2.02 (0.82x) HAT -trie 82.3 (2.92x) 1.63 (0.71x) 0.61 (0.25x) ART 140.9 (5.00x) 2.20 (0.96x) 1.84 (0.75x) Cedar 58.4 (2.07x) 2.56 (1.12x) 2.51 (1.02x) To measure Insert/Search, extracted a million strings at random
XCDA: Novel compressed double-array structure for static string dictionaries supporting fast lookup } DynPDT: Novel space-efficient data structure for dynamic string dictionaries } Open problems } We are not aware of any fast implementation supporting substring-based operations } FM-index and XBW are very slow compared to other dictionaries } We are not aware of any succinct dynamic string dictionaries