Ph.D. Thesis

Ph.D. Thesis

7336da77de517e04e2438553e4f8071d?s=128

Shunsuke Kanda

March 02, 2018
Tweet

Transcript

  1. 2.

    Back Ground } Management of massive data is a fundamental

    problem } Most of such data are handled as strings } Such as documents, Web pages, genomics data } Data structures and algorithms for space-efficient string processing have been developed by many researchers 2
  2. 3.

    String Dictionaries 3 } Data structure for storing a set

    of strings } Such as std::map<std::string, type> in C++ } Classical applications } Manage document vocabulary in NLP and IR tasks 0 20000 40000 60000 80000 100000 120000 140000 160000 0 2x106 4x106 6x106 8x106 1x107 Vocabulary Document length O(nβ) for 0.4 < β < 0.6 Terabyte-size document Megabyte-size dictionary Heaps’ Law
  3. 4.

    String Dictionaries (cont) 4 } Recent applications [Martínez-Prieto+ 16] }

    Natural language applications handling Web collections and N- grams, such as Web search engines and IMEs } RDF stores in Semantic Web managing URIs, blank nodes, and literal values } Web graphs and crawlers storing massive URLs } Alignment software in Bioinformatics indexing large k-mers } NoSQL handling massive key-value stores } GISs storing large geographic data within a limited-resource navigation system The size will easily exceed gigabytes!
  4. 5.

    Our Contributions 5 } We developed some novel data structures

    to solve existing problems of string dictionaries } Static compressed double-array tries supporting fast lookup [Kanda+, KAIS 16] [Kanda+, KAIS 17] } Lightweight dictionary compression using dictionary encoding, instead of using powerful text compression [Kanda+, Innovate- Data 17] [Kanda+, DBSJ 18] } Dynamic path-decomposed tries in compact space [Kanda+, SPIRE 17] } Practical rearrangement methods of dynamic double-array tries [Kanda+, SPE 18] Due to the time limitation
  5. 6.

    Static Compressed Double-Array Tries Supporting Fast Lookup The contents are

    based on the paper in the journal of Knowledge and Information Systems (KAIS), 2017 6
  6. 7.

    Static String Dictionaries (SSD) 7 } Provide a bijection between

    a set of strings X and integer IDs in range [1,|X|] } Lookup(x) returns the ID if x ∈ X } Access(i) restores the string from the given ID i idea ism teach tech 1 2 3 4 provides Lookup(teach) = 3 Aceess(3) = teach PrefixLookup(te) = {teach, tech}
  7. 8.

    State-of-the-art Compressed SSDs 8 } Based on closed hashing }

    HashDAC-rp [Martínez-Prieto+ 16] } Based on Front Coding } HTFC-rp [Martínez-Prieto+ 16] } Based on suffix arrays } FM-index [Ferragina+ 07] } Based on tries } MARISA [Yata 11] } XBW [Ferragina+ 09] } Cent-rp [Grossi+ 16] } LZ string dictionaries [Arz+ 14] } XCDA [Kanda+ 17] Space Time Functionality                New          Note: Depending largely on datasets
  8. 9.

    Tries [Fredkin 60, Knuth 88] 9 } Edge-labeled tree for

    storing a set of strings } Lookup/Access are supported in O(k) optimal time } k is the length of the target string t i r e c s i ie h ea m e ch d a Lookup(tech) = 4 Access(4) = tech idea ism teach tech tie trie PrefixLookup(te) = {teach, tech}
  9. 10.

    Double Arrays [Aoe 89] } Trie representation technique using two

    arrays } BASE stores offsets to children } CHECK stores back pointers to parents } Provide the fastest Lookup/Access in tries 10 c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s +c1 +c2 Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s Parent(t) = CHECK[t] Note: XOR is well used instead of PLUS in practice
  10. 11.

    Shortcoming of Double Arrays } BASE and CHECK are not

    compact in Ω(n lg n) bits } Information-theoretical lower bound is n lg σ + O(n) bits } Existing compressed forms have critical problems } CDA [Yata+ 07] is still large and sacrifices Access } DALF [Kanda+ 16] is smaller than CDA but sacrifices Access 11 Ω(n) Trie with n nodes from alphabet of size σ c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s Note: Some empty elements can be included
  11. 12.

    Novel Compressed Data Structure } XOR-compressed double arrays (XCDA) }

    Represent each BASE/CHECK element in 8 bits empirically through a simple approach } The space efficiency is nearly equal to that of DALF } Support both Lookup and Access 12 BASE CHECK BASE’ CHECK’ Double Array 32/64 bits ~8 bits XCDA
  12. 13.

    Basic Idea } Property of double arrays } Node addresses

    can be freely arranged as long as the following expressions are satisfied for all nodes 13 c1 c2 s t1 t2 … s … bs … t1 … t2 … BASE bs CHECK s s +c1 +c2 Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s Parent(t) = CHECK[t]
  13. 14.

    Basic Idea (cont) } Property of double arrays } Node

    addresses can be freely arranged as long as the following expressions are satisfied for all nodes 14 Note: BASE is also the same 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 CHECK[s] XOR s Address: s XOR Most of values are within one byte 0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000 CHECK[s] Address: s We can obtain Modified version [Kanda+ 17] of DACs [Brisaboa+ 11]
  14. 15.

    Experimental Settings 15 } Machine } Quad-Core Intel Xeon 2

    x 2.4 GHz CPU } 16 GB RAM } Language } C++ } Apple LLVM version 7.0.2 (clang-700) } Optimization -O3 } Datasets } Geographic names on the asciiname column from GeoNames dump } URLs of a 2005 crawl by the UbiCrawler on the .uk domain
  15. 16.

    Experimental Settings (cont) 16 } Based on closed hashing }

    HashDAC-rp [Martínez-Prieto+ 16] } Based on Front Coding } HTFC-rp [Martínez-Prieto+ 16] } Based on suffix arrays } FM-index [Ferragina+ 07] } Based on tries } MARISA [Yata 11] } XBW [Ferragina+ 09] } Cent-rp [Grossi+ 16] } LZ string dictionaries [Arz+ 14] } XCDA [Kanda+ 17] Space Time Functionality                         Original double array, DA [Aoe 89] Previous compressed double array, DALF [Kanda+ 16]
  16. 17.

    Experimental Results 17 } Geographic names on the asciiname column

    from GeoNames } Raw size: 106.1 MiB } # of strings: 6,784,722 } Ave. length: 15.6 bytes Constr (sec) Cmpr (%) Lookup (μs/str) Access (μs/str) XCDA 5.7 base 52.8 base 0.93 base 1.29 base DA 4.9 (0.9x) 95.8 (1.8x) 0.61 (0.7x) 0.95 (0.7x) DALF 9.5 (1.7x) 52.8 (1.0x) 0.80 (0.9x) – – Cent-rp 33.8 (5.9x) 31.5 (0.6x) 2.10 (2.3x) 2.17 (1.7x) HTFC-rp 125.1 (21.9x) 34.4 (0.7x) 3.50 (3.8x) 1.79 (1.4x) HashDAC-rp 298.9 (52.4x) 48.0 (0.9x) 1.28 (1.4x) 0.92 (0.7x) To measure Lookup/Access, extracted a million strings at random
  17. 18.

    Experimental Results (cont) 18 } URLs of a 2005 crawl

    by the UbiCrawler on the .uk domain } Raw size: 2,855.5 MiB } # of strings: 39,459,925 } Ave. length: 72.4 bytes Constr (sec) Cmpr (%) Lookup (μs/str) Access (μs/str) XCDA 71.7 base 25.2 base 2.70 base 3.54 base DA 65.9 (0.9x) 43.8 (1.7x) 1.95 (0.7x) 2.93 (0.8x) DALF 110.1 (1.5x) 24.1 (1.0x) 6.00 (2.2x) – – Cent-rp 472.7 (6.6x) 17.5 (0.7x) 4.02 (1.5x) 4.47 (1.3x) HTFC-rp 12598.4 (175.7x) 18.3 (0.7x) 7.96 (2.9x) 4.41 (1.2x) HashDAC-rp Could not be constructed in practical time To measure Lookup/Access, extracted a million strings at random
  18. 19.

    Dynamic Path-Decomposed Tries The contents are based on the paper

    in the proceedings of 24th International Symposium on String Processing and Information Retrieval (SPIRE), 2017 19
  19. 20.

    Dynamic String Dictionaries (DSD) } Providing mapping from a set

    of strings X to values of any type } Search(x) returns the associated value if x ∈ X } Insert(x, v) adds the string x associated with value v to the dictionary } Delete(x) erases the string x from the dictionary 20 idea ism teach tech 12 51 87 provides Search(teach) = 12
  20. 21.

    State-of-the-art DSDs 21 } Based on hashing } Google sparse

    hash [Google 07] } Array hash [Askitis+ 05] } Based on tries } Judy [Baskins 02] } HAT-trie [Askitis+ 07] } ART [Leis+ 13] } Cedar [Yoshinaga+ 14] } DynPDT [Kanda+ 17] } Recent works } Provided theoretical results for dynamic tries, not in practice [Jansson+ 15] [Arroyuelo+ 16] DynPDT is much smaller than all the others Main results New
  21. 22.

    Novel Data Structure of DSDs 22 } Dynamic Path-Decomposed Tries

    (DynPDT) } Path decomposition: Procedure to build cache-friendly tries } Application: Compressed SSDs (Cent-rp) [Grossi+ 14] } Apply it to implement DSDs with a different approach i r a i t ide rie e chnology$ ology ie$ al a technology$ Path-Decomposed Trie (PDT) Path decomposition
  22. 23.

    Basic Idea 23 } Incrementally construct a PDT 1. Add

    key technology$ to an empty dictionary technology$ v1
  23. 24.

    Basic Idea (cont) 24 } Incrementally construct a PDT 2.

    Add key technics$ technology$ v1 (5, i) v2 012345 technics$ cs$
  24. 25.

    Basic Idea (cont) 25 } Incrementally construct a PDT 3.

    Add key technique$ technology$ cs$ (5, i) v1 (0, q) v3 012345 technique$ Dynamic Path-Decomposed Trie (DynPDT) 0 que$ ue$ v2
  25. 26.

    How to Implement? 26 } Representation of trie topology }

    Use a compact dynamic trie, or m-Bonsai [Poyias+ 16] } Trie can be represented in (n/α)(lg σ + O(1)) bits } Where n is #nodes, σ is alphabet size and α is 0 < α < 1 } Close to the informational optimal n(lg σ + O(1)) bits } Management of node labels } Plain implementation takes O(n lg n) bits for pointers } Our compact one takes O(n) bits, although access time is O(lg n) technology$ cs$ ue$ lly$ cal$
  26. 27.

    Experimental Settings 27 } Machine } Intel Xeon E5540 2.53

    GHz CPU } 32 GB RAM (L2 cache: 1 MB, L3 cache: 8 MB) } Language } C++ } g++ (version 5.4.0) } Optimization -O9 } Datasets } Geographic names on the asciiname column from GeoNames dump } URLs of a 2005 crawl by the UbiCrawler on the .uk domain
  27. 28.

    Experimental Settings (cont) 28 } Based on hashing } Google

    sparse hash [Google 07] } Array hash [Askitis+ 05] } Based on tries } Judy [Baskins 02] } HAT-trie [Askitis+ 07] } ART [Leis+ 13] } Cedar [Yoshinaga+ 14] } DynPDT [Kanda+ 17]
  28. 29.

    Experimental Results 29 } Geographic names on the asciiname column

    from GeoNames } Raw size: 106.1 MiB } # of strings: 6,784,722 } Ave. length: 15.6 bytes Space (bytes/str) Insert (μs/str) Search (μs/str) DynPDT 16.8 base 1.37 base 1.38 base Sparsehash 62.3 (3.71x) 4.31 (3.14x) 0.34 (0.24x) Judy 47.6 (2.83x) 0.93 (0.68x) 0.70 (0.50x) HAT -trie 35.4 (2.11x) 0.96 (0.70x) 0.31 (0.23x) ART 87.1 (5.18x) 1.07 (0.78x) 0.81 (0.59x) Cedar 30.5 (1.82x) 1.05 (0.76x) 0.42 (0.30x) To measure Insert/Search, extracted a million strings at random
  29. 30.

    Experimental Results (cont) 30 } URLs of a 2005 crawl

    by the UbiCrawler on the .uk domain } Raw size: 2,855.5 MiB } # of strings: 39,459,925 } Ave. length: 72.4 bytes Space (bytes/str) Insert (μs/str) Search (μs/str) DynPDT 28.2 base 2.29 base 2.47 base Sparsehash 131.0 (4.65x) 9.13 (3.99x) 0.67 (0.27x) Judy 60.3 (2.14x) 2.15 (0.94x) 2.02 (0.82x) HAT -trie 82.3 (2.92x) 1.63 (0.71x) 0.61 (0.25x) ART 140.9 (5.00x) 2.20 (0.96x) 1.84 (0.75x) Cedar 58.4 (2.07x) 2.56 (1.12x) 2.51 (1.02x) To measure Insert/Search, extracted a million strings at random
  30. 31.

    Concluding Remarks 31 } Our contributions (in the presentation) }

    XCDA: Novel compressed double-array structure for static string dictionaries supporting fast lookup } DynPDT: Novel space-efficient data structure for dynamic string dictionaries } Open problems } We are not aware of any fast implementation supporting substring-based operations } FM-index and XBW are very slow compared to other dictionaries } We are not aware of any succinct dynamic string dictionaries