Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ph.D. Thesis

Ph.D. Thesis

Shunsuke Kanda

March 02, 2018
Tweet

More Decks by Shunsuke Kanda

Other Decks in Research

Transcript

  1. Space- and Time-Efficient
    String Dictionaries
    Shunsuke Kanda (D2) @ Fuketa Lab.
    1
    J

    View Slide

  2. Back Ground
    } Management of massive data is a fundamental problem
    } Most of such data are handled as strings
    } Such as documents, Web pages, genomics data
    } Data structures and algorithms for space-efficient string
    processing have been developed by many researchers
    2

    View Slide

  3. String Dictionaries
    3
    } Data structure for storing a set of strings
    } Such as std::map in C++
    } Classical applications
    } Manage document vocabulary in NLP and IR tasks
    0
    20000
    40000
    60000
    80000
    100000
    120000
    140000
    160000
    0 2x106 4x106 6x106 8x106 1x107
    Vocabulary
    Document length
    O(nβ)
    for 0.4 < β < 0.6
    Terabyte-size
    document
    Megabyte-size
    dictionary
    Heaps’ Law

    View Slide

  4. String Dictionaries (cont)
    4
    } Recent applications [Martínez-Prieto+ 16]
    } Natural language applications handling Web collections and N-
    grams, such as Web search engines and IMEs
    } RDF stores in Semantic Web managing URIs, blank nodes, and
    literal values
    } Web graphs and crawlers storing massive URLs
    } Alignment software in Bioinformatics indexing large k-mers
    } NoSQL handling massive key-value stores
    } GISs storing large geographic data within a limited-resource
    navigation system
    The size will easily exceed gigabytes!

    View Slide

  5. Our Contributions
    5
    } We developed some novel data structures to solve
    existing problems of string dictionaries
    } Static compressed double-array tries supporting fast lookup
    [Kanda+, KAIS 16] [Kanda+, KAIS 17]
    } Lightweight dictionary compression using dictionary encoding,
    instead of using powerful text compression [Kanda+, Innovate-
    Data 17] [Kanda+, DBSJ 18]
    } Dynamic path-decomposed tries in compact space [Kanda+,
    SPIRE 17]
    } Practical rearrangement methods of dynamic double-array tries
    [Kanda+, SPE 18]
    Due to the time limitation

    View Slide

  6. Static Compressed Double-Array
    Tries Supporting Fast Lookup
    The contents are based on the paper in the journal of
    Knowledge and Information Systems (KAIS), 2017
    6

    View Slide

  7. Static String Dictionaries (SSD)
    7
    } Provide a bijection between a set of strings X and integer
    IDs in range [1,|X|]
    } Lookup(x) returns the ID if x ∈ X
    } Access(i) restores the string from the given ID i
    idea
    ism
    teach
    tech
    1
    2
    3
    4
    provides
    Lookup(teach) = 3
    Aceess(3) = teach
    PrefixLookup(te) = {teach, tech}

    View Slide

  8. State-of-the-art Compressed SSDs
    8
    } Based on closed hashing
    } HashDAC-rp [Martínez-Prieto+ 16]
    } Based on Front Coding
    } HTFC-rp [Martínez-Prieto+ 16]
    } Based on suffix arrays
    } FM-index [Ferragina+ 07]
    } Based on tries
    } MARISA [Yata 11]
    } XBW [Ferragina+ 09]
    } Cent-rp [Grossi+ 16]
    } LZ string dictionaries [Arz+ 14]
    } XCDA [Kanda+ 17]
    Space Time Functionality









    New






    Note: Depending largely on datasets

    View Slide

  9. Tries [Fredkin 60, Knuth 88]
    9
    } Edge-labeled tree for storing a set of strings
    } Lookup/Access are supported in O(k) optimal time
    } k is the length of the target string
    t
    i
    r
    e
    c
    s i
    ie
    h
    ea m e
    ch
    d
    a
    Lookup(tech) = 4
    Access(4) = tech
    idea
    ism
    teach
    tech
    tie
    trie
    PrefixLookup(te) = {teach, tech}

    View Slide

  10. Double Arrays [Aoe 89]
    } Trie representation technique using two arrays
    } BASE stores offsets to children
    } CHECK stores back pointers to parents
    } Provide the fastest Lookup/Access in tries
    10
    c1
    c2
    s
    t1
    t2
    … s … bs
    … t1
    … t2

    BASE bs
    CHECK s s
    +c1
    +c2
    Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s
    Parent(t) = CHECK[t]
    Note: XOR is well used instead of PLUS in practice

    View Slide

  11. Shortcoming of Double Arrays
    } BASE and CHECK are not compact in Ω(n lg n) bits
    } Information-theoretical lower bound is n lg σ + O(n) bits
    } Existing compressed forms have critical problems
    } CDA [Yata+ 07] is still large and sacrifices Access
    } DALF [Kanda+ 16] is smaller than CDA but sacrifices Access
    11
    Ω(n)
    Trie with n nodes
    from alphabet of size σ
    c1
    c2
    s
    t1
    t2
    … s … bs
    … t1
    … t2

    BASE bs
    CHECK s s
    Note: Some empty elements can be included

    View Slide

  12. Novel Compressed Data Structure
    } XOR-compressed double arrays (XCDA)
    } Represent each BASE/CHECK element in 8 bits empirically
    through a simple approach
    } The space efficiency is nearly equal to that of DALF
    } Support both Lookup and Access
    12
    BASE
    CHECK
    BASE’
    CHECK’
    Double Array
    32/64 bits ~8 bits
    XCDA

    View Slide

  13. Basic Idea
    } Property of double arrays
    } Node addresses can be freely arranged as long as the following
    expressions are satisfied for all nodes
    13
    c1
    c2
    s
    t1
    t2
    … s … bs
    … t1
    … t2

    BASE bs
    CHECK s s
    +c1
    +c2
    Child(s, c) = BASE[s] + c iff CHECK[BASE[s] + c] = s
    Parent(t) = CHECK[t]

    View Slide

  14. Basic Idea (cont)
    } Property of double arrays
    } Node addresses can be freely arranged as long as the following
    expressions are satisfied for all nodes
    14 Note: BASE is also the same
    0
    50000
    100000
    150000
    200000
    250000
    0 50000 100000 150000 200000 250000
    CHECK[s] XOR s
    Address: s
    XOR
    Most of values are
    within one byte
    0
    50000
    100000
    150000
    200000
    250000
    0 50000 100000 150000 200000 250000
    CHECK[s]
    Address: s
    We can obtain Modified version [Kanda+ 17] of
    DACs [Brisaboa+ 11]

    View Slide

  15. Experimental Settings
    15
    } Machine
    } Quad-Core Intel Xeon 2 x 2.4 GHz CPU
    } 16 GB RAM
    } Language
    } C++
    } Apple LLVM version 7.0.2 (clang-700)
    } Optimization -O3
    } Datasets
    } Geographic names on the asciiname column from GeoNames dump
    } URLs of a 2005 crawl by the UbiCrawler on the .uk domain

    View Slide

  16. Experimental Settings (cont)
    16
    } Based on closed hashing
    } HashDAC-rp [Martínez-Prieto+ 16]
    } Based on Front Coding
    } HTFC-rp [Martínez-Prieto+ 16]
    } Based on suffix arrays
    } FM-index [Ferragina+ 07]
    } Based on tries
    } MARISA [Yata 11]
    } XBW [Ferragina+ 09]
    } Cent-rp [Grossi+ 16]
    } LZ string dictionaries [Arz+ 14]
    } XCDA [Kanda+ 17]
    Space Time Functionality
















    Original double array, DA [Aoe 89]
    Previous compressed double array, DALF [Kanda+ 16]

    View Slide

  17. Experimental Results
    17
    } Geographic names on the asciiname column from GeoNames
    } Raw size: 106.1 MiB
    } # of strings: 6,784,722
    } Ave. length: 15.6 bytes
    Constr (sec) Cmpr (%) Lookup (μs/str) Access (μs/str)
    XCDA 5.7 base 52.8 base 0.93 base 1.29 base
    DA 4.9 (0.9x) 95.8 (1.8x) 0.61 (0.7x) 0.95 (0.7x)
    DALF 9.5 (1.7x) 52.8 (1.0x) 0.80 (0.9x) – –
    Cent-rp 33.8 (5.9x) 31.5 (0.6x) 2.10 (2.3x) 2.17 (1.7x)
    HTFC-rp 125.1 (21.9x) 34.4 (0.7x) 3.50 (3.8x) 1.79 (1.4x)
    HashDAC-rp 298.9 (52.4x) 48.0 (0.9x) 1.28 (1.4x) 0.92 (0.7x)
    To measure Lookup/Access, extracted a million strings at random

    View Slide

  18. Experimental Results (cont)
    18
    } URLs of a 2005 crawl by the UbiCrawler on the .uk domain
    } Raw size: 2,855.5 MiB
    } # of strings: 39,459,925
    } Ave. length: 72.4 bytes
    Constr (sec) Cmpr (%) Lookup (μs/str) Access (μs/str)
    XCDA 71.7 base 25.2 base 2.70 base 3.54 base
    DA 65.9 (0.9x) 43.8 (1.7x) 1.95 (0.7x) 2.93 (0.8x)
    DALF 110.1 (1.5x) 24.1 (1.0x) 6.00 (2.2x) – –
    Cent-rp 472.7 (6.6x) 17.5 (0.7x) 4.02 (1.5x) 4.47 (1.3x)
    HTFC-rp 12598.4 (175.7x) 18.3 (0.7x) 7.96 (2.9x) 4.41 (1.2x)
    HashDAC-rp Could not be constructed in practical time
    To measure Lookup/Access, extracted a million strings at random

    View Slide

  19. Dynamic Path-Decomposed Tries
    The contents are based on the paper in the proceedings of
    24th International Symposium on String Processing and
    Information Retrieval (SPIRE), 2017
    19

    View Slide

  20. Dynamic String Dictionaries (DSD)
    } Providing mapping from a set of strings X to values of any type
    } Search(x) returns the associated value if x ∈ X
    } Insert(x, v) adds the string x associated with value v to the dictionary
    } Delete(x) erases the string x from the dictionary
    20
    idea
    ism
    teach
    tech
    12
    51
    87
    provides
    Search(teach) = 12

    View Slide

  21. State-of-the-art DSDs
    21
    } Based on hashing
    } Google sparse hash [Google 07]
    } Array hash [Askitis+ 05]
    } Based on tries
    } Judy [Baskins 02]
    } HAT-trie [Askitis+ 07]
    } ART [Leis+ 13]
    } Cedar [Yoshinaga+ 14]
    } DynPDT [Kanda+ 17]
    } Recent works
    } Provided theoretical results for dynamic tries, not in practice
    [Jansson+ 15] [Arroyuelo+ 16]
    DynPDT is much smaller
    than all the others
    Main results
    New

    View Slide

  22. Novel Data Structure of DSDs
    22
    } Dynamic Path-Decomposed Tries (DynPDT)
    } Path decomposition: Procedure to build cache-friendly tries
    } Application: Compressed SSDs (Cent-rp) [Grossi+ 14]
    } Apply it to implement DSDs with a different approach
    i r a
    i
    t
    ide
    rie
    e
    chnology$
    ology
    ie$
    al
    a
    technology$
    Path-Decomposed Trie (PDT)
    Path decomposition

    View Slide

  23. Basic Idea
    23
    } Incrementally construct a PDT
    1. Add key technology$ to an empty dictionary
    technology$
    v1

    View Slide

  24. Basic Idea (cont)
    24
    } Incrementally construct a PDT
    2. Add key technics$
    technology$
    v1
    (5, i)
    v2
    012345
    technics$
    cs$

    View Slide

  25. Basic Idea (cont)
    25
    } Incrementally construct a PDT
    3. Add key technique$
    technology$
    cs$
    (5, i)
    v1
    (0, q)
    v3
    012345
    technique$
    Dynamic Path-Decomposed Trie (DynPDT)
    0
    que$
    ue$
    v2

    View Slide

  26. How to Implement?
    26
    } Representation of trie topology
    } Use a compact dynamic trie, or m-Bonsai [Poyias+ 16]
    } Trie can be represented in (n/α)(lg σ + O(1)) bits
    } Where n is #nodes, σ is alphabet size and α is 0 < α < 1
    } Close to the informational optimal n(lg σ + O(1)) bits
    } Management of node labels
    } Plain implementation takes O(n lg n)
    bits for pointers
    } Our compact one takes O(n) bits,
    although access time is O(lg n) technology$
    cs$ ue$
    lly$
    cal$

    View Slide

  27. Experimental Settings
    27
    } Machine
    } Intel Xeon E5540 2.53 GHz CPU
    } 32 GB RAM (L2 cache: 1 MB, L3 cache: 8 MB)
    } Language
    } C++
    } g++ (version 5.4.0)
    } Optimization -O9
    } Datasets
    } Geographic names on the asciiname column from GeoNames dump
    } URLs of a 2005 crawl by the UbiCrawler on the .uk domain

    View Slide

  28. Experimental Settings (cont)
    28
    } Based on hashing
    } Google sparse hash [Google 07]
    } Array hash [Askitis+ 05]
    } Based on tries
    } Judy [Baskins 02]
    } HAT-trie [Askitis+ 07]
    } ART [Leis+ 13]
    } Cedar [Yoshinaga+ 14]
    } DynPDT [Kanda+ 17]

    View Slide

  29. Experimental Results
    29
    } Geographic names on the asciiname column from GeoNames
    } Raw size: 106.1 MiB
    } # of strings: 6,784,722
    } Ave. length: 15.6 bytes
    Space (bytes/str) Insert (μs/str) Search (μs/str)
    DynPDT 16.8 base 1.37 base 1.38 base
    Sparsehash 62.3 (3.71x) 4.31 (3.14x) 0.34 (0.24x)
    Judy 47.6 (2.83x) 0.93 (0.68x) 0.70 (0.50x)
    HAT
    -trie 35.4 (2.11x) 0.96 (0.70x) 0.31 (0.23x)
    ART 87.1 (5.18x) 1.07 (0.78x) 0.81 (0.59x)
    Cedar 30.5 (1.82x) 1.05 (0.76x) 0.42 (0.30x)
    To measure Insert/Search, extracted a million strings at random

    View Slide

  30. Experimental Results (cont)
    30
    } URLs of a 2005 crawl by the UbiCrawler on the .uk domain
    } Raw size: 2,855.5 MiB
    } # of strings: 39,459,925
    } Ave. length: 72.4 bytes
    Space (bytes/str) Insert (μs/str) Search (μs/str)
    DynPDT 28.2 base 2.29 base 2.47 base
    Sparsehash 131.0 (4.65x) 9.13 (3.99x) 0.67 (0.27x)
    Judy 60.3 (2.14x) 2.15 (0.94x) 2.02 (0.82x)
    HAT
    -trie 82.3 (2.92x) 1.63 (0.71x) 0.61 (0.25x)
    ART 140.9 (5.00x) 2.20 (0.96x) 1.84 (0.75x)
    Cedar 58.4 (2.07x) 2.56 (1.12x) 2.51 (1.02x)
    To measure Insert/Search, extracted a million strings at random

    View Slide

  31. Concluding Remarks
    31
    } Our contributions (in the presentation)
    } XCDA: Novel compressed double-array structure for static
    string dictionaries supporting fast lookup
    } DynPDT: Novel space-efficient data structure for dynamic
    string dictionaries
    } Open problems
    } We are not aware of any fast implementation supporting
    substring-based operations
    } FM-index and XBW are very slow compared to other dictionaries
    } We are not aware of any succinct dynamic string dictionaries

    View Slide