Upgrade to Pro — share decks privately, control downloads, hide ads and more …

InnovateData2017

 InnovateData2017

Shunsuke Kanda

August 21, 2017
Tweet

More Decks by Shunsuke Kanda

Other Decks in Research

Transcript

  1. Practical String Dictionary Compression Using String Dictionary Encoding Shunsuke Kanda,

    Kazuhiro Morita and Masao Fuketa Tokushima Univ., JAPAN 3rd Innovate-Data in Prague 21–23 Aug. 2017
  2. String Dictionaries } Data structure for storing a set of

    strings } Mapping between strings and unique IDs } With two primitive operations: } Lookup obtains the ID from a given string } Access obtains the string from a given ID 2 and Applications Big Data Innovations 1 2 3 4 5 Lookup(Big) = 3 Access(5) = Innovations
  3. String Dictionaries 3 } Application example: Text encoding } Encoding

    a text to integers using a string dictionary } Basically, such an integer sequence is more compact than the original } Arbitrary words can be directly decoded Big Data Innovations and Applications 3 4 5 1 2 Encoding using Big Data Innovations and Applications Decoding using
  4. Background 4 } The space efficiency is significant in many

    recent applications: } Word lexicon of NLP and IR, } Management of URLs of Web graph, } RDF dictionary of Semantic Web, } Gazetteer of GISs, and so on… } For example, recent RDF systems consider to store a URI dataset of 14GB in main memory [Mavlyutov+ 14] } Such management will be impossible when using naïve data structures, at least on general personal computers } Compact dictionary implementation is very important } Needless to say, time performance is also important
  5. Background 5 } Two significant choices for efficient implementation }

    State-of-the-art dictionary implementations using them } Trie with RePair [Grossi+ 14] } Front Coding with RePair [Martínez-Prieto+ 16] Implementation technique is Trie or Front Coding 1st choice Compression strategy is RePair 2nd choice
  6. Problem and our work 6 } Shortcoming of the RePair

    compression } The construction cost is very large and not practical } While the cost can be improved, some space efficiency is sacrificed } Our contribution } Proposing an alternative compression strategy using string dictionary encoding, instead of using RePair } Presenting new string dictionary implementations for our strategy } Enabling considerably faster construction, while being competitive in space efficiency and operation speed
  7. Trie 8 } Edge-labeled tree for storing a set of

    strings } Built by merging prefixes of strings } Operation time depends only on the target string length *Specifically, this form is known as Patricia Trie t ide rie e a chnology al ology ie 1 ideal 2 ideology 3 tea 4 technology 5 tie 6 trie 6 4 1 2 5 3 Lookup(ideology) = 2 Access(6) = trie
  8. Path Decomposition [Ferragina+ 08] 9 } Trie transformation technique to

    reduce the number of random accesses on search by lowering the tree height t ide rie e a chnology al ology ie i r a i 1t2e1chnology Path-Decomposed Trie (PDT)
  9. Compression Strategies of PDT 10 } Existing [Grossi+ 14] }

    Compressing node labels using the lightweight RePair } Ours } Compressing node labels using dictionary encoding i r a i A B o F D C E * The tree structure is compactly represented using DFUDS. i r a i o de1al ϵ ie e logy 1t2e1chnology Dictionary Encoding
  10. Front Coding 11 } Compression technique for sorted strings }

    Dividing strings into buckets of constant size } Storing each header of the buckets simply } Replacing internal strings into the length of the longest common prefix with its predecessor and the remaining suffix 1 ideal 2 ideas 3 ideology 4 tea 5 techie 噣 噣 ideal 4 s 3 ology 0 tea techie 噣 Access(3) ideas ideology ideal Decoding internal strings from the header *Lookup is performed by binary search for headers internal strings header header
  11. Compression Strategies of Front Coding 12 } Existing [Martínez-Prieto+ 16]

    } Compressing internal strings using the original RePair } Ours } Compressing internal strings using dictionary encoding Dictionary Encoding internal strings ideal 4 s 3 ology 0 tea techie 噣 ideal 4 C 3 B 0 D techie 噣
  12. Auxiliary String Dictionaries (ASDs) 14 } For our compression strategies

    } Requiring only Access for decoding, without Lookup } Operation speed is especially important because an ASD is called multiple times for each Lookup/Access of the original string dictionary } The target strings for compression have many similar suffixes because both Trie and Front Coding are built by merging prefixes ASD A nology B ology C s D tea decoding using ideal 4 C 3 B 0 D techie 4 A Three types of ASDs are presented => s => ology => tea Access
  13. 1. TAIL: Plain Concatenation and Sharing 15 } Data structure

    } Concatenating strings with special terminator ‘$’ } Shareable strings are merged such as ‘nology’ and ‘ology’ } Each beginning position becomes each string ID } Features } The decoding speed is the fastest } Space efficiency is low compared to other ASDs D A B C E r i e $ n o l o g y $ t e a $ A ie B nology C ology D rie E tea
  14. 2. Reverse Path-Decomposed Trie (RPDT) 16 } Reverse Trie }

    Merging suffixes of strings in reverse } Strings are decoded by traversing leaf-to-root paths } RPDT } Result of path decomposition of Reverse Trie } High space efficiency while maintaining cache efficiency of path decomposition A ie B itie C nology D ology E rie F tea ygolo ei n r ti Reverse Trie e t aet a A D C E B F RPDT ir i ygolon et
  15. 3. Back Coding 17 } Suffix version of Front Coding

    } Suffixes are replaced into integers, instead of prefixes } Fast Back Coding (FBC) } Using differences with headers, instead of predecessors } Unnecessary to decode internal strings other than the target string } # of memory copies is always not bigger than 2 } The decoding speed is very fast A ide B ie C rie D itie E otogy ide i 1 ri 1 iti 1 otogy Back Coding FBC Access(D) iti + e skip skip ide i 1 r 2 it 2 otogy
  16. Settings 19 } Machine } Intel Xeon E5540 @2.53 GHz,

    running Ubuntu Server 16.04 } 32 GB of RAM (L2 cache: 1 MB, L3 cache: 8 MB) } Code } C++11 complied with g++ } Optimization option is -O9 (fastest and smallest) } Datasets } All page titles from English Wikipedia } 227.2 MiB, 11 519 354 strings } URLs of a crawl by the UbiCrawler on the .uk domain } 2 723.3 MiB, 39 459 925 strings
  17. Settings 20 } Dictionary implementations PDT Front Coding (Bucket size

    is 8) RePair TAIL RPDT FBC (Bucket size is 4) Ours Baseline Which combinations provide high performance in time and space? using
  18. Results for PDT 21 Construction (sec) Compression (%) Lookup (micro

    sec / str) Access (micro sec / ID) Wiki RePair 62.1 – 31.6 – 2.59 – 2.50 – TAIL 8.3 (0.13x) 41.7 (1.32x) 1.72 (0.66x) 1.76 (0.70x) RPDT 11.8 (0.19x) 32.4 (1.03x) 2.40 (0.93x) 2.26 (0.90x) FBC 10.6 (0.17x) 37.0 (1.17x) 2.29 (0.88x) 2.38 (0.95x) URLs RePair 437.2 – 17.5 – 4.00 – 3.98 – TAIL 55.7 (0.13x) 20.7 (1.18x) 2.59 (0.65x) 2.94 (0.74x) RPDT 58.6 (0.13x) 16.4 (0.94x) 4.16 (1.04x) 3.91 (0.98x) FBC 53.5 (0.12x) 17.3 (0.99x) 3.24 (0.81x) 3.55 (0.89x) *The smaller these results are, the better up to 8.2x faster up to 1.3x bigger up to 1.5x faster up to 1.4x faster
  19. Results for Front Coding 22 Construction (sec) Compression (%) Lookup

    (micro sec / str) Access (micro sec / ID) Wiki RePair 667.2 – 36.5 – 2.63 – 1.59 – TAIL 6.1 (0.009x) 45.3 (1.24x) 1.51 (0.57x) 0.66 (0.42x) RPDT 9.6 (0.014x) 38.7 (1.06x) 2.23 (0.85x) 1.21 (0.76x) FBC 8.6 (0.013x) 42.7 (1.17x) 2.00 (0.76x) 1.09 (0.69x) URLs RePair 11835.2 – 22.3 – 4.28 – 2.41 – TAIL 28.0 (0.002x) 31.4 (1.41x) 2.32 (0.54x) 0.77 (0.32x) RPDT 50.0 (0.004x) 27.1 (1.22x) 3.64 (0.85x) 1.73 (0.72x) FBC 43.3 (0.004x) 28.1 (1.26x) 3.02 (0.71x) 1.47 (0.61x) *The smaller these results are, the better up to 422x faster up to 1.4x bigger up to 1.8x faster up to 3.1x faster
  20. Summary 23 } Our contributions } Proposing an alternative compression

    strategy using string dictionary encoding, instead of using RePair } Enabling considerably faster construction, while being competitive in space efficiency and operation speed } Light reconstruction is allowed in many applications } Future work } Comparing ours with other text compression techniques, such as Huffman Coding and Online Grammar Compression } Applying the string dictionary encoding to other data structures for string processing
  21. 24 Thank you for your attention! My English skills are

    limited. Please speak slowly if you have any questions.
  22. 2. Implementation of RPDT 25 } Construction procedure } Serializing

    the node and edge labels in breadth-first order } Giving each edge label has pointers to their parents } The pointers become monotonically increasing integers } Elias-Fano codes can represent them in compact space } Features } The space usage closes to the theoretically lower bound } while supporting cache friendly traversal from path decomposition D C A E F B $ y g o l o n e i r a e t t i e t a ir i ygolon et Access(B) = itie