InnovateData2017 - Speaker Deck

Slide 1

Slide 1 text

Practical String Dictionary Compression Using String Dictionary Encoding Shunsuke Kanda, Kazuhiro Morita and Masao Fuketa Tokushima Univ., JAPAN 3rd Innovate-Data in Prague 21–23 Aug. 2017

Slide 2

Slide 2 text

String Dictionaries } Data structure for storing a set of strings } Mapping between strings and unique IDs } With two primitive operations: } Lookup obtains the ID from a given string } Access obtains the string from a given ID 2 and Applications Big Data Innovations 1 2 3 4 5 Lookup(Big) = 3 Access(5) = Innovations

Slide 3

Slide 3 text

String Dictionaries 3 } Application example: Text encoding } Encoding a text to integers using a string dictionary } Basically, such an integer sequence is more compact than the original } Arbitrary words can be directly decoded Big Data Innovations and Applications 3 4 5 1 2 Encoding using Big Data Innovations and Applications Decoding using

Slide 4

Slide 4 text

Background 4 } The space efficiency is significant in many recent applications: } Word lexicon of NLP and IR, } Management of URLs of Web graph, } RDF dictionary of Semantic Web, } Gazetteer of GISs, and so on… } For example, recent RDF systems consider to store a URI dataset of 14GB in main memory [Mavlyutov+ 14] } Such management will be impossible when using naïve data structures, at least on general personal computers } Compact dictionary implementation is very important } Needless to say, time performance is also important

Slide 5

Slide 5 text

Background 5 } Two significant choices for efficient implementation } State-of-the-art dictionary implementations using them } Trie with RePair [Grossi+ 14] } Front Coding with RePair [Martínez-Prieto+ 16] Implementation technique is Trie or Front Coding 1st choice Compression strategy is RePair 2nd choice

Slide 6

Slide 6 text

Problem and our work 6 } Shortcoming of the RePair compression } The construction cost is very large and not practical } While the cost can be improved, some space efficiency is sacrificed } Our contribution } Proposing an alternative compression strategy using string dictionary encoding, instead of using RePair } Presenting new string dictionary implementations for our strategy } Enabling considerably faster construction, while being competitive in space efficiency and operation speed

Slide 7

Slide 7 text

String Dictionary Implementations and Compression Strategies Trie- and Front-Coding-based implementations Old and new compression strategies 7

Slide 8

Slide 8 text

Trie 8 } Edge-labeled tree for storing a set of strings } Built by merging prefixes of strings } Operation time depends only on the target string length *Specifically, this form is known as Patricia Trie t ide rie e a chnology al ology ie 1 ideal 2 ideology 3 tea 4 technology 5 tie 6 trie 6 4 1 2 5 3 Lookup(ideology) = 2 Access(6) = trie

Slide 9

Slide 9 text

Path Decomposition [Ferragina+ 08] 9 } Trie transformation technique to reduce the number of random accesses on search by lowering the tree height t ide rie e a chnology al ology ie i r a i 1t2e1chnology Path-Decomposed Trie (PDT)

Slide 10

Slide 10 text

Compression Strategies of PDT 10 } Existing [Grossi+ 14] } Compressing node labels using the lightweight RePair } Ours } Compressing node labels using dictionary encoding i r a i A B o F D C E * The tree structure is compactly represented using DFUDS. i r a i o de1al ϵ ie e logy 1t2e1chnology Dictionary Encoding

Slide 11

Slide 11 text

Front Coding 11 } Compression technique for sorted strings } Dividing strings into buckets of constant size } Storing each header of the buckets simply } Replacing internal strings into the length of the longest common prefix with its predecessor and the remaining suffix 1 ideal 2 ideas 3 ideology 4 tea 5 techie 噣噣 ideal 4 s 3 ology 0 tea techie 噣 Access(3) ideas ideology ideal Decoding internal strings from the header *Lookup is performed by binary search for headers internal strings header header

Slide 12

Slide 12 text

Compression Strategies of Front Coding 12 } Existing [Martínez-Prieto+ 16] } Compressing internal strings using the original RePair } Ours } Compressing internal strings using dictionary encoding Dictionary Encoding internal strings ideal 4 s 3 ology 0 tea techie 噣 ideal 4 C 3 B 0 D techie 噣

Slide 13

Slide 13 text

Auxiliary String Dictionaries String dictionaries for our strategy 13

Slide 14

Slide 14 text

Auxiliary String Dictionaries (ASDs) 14 } For our compression strategies } Requiring only Access for decoding, without Lookup } Operation speed is especially important because an ASD is called multiple times for each Lookup/Access of the original string dictionary } The target strings for compression have many similar suffixes because both Trie and Front Coding are built by merging prefixes ASD A nology B ology C s D tea decoding using ideal 4 C 3 B 0 D techie 4 A Three types of ASDs are presented => s => ology => tea Access

Slide 15

Slide 15 text

1. TAIL: Plain Concatenation and Sharing 15 } Data structure } Concatenating strings with special terminator ‘$’ } Shareable strings are merged such as ‘nology’ and ‘ology’ } Each beginning position becomes each string ID } Features } The decoding speed is the fastest } Space efficiency is low compared to other ASDs D A B C E r i e $ n o l o g y $ t e a $ A ie B nology C ology D rie E tea

Slide 16

Slide 16 text

2. Reverse Path-Decomposed Trie (RPDT) 16 } Reverse Trie } Merging suffixes of strings in reverse } Strings are decoded by traversing leaf-to-root paths } RPDT } Result of path decomposition of Reverse Trie } High space efficiency while maintaining cache efficiency of path decomposition A ie B itie C nology D ology E rie F tea ygolo ei n r ti Reverse Trie e t aet a A D C E B F RPDT ir i ygolon et

Slide 17

Slide 17 text

3. Back Coding 17 } Suffix version of Front Coding } Suffixes are replaced into integers, instead of prefixes } Fast Back Coding (FBC) } Using differences with headers, instead of predecessors } Unnecessary to decode internal strings other than the target string } # of memory copies is always not bigger than 2 } The decoding speed is very fast A ide B ie C rie D itie E otogy ide i 1 ri 1 iti 1 otogy Back Coding FBC Access(D) iti + e skip skip ide i 1 r 2 it 2 otogy

Slide 18

Slide 18 text

Experiments and Summary Results of PDT and Front Coding 18

Slide 19

Slide 19 text

Settings 19 } Machine } Intel Xeon E5540 @2.53 GHz, running Ubuntu Server 16.04 } 32 GB of RAM (L2 cache: 1 MB, L3 cache: 8 MB) } Code } C++11 complied with g++ } Optimization option is -O9 (fastest and smallest) } Datasets } All page titles from English Wikipedia } 227.2 MiB, 11 519 354 strings } URLs of a crawl by the UbiCrawler on the .uk domain } 2 723.3 MiB, 39 459 925 strings

Slide 20

Slide 20 text

Settings 20 } Dictionary implementations PDT Front Coding (Bucket size is 8) RePair TAIL RPDT FBC (Bucket size is 4) Ours Baseline Which combinations provide high performance in time and space? using

Slide 21

Slide 21 text

Results for PDT 21 Construction (sec) Compression (%) Lookup (micro sec / str) Access (micro sec / ID) Wiki RePair 62.1 – 31.6 – 2.59 – 2.50 – TAIL 8.3 (0.13x) 41.7 (1.32x) 1.72 (0.66x) 1.76 (0.70x) RPDT 11.8 (0.19x) 32.4 (1.03x) 2.40 (0.93x) 2.26 (0.90x) FBC 10.6 (0.17x) 37.0 (1.17x) 2.29 (0.88x) 2.38 (0.95x) URLs RePair 437.2 – 17.5 – 4.00 – 3.98 – TAIL 55.7 (0.13x) 20.7 (1.18x) 2.59 (0.65x) 2.94 (0.74x) RPDT 58.6 (0.13x) 16.4 (0.94x) 4.16 (1.04x) 3.91 (0.98x) FBC 53.5 (0.12x) 17.3 (0.99x) 3.24 (0.81x) 3.55 (0.89x) *The smaller these results are, the better up to 8.2x faster up to 1.3x bigger up to 1.5x faster up to 1.4x faster

Slide 22

Slide 22 text

Results for Front Coding 22 Construction (sec) Compression (%) Lookup (micro sec / str) Access (micro sec / ID) Wiki RePair 667.2 – 36.5 – 2.63 – 1.59 – TAIL 6.1 (0.009x) 45.3 (1.24x) 1.51 (0.57x) 0.66 (0.42x) RPDT 9.6 (0.014x) 38.7 (1.06x) 2.23 (0.85x) 1.21 (0.76x) FBC 8.6 (0.013x) 42.7 (1.17x) 2.00 (0.76x) 1.09 (0.69x) URLs RePair 11835.2 – 22.3 – 4.28 – 2.41 – TAIL 28.0 (0.002x) 31.4 (1.41x) 2.32 (0.54x) 0.77 (0.32x) RPDT 50.0 (0.004x) 27.1 (1.22x) 3.64 (0.85x) 1.73 (0.72x) FBC 43.3 (0.004x) 28.1 (1.26x) 3.02 (0.71x) 1.47 (0.61x) *The smaller these results are, the better up to 422x faster up to 1.4x bigger up to 1.8x faster up to 3.1x faster

Slide 23

Slide 23 text

Summary 23 } Our contributions } Proposing an alternative compression strategy using string dictionary encoding, instead of using RePair } Enabling considerably faster construction, while being competitive in space efficiency and operation speed } Light reconstruction is allowed in many applications } Future work } Comparing ours with other text compression techniques, such as Huffman Coding and Online Grammar Compression } Applying the string dictionary encoding to other data structures for string processing

Slide 24

Slide 24 text

24 Thank you for your attention! My English skills are limited. Please speak slowly if you have any questions.

Slide 25

Slide 25 text

2. Implementation of RPDT 25 } Construction procedure } Serializing the node and edge labels in breadth-first order } Giving each edge label has pointers to their parents } The pointers become monotonically increasing integers } Elias-Fano codes can represent them in compact space } Features } The space usage closes to the theoretically lower bound } while supporting cache friendly traversal from path decomposition D C A E F B $ y g o l o n e i r a e t t i e t a ir i ygolon et Access(B) = itie