InnovateData2017

Practical String Dictionary Compression Using String Dictionary Encoding Shunsuke Kanda,
Kazuhiro Morita and Masao Fuketa Tokushima Univ., JAPAN 3rd Innovate-Data in Prague 21–23 Aug. 2017

String Dictionaries } Data structure for storing a set of
strings } Mapping between strings and unique IDs } With two primitive operations: } Lookup obtains the ID from a given string } Access obtains the string from a given ID 2 and Applications Big Data Innovations 1 2 3 4 5 Lookup(Big) = 3 Access(5) = Innovations

String Dictionaries 3 } Application example: Text encoding } Encoding
a text to integers using a string dictionary } Basically, such an integer sequence is more compact than the original } Arbitrary words can be directly decoded Big Data Innovations and Applications 3 4 5 1 2 Encoding using Big Data Innovations and Applications Decoding using

Background 4 } The space efficiency is significant in many
recent applications: } Word lexicon of NLP and IR, } Management of URLs of Web graph, } RDF dictionary of Semantic Web, } Gazetteer of GISs, and so on… } For example, recent RDF systems consider to store a URI dataset of 14GB in main memory [Mavlyutov+ 14] } Such management will be impossible when using naïve data structures, at least on general personal computers } Compact dictionary implementation is very important } Needless to say, time performance is also important

Background 5 } Two significant choices for efficient implementation }
State-of-the-art dictionary implementations using them } Trie with RePair [Grossi+ 14] } Front Coding with RePair [Martínez-Prieto+ 16] Implementation technique is Trie or Front Coding 1st choice Compression strategy is RePair 2nd choice

Problem and our work 6 } Shortcoming of the RePair
compression } The construction cost is very large and not practical } While the cost can be improved, some space efficiency is sacrificed } Our contribution } Proposing an alternative compression strategy using string dictionary encoding, instead of using RePair } Presenting new string dictionary implementations for our strategy } Enabling considerably faster construction, while being competitive in space efficiency and operation speed

String Dictionary Implementations and Compression Strategies Trie- and Front-Coding-based implementations
Old and new compression strategies 7

Trie 8 } Edge-labeled tree for storing a set of
strings } Built by merging prefixes of strings } Operation time depends only on the target string length *Specifically, this form is known as Patricia Trie t ide rie e a chnology al ology ie 1 ideal 2 ideology 3 tea 4 technology 5 tie 6 trie 6 4 1 2 5 3 Lookup(ideology) = 2 Access(6) = trie

Path Decomposition [Ferragina+ 08] 9 } Trie transformation technique to
reduce the number of random accesses on search by lowering the tree height t ide rie e a chnology al ology ie i r a i 1t2e1chnology Path-Decomposed Trie (PDT)

Compression Strategies of PDT 10 } Existing [Grossi+ 14] }
Compressing node labels using the lightweight RePair } Ours } Compressing node labels using dictionary encoding i r a i A B o F D C E * The tree structure is compactly represented using DFUDS. i r a i o de1al ϵ ie e logy 1t2e1chnology Dictionary Encoding

Front Coding 11 } Compression technique for sorted strings }
Dividing strings into buckets of constant size } Storing each header of the buckets simply } Replacing internal strings into the length of the longest common prefix with its predecessor and the remaining suffix 1 ideal 2 ideas 3 ideology 4 tea 5 techie 噣噣 ideal 4 s 3 ology 0 tea techie 噣 Access(3) ideas ideology ideal Decoding internal strings from the header *Lookup is performed by binary search for headers internal strings header header

Compression Strategies of Front Coding 12 } Existing [Martínez-Prieto+ 16]
} Compressing internal strings using the original RePair } Ours } Compressing internal strings using dictionary encoding Dictionary Encoding internal strings ideal 4 s 3 ology 0 tea techie 噣 ideal 4 C 3 B 0 D techie 噣

Auxiliary String Dictionaries String dictionaries for our strategy 13

Auxiliary String Dictionaries (ASDs) 14 } For our compression strategies
} Requiring only Access for decoding, without Lookup } Operation speed is especially important because an ASD is called multiple times for each Lookup/Access of the original string dictionary } The target strings for compression have many similar suffixes because both Trie and Front Coding are built by merging prefixes ASD A nology B ology C s D tea decoding using ideal 4 C 3 B 0 D techie 4 A Three types of ASDs are presented => s => ology => tea Access

1. TAIL: Plain Concatenation and Sharing 15 } Data structure
} Concatenating strings with special terminator ‘$’ } Shareable strings are merged such as ‘nology’ and ‘ology’ } Each beginning position becomes each string ID } Features } The decoding speed is the fastest } Space efficiency is low compared to other ASDs D A B C E r i e $ n o l o g y $ t e a $ A ie B nology C ology D rie E tea

2. Reverse Path-Decomposed Trie (RPDT) 16 } Reverse Trie }
Merging suffixes of strings in reverse } Strings are decoded by traversing leaf-to-root paths } RPDT } Result of path decomposition of Reverse Trie } High space efficiency while maintaining cache efficiency of path decomposition A ie B itie C nology D ology E rie F tea ygolo ei n r ti Reverse Trie e t aet a A D C E B F RPDT ir i ygolon et

3. Back Coding 17 } Suffix version of Front Coding
} Suffixes are replaced into integers, instead of prefixes } Fast Back Coding (FBC) } Using differences with headers, instead of predecessors } Unnecessary to decode internal strings other than the target string } # of memory copies is always not bigger than 2 } The decoding speed is very fast A ide B ie C rie D itie E otogy ide i 1 ri 1 iti 1 otogy Back Coding FBC Access(D) iti + e skip skip ide i 1 r 2 it 2 otogy

Experiments and Summary Results of PDT and Front Coding 18

Settings 19 } Machine } Intel Xeon E5540 @2.53 GHz,
running Ubuntu Server 16.04 } 32 GB of RAM (L2 cache: 1 MB, L3 cache: 8 MB) } Code } C++11 complied with g++ } Optimization option is -O9 (fastest and smallest) } Datasets } All page titles from English Wikipedia } 227.2 MiB, 11 519 354 strings } URLs of a crawl by the UbiCrawler on the .uk domain } 2 723.3 MiB, 39 459 925 strings

Settings 20 } Dictionary implementations PDT Front Coding (Bucket size
is 8) RePair TAIL RPDT FBC (Bucket size is 4) Ours Baseline Which combinations provide high performance in time and space? using

Results for PDT 21 Construction (sec) Compression (%) Lookup (micro
sec / str) Access (micro sec / ID) Wiki RePair 62.1 – 31.6 – 2.59 – 2.50 – TAIL 8.3 (0.13x) 41.7 (1.32x) 1.72 (0.66x) 1.76 (0.70x) RPDT 11.8 (0.19x) 32.4 (1.03x) 2.40 (0.93x) 2.26 (0.90x) FBC 10.6 (0.17x) 37.0 (1.17x) 2.29 (0.88x) 2.38 (0.95x) URLs RePair 437.2 – 17.5 – 4.00 – 3.98 – TAIL 55.7 (0.13x) 20.7 (1.18x) 2.59 (0.65x) 2.94 (0.74x) RPDT 58.6 (0.13x) 16.4 (0.94x) 4.16 (1.04x) 3.91 (0.98x) FBC 53.5 (0.12x) 17.3 (0.99x) 3.24 (0.81x) 3.55 (0.89x) *The smaller these results are, the better up to 8.2x faster up to 1.3x bigger up to 1.5x faster up to 1.4x faster

Results for Front Coding 22 Construction (sec) Compression (%) Lookup
(micro sec / str) Access (micro sec / ID) Wiki RePair 667.2 – 36.5 – 2.63 – 1.59 – TAIL 6.1 (0.009x) 45.3 (1.24x) 1.51 (0.57x) 0.66 (0.42x) RPDT 9.6 (0.014x) 38.7 (1.06x) 2.23 (0.85x) 1.21 (0.76x) FBC 8.6 (0.013x) 42.7 (1.17x) 2.00 (0.76x) 1.09 (0.69x) URLs RePair 11835.2 – 22.3 – 4.28 – 2.41 – TAIL 28.0 (0.002x) 31.4 (1.41x) 2.32 (0.54x) 0.77 (0.32x) RPDT 50.0 (0.004x) 27.1 (1.22x) 3.64 (0.85x) 1.73 (0.72x) FBC 43.3 (0.004x) 28.1 (1.26x) 3.02 (0.71x) 1.47 (0.61x) *The smaller these results are, the better up to 422x faster up to 1.4x bigger up to 1.8x faster up to 3.1x faster

Summary 23 } Our contributions } Proposing an alternative compression
strategy using string dictionary encoding, instead of using RePair } Enabling considerably faster construction, while being competitive in space efficiency and operation speed } Light reconstruction is allowed in many applications } Future work } Comparing ours with other text compression techniques, such as Huffman Coding and Online Grammar Compression } Applying the string dictionary encoding to other data structures for string processing

24 Thank you for your attention! My English skills are
limited. Please speak slowly if you have any questions.

2. Implementation of RPDT 25 } Construction procedure } Serializing
the node and edge labels in breadth-first order } Giving each edge label has pointers to their parents } The pointers become monotonically increasing integers } Elias-Fano codes can represent them in compact space } Features } The space usage closes to the theoretically lower bound } while supporting cache friendly traversal from path decomposition D C A E F B $ y g o l o n e i r a e t t i e t a ir i ygolon et Access(B) = itie

InnovateData2017

InnovateData2017

Shunsuke Kanda

More Decks by Shunsuke Kanda

Other Decks in Research

Featured

Transcript

Practical String Dictionary Compression Using String Dictionary Encoding Shunsuke Kanda,

String Dictionaries } Data structure for storing a set of

String Dictionaries 3 } Application example: Text encoding } Encoding

Background 4 } The space efficiency is significant in many

Background 5 } Two significant choices for efficient implementation }

Problem and our work 6 } Shortcoming of the RePair

String Dictionary Implementations and Compression Strategies Trie- and Front-Coding-based implementations

Trie 8 } Edge-labeled tree for storing a set of

Path Decomposition [Ferragina+ 08] 9 } Trie transformation technique to

Compression Strategies of PDT 10 } Existing [Grossi+ 14] }

Front Coding 11 } Compression technique for sorted strings }

Compression Strategies of Front Coding 12 } Existing [Martínez-Prieto+ 16]

Auxiliary String Dictionaries String dictionaries for our strategy 13

Auxiliary String Dictionaries (ASDs) 14 } For our compression strategies

1. TAIL: Plain Concatenation and Sharing 15 } Data structure

2. Reverse Path-Decomposed Trie (RPDT) 16 } Reverse Trie }

3. Back Coding 17 } Suffix version of Front Coding

Experiments and Summary Results of PDT and Front Coding 18

Settings 19 } Machine } Intel Xeon E5540 @2.53 GHz,

Settings 20 } Dictionary implementations PDT Front Coding (Bucket size

Results for PDT 21 Construction (sec) Compression (%) Lookup (micro

Results for Front Coding 22 Construction (sec) Compression (%) Lookup

Summary 23 } Our contributions } Proposing an alternative compression

24 Thank you for your attention! My English skills are

2. Implementation of RPDT 25 } Construction procedure } Serializing