strings } Mapping between strings and unique IDs } With two primitive operations: } Lookup obtains the ID from a given string } Access obtains the string from a given ID 2 and Applications Big Data Innovations 1 2 3 4 5 Lookup(Big) = 3 Access(5) = Innovations
a text to integers using a string dictionary } Basically, such an integer sequence is more compact than the original } Arbitrary words can be directly decoded Big Data Innovations and Applications 3 4 5 1 2 Encoding using Big Data Innovations and Applications Decoding using
recent applications: } Word lexicon of NLP and IR, } Management of URLs of Web graph, } RDF dictionary of Semantic Web, } Gazetteer of GISs, and so on… } For example, recent RDF systems consider to store a URI dataset of 14GB in main memory [Mavlyutov+ 14] } Such management will be impossible when using naïve data structures, at least on general personal computers } Compact dictionary implementation is very important } Needless to say, time performance is also important
State-of-the-art dictionary implementations using them } Trie with RePair [Grossi+ 14] } Front Coding with RePair [Martínez-Prieto+ 16] Implementation technique is Trie or Front Coding 1st choice Compression strategy is RePair 2nd choice
compression } The construction cost is very large and not practical } While the cost can be improved, some space efficiency is sacrificed } Our contribution } Proposing an alternative compression strategy using string dictionary encoding, instead of using RePair } Presenting new string dictionary implementations for our strategy } Enabling considerably faster construction, while being competitive in space efficiency and operation speed
strings } Built by merging prefixes of strings } Operation time depends only on the target string length *Specifically, this form is known as Patricia Trie t ide rie e a chnology al ology ie 1 ideal 2 ideology 3 tea 4 technology 5 tie 6 trie 6 4 1 2 5 3 Lookup(ideology) = 2 Access(6) = trie
reduce the number of random accesses on search by lowering the tree height t ide rie e a chnology al ology ie i r a i 1t2e1chnology Path-Decomposed Trie (PDT)
Compressing node labels using the lightweight RePair } Ours } Compressing node labels using dictionary encoding i r a i A B o F D C E * The tree structure is compactly represented using DFUDS. i r a i o de1al ϵ ie e logy 1t2e1chnology Dictionary Encoding
Dividing strings into buckets of constant size } Storing each header of the buckets simply } Replacing internal strings into the length of the longest common prefix with its predecessor and the remaining suffix 1 ideal 2 ideas 3 ideology 4 tea 5 techie 噣 噣 ideal 4 s 3 ology 0 tea techie 噣 Access(3) ideas ideology ideal Decoding internal strings from the header *Lookup is performed by binary search for headers internal strings header header
} Compressing internal strings using the original RePair } Ours } Compressing internal strings using dictionary encoding Dictionary Encoding internal strings ideal 4 s 3 ology 0 tea techie 噣 ideal 4 C 3 B 0 D techie 噣
} Requiring only Access for decoding, without Lookup } Operation speed is especially important because an ASD is called multiple times for each Lookup/Access of the original string dictionary } The target strings for compression have many similar suffixes because both Trie and Front Coding are built by merging prefixes ASD A nology B ology C s D tea decoding using ideal 4 C 3 B 0 D techie 4 A Three types of ASDs are presented => s => ology => tea Access
} Concatenating strings with special terminator ‘$’ } Shareable strings are merged such as ‘nology’ and ‘ology’ } Each beginning position becomes each string ID } Features } The decoding speed is the fastest } Space efficiency is low compared to other ASDs D A B C E r i e $ n o l o g y $ t e a $ A ie B nology C ology D rie E tea
Merging suffixes of strings in reverse } Strings are decoded by traversing leaf-to-root paths } RPDT } Result of path decomposition of Reverse Trie } High space efficiency while maintaining cache efficiency of path decomposition A ie B itie C nology D ology E rie F tea ygolo ei n r ti Reverse Trie e t aet a A D C E B F RPDT ir i ygolon et
} Suffixes are replaced into integers, instead of prefixes } Fast Back Coding (FBC) } Using differences with headers, instead of predecessors } Unnecessary to decode internal strings other than the target string } # of memory copies is always not bigger than 2 } The decoding speed is very fast A ide B ie C rie D itie E otogy ide i 1 ri 1 iti 1 otogy Back Coding FBC Access(D) iti + e skip skip ide i 1 r 2 it 2 otogy
running Ubuntu Server 16.04 } 32 GB of RAM (L2 cache: 1 MB, L3 cache: 8 MB) } Code } C++11 complied with g++ } Optimization option is -O9 (fastest and smallest) } Datasets } All page titles from English Wikipedia } 227.2 MiB, 11 519 354 strings } URLs of a crawl by the UbiCrawler on the .uk domain } 2 723.3 MiB, 39 459 925 strings
strategy using string dictionary encoding, instead of using RePair } Enabling considerably faster construction, while being competitive in space efficiency and operation speed } Light reconstruction is allowed in many applications } Future work } Comparing ours with other text compression techniques, such as Huffman Coding and Online Grammar Compression } Applying the string dictionary encoding to other data structures for string processing
the node and edge labels in breadth-first order } Giving each edge label has pointers to their parents } The pointers become monotonically increasing integers } Elias-Fano codes can represent them in compact space } Features } The space usage closes to the theoretically lower bound } while supporting cache friendly traversal from path decomposition D C A E F B $ y g o l o n e i r a e t t i e t a ir i ygolon et Access(B) = itie