extract_words(): for fname in FILES: for word in open(fname).read().split(): yield word, fname db = DiscoDB(extract_words()) db[‘dog’] db.keys() db.unique_values() db.items() # files that mention ‘dog’ # all distinct word # all distinct filenames # all (word, iter(fname)) pairs
→ [Value ID, ...] Keys: Key ID → Key Values: Value ID → Value DiscoDB Chunk Perfect hashing by CMPH, guaranteed O(1) The list of Value IDs is delta-encoded Values are compressed with a global Huffman codebook
→ [Banana, Melon] Q(“A & B”) Apple Banana Q(“A | B”) Apple Orange Banana Q(“(A & B) | C”) Banana DiscoDB from discodb.query import Q Querying with Conjunctive Normal Form