reduce storage and improve query speed • on average, 40x reduction in storage on with our own data ‣ We’ve lost info about individual prices ‣ Data summarization is not always trivial SUMMARIZATION SUMMARY
a “sketch” of the data that represents some result that we care about ‣ Analogy: Imagine we wanted to know how many times we flipped a coin • ~50 % heads/tails • We could store the result of every coin flip as it occurs (HHTTTHTHHT) • Or we could just store the number of times heads appeared as we ingest data and use the magic of probability HYPERLOGLOG
bucket is storing a number ‣ Each time we see a user, we only update a bucket value if a specific phenomenon is seen ‣ The phenomenon we care about is based on how bits are distributed when we hash a username ‣ We are looking for the position of the first ‘1’ bit ‣ Update a bucket if this position is greater than the existing value
faster/less storage • 100% accuracy • Sketches for cardinality/distribution: 1-2 orders of magnitude faster/ less storage than raw • 98% accuracy • 40x lower costs is make or break • interactive queries that are accurate enough CONCLUSIONS
1xxxxx…x ‣ 25% of hashed values will look like this: 01xxxx…x ‣ 12.5% of hashed values will look like this: 001xxx…x ‣ 6.25% of hashed values will look like this: 0001xx…x HYPERLOGLOG
‘1’ is 2, we saw 4 unique values • If highest index of ‘1’ is 4, we saw 16 unique values ‣ Use the highest index of ‘1’ to determine cardinality ‣ For better accuracy, the highest index of ‘1’ is stored in a series of buckets HYPERLOGLOG