about individual events ‣ Drastically reduce storage and improve query speed • on average, 40x reduction in storage on with our own data ‣ We’ve lost info about individual prices ‣ Data summarization is not always trivial SUMMARIZATION SUMMARY
all the data, let’s store a “sketch” of the data that represents some result that we care about ‣ Analogy: Imagine we wanted to know how many times we flipped a coin • ~50 % heads/tails • We could store the result of every coin flip as it occurs (HHTTTHTHHT) • Or we could just store the number of times heads appeared as we ingest data and use the magic of probability HYPERLOGLOG
series of buckets ‣ Each bucket is storing a number ‣ Each time we see a user, we only update a bucket value if a specific phenomenon is seen ‣ The phenomenon we care about is based on how bits are distributed when we hash a username ‣ We are looking for the position of the first ‘1’ bit ‣ Update a bucket if this position is greater than the existing value
‣ Storage: Linear ‣ Computation: Linear ‣ Accuracy: 100% ‣ Problem: Storing raw values can often be more expensive than storing the rest of the row. ‣ Solution: Store an approximate representation! EXACT SOLUTION
substantially (e.g. ~40x for us) faster/less storage • 100% accuracy • Sketches for cardinality/distribution: 1-2 orders of magnitude faster/ less storage than raw • 97% accuracy • 40x lower costs is make or break • interactive queries that are accurate enough CONCLUSIONS
values will look like this: 1xxxxx…x ‣ 25% of hashed values will look like this: 01xxxx…x ‣ 12.5% of hashed values will look like this: 001xxx…x ‣ 6.25% of hashed values will look like this: 0001xx…x HYPERLOGLOG
• If highest index of ‘1’ is 2, we saw 4 unique values • If highest index of ‘1’ is 4, we saw 16 unique values ‣ Use the highest index of ‘1’ to determine cardinality ‣ For better accuracy, the highest index of ‘1’ is stored in a series of buckets HYPERLOGLOG