can be used to map digital data of arbitrary size to digital data of fixed size, with slight differences in input data producing very big differences in output data.” - 62 United States Attorneys’ Bulletin 44-82
different hash functions ◦ Bit array of size m with all bits initialized to 0 • Insertion ◦ Hash the input with all k hash functions ◦ Set each corresponding bit in the array to 1 • Query ◦ Hash the input with all k hash functions ◦ All corresponding bits are set to 1 ? “Yes” : “No”
k different hash functions ◦ Matrix of size m x k with all values initialized to 0 • Insertion ◦ Hash the input with all k hash functions ◦ Increment each corresponding value by 1 • Query ◦ Hash the input with all k hash functions ◦ Return the minimum of all the corresponding values
◦ Bit array of size m with all bits initialized to 0 • Insertion ◦ Hash the input with the hash function ◦ Set the corresponding bit in the array to 1 • Query ◦ Count the number of 1s in the array
(pseudo) random, evenly distributed output • In such a random, evenly distributed series, you would expect to see a value with k leading zeros once in every 2^k elements
- Toss out largest 30% • HyperLogLog ◦ Use harmonic mean instead of geometric mean ◦ Special case logic for low cardinality • HyperLogLog++ ◦ Google’s version of HLL ◦ Use 64-bit hash function to count larger cardinalities ◦ Different special case logic for low cardinality