630

# HyperLogLog feature of ClickHouse

November 14, 2019

## Transcript

1. ### HyperLogLog feature of ClickHouse HyperLogLog feature of ClickHouse Haruki Okada

of a dataset O(1) time & space 0.81% standard error with 16384 buckets Can be merged with another HLL Easily parallelized
3. ### Count-distinct problem Count-distinct problem Generally, calculating count-distinct requires O(N) space

The space may not t on single machine for massive datasets By relaxing exactness, there are some algorithms to approximate count-distinct
4. ### HLL Intuition HLL Intuition Choose a 64bit integer at random

The probability of "leftmost '1' is at bit N" is 1/(2^N) In other words, we have to choose an integer at least 2^N times to get an integer such that "letmost '1' is at bit N"
5. ### HLL Intuition HLL Intuition Map each element in a dataset

to 64bit integer space by applying hash function Equivalent to choosing 64bit integer at random N(unique elements in the dataset) times Conversely, if we have an integer such that "leftmost '1' is at bit N" as a result of applying hash function, it indicates that "the dataset contains 2^N unique elements" Now we approximated count-distinct by only "leftmost position of '1' bit"

uniqHLL12
7. ### uniqCombined uniqCombined Switches 3 algorithms depending on number of elements

count <= 16 : Array 16 < count <= 8192: HashSet Assume the HLL_precision is default (17) 8192 < count: HLL refs: AggregateFunctionUniqCombined.h#L84
8. ### uniqCombined uniqCombined :) select uniqCombined(cookie_id) from page_views where domain =

'example.com'; ┌─uniqCombined(cookie_id)─┐ │ 128751 │ └─────────────────────────┘ :) -- specify HLL_precision (bucket size) :) select uniqCombined(14)(cookie_id) from conversions where advertiser_id = 55301; ┌─uniqCombined(cookie_id)─┐ │ 7752 │ └─────────────────────────┘
9. ### uniqCombinedState/Merge uniqCombinedState/Merge Returns an intermediate state of the unique aggregation

(i.e. HLL sketch) :) SELECT uniqCombinedMerge(state) FROM ( SELECT uniqCombinedState(cookie_id) AS state FROM page_views WHERE domain = 'example.com' UNION ALL SELECT uniqCombinedState(cookie_id) AS state FROM page_views WHERE domain = 'example2.com'); ┌─uniqCombinedMerge(state)─┐ │ 195763 │ └──────────────────────────┘
10. ### ClickHouse HLL Implementation ClickHouse HLL Implementation Linear Counting and HyperLogLog++

"HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" (S. Heule et al., 2013)
11. ### Comparing to Redis Comparing to Redis Redis also provides HLL

through PF* command family Transition of Redis HLL: 2.8.9: HLL++ and linear counting 4.0.0: LogLog-Beta 5.0.0: New HLL algorithm
12. ### New HLL algorithm Redis currently adopts New HLL algorithm Redis

currently adopts Single estimation formula for all carinality range No bias correction table needed "New cardinality estimation algorithms for HyperLogLog sketches" (Otmar Ertl, 2017)
13. ### Conclusion Conclusion ClickHouse provides HLL in a con gurable and

useful way Seems there's still a room for updating HLL algorithm ?