Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HyperLogLog feature of ClickHouse

Okada Haruki
November 14, 2019

HyperLogLog feature of ClickHouse

Okada Haruki

November 14, 2019
Tweet

More Decks by Okada Haruki

Other Decks in Technology

Transcript

  1. HyperLogLog Introduction HyperLogLog Introduction Ef cient algorithm to approximate count-distinct

    of a dataset O(1) time & space 0.81% standard error with 16384 buckets Can be merged with another HLL Easily parallelized
  2. Count-distinct problem Count-distinct problem Generally, calculating count-distinct requires O(N) space

    The space may not t on single machine for massive datasets By relaxing exactness, there are some algorithms to approximate count-distinct
  3. HLL Intuition HLL Intuition Choose a 64bit integer at random

    The probability of "leftmost '1' is at bit N" is 1/(2^N) In other words, we have to choose an integer at least 2^N times to get an integer such that "letmost '1' is at bit N"
  4. HLL Intuition HLL Intuition Map each element in a dataset

    to 64bit integer space by applying hash function Equivalent to choosing 64bit integer at random N(unique elements in the dataset) times Conversely, if we have an integer such that "leftmost '1' is at bit N" as a result of applying hash function, it indicates that "the dataset contains 2^N unique elements" Now we approximated count-distinct by only "leftmost position of '1' bit"
  5. uniqCombined uniqCombined Switches 3 algorithms depending on number of elements

    count <= 16 : Array 16 < count <= 8192: HashSet Assume the HLL_precision is default (17) 8192 < count: HLL refs: AggregateFunctionUniqCombined.h#L84
  6. uniqCombined uniqCombined :) select uniqCombined(cookie_id) from page_views where domain =

    'example.com'; ┌─uniqCombined(cookie_id)─┐ │ 128751 │ └─────────────────────────┘ :) -- specify HLL_precision (bucket size) :) select uniqCombined(14)(cookie_id) from conversions where advertiser_id = 55301; ┌─uniqCombined(cookie_id)─┐ │ 7752 │ └─────────────────────────┘
  7. uniqCombinedState/Merge uniqCombinedState/Merge Returns an intermediate state of the unique aggregation

    (i.e. HLL sketch) :) SELECT uniqCombinedMerge(state) FROM ( SELECT uniqCombinedState(cookie_id) AS state FROM page_views WHERE domain = 'example.com' UNION ALL SELECT uniqCombinedState(cookie_id) AS state FROM page_views WHERE domain = 'example2.com'); ┌─uniqCombinedMerge(state)─┐ │ 195763 │ └──────────────────────────┘
  8. ClickHouse HLL Implementation ClickHouse HLL Implementation Linear Counting and HyperLogLog++

    "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" (S. Heule et al., 2013)
  9. Comparing to Redis Comparing to Redis Redis also provides HLL

    through PF* command family Transition of Redis HLL: 2.8.9: HLL++ and linear counting 4.0.0: LogLog-Beta 5.0.0: New HLL algorithm
  10. New HLL algorithm Redis currently adopts New HLL algorithm Redis

    currently adopts Single estimation formula for all carinality range No bias correction table needed "New cardinality estimation algorithms for HyperLogLog sketches" (Otmar Ertl, 2017)
  11. Conclusion Conclusion ClickHouse provides HLL in a con gurable and

    useful way Seems there's still a room for updating HLL algorithm ?