hama_du
October 05, 2018
53

# Distinct Query using HyperLogLog

Distinct Queryを例にHyperLogLogのお気持ちを理解する

October 05, 2018

## Transcript

1. HyperLogLogͰ
লϝϞϦͳDistinct Query
SDDษڧձ@r-n-i
2018/10/05

2. Distinct Query

3. ΫΤϦͷྫ
• Distinct([‘A’, ‘B’]) = 2
• Distinct([‘A’, ‘B’, ‘C’, ‘A’, ‘C’]) = 3

4. [࣮૷ํ๏]
SetʹಥͬࠐΜͰେ͖͞ΛऔΔ
;ͭ͏ͷ

5. SetʹಥͬࠐΉࡍͷ໰୊఺
• σʔλྻͷαΠζ͕େ͖͍ͱਏ͍
• ௕͞ͷ෼͚ͩϝϞϦ৯͏

6. ΫΤϦͷྫ - ۩ମྫ
• ϢʔβIDͷྻʹରͯ͠ɺϢχʔΫϢʔβ਺͸

7. ϢχʔΫϢʔβ਺͸…
35915ਓͰͨ͠ʂʂ

8. ϢχʔΫϢʔβ਺͸…
35915ਓͰͨ͠ʂʂ

9. ϢχʔΫϢʔβ਺͸…
35915ਓͰͨ͠ʂʂ
͜Ε͍Δʁ

10. ਖ਼֬ͳ஋͸
ͦΜͳʹେࣄ͡Όͳ͍
͜ͱ΋͋Δ

11. Hash஋Λ༻͍ͨਪఆ

12. ϋογϡ஋ͷܭࢉ
hash(AB) = 0x36f… = 0011 0110 1111 …
hash(CD) = 0xc90… = 1100 1001 0000 …
hash(EF) = 0x01e… = 0000 0001 1110 …

13. ઌ಄ʹ͍ͭ͘ 0 ͕͍ͭͯΔʁ
zero(hash(AB))
= zero(0011 0110 1111…)
= 2
zero(hash(CD))
= zero(1100 1001 0000…)
= 0
zero(hash(EF))
= zero(0000 0001 1110…)
= 7

14. ͜ΕΒͷ࠷େ஋ΛऔΔ
D = max(
zero(hash(AB)),
zero(hash(CD)),
zero(hash(EF))
)
= max(2, 0, 7)
= 7

15. ٯʹ…

16. ࠷େ஋͚ͩΘ͔ͬͯΔͱ͢Δ
D = 7

17. ͭ·Γ…
D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?))
= zero(0000 0001 …)
= 7

18. ͭ·Γ…
D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?))
= zero(0000 0001 …)
= 7
݁ߏϨΞʂ

19. Ͳͷఔ౓ϨΞʁ
D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?))
= zero(0000 0001 …)
= 7
1/2^7 = 1/128

20. ϢχʔΫͳHashΛ͍ͭ͘ݟͨʁ
D = max(?, ?, …, 7, …, ?, ?)
1/2^7 = 1/128
ฏۉ128ݸʁ

21. Distinct ͳཁૉ(Hash)਺Λ
େࡶ೺ʹ༧૝Ͱ͖Δ

22. HyperLogLog

23. Hashͷ຤ඌͰৼΓ෼͚
hash(AB) = 0x36f… = 0011 0110 … 1010
D:
0 1 9 10 11 14 15
1
1
0
2 0 0 0
… …

24. େ͖͍஋Ͱߋ৽ʂ
hash(AB) = 0x36f… = 0011 0110 … 1010
D:
0 1 9 10 11 14 15
2
1
0
2 0 0 0
… …

25. ཁૉ਺ͷਪఆ
• Dͷঢ়گ͕ฏۉͲͷఔ౓ϨΞ͔ʁ
• ௐ࿨ฏۉʂ

26. ཁૉ਺ͷਪఆ
1
1
2 4
C × 4 ×
4
1
22
+ 1
21
+ 1
21
+ 1
24
ശ1ͭ͋ͨΓͷೱ౓

27. ن໛͕খ͍͞ͱޡࠩ͸ଟΊ

28. ۭؒܭࢉྔ(࢖༻ϝϞϦ)
• ࢖༻ϝϞϦ: ੔਺ܕͷ഑ྻ͚ͩʂ

29. ࢀߟจݙ
• HyperLogLog in Practice: Algorithmic
Engineering of a State of The Art Cardinality
Estimation Algorithm