Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Distinct Query using HyperLogLog
Search
hama_du
October 05, 2018
Science
2
81
Distinct Query using HyperLogLog
Distinct Queryを例にHyperLogLogのお気持ちを理解する
hama_du
October 05, 2018
Tweet
Share
More Decks by hama_du
See All by hama_du
Google File System
hamadu
0
80
木の上を歩こう
hamadu
1
970
linear-algebra-in-n-minutes
hamadu
0
240
Other Decks in Science
See All in Science
学術講演会中央大学学員会府中支部
tagtag
0
260
CV_5_3dVision
hachama
0
140
安心・効率的な医療現場の実現へ ~オンプレAI & ノーコードワークフローで進める業務改革~
siyoo
0
220
Trend Classification of InSAR Displacement Time Series Using SAE–CNN
satai
3
390
局所保存性・相似変換対称性を満たす機械学習モデルによる数値流体力学
yellowshippo
1
260
機械学習 - SVM
trycycle
PRO
1
800
白金鉱業Meetup Vol.16_【初学者向け発表】 数理最適化のはじめの一歩 〜身近な問題で学ぶ最適化の面白さ〜
brainpadpr
11
2.2k
機械学習 - K-means & 階層的クラスタリング
trycycle
PRO
0
830
butterfly_effect/butterfly_effect_in-house
florets1
1
180
観察研究における因果推論
nearme_tech
PRO
1
250
WCS-LA-2024
lcolladotor
0
240
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
120
Featured
See All Featured
The Invisible Side of Design
smashingmag
299
50k
How GitHub (no longer) Works
holman
314
140k
Become a Pro
speakerdeck
PRO
28
5.4k
RailsConf 2023
tenderlove
30
1.1k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
[RailsConf 2023] Rails as a piece of cake
palkan
55
5.6k
Visualization
eitanlees
146
16k
Rebuilding a faster, lazier Slack
samanthasiow
81
9k
How to Think Like a Performance Engineer
csswizardry
24
1.7k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.8k
The Art of Programming - Codeland 2020
erikaheidi
54
13k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
Transcript
HyperLogLogͰ লϝϞϦͳDistinct Query SDDษڧձ@r-n-i 2018/10/05
Distinct Query
ΫΤϦͷྫ • Distinct([‘A’, ‘B’]) = 2 • Distinct([‘A’, ‘B’, ‘C’,
‘A’, ‘C’]) = 3
[࣮ํ๏] SetʹಥͬࠐΜͰେ͖͞ΛऔΔ ;ͭ͏ͷ
SetʹಥͬࠐΉࡍͷ • σʔλྻͷαΠζ͕େ͖͍ͱਏ͍ • ͞ͷ͚ͩϝϞϦ৯͏
ΫΤϦͷྫ - ۩ମྫ • ϢʔβIDͷྻʹରͯ͠ɺϢχʔΫϢʔβ
ϢχʔΫϢʔβ… 35915ਓͰͨ͠ʂʂ
ϢχʔΫϢʔβ… 35915ਓͰͨ͠ʂʂ
ϢχʔΫϢʔβ… 35915ਓͰͨ͠ʂʂ ͜Ε͍Δʁ
ਖ਼֬ͳ ͦΜͳʹେࣄ͡Όͳ͍ ͜ͱ͋Δ
HashΛ༻͍ͨਪఆ
ϋογϡͷܭࢉ hash(AB) = 0x36f… = 0011 0110 1111 … hash(CD)
= 0xc90… = 1100 1001 0000 … hash(EF) = 0x01e… = 0000 0001 1110 …
ઌ಄ʹ͍ͭ͘ 0 ͕͍ͭͯΔʁ zero(hash(AB)) = zero(0011 0110 1111…) = 2
zero(hash(CD)) = zero(1100 1001 0000…) = 0 zero(hash(EF)) = zero(0000 0001 1110…) = 7
͜ΕΒͷ࠷େΛऔΔ D = max( zero(hash(AB)), zero(hash(CD)), zero(hash(EF)) ) = max(2,
0, 7) = 7
ٯʹ…
࠷େ͚ͩΘ͔ͬͯΔͱ͢Δ D = 7
ͭ·Γ… D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?)) = zero(0000 0001 …) = 7
ͭ·Γ… D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?)) = zero(0000 0001 …) = 7 ݁ߏϨΞʂ
ͲͷఔϨΞʁ D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?)) = zero(0000 0001 …) = 7 1/2^7 = 1/128
ϢχʔΫͳHashΛ͍ͭ͘ݟͨʁ D = max(?, ?, …, 7, …, ?, ?)
1/2^7 = 1/128 ฏۉ128ݸʁ
Distinct ͳཁૉ(Hash)Λ େࡶʹ༧Ͱ͖Δ
HyperLogLog
HashͷඌͰৼΓ͚ hash(AB) = 0x36f… = 0011 0110 … 1010 D:
0 1 9 10 11 14 15 1 1 0 2 0 0 0 … …
େ͖͍Ͱߋ৽ʂ hash(AB) = 0x36f… = 0011 0110 … 1010 D:
0 1 9 10 11 14 15 2 1 0 2 0 0 0 … …
ཁૉͷਪఆ • Dͷঢ়گ͕ฏۉͲͷఔϨΞ͔ʁ • ௐฏۉʂ
ཁૉͷਪఆ 1 1 2 4 C × 4 × 4
1 22 + 1 21 + 1 21 + 1 24 ശ1ͭ͋ͨΓͷೱ
ن͕খ͍͞ͱޡࠩଟΊ
ۭؒܭࢉྔ(༻ϝϞϦ) • ༻ϝϞϦ: ܕͷྻ͚ͩʂ
ࢀߟจݙ • HyperLogLog in Practice: Algorithmic Engineering of a State
of The Art Cardinality Estimation Algorithm