Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Distinct Query using HyperLogLog
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
hama_du
October 05, 2018
Science
88
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Distinct Query using HyperLogLog
Distinct Queryを例にHyperLogLogのお気持ちを理解する
hama_du
October 05, 2018
More Decks by hama_du
See All by hama_du
Google File System
hamadu
0
89
木の上を歩こう
hamadu
1
1.1k
linear-algebra-in-n-minutes
hamadu
0
250
Other Decks in Science
See All in Science
ITTF卓球世界ランキングのポイント比を用いた試合結果予測モデルの性能評価 / Performance evaluation of match result prediction models using the point ratio of the ITTF Table Tennis World Ranking
konakalab
0
130
やるべきときにMLをやる AIエージェント開発
fufufukakaka
2
1.5k
SHINOMIYA Nariyoshi
genomethica
0
150
データベース01: データベースを使わない世界
trycycle
PRO
1
1.3k
検索と推論タスクに関する論文の紹介
ynakano
1
230
白金鉱業Vol.21【初学者向け発表枠】身近な例から学ぶ数理最適化の基礎 / Learning the Basics of Mathematical Optimization Through Everyday Examples
brainpadpr
1
750
Inside the Mind of an LLM
baggiponte
0
180
生成AI・プレプリント時代における 研究成果公開の再設計 ― トップカンファレンス文化はどこへ向かうのか / Redesigning the Dissemination of Research Outputs in the Age of Generative AI and Preprints — Where Is the Top-Conference Culture Heading?
ykiyota
0
25k
機械学習 - SVM
trycycle
PRO
1
1.1k
大黒市で発生した大規模インシデント の ポストモーテムから読み解く、 記憶媒体消去の大切さ
shucho0103
0
190
なぜエネルギーは保存する? 〜自由落下でわかる“対称性”とネーターの定理〜
syotasasaki593876
0
180
データベース03: 関係データモデル
trycycle
PRO
1
550
Featured
See All Featured
How STYLIGHT went responsive
nonsquared
100
6.2k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
62k
Typedesign – Prime Four
hannesfritz
42
3.1k
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
280
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Believing is Seeing
oripsolob
1
140
KATA
mclloyd
PRO
35
15k
The Art of Programming - Codeland 2020
erikaheidi
57
14k
More Than Pixels: Becoming A User Experience Designer
marktimemedia
3
440
Building Applications with DynamoDB
mza
96
7.1k
AI Search: Implications for SEO and How to Move Forward - #ShenzhenSEOConference
aleyda
1
1.3k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.7k
Transcript
HyperLogLogͰ লϝϞϦͳDistinct Query SDDษڧձ@r-n-i 2018/10/05
Distinct Query
ΫΤϦͷྫ • Distinct([‘A’, ‘B’]) = 2 • Distinct([‘A’, ‘B’, ‘C’,
‘A’, ‘C’]) = 3
[࣮ํ๏] SetʹಥͬࠐΜͰେ͖͞ΛऔΔ ;ͭ͏ͷ
SetʹಥͬࠐΉࡍͷ • σʔλྻͷαΠζ͕େ͖͍ͱਏ͍ • ͞ͷ͚ͩϝϞϦ৯͏
ΫΤϦͷྫ - ۩ମྫ • ϢʔβIDͷྻʹରͯ͠ɺϢχʔΫϢʔβ
ϢχʔΫϢʔβ… 35915ਓͰͨ͠ʂʂ
ϢχʔΫϢʔβ… 35915ਓͰͨ͠ʂʂ
ϢχʔΫϢʔβ… 35915ਓͰͨ͠ʂʂ ͜Ε͍Δʁ
ਖ਼֬ͳ ͦΜͳʹେࣄ͡Όͳ͍ ͜ͱ͋Δ
HashΛ༻͍ͨਪఆ
ϋογϡͷܭࢉ hash(AB) = 0x36f… = 0011 0110 1111 … hash(CD)
= 0xc90… = 1100 1001 0000 … hash(EF) = 0x01e… = 0000 0001 1110 …
ઌ಄ʹ͍ͭ͘ 0 ͕͍ͭͯΔʁ zero(hash(AB)) = zero(0011 0110 1111…) = 2
zero(hash(CD)) = zero(1100 1001 0000…) = 0 zero(hash(EF)) = zero(0000 0001 1110…) = 7
͜ΕΒͷ࠷େΛऔΔ D = max( zero(hash(AB)), zero(hash(CD)), zero(hash(EF)) ) = max(2,
0, 7) = 7
ٯʹ…
࠷େ͚ͩΘ͔ͬͯΔͱ͢Δ D = 7
ͭ·Γ… D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?)) = zero(0000 0001 …) = 7
ͭ·Γ… D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?)) = zero(0000 0001 …) = 7 ݁ߏϨΞʂ
ͲͷఔϨΞʁ D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?)) = zero(0000 0001 …) = 7 1/2^7 = 1/128
ϢχʔΫͳHashΛ͍ͭ͘ݟͨʁ D = max(?, ?, …, 7, …, ?, ?)
1/2^7 = 1/128 ฏۉ128ݸʁ
Distinct ͳཁૉ(Hash)Λ େࡶʹ༧Ͱ͖Δ
HyperLogLog
HashͷඌͰৼΓ͚ hash(AB) = 0x36f… = 0011 0110 … 1010 D:
0 1 9 10 11 14 15 1 1 0 2 0 0 0 … …
େ͖͍Ͱߋ৽ʂ hash(AB) = 0x36f… = 0011 0110 … 1010 D:
0 1 9 10 11 14 15 2 1 0 2 0 0 0 … …
ཁૉͷਪఆ • Dͷঢ়گ͕ฏۉͲͷఔϨΞ͔ʁ • ௐฏۉʂ
ཁૉͷਪఆ 1 1 2 4 C × 4 × 4
1 22 + 1 21 + 1 21 + 1 24 ശ1ͭ͋ͨΓͷೱ
ن͕খ͍͞ͱޡࠩଟΊ
ۭؒܭࢉྔ(༻ϝϞϦ) • ༻ϝϞϦ: ܕͷྻ͚ͩʂ
ࢀߟจݙ • HyperLogLog in Practice: Algorithmic Engineering of a State
of The Art Cardinality Estimation Algorithm