Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Distinct Query using HyperLogLog
Search
hama_du
October 05, 2018
Science
2
71
Distinct Query using HyperLogLog
Distinct Queryを例にHyperLogLogのお気持ちを理解する
hama_du
October 05, 2018
Tweet
Share
More Decks by hama_du
See All by hama_du
Google File System
hamadu
0
72
木の上を歩こう
hamadu
1
780
linear-algebra-in-n-minutes
hamadu
0
240
Other Decks in Science
See All in Science
FIBA W杯の日本代表って組み合わせ次第で2次ラウンド行けたんじゃね?をデータで検証
saltcooky12
0
200
構造活性フォーラム2023-山﨑担当分
yamasakih
0
310
AI(人工知能)の過去・現在・未来 —AIは人間を超えるのか—
tagtag
0
120
量子コンピュータとデータサイエンティスト
fuyu_quant0
0
130
DEIM2024 チュートリアル ~AWSで生成AIのRAGを使ったチャットボットを作ってみよう~
yamahiro
3
620
Machine Learning for Materials (Lecture 5)
aronwalsh
0
550
Unlocking Healthcare data: the power of Open Formats in Python Data Science
whitone
0
150
2023-07-18_Verge_Genomics
lcolladotor
0
110
HIBINO Aiko
genomethica
0
370
救急外来でのめまい診療_中枢性めまいを見逃さない!
psasa
0
160
【論文紹介】DocTr_ Document Transformer for Structured Information Extraction in Documents / iccv2023-doctr
yuya4
3
580
最新のAI技術を使った材料シミュレーションで材料研究現場に変革を
matlantis
0
440
Featured
See All Featured
Writing Fast Ruby
sferik
621
60k
Raft: Consensus for Rubyists
vanstee
132
6.3k
GraphQLの誤解/rethinking-graphql
sonatard
50
9.2k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
20
1.9k
Side Projects
sachag
451
41k
Statistics for Hackers
jakevdp
789
220k
Code Review Best Practice
trishagee
55
15k
Imperfection Machines: The Place of Print at Facebook
scottboms
260
12k
From Idea to $5000 a Month in 5 Months
shpigford
377
45k
A designer walks into a library…
pauljervisheath
200
23k
Building Your Own Lightsaber
phodgson
99
5.7k
The Pragmatic Product Professional
lauravandoore
25
5.8k
Transcript
HyperLogLogͰ লϝϞϦͳDistinct Query SDDษڧձ@r-n-i 2018/10/05
Distinct Query
ΫΤϦͷྫ • Distinct([‘A’, ‘B’]) = 2 • Distinct([‘A’, ‘B’, ‘C’,
‘A’, ‘C’]) = 3
[࣮ํ๏] SetʹಥͬࠐΜͰେ͖͞ΛऔΔ ;ͭ͏ͷ
SetʹಥͬࠐΉࡍͷ • σʔλྻͷαΠζ͕େ͖͍ͱਏ͍ • ͞ͷ͚ͩϝϞϦ৯͏
ΫΤϦͷྫ - ۩ମྫ • ϢʔβIDͷྻʹରͯ͠ɺϢχʔΫϢʔβ
ϢχʔΫϢʔβ… 35915ਓͰͨ͠ʂʂ
ϢχʔΫϢʔβ… 35915ਓͰͨ͠ʂʂ
ϢχʔΫϢʔβ… 35915ਓͰͨ͠ʂʂ ͜Ε͍Δʁ
ਖ਼֬ͳ ͦΜͳʹେࣄ͡Όͳ͍ ͜ͱ͋Δ
HashΛ༻͍ͨਪఆ
ϋογϡͷܭࢉ hash(AB) = 0x36f… = 0011 0110 1111 … hash(CD)
= 0xc90… = 1100 1001 0000 … hash(EF) = 0x01e… = 0000 0001 1110 …
ઌ಄ʹ͍ͭ͘ 0 ͕͍ͭͯΔʁ zero(hash(AB)) = zero(0011 0110 1111…) = 2
zero(hash(CD)) = zero(1100 1001 0000…) = 0 zero(hash(EF)) = zero(0000 0001 1110…) = 7
͜ΕΒͷ࠷େΛऔΔ D = max( zero(hash(AB)), zero(hash(CD)), zero(hash(EF)) ) = max(2,
0, 7) = 7
ٯʹ…
࠷େ͚ͩΘ͔ͬͯΔͱ͢Δ D = 7
ͭ·Γ… D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?)) = zero(0000 0001 …) = 7
ͭ·Γ… D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?)) = zero(0000 0001 …) = 7 ݁ߏϨΞʂ
ͲͷఔϨΞʁ D = max(?, ?, …, 7, …, ?, ?)
zero(hash(?)) = zero(0000 0001 …) = 7 1/2^7 = 1/128
ϢχʔΫͳHashΛ͍ͭ͘ݟͨʁ D = max(?, ?, …, 7, …, ?, ?)
1/2^7 = 1/128 ฏۉ128ݸʁ
Distinct ͳཁૉ(Hash)Λ େࡶʹ༧Ͱ͖Δ
HyperLogLog
HashͷඌͰৼΓ͚ hash(AB) = 0x36f… = 0011 0110 … 1010 D:
0 1 9 10 11 14 15 1 1 0 2 0 0 0 … …
େ͖͍Ͱߋ৽ʂ hash(AB) = 0x36f… = 0011 0110 … 1010 D:
0 1 9 10 11 14 15 2 1 0 2 0 0 0 … …
ཁૉͷਪఆ • Dͷঢ়گ͕ฏۉͲͷఔϨΞ͔ʁ • ௐฏۉʂ
ཁૉͷਪఆ 1 1 2 4 C × 4 × 4
1 22 + 1 21 + 1 21 + 1 24 ശ1ͭ͋ͨΓͷೱ
ن͕খ͍͞ͱޡࠩଟΊ
ۭؒܭࢉྔ(༻ϝϞϦ) • ༻ϝϞϦ: ܕͷྻ͚ͩʂ
ࢀߟจݙ • HyperLogLog in Practice: Algorithmic Engineering of a State
of The Art Cardinality Estimation Algorithm