Distinct Queryを例にHyperLogLogのお気持ちを理解する
HyperLogLogͰলϝϞϦͳDistinct QuerySDDษڧձ@r-n-i2018/10/05
View Slide
Distinct Query
ΫΤϦͷྫ• Distinct([‘A’, ‘B’]) = 2• Distinct([‘A’, ‘B’, ‘C’, ‘A’, ‘C’]) = 3
[࣮ํ๏]SetʹಥͬࠐΜͰେ͖͞ΛऔΔ;ͭ͏ͷ
SetʹಥͬࠐΉࡍͷ• σʔλྻͷαΠζ͕େ͖͍ͱਏ͍• ͞ͷ͚ͩϝϞϦ৯͏
ΫΤϦͷྫ - ۩ମྫ• ϢʔβIDͷྻʹରͯ͠ɺϢχʔΫϢʔβ
ϢχʔΫϢʔβ…35915ਓͰͨ͠ʂʂ
ϢχʔΫϢʔβ…35915ਓͰͨ͠ʂʂ͜Ε͍Δʁ
ਖ਼֬ͳͦΜͳʹେࣄ͡Όͳ͍͜ͱ͋Δ
HashΛ༻͍ͨਪఆ
ϋογϡͷܭࢉhash(AB) = 0x36f… = 0011 0110 1111 …hash(CD) = 0xc90… = 1100 1001 0000 …hash(EF) = 0x01e… = 0000 0001 1110 …
ઌ಄ʹ͍ͭ͘ 0 ͕͍ͭͯΔʁzero(hash(AB))= zero(0011 0110 1111…)= 2zero(hash(CD))= zero(1100 1001 0000…)= 0zero(hash(EF))= zero(0000 0001 1110…)= 7
͜ΕΒͷ࠷େΛऔΔD = max(zero(hash(AB)),zero(hash(CD)),zero(hash(EF)))= max(2, 0, 7)= 7
ٯʹ…
࠷େ͚ͩΘ͔ͬͯΔͱ͢ΔD = 7
ͭ·Γ…D = max(?, ?, …, 7, …, ?, ?)zero(hash(?))= zero(0000 0001 …)= 7
ͭ·Γ…D = max(?, ?, …, 7, …, ?, ?)zero(hash(?))= zero(0000 0001 …)= 7݁ߏϨΞʂ
ͲͷఔϨΞʁD = max(?, ?, …, 7, …, ?, ?)zero(hash(?))= zero(0000 0001 …)= 71/2^7 = 1/128
ϢχʔΫͳHashΛ͍ͭ͘ݟͨʁD = max(?, ?, …, 7, …, ?, ?)1/2^7 = 1/128ฏۉ128ݸʁ
Distinct ͳཁૉ(Hash)Λେࡶʹ༧Ͱ͖Δ
HyperLogLog
HashͷඌͰৼΓ͚hash(AB) = 0x36f… = 0011 0110 … 1010D:0 1 9 10 11 14 151102 0 0 0… …
େ͖͍Ͱߋ৽ʂhash(AB) = 0x36f… = 0011 0110 … 1010D:0 1 9 10 11 14 152102 0 0 0… …
ཁૉͷਪఆ• Dͷঢ়گ͕ฏۉͲͷఔϨΞ͔ʁ• ௐฏۉʂ
ཁૉͷਪఆ112 4C × 4 ×4122+ 121+ 121+ 124ശ1ͭ͋ͨΓͷೱ
ن͕খ͍͞ͱޡࠩଟΊ
ۭؒܭࢉྔ(༻ϝϞϦ)• ༻ϝϞϦ: ܕͷྻ͚ͩʂ
ࢀߟจݙ• HyperLogLog in Practice: AlgorithmicEngineering of a State of The Art CardinalityEstimation Algorithm