Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
確率的データ構造を Java で扱いたい! #JJUG
KOMIYA Atsushi
August 23, 2017
Programming
6
1.8k
確率的データ構造を Java で扱いたい! #JJUG
JJUG ナイト・セミナー 「ビール片手にLT&納涼会 2017」 の発表資料です。
https://jjug.doorkeeper.jp/events/63719
KOMIYA Atsushi
August 23, 2017
Tweet
Share
More Decks by KOMIYA Atsushi
See All by KOMIYA Atsushi
komiya_atsushi
5
3.1k
komiya_atsushi
0
140
komiya_atsushi
1
10k
komiya_atsushi
4
2.5k
komiya_atsushi
1
2.7k
komiya_atsushi
4
15k
komiya_atsushi
5
1k
komiya_atsushi
0
580
komiya_atsushi
2
11k
Other Decks in Programming
See All in Programming
bkuhlmann
4
270
dbrumann
0
100
amaotone
15
7.8k
oleindesign
1
200
hackz_inc
0
310
panini
1
160
mrtc0
2
900
williln
0
210
mihyaeru21
0
350
danilop
0
220
pirosikick
4
930
numeroanddev
1
210
Featured
See All Featured
brad_frost
157
6.4k
thoeni
4
590
mthomps
39
2.3k
holman
461
280k
sferik
610
54k
destraynor
223
47k
bkeepers
408
58k
addyosmani
1347
190k
moore
125
21k
maltzj
501
36k
malarkey
119
16k
afnizarnur
176
14k
Transcript
֬తσʔλߏΛ Java Ͱѻ͍͍ͨʂ 2017-08-23 JJUG night seminar LT KOMIYA Atsushi
@komiya_atsushi
Today’s topic
֬తσʔλߏ
֬తσʔλߏͱʁ • ֬తಛੑΛར༻ͨ͠σʔλߏ • ͋ΔΛɺ࣌ؒతۭؒ͘͠తʹޮΑ͘ (≅লϝϞϦͰ) ղ͘͜ͱΛతͱ͢Δ • ࠓճʮۭؒޮͷΑ͍σʔλߏʯʹண •
σʔλߏʹΑͬͯɺݫີղͰͳۙ͘ࣅղ ͕ಘΒΕΔ͜ͱ͕͋Δ • ਫ਼ͱۭؒޮτϨʔυΦϑͷؔ
ͲΜͳͱ͖ʹ͏ͷ͔ʁ
ͲΜͳͱ͖ʹ͏ͷ͔ʁ • ϦΞϧλΠϜ͔ͭେྔʹൃੜ͢ΔσʔλΛ ΦϯϥΠϯͰॲཧ͍ͨ͠ • ϝϞϦʹऩ·Γ͖Βͳ͍େنͳσʔλΛ ඇྗͳ PC Ͱॲཧ͍ͨ͠ •
ࢄॲཧͰ͖Δڥ͕͋ΔͳΒɺ͋͑ͯ ֬తσʔλߏΛ͏ඞཁͳ͍
Java Ͱ ֬తσʔλߏΛѻ͏
ࣗલ࣮ʁ ϥΠϒϥϦ͏ʁ • ଟ͘ͷ֬తσʔλߏɺͦͷจ͕͙͙ ΕӾཡՄೳͳঢ়ଶͰ͙͢ʹݟ͔ͭΔ • ͦΕΛಡΜͰࣗલ࣮͢ΔͷΑ͠ • ҰํͰ Maven
central ʹ͍ͭ͘ͷطଘ࣮ ͕ଘࡏ͍ͯ͠Δ • ڊਓͷݞͷ্ʹཱͭͷ͕ݡ͍Γํ
֬తσʔλߏͷ Java ࣮ • stream-lib ‘com.addthis:stream-lib’ • Membership query /
cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation
֬తσʔλߏͷ Java ࣮ • stream-lib ‘com.addthis:stream-lib’ • Membership query /
cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation
stream-lib ʹΑΔ ֬తσʔλߏͷར༻ํ๏
http://bit.ly/JJUG-2017-08- probds-code
Membership query
ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λఆ͢Δ
ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λఆ͢Δ Set<T> Λ༻ҙͯ͠ Set#contains(T) Ͱଘ൱Λఆ͠ Set#add(T) Ͱू߹ʹཁૉΛՃ͢Δ
Bloom filter • ֬తʹؒҧͬͨ͑ʢଘ൱݁ՌʣΛฦ͢ • ِཅੑ (ଘࡏ͠ͳ͍ͷΛଘࡏ͢Δͱޡೝ͢ Δࣄ) ੜ͡Δ͕ɺِӄੑੜ͡ͳ͍ •
ʮఆ͞ΕΔཁૉͷछྨʯʮڐ༰Ͱ͖Δِ ཅੑͷ֬ʯΛࢦఆͯ͠ɺώʔϓ༻ྔΛ੍ޚ Ͱ͖Δ • ཁૉͷՃͰ͖Δ͕ɺআ͍͠
stream-lib ͷ Bloom filter
stream-lib ͷ Bloom filter ཁૉͱِཅੑ֬Λࢦఆͯ͠ BloomFilter Λ༻ҙ͠ BloomFilter#isPresent(String) Ͱଘ൱Λఆ Set
ͱಉ༷ʹ add() ͢Δ
ώʔϓ༻ྔΛ֬ೝͯ͠ΈΔ • “Lorem ipsum” ͷςΩετΛྫʹɺJOL (Java Object Layout) Ͱώʔϓ༻ྔΛଌఆ •
http://openjdk.java.net/projects/code- tools/jol/ • Set: 6,032 bytes • stream-lib BloomFilter: 136 bytes 97.8% smaller !
Cardinality estimation
ҟͳΓΛٻΊΔ
ҟͳΓΛٻΊΔ Set<T> Λ༻ҙ͠ɺ Set#add() Ͱͻͨ͢ΒಥͬࠐΉ Set#size() ͰҟͳΓ͕ಘΒΕΔ
HyperLogLog++ (1/2) • ҟͳΓΛਪఆ͢Δσʔλߏ • ಘΒΕΔਪఆɺຊདྷͷҟͳΓʹର্ͯ͠ৼΕɾԼৼ Εͱʹى͜Γ͏Δ • Redshift /
BigQuery / Presto ͳͲͰɺCOUNT(DISTINCT x) Λۙࣅ͢Δखஈͱͯ͠ΘΕ͍ͯΔ • https://aws.amazon.com/jp/about-aws/whats-new/ 2013/11/11/amazon-redshift-new-performance-data- loading-security-features/ • https://cloud.google.com/blog/big-data/2017/07/ counting-uniques-faster-in-bigquery-with-hyperloglog
HyperLogLog++ (2/2) • ʮਪఆͷਫ਼ pʯΛௐ͢Δ͜ͱͰɺώʔϓ༻ྔΛ੍ ޚ͢Δ͜ͱ͕Ͱ͖Δ • Λେ͖͘͢Δͱਫ਼͕ߴ͘ͳΔ & ۭؒޮѱԽ͢Δ
• ఆ͞ΕΔҟͳΓඞཁͱ͞ΕΔਫ਼ɺώʔϓͷ੍ Λߟྀͯ͠ p Λܾఆ͢Δ • HyperLogLog ͷΈΛཧղ͢ΔʹɺҎԼͷϒϩάΤϯ τϦ͕͓͢͢Ί • http://blog.brainpad.co.jp/entry/2016/06/27/110000
stream-lib ͷ HyperLogLog++
stream-lib ͷ HyperLogLog++ ਫ਼Λࢦఆͯ͠ HyperLogLogPlus() Λ༻ҙ͢Δ HyperLogLogPlus#offer() ͰཁૉΛՃ͍ͯ͘͠ HyperLogLogPlus#cardinality() ͰҟͳΓ͕ಘΒΕΔ
Frequency counting
ཁૉͷසΛ্͑͛Δ
ཁૉͷසΛ্͑͛Δ Map Ͱཁૉ͝ͱͷΧϯλΛදݱ͢Δ ͻͨ͢Βཁૉ͝ͱʹ্͑͛Δ
Count-min sketch (1/2) • ཁૉͷසΛਪఆ͢ΔσʔλߏͷҰͭ • ࣮ࡍͷසΑΓେ͖͍ਪఆΛฦ͢͜ͱ͕ ͋ΔҰํͰɺখ͍͞ਪఆΛฦ͢͜ͱͳ͍ • ස͕খ͍͞ཁૉ΄Ͳɺ͜ͷόΠΞεͷӨ
ڹΛड͚͘͢ͳΔ
Count-min sketch (2/2) • width ͱ depth ͷೋͭͷύϥϝʔλͰɺۭؒ ޮਫ਼Λ੍ޚ͢Δ •
width * depth ͷݸͷΧϯλ͕࡞ΒΕΔ • Χϯλ 2࣍ݩྻͰදݱ • depth ͷ͚ͩϋογϡ͕࣮ؔߦ͞ΕΔͷ ͰɺతͳύϑΥʔϚϯεʹӨڹΛ༩͑Δ
stream-lib ͷ Count-min sketch
stream-lib ͷ Count-min sketch width:10 * depth:30 ͷΧϯλʹΑΔ Count-Min sketch
Λ༻ҙ͢Δ CountMinSketch#add(String, int) ͰΧϯτ͍ͯ͘͠
Quantile estimation
ύʔηϯλΠϧΛٻΊΔ
ύʔηϯλΠϧΛٻΊΔ ιʔτ͞Εͨঢ়ଶͰྻԽ͢Δ ͋ͱ n ύʔηϯλΠϧΛࢀর͢Δ͚ͩ
t-digest • ྻͷҐΛਪఆ͢Δσʔλߏ • ܦݧΛۙࣅతʹදݱ͢Δ • ύʔηϯλΠϧɺ͜ͷܦݧͷۙࣅදݱ͔Βૠ Λ༻͍ͯࢉग़͞ΕΔ • ʮѹॖύϥϝʔλʯʹΑͬͯɺਫ਼ͱۭؒޮͷτϨʔυ
ΦϑΛௐ͢Δ • Λେ͖͘͢Δ͜ͱͰɺਫ਼ΛߴΊΔ͜ͱ͕Ͱ͖Δ
stream-lib ͷ t-digest
stream-lib ͷ t-digest ѹॖύϥϝʔλΛࢦఆͯ͠ TDigest Λ༻ҙ͢Δ TDigest#add(double) ͰΛՃ͍ͯ͘͠ TDigest#quantile(double) ͰύʔηϯλΠϧΛಘΔ
·ͱΊ
·ͱΊ • ֬తσʔλߏΛ༻͍Δ͜ͱͰɺେنσʔλॲཧ ΦϯϥΠϯॲཧΛޮతʹ࣮ݱͰ͖Δʢ͔ʣ • Java Ͱ֬తσʔλߏΛ͓खܰʹѻ͍͍ͨͳΒɺ ·ͣstream-lib ͷར༻Λݕ౼ͯ͠ΈΔ •
ਪఆਫ਼ͱۭؒޮͷτϨʔυΦϑΛ੍ޚ͢Δ ύϥϝʔλͷௐɺ৬ਓܳʹͳΓ͕ͪ • JOL JMH Λ༻͍ͯɺ࣮ࡍͷۭؒޮͱ࣌ؒޮΛ ͖ͪΜͱଌఆ͠ͳ͕Βௐ͢Δ͜ͱΛ͓͢͢Ί͍ͨ͠
Thank you!