Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
確率的データ構造を Java で扱いたい! #JJUG
Search
KOMIYA Atsushi
August 23, 2017
Programming
6
2.1k
確率的データ構造を Java で扱いたい! #JJUG
JJUG ナイト・セミナー 「ビール片手にLT&納涼会 2017」 の発表資料です。
https://jjug.doorkeeper.jp/events/63719
KOMIYA Atsushi
August 23, 2017
Tweet
Share
More Decks by KOMIYA Atsushi
See All by KOMIYA Atsushi
#JJUG Java における乱数生成器とのつき合い方
komiya_atsushi
5
4.8k
#JJUG Fork/Join フレームワークを効率的に正しく使いたい
komiya_atsushi
0
410
[#JSUG] SmartNews における container friendly な Spring Boot アプリケーション開発
komiya_atsushi
1
11k
Java のデータ圧縮ライブラリを極める #jjug_ccc #ccc_c7
komiya_atsushi
4
4.2k
#devsumi 自然言語処理・機械学習によるファクトチェック業務の支援
komiya_atsushi
1
4k
SmartNews Ads における機械学習の活用とその運用 #mlops
komiya_atsushi
3
19k
GBDT によるクリック率予測を高速化したい #オレシカナイト vol.4
komiya_atsushi
5
1.2k
Maven central repository の artifact をランキングする #渋谷java
komiya_atsushi
0
1.1k
High-performance Jackson #渋谷Java
komiya_atsushi
2
16k
Other Decks in Programming
See All in Programming
Documentation for users with AsciiDoc and Antora
ahus1
0
370
大規模UIKitベースアプリへのTCAの段階的導入/gradual-adoption-of-tca-in-a-large-scale-uikit-based-app
takehilo
2
210
Micro Frontends for Java Microservices - Utah JUG 2024
mraible
PRO
1
110
Milestoner
bkuhlmann
1
410
Elm Form Validation
bkuhlmann
0
510
Goのmultiple errorsについて (2024年4月版)
syumai
4
1.2k
CREってこういうこと? 体験入社 - 提案資料 - / what-is-cre-trial-employment
shinden
1
530
Introducing Kotlin Multiplatform in an existing mobile app - Workshop Edition | AndroidMakers Paris
prof18
0
160
Deep Dive into React Stream/Serialize
mugi_uno
3
700
業務ツールとして使うPostman
msys75
0
110
単体テストを書かない技術 #phpcon_odawara
o0h
PRO
27
8.5k
Apache Hive 4 on Treasure Data
ryukobayashi
1
420
Featured
See All Featured
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
126
32k
Principles of Awesome APIs and How to Build Them.
keavy
121
16k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
21
1.6k
5 minutes of I Can Smell Your CMS
philhawksworth
199
19k
Imperfection Machines: The Place of Print at Facebook
scottboms
261
12k
Side Projects
sachag
451
41k
What’s in a name? Adding method to the madness
productmarketing
PRO
17
2.7k
For a Future-Friendly Web
brad_frost
172
9k
Code Reviewing Like a Champion
maltzj
515
39k
Atom: Resistance is Futile
akmur
260
25k
Infographics Made Easy
chrislema
238
18k
Producing Creativity
orderedlist
PRO
338
39k
Transcript
֬తσʔλߏΛ Java Ͱѻ͍͍ͨʂ 2017-08-23 JJUG night seminar LT KOMIYA Atsushi
@komiya_atsushi
Today’s topic
֬తσʔλߏ
֬తσʔλߏͱʁ • ֬తಛੑΛར༻ͨ͠σʔλߏ • ͋ΔΛɺ࣌ؒతۭؒ͘͠తʹޮΑ͘ (≅লϝϞϦͰ) ղ͘͜ͱΛతͱ͢Δ • ࠓճʮۭؒޮͷΑ͍σʔλߏʯʹண •
σʔλߏʹΑͬͯɺݫີղͰͳۙ͘ࣅղ ͕ಘΒΕΔ͜ͱ͕͋Δ • ਫ਼ͱۭؒޮτϨʔυΦϑͷؔ
ͲΜͳͱ͖ʹ͏ͷ͔ʁ
ͲΜͳͱ͖ʹ͏ͷ͔ʁ • ϦΞϧλΠϜ͔ͭେྔʹൃੜ͢ΔσʔλΛ ΦϯϥΠϯͰॲཧ͍ͨ͠ • ϝϞϦʹऩ·Γ͖Βͳ͍େنͳσʔλΛ ඇྗͳ PC Ͱॲཧ͍ͨ͠ •
ࢄॲཧͰ͖Δڥ͕͋ΔͳΒɺ͋͑ͯ ֬తσʔλߏΛ͏ඞཁͳ͍
Java Ͱ ֬తσʔλߏΛѻ͏
ࣗલ࣮ʁ ϥΠϒϥϦ͏ʁ • ଟ͘ͷ֬తσʔλߏɺͦͷจ͕͙͙ ΕӾཡՄೳͳঢ়ଶͰ͙͢ʹݟ͔ͭΔ • ͦΕΛಡΜͰࣗલ࣮͢ΔͷΑ͠ • ҰํͰ Maven
central ʹ͍ͭ͘ͷطଘ࣮ ͕ଘࡏ͍ͯ͠Δ • ڊਓͷݞͷ্ʹཱͭͷ͕ݡ͍Γํ
֬తσʔλߏͷ Java ࣮ • stream-lib ‘com.addthis:stream-lib’ • Membership query /
cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation
֬తσʔλߏͷ Java ࣮ • stream-lib ‘com.addthis:stream-lib’ • Membership query /
cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation
stream-lib ʹΑΔ ֬తσʔλߏͷར༻ํ๏
http://bit.ly/JJUG-2017-08- probds-code
Membership query
ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λఆ͢Δ
ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λఆ͢Δ Set<T> Λ༻ҙͯ͠ Set#contains(T) Ͱଘ൱Λఆ͠ Set#add(T) Ͱू߹ʹཁૉΛՃ͢Δ
Bloom filter • ֬తʹؒҧͬͨ͑ʢଘ൱݁ՌʣΛฦ͢ • ِཅੑ (ଘࡏ͠ͳ͍ͷΛଘࡏ͢Δͱޡೝ͢ Δࣄ) ੜ͡Δ͕ɺِӄੑੜ͡ͳ͍ •
ʮఆ͞ΕΔཁૉͷछྨʯʮڐ༰Ͱ͖Δِ ཅੑͷ֬ʯΛࢦఆͯ͠ɺώʔϓ༻ྔΛ੍ޚ Ͱ͖Δ • ཁૉͷՃͰ͖Δ͕ɺআ͍͠
stream-lib ͷ Bloom filter
stream-lib ͷ Bloom filter ཁૉͱِཅੑ֬Λࢦఆͯ͠ BloomFilter Λ༻ҙ͠ BloomFilter#isPresent(String) Ͱଘ൱Λఆ Set
ͱಉ༷ʹ add() ͢Δ
ώʔϓ༻ྔΛ֬ೝͯ͠ΈΔ • “Lorem ipsum” ͷςΩετΛྫʹɺJOL (Java Object Layout) Ͱώʔϓ༻ྔΛଌఆ •
http://openjdk.java.net/projects/code- tools/jol/ • Set: 6,032 bytes • stream-lib BloomFilter: 136 bytes 97.8% smaller !
Cardinality estimation
ҟͳΓΛٻΊΔ
ҟͳΓΛٻΊΔ Set<T> Λ༻ҙ͠ɺ Set#add() Ͱͻͨ͢ΒಥͬࠐΉ Set#size() ͰҟͳΓ͕ಘΒΕΔ
HyperLogLog++ (1/2) • ҟͳΓΛਪఆ͢Δσʔλߏ • ಘΒΕΔਪఆɺຊདྷͷҟͳΓʹର্ͯ͠ৼΕɾԼৼ Εͱʹى͜Γ͏Δ • Redshift /
BigQuery / Presto ͳͲͰɺCOUNT(DISTINCT x) Λۙࣅ͢Δखஈͱͯ͠ΘΕ͍ͯΔ • https://aws.amazon.com/jp/about-aws/whats-new/ 2013/11/11/amazon-redshift-new-performance-data- loading-security-features/ • https://cloud.google.com/blog/big-data/2017/07/ counting-uniques-faster-in-bigquery-with-hyperloglog
HyperLogLog++ (2/2) • ʮਪఆͷਫ਼ pʯΛௐ͢Δ͜ͱͰɺώʔϓ༻ྔΛ੍ ޚ͢Δ͜ͱ͕Ͱ͖Δ • Λେ͖͘͢Δͱਫ਼͕ߴ͘ͳΔ & ۭؒޮѱԽ͢Δ
• ఆ͞ΕΔҟͳΓඞཁͱ͞ΕΔਫ਼ɺώʔϓͷ੍ Λߟྀͯ͠ p Λܾఆ͢Δ • HyperLogLog ͷΈΛཧղ͢ΔʹɺҎԼͷϒϩάΤϯ τϦ͕͓͢͢Ί • http://blog.brainpad.co.jp/entry/2016/06/27/110000
stream-lib ͷ HyperLogLog++
stream-lib ͷ HyperLogLog++ ਫ਼Λࢦఆͯ͠ HyperLogLogPlus() Λ༻ҙ͢Δ HyperLogLogPlus#offer() ͰཁૉΛՃ͍ͯ͘͠ HyperLogLogPlus#cardinality() ͰҟͳΓ͕ಘΒΕΔ
Frequency counting
ཁૉͷසΛ্͑͛Δ
ཁૉͷසΛ্͑͛Δ Map Ͱཁૉ͝ͱͷΧϯλΛදݱ͢Δ ͻͨ͢Βཁૉ͝ͱʹ্͑͛Δ
Count-min sketch (1/2) • ཁૉͷසΛਪఆ͢ΔσʔλߏͷҰͭ • ࣮ࡍͷසΑΓେ͖͍ਪఆΛฦ͢͜ͱ͕ ͋ΔҰํͰɺখ͍͞ਪఆΛฦ͢͜ͱͳ͍ • ස͕খ͍͞ཁૉ΄Ͳɺ͜ͷόΠΞεͷӨ
ڹΛड͚͘͢ͳΔ
Count-min sketch (2/2) • width ͱ depth ͷೋͭͷύϥϝʔλͰɺۭؒ ޮਫ਼Λ੍ޚ͢Δ •
width * depth ͷݸͷΧϯλ͕࡞ΒΕΔ • Χϯλ 2࣍ݩྻͰදݱ • depth ͷ͚ͩϋογϡ͕࣮ؔߦ͞ΕΔͷ ͰɺతͳύϑΥʔϚϯεʹӨڹΛ༩͑Δ
stream-lib ͷ Count-min sketch
stream-lib ͷ Count-min sketch width:10 * depth:30 ͷΧϯλʹΑΔ Count-Min sketch
Λ༻ҙ͢Δ CountMinSketch#add(String, int) ͰΧϯτ͍ͯ͘͠
Quantile estimation
ύʔηϯλΠϧΛٻΊΔ
ύʔηϯλΠϧΛٻΊΔ ιʔτ͞Εͨঢ়ଶͰྻԽ͢Δ ͋ͱ n ύʔηϯλΠϧΛࢀর͢Δ͚ͩ
t-digest • ྻͷҐΛਪఆ͢Δσʔλߏ • ܦݧΛۙࣅతʹදݱ͢Δ • ύʔηϯλΠϧɺ͜ͷܦݧͷۙࣅදݱ͔Βૠ Λ༻͍ͯࢉग़͞ΕΔ • ʮѹॖύϥϝʔλʯʹΑͬͯɺਫ਼ͱۭؒޮͷτϨʔυ
ΦϑΛௐ͢Δ • Λେ͖͘͢Δ͜ͱͰɺਫ਼ΛߴΊΔ͜ͱ͕Ͱ͖Δ
stream-lib ͷ t-digest
stream-lib ͷ t-digest ѹॖύϥϝʔλΛࢦఆͯ͠ TDigest Λ༻ҙ͢Δ TDigest#add(double) ͰΛՃ͍ͯ͘͠ TDigest#quantile(double) ͰύʔηϯλΠϧΛಘΔ
·ͱΊ
·ͱΊ • ֬తσʔλߏΛ༻͍Δ͜ͱͰɺେنσʔλॲཧ ΦϯϥΠϯॲཧΛޮతʹ࣮ݱͰ͖Δʢ͔ʣ • Java Ͱ֬తσʔλߏΛ͓खܰʹѻ͍͍ͨͳΒɺ ·ͣstream-lib ͷར༻Λݕ౼ͯ͠ΈΔ •
ਪఆਫ਼ͱۭؒޮͷτϨʔυΦϑΛ੍ޚ͢Δ ύϥϝʔλͷௐɺ৬ਓܳʹͳΓ͕ͪ • JOL JMH Λ༻͍ͯɺ࣮ࡍͷۭؒޮͱ࣌ؒޮΛ ͖ͪΜͱଌఆ͠ͳ͕Βௐ͢Δ͜ͱΛ͓͢͢Ί͍ͨ͠
Thank you!