Slide 1

Slide 1 text

֬཰తσʔλߏ଄Λ Java Ͱѻ͍͍ͨʂ 2017-08-23 JJUG night seminar LT KOMIYA Atsushi

Slide 2

Slide 2 text

@komiya_atsushi

Slide 3

Slide 3 text

Today’s topic

Slide 4

Slide 4 text

֬཰తσʔλߏ଄

Slide 5

Slide 5 text

֬཰తσʔλߏ଄ͱ͸ʁ • ֬཰࿦తಛੑΛར༻ͨ͠σʔλߏ଄ • ͋Δ໰୊Λɺ࣌ؒత΋͘͠͸ۭؒతʹޮ཰Α͘
 (≅লϝϞϦͰ) ղ͘͜ͱΛ໨తͱ͢Δ • ࠓճ͸ʮۭؒޮ཰ͷΑ͍σʔλߏ଄ʯʹண໨ • σʔλߏ଄ʹΑͬͯ͸ɺݫີղͰ͸ͳۙ͘ࣅղ ͕ಘΒΕΔ͜ͱ͕͋Δ • ਫ਼౓ͱۭؒޮ཰͸τϨʔυΦϑͷؔ܎

Slide 6

Slide 6 text

ͲΜͳͱ͖ʹ࢖͏ͷ͔ʁ

Slide 7

Slide 7 text

ͲΜͳͱ͖ʹ࢖͏ͷ͔ʁ • ϦΞϧλΠϜ͔ͭେྔʹൃੜ͢ΔσʔλΛ
 ΦϯϥΠϯͰॲཧ͍ͨ͠ • ϝϞϦʹऩ·Γ͖Βͳ͍େن໛ͳσʔλΛ
 ඇྗͳ PC Ͱॲཧ͍ͨ͠ • ෼ࢄॲཧͰ͖Δ؀ڥ͕͋ΔͳΒɺ͋͑ͯ
 ֬཰తσʔλߏ଄Λ࢖͏ඞཁ͸ͳ͍

Slide 8

Slide 8 text

Java Ͱ ֬཰తσʔλߏ଄Λѻ͏

Slide 9

Slide 9 text

ࣗલ࣮૷ʁ ϥΠϒϥϦ࢖͏ʁ • ଟ͘ͷ֬཰తσʔλߏ଄͸ɺͦͷ࿦จ͕͙͙ Ε͹ӾཡՄೳͳঢ়ଶͰ͙͢ʹݟ͔ͭΔ • ͦΕΛಡΜͰࣗલ࣮૷͢Δͷ΋Α͠ • ҰํͰ Maven central ʹ͸͍ͭ͘΋ͷطଘ࣮ ૷͕ଘࡏ͍ͯ͠Δ • ڊਓͷݞͷ্ʹཱͭͷ͕ݡ͍΍Γํ

Slide 10

Slide 10 text

֬཰తσʔλߏ଄ͷ Java ࣮૷ • stream-lib ‘com.addthis:stream-lib’ • Membership query / cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation

Slide 11

Slide 11 text

֬཰తσʔλߏ଄ͷ Java ࣮૷ • stream-lib ‘com.addthis:stream-lib’ • Membership query / cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation

Slide 12

Slide 12 text

stream-lib ʹΑΔ ֬཰తσʔλߏ଄ͷར༻ํ๏

Slide 13

Slide 13 text

http://bit.ly/JJUG-2017-08- probds-code

Slide 14

Slide 14 text

Membership query

Slide 15

Slide 15 text

ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λ൑ఆ͢Δ

Slide 16

Slide 16 text

ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λ൑ఆ͢Δ Set Λ༻ҙͯ͠ Set#contains(T) Ͱଘ൱Λ൑ఆ͠ Set#add(T) Ͱू߹ʹཁૉΛ௥Ճ͢Δ

Slide 17

Slide 17 text

Bloom filter • ֬཰తʹؒҧͬͨ౴͑ʢଘ൱݁ՌʣΛฦ͢ • ِཅੑ (ଘࡏ͠ͳ͍΋ͷΛଘࡏ͢Δͱޡೝ͢ Δࣄ৅) ͸ੜ͡Δ͕ɺِӄੑ͸ੜ͡ͳ͍ • ʮ૝ఆ͞ΕΔཁૉͷछྨ਺ʯ΍ʮڐ༰Ͱ͖Δِ ཅੑͷ֬཰ʯΛࢦఆͯ͠ɺώʔϓ࢖༻ྔΛ੍ޚ Ͱ͖Δ • ཁૉͷ௥Ճ͸Ͱ͖Δ͕ɺ࡟আ͸೉͍͠

Slide 18

Slide 18 text

stream-lib ͷ Bloom filter

Slide 19

Slide 19 text

stream-lib ͷ Bloom filter ཁૉ਺ͱِཅੑ֬཰Λࢦఆͯ͠ BloomFilter Λ༻ҙ͠ BloomFilter#isPresent(String) Ͱଘ൱Λ൑ఆ Set ͱಉ༷ʹ add() ͢Δ

Slide 20

Slide 20 text

ώʔϓ࢖༻ྔΛ֬ೝͯ͠ΈΔ • “Lorem ipsum” ͷςΩετΛྫʹɺJOL (Java Object Layout) Ͱώʔϓ࢖༻ྔΛଌఆ • http://openjdk.java.net/projects/code- tools/jol/ • Set: 6,032 bytes • stream-lib BloomFilter: 136 bytes 97.8% smaller !

Slide 21

Slide 21 text

Cardinality estimation

Slide 22

Slide 22 text

ҟͳΓ਺ΛٻΊΔ

Slide 23

Slide 23 text

ҟͳΓ਺ΛٻΊΔ Set Λ༻ҙ͠ɺ Set#add() Ͱͻͨ͢ΒಥͬࠐΉ Set#size() ͰҟͳΓ਺͕ಘΒΕΔ

Slide 24

Slide 24 text

HyperLogLog++ (1/2) • ҟͳΓ਺Λਪఆ͢Δσʔλߏ଄ • ಘΒΕΔਪఆ஋͸ɺຊདྷͷҟͳΓ਺ʹର্ͯ͠ৼΕɾԼৼ Εͱ΋ʹى͜Γ͏Δ • Redshift / BigQuery / Presto ͳͲͰ΋ɺCOUNT(DISTINCT x) Λۙࣅ͢Δखஈͱͯ͠࢖ΘΕ͍ͯΔ • https://aws.amazon.com/jp/about-aws/whats-new/ 2013/11/11/amazon-redshift-new-performance-data- loading-security-features/ • https://cloud.google.com/blog/big-data/2017/07/ counting-uniques-faster-in-bigquery-with-hyperloglog

Slide 25

Slide 25 text

HyperLogLog++ (2/2) • ʮਪఆ஋ͷਫ਼౓ pʯΛௐ੔͢Δ͜ͱͰɺώʔϓ࢖༻ྔΛ੍ ޚ͢Δ͜ͱ͕Ͱ͖Δ • ஋Λେ͖͘͢Δͱਫ਼౓͕ߴ͘ͳΔ & ۭؒޮ཰͸ѱԽ͢Δ • ૝ఆ͞ΕΔҟͳΓ਺΍ඞཁͱ͞ΕΔਫ਼౓ɺώʔϓͷ੍໿ Λߟྀͯ͠ p Λܾఆ͢Δ • HyperLogLog ͷ࢓૊ΈΛཧղ͢Δʹ͸ɺҎԼͷϒϩάΤϯ τϦ͕͓͢͢Ί • http://blog.brainpad.co.jp/entry/2016/06/27/110000

Slide 26

Slide 26 text

stream-lib ͷ HyperLogLog++

Slide 27

Slide 27 text

stream-lib ͷ HyperLogLog++ ਫ਼౓Λࢦఆͯ͠ HyperLogLogPlus() Λ༻ҙ͢Δ HyperLogLogPlus#offer() ͰཁૉΛ௥Ճ͍ͯ͘͠ HyperLogLogPlus#cardinality() ͰҟͳΓ਺͕ಘΒΕΔ

Slide 28

Slide 28 text

Frequency counting

Slide 29

Slide 29 text

ཁૉͷස౓Λ਺্͑͛Δ

Slide 30

Slide 30 text

ཁૉͷස౓Λ਺্͑͛Δ Map Ͱཁૉ͝ͱͷΧ΢ϯλΛදݱ͢Δ ͻͨ͢Βཁૉ͝ͱʹ਺্͑͛Δ

Slide 31

Slide 31 text

Count-min sketch (1/2) • ཁૉͷස౓Λਪఆ͢Δσʔλߏ଄ͷҰͭ • ࣮ࡍͷස౓ΑΓ΋େ͖͍ਪఆ஋Λฦ͢͜ͱ͕ ͋ΔҰํͰɺখ͍͞ਪఆ஋Λฦ͢͜ͱ͸ͳ͍ • ස౓͕খ͍͞ཁૉ΄Ͳɺ͜ͷόΠΞεͷӨ ڹΛड͚΍͘͢ͳΔ

Slide 32

Slide 32 text

Count-min sketch (2/2) • width ͱ depth ͷೋͭͷύϥϝʔλͰɺۭؒ ޮ཰΍ਫ਼౓Λ੍ޚ͢Δ • width * depth ͷݸ਺ͷΧ΢ϯλ͕࡞ΒΕΔ • Χ΢ϯλ͸ 2࣍ݩ഑ྻͰදݱ • depth ͷ਺͚ͩϋογϡؔ਺͕࣮ߦ͞ΕΔͷ Ͱɺ଎౓తͳύϑΥʔϚϯεʹӨڹΛ༩͑Δ

Slide 33

Slide 33 text

stream-lib ͷ Count-min sketch

Slide 34

Slide 34 text

stream-lib ͷ Count-min sketch width:10 * depth:30 ͷΧ΢ϯλʹΑΔ Count-Min sketch Λ༻ҙ͢Δ CountMinSketch#add(String, int) ͰΧ΢ϯτ͍ͯ͘͠

Slide 35

Slide 35 text

Quantile estimation

Slide 36

Slide 36 text

ύʔηϯλΠϧ஋ΛٻΊΔ

Slide 37

Slide 37 text

ύʔηϯλΠϧ஋ΛٻΊΔ ιʔτ͞Εͨঢ়ଶͰ഑ྻԽ͢Δ ͋ͱ͸ n ύʔηϯλΠϧΛࢀর͢Δ͚ͩ

Slide 38

Slide 38 text

t-digest • ਺஋ྻͷ෼Ґ਺Λਪఆ͢Δσʔλߏ଄ • ܦݧ෼෍Λۙࣅతʹදݱ͢Δ • ύʔηϯλΠϧ஋͸ɺ͜ͷܦݧ෼෍ͷۙࣅදݱ͔Β಺ૠ Λ༻͍ͯࢉग़͞ΕΔ • ʮѹॖύϥϝʔλʯʹΑͬͯɺਫ਼౓ͱۭؒޮ཰ͷτϨʔυ ΦϑΛௐ੔͢Δ • ஋Λେ͖͘͢Δ͜ͱͰɺਫ਼౓ΛߴΊΔ͜ͱ͕Ͱ͖Δ

Slide 39

Slide 39 text

stream-lib ͷ t-digest

Slide 40

Slide 40 text

stream-lib ͷ t-digest ѹॖύϥϝʔλΛࢦఆͯ͠ TDigest Λ༻ҙ͢Δ TDigest#add(double) Ͱ਺஋Λ௥Ճ͍ͯ͘͠ TDigest#quantile(double) ͰύʔηϯλΠϧ஋ΛಘΔ

Slide 41

Slide 41 text

·ͱΊ

Slide 42

Slide 42 text

·ͱΊ • ֬཰తσʔλߏ଄Λ༻͍Δ͜ͱͰɺେن໛σʔλॲཧ΍ ΦϯϥΠϯॲཧΛޮ཰తʹ࣮ݱͰ͖Δʢ͔΋ʣ • Java Ͱ֬཰తσʔλߏ଄Λ͓खܰʹѻ͍͍ͨͳΒɺ
 ·ͣ͸stream-lib ͷར༻Λݕ౼ͯ͠ΈΔ • ਪఆਫ਼౓ͱۭؒޮ཰ͷτϨʔυΦϑΛ੍ޚ͢Δ
 ύϥϝʔλͷௐ੔͸ɺ৬ਓܳʹͳΓ͕ͪ • JOL ΍ JMH Λ༻͍ͯɺ࣮ࡍͷۭؒޮ཰ͱ࣌ؒޮ཰Λ ͖ͪΜͱଌఆ͠ͳ͕Βௐ੔͢Δ͜ͱΛ͓͢͢Ί͍ͨ͠

Slide 43

Slide 43 text

Thank you!