Upgrade to Pro — share decks privately, control downloads, hide ads and more …

確率的データ構造を Java で扱いたい! #JJUG

確率的データ構造を Java で扱いたい! #JJUG

JJUG ナイト・セミナー 「ビール片手にLT&納涼会 2017」 の発表資料です。
https://jjug.doorkeeper.jp/events/63719

E77287648aff5484ac7659748e45c936?s=128

KOMIYA Atsushi

August 23, 2017
Tweet

Transcript

  1. ֬཰తσʔλߏ଄Λ Java Ͱѻ͍͍ͨʂ 2017-08-23 JJUG night seminar LT KOMIYA Atsushi

  2. @komiya_atsushi

  3. Today’s topic

  4. ֬཰తσʔλߏ଄

  5. ֬཰తσʔλߏ଄ͱ͸ʁ • ֬཰࿦తಛੑΛར༻ͨ͠σʔλߏ଄ • ͋Δ໰୊Λɺ࣌ؒత΋͘͠͸ۭؒతʹޮ཰Α͘
 (≅লϝϞϦͰ) ղ͘͜ͱΛ໨తͱ͢Δ • ࠓճ͸ʮۭؒޮ཰ͷΑ͍σʔλߏ଄ʯʹண໨ •

    σʔλߏ଄ʹΑͬͯ͸ɺݫີղͰ͸ͳۙ͘ࣅղ ͕ಘΒΕΔ͜ͱ͕͋Δ • ਫ਼౓ͱۭؒޮ཰͸τϨʔυΦϑͷؔ܎
  6. ͲΜͳͱ͖ʹ࢖͏ͷ͔ʁ

  7. ͲΜͳͱ͖ʹ࢖͏ͷ͔ʁ • ϦΞϧλΠϜ͔ͭେྔʹൃੜ͢ΔσʔλΛ
 ΦϯϥΠϯͰॲཧ͍ͨ͠ • ϝϞϦʹऩ·Γ͖Βͳ͍େن໛ͳσʔλΛ
 ඇྗͳ PC Ͱॲཧ͍ͨ͠ •

    ෼ࢄॲཧͰ͖Δ؀ڥ͕͋ΔͳΒɺ͋͑ͯ
 ֬཰తσʔλߏ଄Λ࢖͏ඞཁ͸ͳ͍
  8. Java Ͱ ֬཰తσʔλߏ଄Λѻ͏

  9. ࣗલ࣮૷ʁ ϥΠϒϥϦ࢖͏ʁ • ଟ͘ͷ֬཰తσʔλߏ଄͸ɺͦͷ࿦จ͕͙͙ Ε͹ӾཡՄೳͳঢ়ଶͰ͙͢ʹݟ͔ͭΔ • ͦΕΛಡΜͰࣗલ࣮૷͢Δͷ΋Α͠ • ҰํͰ Maven

    central ʹ͸͍ͭ͘΋ͷطଘ࣮ ૷͕ଘࡏ͍ͯ͠Δ • ڊਓͷݞͷ্ʹཱͭͷ͕ݡ͍΍Γํ
  10. ֬཰తσʔλߏ଄ͷ Java ࣮૷ • stream-lib ‘com.addthis:stream-lib’ • Membership query /

    cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation
  11. ֬཰తσʔλߏ଄ͷ Java ࣮૷ • stream-lib ‘com.addthis:stream-lib’ • Membership query /

    cardinality estimation / frequency counting / quantile estimation • Google Guava ‘com.google.guava:guava’ • Membership query • java-hll ‘net.agkn:hll’ • Cardinality estimation • t-digest ‘com.tdunning:t-digest’ • Quantile estimation
  12. stream-lib ʹΑΔ ֬཰తσʔλߏ଄ͷར༻ํ๏

  13. http://bit.ly/JJUG-2017-08- probds-code

  14. Membership query

  15. ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λ൑ఆ͢Δ

  16. ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λ൑ఆ͢Δ Set<T> Λ༻ҙͯ͠ Set#contains(T) Ͱଘ൱Λ൑ఆ͠ Set#add(T) Ͱू߹ʹཁૉΛ௥Ճ͢Δ

  17. Bloom filter • ֬཰తʹؒҧͬͨ౴͑ʢଘ൱݁ՌʣΛฦ͢ • ِཅੑ (ଘࡏ͠ͳ͍΋ͷΛଘࡏ͢Δͱޡೝ͢ Δࣄ৅) ͸ੜ͡Δ͕ɺِӄੑ͸ੜ͡ͳ͍ •

    ʮ૝ఆ͞ΕΔཁૉͷछྨ਺ʯ΍ʮڐ༰Ͱ͖Δِ ཅੑͷ֬཰ʯΛࢦఆͯ͠ɺώʔϓ࢖༻ྔΛ੍ޚ Ͱ͖Δ • ཁૉͷ௥Ճ͸Ͱ͖Δ͕ɺ࡟আ͸೉͍͠
  18. stream-lib ͷ Bloom filter

  19. stream-lib ͷ Bloom filter ཁૉ਺ͱِཅੑ֬཰Λࢦఆͯ͠ BloomFilter Λ༻ҙ͠ BloomFilter#isPresent(String) Ͱଘ൱Λ൑ఆ Set

    ͱಉ༷ʹ add() ͢Δ
  20. ώʔϓ࢖༻ྔΛ֬ೝͯ͠ΈΔ • “Lorem ipsum” ͷςΩετΛྫʹɺJOL (Java Object Layout) Ͱώʔϓ࢖༻ྔΛଌఆ •

    http://openjdk.java.net/projects/code- tools/jol/ • Set: 6,032 bytes • stream-lib BloomFilter: 136 bytes 97.8% smaller !
  21. Cardinality estimation

  22. ҟͳΓ਺ΛٻΊΔ

  23. ҟͳΓ਺ΛٻΊΔ Set<T> Λ༻ҙ͠ɺ Set#add() Ͱͻͨ͢ΒಥͬࠐΉ Set#size() ͰҟͳΓ਺͕ಘΒΕΔ

  24. HyperLogLog++ (1/2) • ҟͳΓ਺Λਪఆ͢Δσʔλߏ଄ • ಘΒΕΔਪఆ஋͸ɺຊདྷͷҟͳΓ਺ʹର্ͯ͠ৼΕɾԼৼ Εͱ΋ʹى͜Γ͏Δ • Redshift /

    BigQuery / Presto ͳͲͰ΋ɺCOUNT(DISTINCT x) Λۙࣅ͢Δखஈͱͯ͠࢖ΘΕ͍ͯΔ • https://aws.amazon.com/jp/about-aws/whats-new/ 2013/11/11/amazon-redshift-new-performance-data- loading-security-features/ • https://cloud.google.com/blog/big-data/2017/07/ counting-uniques-faster-in-bigquery-with-hyperloglog
  25. HyperLogLog++ (2/2) • ʮਪఆ஋ͷਫ਼౓ pʯΛௐ੔͢Δ͜ͱͰɺώʔϓ࢖༻ྔΛ੍ ޚ͢Δ͜ͱ͕Ͱ͖Δ • ஋Λେ͖͘͢Δͱਫ਼౓͕ߴ͘ͳΔ & ۭؒޮ཰͸ѱԽ͢Δ

    • ૝ఆ͞ΕΔҟͳΓ਺΍ඞཁͱ͞ΕΔਫ਼౓ɺώʔϓͷ੍໿ Λߟྀͯ͠ p Λܾఆ͢Δ • HyperLogLog ͷ࢓૊ΈΛཧղ͢Δʹ͸ɺҎԼͷϒϩάΤϯ τϦ͕͓͢͢Ί • http://blog.brainpad.co.jp/entry/2016/06/27/110000
  26. stream-lib ͷ HyperLogLog++

  27. stream-lib ͷ HyperLogLog++ ਫ਼౓Λࢦఆͯ͠ HyperLogLogPlus() Λ༻ҙ͢Δ HyperLogLogPlus#offer() ͰཁૉΛ௥Ճ͍ͯ͘͠ HyperLogLogPlus#cardinality() ͰҟͳΓ਺͕ಘΒΕΔ

  28. Frequency counting

  29. ཁૉͷස౓Λ਺্͑͛Δ

  30. ཁૉͷස౓Λ਺্͑͛Δ Map Ͱཁૉ͝ͱͷΧ΢ϯλΛදݱ͢Δ ͻͨ͢Βཁૉ͝ͱʹ਺্͑͛Δ

  31. Count-min sketch (1/2) • ཁૉͷස౓Λਪఆ͢Δσʔλߏ଄ͷҰͭ • ࣮ࡍͷස౓ΑΓ΋େ͖͍ਪఆ஋Λฦ͢͜ͱ͕ ͋ΔҰํͰɺখ͍͞ਪఆ஋Λฦ͢͜ͱ͸ͳ͍ • ස౓͕খ͍͞ཁૉ΄Ͳɺ͜ͷόΠΞεͷӨ

    ڹΛड͚΍͘͢ͳΔ
  32. Count-min sketch (2/2) • width ͱ depth ͷೋͭͷύϥϝʔλͰɺۭؒ ޮ཰΍ਫ਼౓Λ੍ޚ͢Δ •

    width * depth ͷݸ਺ͷΧ΢ϯλ͕࡞ΒΕΔ • Χ΢ϯλ͸ 2࣍ݩ഑ྻͰදݱ • depth ͷ਺͚ͩϋογϡؔ਺͕࣮ߦ͞ΕΔͷ Ͱɺ଎౓తͳύϑΥʔϚϯεʹӨڹΛ༩͑Δ
  33. stream-lib ͷ Count-min sketch

  34. stream-lib ͷ Count-min sketch width:10 * depth:30 ͷΧ΢ϯλʹΑΔ Count-Min sketch

    Λ༻ҙ͢Δ CountMinSketch#add(String, int) ͰΧ΢ϯτ͍ͯ͘͠
  35. Quantile estimation

  36. ύʔηϯλΠϧ஋ΛٻΊΔ

  37. ύʔηϯλΠϧ஋ΛٻΊΔ ιʔτ͞Εͨঢ়ଶͰ഑ྻԽ͢Δ ͋ͱ͸ n ύʔηϯλΠϧΛࢀর͢Δ͚ͩ

  38. t-digest • ਺஋ྻͷ෼Ґ਺Λਪఆ͢Δσʔλߏ଄ • ܦݧ෼෍Λۙࣅతʹදݱ͢Δ • ύʔηϯλΠϧ஋͸ɺ͜ͷܦݧ෼෍ͷۙࣅදݱ͔Β಺ૠ Λ༻͍ͯࢉग़͞ΕΔ • ʮѹॖύϥϝʔλʯʹΑͬͯɺਫ਼౓ͱۭؒޮ཰ͷτϨʔυ

    ΦϑΛௐ੔͢Δ • ஋Λେ͖͘͢Δ͜ͱͰɺਫ਼౓ΛߴΊΔ͜ͱ͕Ͱ͖Δ
  39. stream-lib ͷ t-digest

  40. stream-lib ͷ t-digest ѹॖύϥϝʔλΛࢦఆͯ͠ TDigest Λ༻ҙ͢Δ TDigest#add(double) Ͱ਺஋Λ௥Ճ͍ͯ͘͠ TDigest#quantile(double) ͰύʔηϯλΠϧ஋ΛಘΔ

  41. ·ͱΊ

  42. ·ͱΊ • ֬཰తσʔλߏ଄Λ༻͍Δ͜ͱͰɺେن໛σʔλॲཧ΍ ΦϯϥΠϯॲཧΛޮ཰తʹ࣮ݱͰ͖Δʢ͔΋ʣ • Java Ͱ֬཰తσʔλߏ଄Λ͓खܰʹѻ͍͍ͨͳΒɺ
 ·ͣ͸stream-lib ͷར༻Λݕ౼ͯ͠ΈΔ •

    ਪఆਫ਼౓ͱۭؒޮ཰ͷτϨʔυΦϑΛ੍ޚ͢Δ
 ύϥϝʔλͷௐ੔͸ɺ৬ਓܳʹͳΓ͕ͪ • JOL ΍ JMH Λ༻͍ͯɺ࣮ࡍͷۭؒޮ཰ͱ࣌ؒޮ཰Λ ͖ͪΜͱଌఆ͠ͳ͕Βௐ੔͢Δ͜ͱΛ͓͢͢Ί͍ͨ͠
  43. Thank you!