$30 off During Our Annual Pro Sale. View Details »

確率的データ構造を Java で扱いたい! #JJUG

確率的データ構造を Java で扱いたい! #JJUG

JJUG ナイト・セミナー 「ビール片手にLT&納涼会 2017」 の発表資料です。
https://jjug.doorkeeper.jp/events/63719

KOMIYA Atsushi

August 23, 2017
Tweet

More Decks by KOMIYA Atsushi

Other Decks in Programming

Transcript

  1. ֬཰తσʔλߏ଄Λ
    Java Ͱѻ͍͍ͨʂ
    2017-08-23 JJUG night seminar LT
    KOMIYA Atsushi

    View Slide

  2. @komiya_atsushi

    View Slide

  3. Today’s topic

    View Slide

  4. ֬཰తσʔλߏ଄

    View Slide

  5. ֬཰తσʔλߏ଄ͱ͸ʁ
    • ֬཰࿦తಛੑΛར༻ͨ͠σʔλߏ଄
    • ͋Δ໰୊Λɺ࣌ؒత΋͘͠͸ۭؒతʹޮ཰Α͘

    (≅লϝϞϦͰ) ղ͘͜ͱΛ໨తͱ͢Δ
    • ࠓճ͸ʮۭؒޮ཰ͷΑ͍σʔλߏ଄ʯʹண໨
    • σʔλߏ଄ʹΑͬͯ͸ɺݫີղͰ͸ͳۙ͘ࣅղ
    ͕ಘΒΕΔ͜ͱ͕͋Δ
    • ਫ਼౓ͱۭؒޮ཰͸τϨʔυΦϑͷؔ܎

    View Slide

  6. ͲΜͳͱ͖ʹ࢖͏ͷ͔ʁ

    View Slide

  7. ͲΜͳͱ͖ʹ࢖͏ͷ͔ʁ
    • ϦΞϧλΠϜ͔ͭେྔʹൃੜ͢ΔσʔλΛ

    ΦϯϥΠϯͰॲཧ͍ͨ͠
    • ϝϞϦʹऩ·Γ͖Βͳ͍େن໛ͳσʔλΛ

    ඇྗͳ PC Ͱॲཧ͍ͨ͠
    • ෼ࢄॲཧͰ͖Δ؀ڥ͕͋ΔͳΒɺ͋͑ͯ

    ֬཰తσʔλߏ଄Λ࢖͏ඞཁ͸ͳ͍

    View Slide

  8. Java Ͱ
    ֬཰తσʔλߏ଄Λѻ͏

    View Slide

  9. ࣗલ࣮૷ʁ ϥΠϒϥϦ࢖͏ʁ
    • ଟ͘ͷ֬཰తσʔλߏ଄͸ɺͦͷ࿦จ͕͙͙
    Ε͹ӾཡՄೳͳঢ়ଶͰ͙͢ʹݟ͔ͭΔ
    • ͦΕΛಡΜͰࣗલ࣮૷͢Δͷ΋Α͠
    • ҰํͰ Maven central ʹ͸͍ͭ͘΋ͷطଘ࣮
    ૷͕ଘࡏ͍ͯ͠Δ
    • ڊਓͷݞͷ্ʹཱͭͷ͕ݡ͍΍Γํ

    View Slide

  10. ֬཰తσʔλߏ଄ͷ Java ࣮૷
    • stream-lib ‘com.addthis:stream-lib’
    • Membership query / cardinality estimation /
    frequency counting / quantile estimation
    • Google Guava ‘com.google.guava:guava’
    • Membership query
    • java-hll ‘net.agkn:hll’
    • Cardinality estimation
    • t-digest ‘com.tdunning:t-digest’
    • Quantile estimation

    View Slide

  11. ֬཰తσʔλߏ଄ͷ Java ࣮૷
    • stream-lib ‘com.addthis:stream-lib’
    • Membership query / cardinality estimation /
    frequency counting / quantile estimation
    • Google Guava ‘com.google.guava:guava’
    • Membership query
    • java-hll ‘net.agkn:hll’
    • Cardinality estimation
    • t-digest ‘com.tdunning:t-digest’
    • Quantile estimation

    View Slide

  12. stream-lib ʹΑΔ
    ֬཰తσʔλߏ଄ͷར༻ํ๏

    View Slide

  13. http://bit.ly/JJUG-2017-08-
    probds-code

    View Slide

  14. Membership query

    View Slide

  15. ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λ൑ఆ͢Δ

    View Slide

  16. ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λ൑ఆ͢Δ
    Set Λ༻ҙͯ͠
    Set#contains(T) Ͱଘ൱Λ൑ఆ͠
    Set#add(T) Ͱू߹ʹཁૉΛ௥Ճ͢Δ

    View Slide

  17. Bloom filter
    • ֬཰తʹؒҧͬͨ౴͑ʢଘ൱݁ՌʣΛฦ͢
    • ِཅੑ (ଘࡏ͠ͳ͍΋ͷΛଘࡏ͢Δͱޡೝ͢
    Δࣄ৅) ͸ੜ͡Δ͕ɺِӄੑ͸ੜ͡ͳ͍
    • ʮ૝ఆ͞ΕΔཁૉͷछྨ਺ʯ΍ʮڐ༰Ͱ͖Δِ
    ཅੑͷ֬཰ʯΛࢦఆͯ͠ɺώʔϓ࢖༻ྔΛ੍ޚ
    Ͱ͖Δ
    • ཁૉͷ௥Ճ͸Ͱ͖Δ͕ɺ࡟আ͸೉͍͠

    View Slide

  18. stream-lib ͷ Bloom filter

    View Slide

  19. stream-lib ͷ Bloom filter
    ཁૉ਺ͱِཅੑ֬཰Λࢦఆͯ͠ BloomFilter Λ༻ҙ͠
    BloomFilter#isPresent(String) Ͱଘ൱Λ൑ఆ
    Set ͱಉ༷ʹ add() ͢Δ

    View Slide

  20. ώʔϓ࢖༻ྔΛ֬ೝͯ͠ΈΔ
    • “Lorem ipsum” ͷςΩετΛྫʹɺJOL
    (Java Object Layout) Ͱώʔϓ࢖༻ྔΛଌఆ
    • http://openjdk.java.net/projects/code-
    tools/jol/
    • Set: 6,032 bytes
    • stream-lib BloomFilter: 136 bytes
    97.8% smaller !

    View Slide

  21. Cardinality estimation

    View Slide

  22. ҟͳΓ਺ΛٻΊΔ

    View Slide

  23. ҟͳΓ਺ΛٻΊΔ
    Set Λ༻ҙ͠ɺ
    Set#add() Ͱͻͨ͢ΒಥͬࠐΉ
    Set#size() ͰҟͳΓ਺͕ಘΒΕΔ

    View Slide

  24. HyperLogLog++ (1/2)
    • ҟͳΓ਺Λਪఆ͢Δσʔλߏ଄
    • ಘΒΕΔਪఆ஋͸ɺຊདྷͷҟͳΓ਺ʹର্ͯ͠ৼΕɾԼৼ
    Εͱ΋ʹى͜Γ͏Δ
    • Redshift / BigQuery / Presto ͳͲͰ΋ɺCOUNT(DISTINCT x)
    Λۙࣅ͢Δखஈͱͯ͠࢖ΘΕ͍ͯΔ
    • https://aws.amazon.com/jp/about-aws/whats-new/
    2013/11/11/amazon-redshift-new-performance-data-
    loading-security-features/
    • https://cloud.google.com/blog/big-data/2017/07/
    counting-uniques-faster-in-bigquery-with-hyperloglog

    View Slide

  25. HyperLogLog++ (2/2)
    • ʮਪఆ஋ͷਫ਼౓ pʯΛௐ੔͢Δ͜ͱͰɺώʔϓ࢖༻ྔΛ੍
    ޚ͢Δ͜ͱ͕Ͱ͖Δ
    • ஋Λେ͖͘͢Δͱਫ਼౓͕ߴ͘ͳΔ & ۭؒޮ཰͸ѱԽ͢Δ
    • ૝ఆ͞ΕΔҟͳΓ਺΍ඞཁͱ͞ΕΔਫ਼౓ɺώʔϓͷ੍໿
    Λߟྀͯ͠ p Λܾఆ͢Δ
    • HyperLogLog ͷ࢓૊ΈΛཧղ͢Δʹ͸ɺҎԼͷϒϩάΤϯ
    τϦ͕͓͢͢Ί
    • http://blog.brainpad.co.jp/entry/2016/06/27/110000

    View Slide

  26. stream-lib ͷ HyperLogLog++

    View Slide

  27. stream-lib ͷ HyperLogLog++
    ਫ਼౓Λࢦఆͯ͠ HyperLogLogPlus() Λ༻ҙ͢Δ
    HyperLogLogPlus#offer() ͰཁૉΛ௥Ճ͍ͯ͘͠
    HyperLogLogPlus#cardinality() ͰҟͳΓ਺͕ಘΒΕΔ

    View Slide

  28. Frequency counting

    View Slide

  29. ཁૉͷස౓Λ਺্͑͛Δ

    View Slide

  30. ཁૉͷස౓Λ਺্͑͛Δ
    Map Ͱཁૉ͝ͱͷΧ΢ϯλΛදݱ͢Δ
    ͻͨ͢Βཁૉ͝ͱʹ਺্͑͛Δ

    View Slide

  31. Count-min sketch (1/2)
    • ཁૉͷස౓Λਪఆ͢Δσʔλߏ଄ͷҰͭ
    • ࣮ࡍͷස౓ΑΓ΋େ͖͍ਪఆ஋Λฦ͢͜ͱ͕
    ͋ΔҰํͰɺখ͍͞ਪఆ஋Λฦ͢͜ͱ͸ͳ͍
    • ස౓͕খ͍͞ཁૉ΄Ͳɺ͜ͷόΠΞεͷӨ
    ڹΛड͚΍͘͢ͳΔ

    View Slide

  32. Count-min sketch (2/2)
    • width ͱ depth ͷೋͭͷύϥϝʔλͰɺۭؒ
    ޮ཰΍ਫ਼౓Λ੍ޚ͢Δ
    • width * depth ͷݸ਺ͷΧ΢ϯλ͕࡞ΒΕΔ
    • Χ΢ϯλ͸ 2࣍ݩ഑ྻͰදݱ
    • depth ͷ਺͚ͩϋογϡؔ਺͕࣮ߦ͞ΕΔͷ
    Ͱɺ଎౓తͳύϑΥʔϚϯεʹӨڹΛ༩͑Δ

    View Slide

  33. stream-lib ͷ Count-min sketch

    View Slide

  34. stream-lib ͷ Count-min sketch
    width:10 * depth:30 ͷΧ΢ϯλʹΑΔ Count-Min sketch Λ༻ҙ͢Δ
    CountMinSketch#add(String, int) ͰΧ΢ϯτ͍ͯ͘͠

    View Slide

  35. Quantile estimation

    View Slide

  36. ύʔηϯλΠϧ஋ΛٻΊΔ

    View Slide

  37. ύʔηϯλΠϧ஋ΛٻΊΔ
    ιʔτ͞Εͨঢ়ଶͰ഑ྻԽ͢Δ
    ͋ͱ͸ n ύʔηϯλΠϧΛࢀর͢Δ͚ͩ

    View Slide

  38. t-digest
    • ਺஋ྻͷ෼Ґ਺Λਪఆ͢Δσʔλߏ଄
    • ܦݧ෼෍Λۙࣅతʹදݱ͢Δ
    • ύʔηϯλΠϧ஋͸ɺ͜ͷܦݧ෼෍ͷۙࣅදݱ͔Β಺ૠ
    Λ༻͍ͯࢉग़͞ΕΔ
    • ʮѹॖύϥϝʔλʯʹΑͬͯɺਫ਼౓ͱۭؒޮ཰ͷτϨʔυ
    ΦϑΛௐ੔͢Δ
    • ஋Λେ͖͘͢Δ͜ͱͰɺਫ਼౓ΛߴΊΔ͜ͱ͕Ͱ͖Δ

    View Slide

  39. stream-lib ͷ t-digest

    View Slide

  40. stream-lib ͷ t-digest
    ѹॖύϥϝʔλΛࢦఆͯ͠ TDigest Λ༻ҙ͢Δ
    TDigest#add(double) Ͱ਺஋Λ௥Ճ͍ͯ͘͠
    TDigest#quantile(double) ͰύʔηϯλΠϧ஋ΛಘΔ

    View Slide

  41. ·ͱΊ

    View Slide

  42. ·ͱΊ
    • ֬཰తσʔλߏ଄Λ༻͍Δ͜ͱͰɺେن໛σʔλॲཧ΍
    ΦϯϥΠϯॲཧΛޮ཰తʹ࣮ݱͰ͖Δʢ͔΋ʣ
    • Java Ͱ֬཰తσʔλߏ଄Λ͓खܰʹѻ͍͍ͨͳΒɺ

    ·ͣ͸stream-lib ͷར༻Λݕ౼ͯ͠ΈΔ
    • ਪఆਫ਼౓ͱۭؒޮ཰ͷτϨʔυΦϑΛ੍ޚ͢Δ

    ύϥϝʔλͷௐ੔͸ɺ৬ਓܳʹͳΓ͕ͪ
    • JOL ΍ JMH Λ༻͍ͯɺ࣮ࡍͷۭؒޮ཰ͱ࣌ؒޮ཰Λ
    ͖ͪΜͱଌఆ͠ͳ͕Βௐ੔͢Δ͜ͱΛ͓͢͢Ί͍ͨ͠

    View Slide

  43. Thank you!

    View Slide