JJUG ナイト・セミナー 「ビール片手にLT&納涼会 2017」 の発表資料です。 https://jjug.doorkeeper.jp/events/63719
֬తσʔλߏΛJava Ͱѻ͍͍ͨʂ2017-08-23 JJUG night seminar LTKOMIYA Atsushi
View Slide
@komiya_atsushi
Today’s topic
֬తσʔλߏ
֬తσʔλߏͱʁ• ֬తಛੑΛར༻ͨ͠σʔλߏ• ͋ΔΛɺ࣌ؒతۭؒ͘͠తʹޮΑ͘ (≅লϝϞϦͰ) ղ͘͜ͱΛతͱ͢Δ• ࠓճʮۭؒޮͷΑ͍σʔλߏʯʹண• σʔλߏʹΑͬͯɺݫີղͰͳۙ͘ࣅղ͕ಘΒΕΔ͜ͱ͕͋Δ• ਫ਼ͱۭؒޮτϨʔυΦϑͷؔ
ͲΜͳͱ͖ʹ͏ͷ͔ʁ
ͲΜͳͱ͖ʹ͏ͷ͔ʁ• ϦΞϧλΠϜ͔ͭେྔʹൃੜ͢ΔσʔλΛ ΦϯϥΠϯͰॲཧ͍ͨ͠• ϝϞϦʹऩ·Γ͖Βͳ͍େنͳσʔλΛ ඇྗͳ PC Ͱॲཧ͍ͨ͠• ࢄॲཧͰ͖Δڥ͕͋ΔͳΒɺ͋͑ͯ ֬తσʔλߏΛ͏ඞཁͳ͍
Java Ͱ֬తσʔλߏΛѻ͏
ࣗલ࣮ʁ ϥΠϒϥϦ͏ʁ• ଟ͘ͷ֬తσʔλߏɺͦͷจ͕͙͙ΕӾཡՄೳͳঢ়ଶͰ͙͢ʹݟ͔ͭΔ• ͦΕΛಡΜͰࣗલ࣮͢ΔͷΑ͠• ҰํͰ Maven central ʹ͍ͭ͘ͷطଘ࣮͕ଘࡏ͍ͯ͠Δ• ڊਓͷݞͷ্ʹཱͭͷ͕ݡ͍Γํ
֬తσʔλߏͷ Java ࣮• stream-lib ‘com.addthis:stream-lib’• Membership query / cardinality estimation /frequency counting / quantile estimation• Google Guava ‘com.google.guava:guava’• Membership query• java-hll ‘net.agkn:hll’• Cardinality estimation• t-digest ‘com.tdunning:t-digest’• Quantile estimation
stream-lib ʹΑΔ֬తσʔλߏͷར༻ํ๏
http://bit.ly/JJUG-2017-08-probds-code
Membership query
ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λఆ͢Δ
ཁૉ͕ू߹ʹଐ͢Δ͔൱͔Λఆ͢ΔSet Λ༻ҙͯ͠Set#contains(T) Ͱଘ൱Λఆ͠Set#add(T) Ͱू߹ʹཁૉΛՃ͢Δ
Bloom filter• ֬తʹؒҧͬͨ͑ʢଘ൱݁ՌʣΛฦ͢• ِཅੑ (ଘࡏ͠ͳ͍ͷΛଘࡏ͢Δͱޡೝ͢Δࣄ) ੜ͡Δ͕ɺِӄੑੜ͡ͳ͍• ʮఆ͞ΕΔཁૉͷछྨʯʮڐ༰Ͱ͖Δِཅੑͷ֬ʯΛࢦఆͯ͠ɺώʔϓ༻ྔΛ੍ޚͰ͖Δ • ཁૉͷՃͰ͖Δ͕ɺআ͍͠
stream-lib ͷ Bloom filter
stream-lib ͷ Bloom filterཁૉͱِཅੑ֬Λࢦఆͯ͠ BloomFilter Λ༻ҙ͠BloomFilter#isPresent(String) Ͱଘ൱ΛఆSet ͱಉ༷ʹ add() ͢Δ
ώʔϓ༻ྔΛ֬ೝͯ͠ΈΔ• “Lorem ipsum” ͷςΩετΛྫʹɺJOL(Java Object Layout) Ͱώʔϓ༻ྔΛଌఆ• http://openjdk.java.net/projects/code-tools/jol/• Set: 6,032 bytes• stream-lib BloomFilter: 136 bytes97.8% smaller !
Cardinality estimation
ҟͳΓΛٻΊΔ
ҟͳΓΛٻΊΔSet Λ༻ҙ͠ɺSet#add() Ͱͻͨ͢ΒಥͬࠐΉSet#size() ͰҟͳΓ͕ಘΒΕΔ
HyperLogLog++ (1/2)• ҟͳΓΛਪఆ͢Δσʔλߏ• ಘΒΕΔਪఆɺຊདྷͷҟͳΓʹର্ͯ͠ৼΕɾԼৼΕͱʹى͜Γ͏Δ• Redshift / BigQuery / Presto ͳͲͰɺCOUNT(DISTINCT x)Λۙࣅ͢Δखஈͱͯ͠ΘΕ͍ͯΔ• https://aws.amazon.com/jp/about-aws/whats-new/2013/11/11/amazon-redshift-new-performance-data-loading-security-features/• https://cloud.google.com/blog/big-data/2017/07/counting-uniques-faster-in-bigquery-with-hyperloglog
HyperLogLog++ (2/2)• ʮਪఆͷਫ਼ pʯΛௐ͢Δ͜ͱͰɺώʔϓ༻ྔΛ੍ޚ͢Δ͜ͱ͕Ͱ͖Δ• Λେ͖͘͢Δͱਫ਼͕ߴ͘ͳΔ & ۭؒޮѱԽ͢Δ• ఆ͞ΕΔҟͳΓඞཁͱ͞ΕΔਫ਼ɺώʔϓͷ੍Λߟྀͯ͠ p Λܾఆ͢Δ• HyperLogLog ͷΈΛཧղ͢ΔʹɺҎԼͷϒϩάΤϯτϦ͕͓͢͢Ί• http://blog.brainpad.co.jp/entry/2016/06/27/110000
stream-lib ͷ HyperLogLog++
stream-lib ͷ HyperLogLog++ਫ਼Λࢦఆͯ͠ HyperLogLogPlus() Λ༻ҙ͢ΔHyperLogLogPlus#offer() ͰཁૉΛՃ͍ͯ͘͠HyperLogLogPlus#cardinality() ͰҟͳΓ͕ಘΒΕΔ
Frequency counting
ཁૉͷසΛ্͑͛Δ
ཁૉͷසΛ্͑͛ΔMap Ͱཁૉ͝ͱͷΧϯλΛදݱ͢Δͻͨ͢Βཁૉ͝ͱʹ্͑͛Δ
Count-min sketch (1/2)• ཁૉͷසΛਪఆ͢ΔσʔλߏͷҰͭ• ࣮ࡍͷසΑΓେ͖͍ਪఆΛฦ͢͜ͱ͕͋ΔҰํͰɺখ͍͞ਪఆΛฦ͢͜ͱͳ͍• ස͕খ͍͞ཁૉ΄Ͳɺ͜ͷόΠΞεͷӨڹΛड͚͘͢ͳΔ
Count-min sketch (2/2)• width ͱ depth ͷೋͭͷύϥϝʔλͰɺۭؒޮਫ਼Λ੍ޚ͢Δ• width * depth ͷݸͷΧϯλ͕࡞ΒΕΔ• Χϯλ 2࣍ݩྻͰදݱ• depth ͷ͚ͩϋογϡ͕࣮ؔߦ͞ΕΔͷͰɺతͳύϑΥʔϚϯεʹӨڹΛ༩͑Δ
stream-lib ͷ Count-min sketch
stream-lib ͷ Count-min sketchwidth:10 * depth:30 ͷΧϯλʹΑΔ Count-Min sketch Λ༻ҙ͢ΔCountMinSketch#add(String, int) ͰΧϯτ͍ͯ͘͠
Quantile estimation
ύʔηϯλΠϧΛٻΊΔ
ύʔηϯλΠϧΛٻΊΔιʔτ͞Εͨঢ়ଶͰྻԽ͢Δ͋ͱ n ύʔηϯλΠϧΛࢀর͢Δ͚ͩ
t-digest• ྻͷҐΛਪఆ͢Δσʔλߏ• ܦݧΛۙࣅతʹදݱ͢Δ• ύʔηϯλΠϧɺ͜ͷܦݧͷۙࣅදݱ͔ΒૠΛ༻͍ͯࢉग़͞ΕΔ• ʮѹॖύϥϝʔλʯʹΑͬͯɺਫ਼ͱۭؒޮͷτϨʔυΦϑΛௐ͢Δ• Λେ͖͘͢Δ͜ͱͰɺਫ਼ΛߴΊΔ͜ͱ͕Ͱ͖Δ
stream-lib ͷ t-digest
stream-lib ͷ t-digestѹॖύϥϝʔλΛࢦఆͯ͠ TDigest Λ༻ҙ͢ΔTDigest#add(double) ͰΛՃ͍ͯ͘͠TDigest#quantile(double) ͰύʔηϯλΠϧΛಘΔ
·ͱΊ
·ͱΊ• ֬తσʔλߏΛ༻͍Δ͜ͱͰɺେنσʔλॲཧΦϯϥΠϯॲཧΛޮతʹ࣮ݱͰ͖Δʢ͔ʣ• Java Ͱ֬తσʔλߏΛ͓खܰʹѻ͍͍ͨͳΒɺ ·ͣstream-lib ͷར༻Λݕ౼ͯ͠ΈΔ• ਪఆਫ਼ͱۭؒޮͷτϨʔυΦϑΛ੍ޚ͢Δ ύϥϝʔλͷௐɺ৬ਓܳʹͳΓ͕ͪ• JOL JMH Λ༻͍ͯɺ࣮ࡍͷۭؒޮͱ࣌ؒޮΛ͖ͪΜͱଌఆ͠ͳ͕Βௐ͢Δ͜ͱΛ͓͢͢Ί͍ͨ͠
Thank you!