Blog / https://kakakakakku.hatenablog.com/entry/2016/10/14/220055
͓ͬͯ͘ͱศརͳBloom Filter2016-10-14ࣾษڧձ@kakakakakku
View Slide
Bloom Filter• 1970ʹߟҊ͞Εͨ• ߟҊऀ Burton Howard Bloom ࢯ• ۭؒޮͷྑ͍֬తσʔλߏ• ཁૉ͕ू߹ͷதʹؚ·ΕΔ͔Λఆ͢ΔͨΊʹ͏• σʔλྔʹґଘͤͣ O(k) ͷܭࢉྔͰߴʹఆͰ͖Δ
ू߹ؚ·ΕΔ?ؚ·Εͳ͍?ؚ·ΕΔ?ؚ·Εͳ͍?
׆༻ྫΛΕBloom Filter Λͬͱۙʹײ͡ΒΕΔͣʂ
• Cassandra• Key Λݕࡧ͢Δͱ͖ͷ I/O Λݮ͢ΔͨΊ SSTable ʹ Bloom Filter Λॻ͖ࠐΜͰ͍Δ• HBase• HFile ʹಛఆͷσʔλؚ͕·Ε͍ͯͳ͍͜ͱΛ ݕࡧ͢ΔͨΊʹ Bloom Filter Λ׆༻͍ͯ͠ΔBloom Filter ׆༻ྫ 1
Bloom Filter ׆༻ྫ 2• H2O• ແବͳαʔόϓογϡΛૹ৴͠ͳ͍ͨΊʹ ϒϥβΩϟογϡใͷ Bloom Filter Λ ѹॖͯ͠ HTTP Ͱฦ͍ͯ͠Δ CASPER (Cache-Aware Server PushER)• Bitcoin• τϥϯβΫγϣϯͷݕࡧʹ׆༻͍ͯ͠Δ? ʢৄ֬͘͠ೝͰ͖ͯͳ͍ʣ
Bloom Filter ׆༻ྫ 3• pixiv• ࡞ʹ͍ͨλάͱ pixiv ඦՊࣄయͷλάू߹ͷ ଘࡏ֬ೝʹ׆༻͍ͯ͠Δ?http://inside.pixiv.net/entry/2014/07/22/132103
ʘ Bloom Filter ͷڍಈ ʗ
0 01 02 03 04 05 06 07 08 09 010 011 012 013 014 015 0• m bit ͷྻΛ༻ҙ͢Δ• ࠓճ ྻۭؒ = m = 16 ͱ͢Δ• શͯ 0 ͰॳظԽ͓ͯ͘͠
0 01 02 03 04 05 06 07 08 09 010 011 012 013 014 015 0• ҙͷϋογϡؔΛ༻ҙ͢Δ• ࠓճ k = 2 ݸͷؔΛ͏• h1(key) = (key * 1) mod m• h2(key) = (key * 2) mod m
0 11 02 03 04 05 06 07 08 19 010 011 012 013 014 015 0• key = 1000 ΛՃ͢Δ• h1(1000) = 8• h2(1000) = 0[ 1000 ]
0 11 02 13 04 05 06 07 08 19 110 011 012 013 014 015 0• key = 1001 ΛՃ͢Δ• h1(1001) = 9• h2(1001) = 2[ 1000, 1001 ]
0 11 02 13 04 05 06 07 08 19 110 011 012 113 014 015 0• key = 1004 ΛՃ͢Δ• h1(1004) = 12• h2(1004) = 8• h1(1000) = 8 ͱॏෳ͍ͯ͠Δ• ϑϥά 1 ͷ··ʹ͢Δ[ 1000, 1001, 1004 ]
0 11 02 13 04 05 06 07 08 19 110 011 012 113 014 015 0• Query : key = 1005 ଘࡏ͢Δ ?• h1(1005) = 13• h2(1005) = 10• h1(1005) = h2(1005) = 0• ʮଘࡏ͠ͳ͍ʯͱஅݴͰ͖Δ[ 1000, 1001, 1004 ]
0 11 02 13 04 05 06 07 08 19 110 011 012 113 014 015 0• Query : key = 1000 ଘࡏ͢Δ ?• h1(1000) = 8• h2(1000) = 0• h1(1000) = h2(1000) = 1• ʮଘࡏ͢Δʯ͔͠Εͳ͍• 1000 ࣮ࡍʹଘࡏ͢Δ[ 1000, 1001, 1004 ]
0 11 02 13 04 05 06 07 08 19 110 011 012 113 014 015 0• Query : key = 1020 ଘࡏ͢Δ ?• h1(1020) = 12• h2(1020) = 8• h1(1000) = h2(1000) = 1• ʮଘࡏ͢Δʯ͔͠Εͳ͍• 1020 ࣮ࡍʹଘࡏ͠ͳ͍[ 1000, 1001, 1004 ]
(ƅшƅ) Űō?ޡఆͯ͠Δ͚Ͳ…?
False PositiveِཅੑFalse Negativeِӄੑʮଘࡏ͠ͳ͍ʯͱ͖ʹʮଘࡏ͢Δʯͱఆͯ͠͠·͏͜ͱʮଘࡏ͢Δʯͱ͖ʹʮଘࡏ͠ͳ͍ʯͱఆͯ͠͠·͏͜ͱ
False PositiveِཅੑFalse Negativeِӄੑʮଘࡏ͠ͳ͍ʯͱ͖ʹʮଘࡏ͢Δʯͱఆͯ͠͠·͏͜ͱʮଘࡏ͢Δʯͱ͖ʹʮଘࡏ͠ͳ͍ʯͱఆͯ͠͠·͏͜ͱ↑Bloom Filter ʹFalse Positive ͷՄೳੑ͕͋Δ
False Positive ͷՄೳੑ• O(k) ͰߴʹఆͰ͖Δঈͱͯ͠ False Positive ͷՄೳੑ͕͋Δ• Αͬͯ key = 1020 ͷΑ͏ʹ ʮଘࡏ͢Δʯͱޡݕͯ͠͠·͏߹͕͋Δ• ͨͩ͠ False Negative 100% ͋Γಘͳ͍
ϝϦοτ• ܭࢉྔ O(k)• ઢܗ୳ࡧͩͱ O(N)• ೋ୳ࡧͩͱ O(log N)• ϋογϡςʔϒϧͳΒ O(1) ͩͬ͠ͱߴ?• k = 1 ͳΒ Bloom Filter O(1) ʹͳΔ• σʔλΛอ࣋͢Δඞཁ͕ͳۭؒ͘ޮ͕ྑ͍
ʘ Bloom Filter ཁૉ͕আͰ͖ͳ͍ ʗ
0 01 02 13 04 05 06 07 08 09 110 011 012 113 014 015 0• 1000 Λআ͢Δͱ• h1(1005) = 8• h2(1005) = 0• 1004 আ͞Εͯ͠·͏ !!!• h1(1005) = 12• h2(1005) = 8[ 1000, 1001, 1004 ]
ʘ ཁૉΛআ͢ΔͳΒ Counting Filter ʗBloom Filter Λ֦ுͨ͠ΞϧΰϦζϜ
0 11 02 13 04 05 06 07 08 29 110 011 012 113 014 015 0• key = 1004 ΛՃ͢Δ• h1(1004) = 12• h2(1004) = 8• ॏෳͨ͠ΒΠϯΫϦϝϯτ͢Δ[ 1000, 1001, 1004 ]ϏοτͰͳ͘ΧϯλʔͰදݱ͢Δ͕Bloom Filter ͱҟͳΔ
0 01 02 13 04 05 06 07 08 19 110 011 012 113 014 015 0• key = 1000 Λআ͢Δ• h1(1004) = 8• h2(1004) = 0• σΫϦϝϯτ͢Δ[ 1000, 1001, 1004 ]
ʘ False Positive ֬ ʗ
Bloom Filter ެࣜͰࢉग़m : ྻۭؒ (bit)n : ొཁૉৄ͘͠ Wikipedia ʹʂhttps://ja.wikipedia.org/wiki/ϒϧʔϜϑΟϧλFalse Positive Λ࠷খʹ͢Δ࠷దͳϋογϡؔͷۙࣅ࠷దͳ k Λͬͨ߹ͷFalse Positive ֬
ʘ ৺ແ༻ ʗ࠷దͳ k Λ͑False Positive ΛݶΓͳ͘͘Ͱ͖Δ
ʘ ·ͱΊ ʗFalse Positive ͷՄೳੑ͋Δ͠ཁૉͷআͰ͖ͳ͍͚ͲτϨʔυΦϑΛ࠷େݶ׆༻ͯ͠ߴ & ۭؒޮͷྑ͍ॲཧ͕Ͱ͖Δʂ
ʘ ͓ͬͯ͘ͱศརͳ Bloom Filter ʗ