Upgrade to Pro — share decks privately, control downloads, hide ads and more …

軽量なインデックス機構を用いた全文検索ツールの高速化の検討/wsa6_sifter

 軽量なインデックス機構を用いた全文検索ツールの高速化の検討/wsa6_sifter

2020.04.26 Web System Architecture 研究会 (WSA研) #6
https://websystemarchitecture.hatenablog.jp/entry/2019/12/11/165624

monochromegane

April 26, 2020
Tweet

More Decks by monochromegane

Other Decks in Programming

Transcript

  1. ࡾ୐༔հ / Pepabo R&D Institute, GMO Pepabo, Inc. 2020.04.26 Web

    System Architecture ݚڀձ (WSAݚ) #6 ܰྔͳΠϯσοΫεػߏΛ༻͍ͨ શจݕࡧπʔϧͷߴ଎Խͷݕ౼
  2. • ͻͱͭɺ͋Δ͍͸ෳ਺ͷςΩετϑΝΠϧ͔Βࢦఆͨ͠จࣈྻΛݕࡧ͢ΔίϚ ϯυϥΠϯπʔϧ • grep, ag, pt etc… • ϓϩδΣΫτ഑Լͷιʔείʔυݕࡧʹར༻͞ΕΔ

    • ଟ༷ͳΦϓγϣϯʹΑΔࠩҟԽ • ݁Ռͷ৭෇͚ɺલޙͷߦͷදࣔɺgitignoreͷߟྀɺจࣈίʔυରԠͳͲ • ओཁͳࠩҟԽͷཁҼ͸ʮݕࡧ଎౓ʯ 9 શจݕࡧπʔϧ
  3. • ࠶ؼతͳશจݕࡧ͸ʮfindʯʮgrepʯʮprintʯͷཁૉ͔Β੒Δ • ֤ཁૉͰߴ଎Խͷָ͠Έ͕͋Δ[1] • find: readdirentʹΑΔstatγεςϜίʔϧͷ࡟ݮɺฒྻԽ • grep: ߦ୯ҐͰ͸ͳ͘ݻఆ௕Ͱͷݕࡧͱ෮ݩɺSIMDɺޮ཰తͳΞϧΰϦζ

    ϜɺฒྻԽɺʢOSͷϑΝΠϧΩϟογϡͷԸܙ΋େ͖͍ͱ͜Ζʣ • print: όοϑΝϦϯάɺલஈͷॲཧͷϘτϧωοΫʹͳΔ͜ͱΛճආ • ฒྻԽ਺ΛؚΊɺܭࢉࢿݯΛޮ཰Α͘࠷େݶʹར༻͢Δ [2] 10 શจݕࡧπʔϧͷߴ଎Խ <>:VTVLF.JZBLF 0QUJNJ[BUJPOGPS/VNCFSPGHPSPVUJOFT6TJOH'FFECBDL$POUSPM (PQIFS$PO.BSSJPUU.BSRVJT4BO%JFHP.BSJOB $BMJGPSOJB +VMZ <>:VTVLF.JZBLF UIF@QMBUJOVN@TFBSDIFS IUUQTHJUIVCDPNNPOPDISPNFHBOFUIF@QMBUJOVN@TFBSDIFS
  4. • ͋Δ༻ޠͱɺͦͷ༻ޠ͕ग़ݱ͢ΔจॻIDͷϦετ͔ΒͳΔࣙॻ • ग़ݱස౓΍ग़ݱҐஔͷ؅ཧ͕Մೳ • ڞ௨ू߹ʹର͢ΔΫΤϦ΋ಘҙ • ༻ޠ਺ɺจॻ਺ʹൺྫͯ͠ΠϯσοΫεͷαΠζ͕େ͖͘ͳΔ • ͨͩ͠ɺѹॖͷखཱͯ͸ଟ਺͋Γͦ͏[3]ʢˎཁαʔϕΠʣ

    14 શจݕࡧΤϯδϯͷߴ଎ԽʢసஔΠϯσοΫεʣ <>$ISJTUPQIFS%.BOOJOH 1SBCIBLBS3BHIBWBO )JOSJDI4DIVU[F ؠ໺࿨ੜ ࠇ઒ར໌ ᖛా੣࢘ ଜ্໌ࢠ ৘ใݕࡧͷجૅ ڞཱग़൛ 
  5. • ू߹ͷதʹ೚ҙͷཁૉؚ͕·ΕΔ͔Λ໰͍߹ΘͤΔ֬཰తσʔλߏ଄ • ϑΟϧλͷαΠζ͕༻ޠ਺ʹґଘ͠ͳ͍ • ཁૉͷ௥Ճɺཁૉͷ໰͍߹Θͤ΋ݻఆ࣌ؒͰ͢Ή • ͨͩ͠ɺཁૉͷ໰͍߹Θͤʹfalse positive͕ൃੜ͢Δ •

    จॻ͝ͱʹϒϧʔϜϑΟϧλΛ࡞੒͠ɺ͜ͷू߹͔Β༻ޠؚ͕·ΕΔจॻΛݕ ࡧ͢Δ • ͜ͷεʔύʔվળ൛͕BingͷݕࡧΤϯδϯʹ࢖ΘΕͨʢBitFunnelʣ[4][5] 15 શจݕࡧΤϯδϯͷߴ଎ԽʢϒϧʔϜϑΟϧλʣ <>#JOHݕࡧͷཪଆʕ#JU'VOOFMͷΞϧΰϦζϜ IUUQTEFWFMPQFSIBUFOBTUB⒎DPNFOUSZ <>#PC(PPEXJO .JDIBFM)PQDSPGU %BO-VV "MFY$MFNNFS .JIBFMB$VSNFJ 4BNFI&MOJLFUZ BOE:VYJPOH)F#JU'VOOFM3FWJTJUJOH4JHOBUVSFTGPS 4FBSDI*O1SPDFFEJOHTPGUIFUI*OUFSOBUJPOBM"$.4*(*3$POGFSFODFPO3FTFBSDIBOE%FWFMPQNFOUJO*OGPSNBUJPO3FUSJFWBM 4*(*3` "TTPDJBUJPOGPS $PNQVUJOH.BDIJOFSZ /FX:PSL /: 64" r%0*IUUQTEPJPSH
  6. • ͻͱͭͷϒϧʔϜϑΟϧλ͸ Ϗοτͷ഑ྻ͔Β੒Δ • ཁૉ͸ ݸͷϋογϡؔ਺͔ΒಘΒΕΔ഑ྻͷఴࣈҐஔͷू߹ʹม׵͞ΕΔ • ू߹͸શཁૉͷ഑ྻͷఴࣈͷ࿨ू߹Λ1ͱ͢Δ഑ྻͱͯ͠දݱ͞ΕΔ m k

    18 ϒϧʔϜϑΟϧλʢཁૉͷ௥Ճʣ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 H1 (element1 ) = 0 H2 (element1 ) = 9 element1 Bloom filter(m = 10) Hash function(k = 2)
  7. • ͻͱͭͷϒϧʔϜϑΟϧλ͸ Ϗοτͷ഑ྻ͔Β੒Δ • ཁૉ͸ ݸͷϋογϡؔ਺͔ΒಘΒΕΔ഑ྻͷఴࣈҐஔͷू߹ʹม׵͞ΕΔ • ू߹͸શཁૉͷ഑ྻͷఴࣈͷ࿨ू߹Λ1ͱ͢Δ഑ྻͱͯ͠දݱ͞ΕΔ m k

    19 ϒϧʔϜϑΟϧλʢཁૉͷ௥Ճʣ 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1 H1 (element2 ) = 1 H2 (element2 ) = 9 element2 Bloom filter(m = 10) Hash function(k = 2)
  8. • ܰྔ • ΠϯσοΫεͷαΠζ͕খ͍͞΄ͲಡΈࠐΈʢ໰͍߹Θͤͷىಈʣ͕଎͍ • ߴ଎ • ߏங࣌: ߴ଎ʹΠϯσοΫεߏங͕Ͱ͖Ε͹ݕࡧର৅΁ෛ୲ͳ͘௥ै •

    ݕࡧ࣌: ߴ଎ʹ໰͍߹Θ͕ͤͰ͖Ε͹શจݕࡧશମͷ࣮࣌ؒΛ୹ॖ • ൚༻ • ಛఆͷπʔϧʹґଘͤͣɺ૊Έ߹Θͤͯར༻Մೳʹ͢Δ͜ͱͰ༗༻ੑ͕޲্ 24 શจݕࡧπʔϧʹ͓͚ΔΠϯσοΫεػߏͷཁ݅
  9. • ܰྔ / ߴ଎ • ϒϧʔϜϑΟϧλΛ༻͍Δ • αΠζ͕༻ޠ਺ʹґଘͤͣɺ໰͍߹Θ͕ͤݻఆ࣌ؒͰࡁΉಛੑΛར༻ • ൚༻

    • ΠϯσοΫεΛݕࡧ͠ɺ֘౰͢ΔΩʔϫʔυؚ͕·ΕΔϑΝΠϧҰཡΛฦ͢ ίϚϯυΛఏڙ͢Δ • ೚ҙͷશจݕࡧπʔϧ͸ҰཡΛArgsͱͯ͠શจݕࡧΛߦ͏ • ِཅੑʹΑΔޡݕग़͸શจݕࡧπʔϧʹΑͬͯϑΟϧλ͞ΕΔ 25 શจݕࡧπʔϧʹ͓͚ΔΠϯσοΫεػߏͷݕ౼
  10. • A lightweight index for full text search tools using

    bloom filter. [6] • ఏҊख๏ͷGo࣮૷ʢWIPʣ • “sifter"͸ྉཧ༻ͷ;Δ͍ɺͱ͔ɺબΓ෼͚Δਓɺͷҙ 26 monochromegane/sifter <>NPOPDISPNFHBOFTJGUFS IUUQTHJUIVCDPNNPOPDISPNFHBOFTJGUFS
  11. • ໰͍߹Θͤ࣌ʹɺશͯͷϒϧʔϜϑΟϧλΛಡΈࠐΉඞཁ͕͋ΔͨΊɺBit- sliced signatureԽ͢Δ͜ͱͰಡΈࠐΉσʔλΛ࡟ݮ͠ɺߴ଎Խ͢Δ 28 monochromegane/sifter $ sifter -m 5

    -k 3 build 1, 1, 0, 0, 0 1, 0, 1, 0, 0 1, 0, 0, 1, 0 H1 = 4 & 00001 & 00001 & 00001 શͯͷϒϧʔϜϑΟϧλʹ ରͯ͠໰͍߹Θ͕ͤൃੜ
  12. • ໰͍߹Θͤ࣌ʹɺશͯͷϒϧʔϜϑΟϧλΛಡΈࠐΉඞཁ͕͋ΔͨΊɺBit- sliced signatureԽ͢Δ͜ͱͰಡΈࠐΉσʔλΛ࡟ݮ͠ɺߴ଎Խ͢Δ 29 monochromegane/sifter $ sifter -m 5

    -k 3 build 1, 1, 0, 0, 0 1, 0, 1, 0, 0 1, 0, 0, 1, 0 H1 = 4 & 00001 & 00001 & 00001 ࣮࣭ɺఴࣈͷ෦෼͔͠࢖ͬ ͯͳͦ͞͏
  13. • ໰͍߹Θͤ࣌ʹɺશͯͷϒϧʔϜϑΟϧλΛಡΈࠐΉඞཁ͕͋ΔͨΊɺBit- sliced signatureԽ͢Δ͜ͱͰಡΈࠐΉσʔλΛ࡟ݮ͠ɺߴ଎Խ͢Δ 30 monochromegane/sifter $ sifter -m 5

    -k 3 build 1, 1, 0, 0, 0 1, 0, 1, 0, 0 1, 0, 0, 1, 0 1, 1, 1 1, 0, 0 0, 1, 0 0, 0, 1 0, 0, 0 H1 = 4 ֘౰͢ΔఴࣈͷΈΛूΊΔ ʢϒϧʔϜϑΟϧλͷू߹ ΛߦྻͱݟΔͱసஔͨ͠ܗ ʹ૬౰ʣ ∣ F ∣ m m ∣ F ∣
  14. • ໰͍߹Θͤ࣌ʹɺશͯͷϒϧʔϜϑΟϧλΛಡΈࠐΉඞཁ͕͋ΔͨΊɺBit- sliced signatureԽ͢Δ͜ͱͰಡΈࠐΉσʔλΛ࡟ݮ͠ɺߴ଎Խ͢Δ 31 monochromegane/sifter $ sifter -m 5

    -k 3 build 1, 1, 0, 0, 0 1, 0, 1, 0, 0 1, 0, 0, 1, 0 1, 1, 1 1, 0, 0 0, 1, 0 0, 0, 1 0, 0, 0 & 111 H1 = 4 ֘౰͢ΔఴࣈͷΈΛूΊͨ ෦෼͚ͩʹ໰͍߹ΘͤΕ͹ ྑ͍
  15. • ໰͍߹Θͤ࣌͸ύλʔϯจࣈྻΛ3-gramԽ͠ɺͦΕͧΕͷϋογϡؔ਺͔Β ಘΒΕͨఴࣈͷ࿨ू߹Λ΋ͬͯ໰͍߹ΘͤΛߦ͏ 32 monochromegane/sifter $ sifter -m 5 -k

    3 find PATTERN 1, 1, 1 1, 0, 0 0, 1, 0 0, 0, 1 0, 0, 0 & 111 1, 1, 0, 0, 0 PATTERN PAT ATT TER ERN Hk & 111 ൪໨ͷϑΝΠϧʹ͸ ʮଟ෼ʯؚ·ΕͯΔ ൪໨ͷϑΝΠϧʹରԠ ͢ΔϑΝΠϧ໊Λग़ྗ ⭕❌ ❌
  16. • CentOS Linux release 8.1.1911 (Core) on Vagrant • CPU:

    4, Memory: 5,120MB • https://github.com/torvalds/linux (c578ddb) • ૯ϑΝΠϧ਺: 67,947 • ϒϧʔϜϑΟϧλ( ) k = 3, m = 10,000 36 ධՁ؀ڥ
  17. • ݕࡧΩʔϫʔυ: ‘GPL-2.0-or-later' (8,168/67,947 = ໿12%) 37 ධՁ: શจݕࡧπʔϧͱͷ૊Έ߹Θͤ Ωϟογϡͳ͠

    ඵ Ωϟογϡ͋Γ ඵ HSFQ   HSFQ TJGUFS   QU   QU TJGUFS   ఏҊख๏ʹΑΔݕࡧର৅ͷࣄલߜΓࠐΈ ʹΑͬͯɺTJGUFSͷ࣮ߦ࣌ؒΛࠩ͠Ҿ͍ͯ ΋ɺશମͱͯ͠େ෯ͳݕࡧ଎౓ͷվળ͕ ֬ೝͰ͖ͨɻͳ͓ɺTJGUFS͸ ݅ͷީ ิΛTTͰฦ͍ͯ͠Δɻ
  18. • ݕࡧΩʔϫʔυ: ‘GPL-2.0-or-later' (8,168/67,947 = ໿12%) 38 ධՁ: શจݕࡧπʔϧͱͷ૊Έ߹Θͤ Ωϟογϡͳ͠

    ඵ Ωϟογϡ͋Γ ඵ HSFQ   HSFQ TJGUFS   QU   QU TJGUFS   Ωϟογϡͳ͠ ඵ Ωϟογϡ͋Γ ඵ HSFQ   HSFQ TJGUFS   QU   QU TJGUFS   • ݕࡧΩʔϫʔυ: ‘#define BYT_RT5640_MAP(quirk)' (2/67,947 = ໿0.003%) ఏҊख๏ʹΑΔࣄલͷߜΓࠐΈͷޮՌ͕ߴ ͍৔߹ʹ͸ɺΑΓݦஶͳ࣮ߦ࣌ؒͷ୹ॖ͕ ֬ೝ͞ΕͨʢTJGUFS݅TTʣ ͳ͓ɺૉͷQUͷվળ͸ύλʔϯʹ߹க͠ͳ ͚Ε͹ਫ਼ࠪ͠ͳ͍࣮૷ͷ޻෉ʹΑΔ
  19. • 67,947bit=8,494byte*10,000=84.94MB • du -h linux 1.2G • શମͱͯ͠΋ϦϙδτϦͷαΠζͱൺֱͯ͠े෼ʹখ͍͞ •

    ໰͍߹Θͤ࣌ʹ͸ k*8,494byte ͷΈͷಡΈࠐΈͰࡁΉ 39 ධՁ: ΠϯσοΫεͷαΠζ
  20. • ݱࡏɺ1.2G ͷϦϙδτϦʹରͯ͠20෼ఔ౓͔͔Δ͜ͱ͔Βվળ͕ඞཁ… • ϘτϧωοΫ͸ϋογϡؔ਺ [7][8] • 1จࣈʹରͯ͠{1,2,3}-gram*k(3)ճͷϋογϡؔ਺͕࣮ߦ͞ΕΔ (=0.01ms) •

    98KbͷϑΝΠϧͰ͓͓Αͦ1s͔͔Δܭࢉ • ΠϯσοΫεߏஙͷߴ଎Խʹ޲͚ͯɺϋογϡ݁ՌͷΩϟογϡɺߴ଎ͳ ϋογϡؔ਺[9]ͷద༻ɺޮ཰తͳτʔΫϯԽͷݕ౼ͳͲ͕ඞཁ gi (x) = h1 (x) + ih2 (x) mod m 40 ධՁ: ΠϯσοΫεͷߏங <>,JSTDI "EBN BOE.JDIBFM.JU[FONBDIFS-FTTIBTIJOH TBNFQFSGPSNBODFCVJMEJOHBCFUUFSCMPPNpMUFS&VSPQFBO4ZNQPTJVNPO"MHPSJUINT 4QSJOHFS #FSMJO )FJEFMCFSH  <>(PMBOHͰ#MPPN'JMUFSΛ࣮૷ͯ͠Έͨ IUUQTDJQFQTFSIBUFOBCMPHDPNFOUSZ <>.VSNVS)BTI IUUQTUBOKFOUMJWFKPVSOBMDPNIUNM
  21. • શจݕࡧπʔϧͰར༻Ͱ͖Δܰྔɾߴ଎ɾ൚༻ͳΠϯσοΫεػߏΛఏҊͨ͠ • ϒϧʔϜϑΟϧλΛ࠾༻͢Δ͜ͱͰܰྔ͔ͭ໰͍߹Θͤͷߴ଎ԽΛ࣮ݱͨ͠ • ީิͷΈΛฦ٫͢ΔผπʔϧΛఏڙ͢Δ͜ͱͰ൚༻ੑΛߴΊͨ • ҰํͰɺϋογϡؔ਺ͷ࣮ߦ͕࣌ؒϘτϧωοΫͱͳΓେن໛ͳϦϙδτϦʹ ର͢ΔΠϯσοΫεͷߏஙʹ͕͔͔࣌ؒΔͨΊࠓޙͷվળ͕ඞཁ •

    ࠓޙɺ໰͍߹ΘͤࣗମʹΦʔόϔου͕ൃੜ͢ΔΞʔΩςΫνϟ[10]ͱͷ࿈ܞ ΋ݕ౼͢Δ͜ͱͰWebγεςϜͷ෼໺΁ݚڀΛൃల͍ͤͨ͞ 42 ·ͱΊ <>Ѩ෦ത ౡܚҰ ٶຊେี ؔ୩༐࢘ ੴݪ஌༸ Ԭా࿨໵ தଜྒྷ দӜ஌࢙ ࣰాཅҰ࣌ؒ࣠ݕࡧʹ࠷దԽͨ͠εέʔϧΞ΢τՄೳͳߴ଎ϩάݕࡧΤϯδϯͷ࣮ݱͱධ Ձ৘ใॲཧֶձ࿦จࢽ 7PM /P QQr