The first step self made full text search

E96c81ea6ec9fc4258a1fba04ae7ca5e?s=47 ryokato
September 17, 2019
3.5k

The first step self made full text search

E96c81ea6ec9fc4258a1fba04ae7ca5e?s=128

ryokato

September 17, 2019
Tweet

Transcript

  1. ೖ໳ ࣗ࡞ݕࡧΤϯδϯ 1Z$PO+1

  2. ࣗݾ঺հ Ճ౻ྒྷ 3ZP,BUP  !@SZPPL αʔόʔαΠυ"1*ݕࡧ QZUIPO %KBOHP &MBTUJDTFBSDI

  3. એ఻ https://search-tech.connpass.com/ TFBSDIUFDIKQ

  4. એ఻ εϐʔΧʔɾձ৔࠙਌ձεϙϯαʔืूதͰ͢ εϐʔΧʔԠืϑΥʔϜ εϙϯαʔԠืϑΥʔϜ

  5. ͜ͷൃදʹ͍ͭͯ w ೖ໳ࣗ࡞ݕࡧΤϯδϯ w ݕࡧΤϯδϯΛࣗ࡞͢Δ͜ͱʹೖ໳ͨ͠࿩ w ୭͠΋Ұ౓͸ݕࡧΤϯδϯΛ࡞Γ͍ͨͱࢥ͏͸ͣ w ॳΊͯݕࡧΤϯδϯΛ࡞Δͱ͖ͷ஌ݟΛҰํతʹڞ༗͢ΔτʔΫ

  6. ର৅ w શจݕࡧͷ͜ͱΛ஌Βͳ͍ͳΜͱͳ͘஌͍ͬͯΔਓ w ݕࡧʹڵຯ͕͋Δਓ w ݕࡧΤϯδϯΛ࡞ͬͯΈ͍ͨਓ

  7. ࿩͞ͳ͍͜ͱ w &MBTUJDTFBSDI4PMSͱ͍ͬͨશจݕࡧΤϯδϯΛ࢖ͬͨݕࡧΞϓϦέʔγϣϯ ͷ࿩ɾQZUIPOͰ࢖͏UJQT w ஌Γ͍ͨਓ͸1Z$PO+1ͷൃද ˞ ݟ͍ͯͩ͘͞PSݕࡧٕज़ษڧձ΁ w ΋͘͠͸ੋඇۭ͖࣌ؒʹ࿩͠·͠ΐ͏

    w ΫϩʔϦϯάεΫϨΠϐϯάʹ͍ͭͯ w ؾʹͳΔਓ͸ɺຊΛಡΉPS!TIJOZPSLF͞Μ ˞ !WBBBBBORVJTI͞Μ ˞ ͷϒϩάΛݟͯ ͍ͩ͘͞ w ˞IUUQTTMJEFTIJQDPNVTFST!JLUBLBIJSPQSFTFOUBUJPOT%3+YK,G#'&(43DW:K$"G w ˞IUUQTTIJOZPSLFIBUFOBCMPHDPNFOUSZLPXBLVOBJDSBXMBOETDSBQJOH w ˞IUUQTWBBBBBBORVJTIIBUFOBCMPHDPNFOUSZ
  8. ໨࣍ w શจݕࡧʹ͍ͭͯ w γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ w ͞ΒͳΔվળʹ͍ͭͯ w ·ͱΊ

  9. શจݕࡧʹ͍ͭͯ શจݕࡧͱ͸ʁ w lݕࡧͷର৅͕ʮςΩετ͔ΒͳΔจষͷશ෦ͷจʯͰ͋Δ৔߹ʹɺͦͷจষʹର ͯ͠ݕࡧΛߦ͏͜ͱz ˞  ݕࡧΤϯδϯͱ͸ʁ w lจষͷू߹͔Βɺ୯ޠ΍࣭໰ͳͲ͔ΒͳΔ৘ใཁٻʹద߹͢ΔจষΛݟ͚ͭΔ

    ݕ ࡧ͢Δ ͨΊͷγεςϜ΍ιϑτ΢ΣΞͷ૯শz ˞ ˞ʮݕࡧΤϯδϯࣗ࡞ೖ໳ʯٕज़ධ࿦ࣾ
  10. ෼Ͱ࡞ΔQZUIPO੡ͷݕࡧΤϯδϯ w ػೳ w ςΩετͷ௥Ճ͕Ͱ͖Δ w Ωʔϫʔυݕࡧ͕Ͱ͖Δ

  11. ෼Ͱ࡞ΔQZUIPO੡ͷݕࡧΤϯδϯ def add_text(text: str): with open("db.txt", "a") as f: f.write(text

    + "\n") def search(keyword: str): with open("db.txt", "r") as f: return [l.strip() for l in f if keyword in l] if __name__ == "__main__": texts = [ "Beautiful is better than ugly.", "Explicit is better than implicit.", "Simple is better than complex." ] for text in texts: add_text(text) results = search("Simple") for result in results: print(result)
  12. શจݕࡧʹ͍ͭͯ (SFQܕ ˞  w ઢܗ૸ࠪΛߦ͏ w ݱࡏͷίϯϐϡʔλʔͰ͸ɺγΣʔΫεϐΞશू ໿ສޠ ن໛ͷจষʹର

    ͯ͠ͷ୯७ͳΫΤϦʹରͯ͠͸͜ΕͰॆ෼ͱ͍͏આ΋͋Δ ˞  ࡧҾ ΠϯσοΫε ܕ ˞  w ͋Β͔͡Ίݕࡧର৅ͱͳΔจॻ܈Λ૸ࠪͯ͠ࡧҾσʔλΛ࡞͓ͬͯ͘ ϕΫτϧܕ w ಛ௃ϕΫτϧΛ࡞੒ͯ͠ϕΫτϧؒͷڑ཭Λܭࢉ ˞IUUQTKBXJLJQFEJBPSHXJLJ&"&&"$&#" ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛
  13. ࡧҾ ΠϯσοΫε ܕ w ͋Β͔͡Ίݕࡧର৅ͷจॻΛ૸ࠪͯ͠ɺࡧҾσʔλΛ४උ͓ͯ͘͠ํ๏ w ࡧҾϑΝΠϧΛ࡞੒͢Δ͜ͱΛΠϯσΫγϯάɺੜ੒͞ΕΔσʔλΛΠϯσοΫ εͱ͍͏ w ࡧҾͷ࡞Γํ͸༷ʑ͋Δ͕ɺҰൠతͳͷ͸సஔΠϯσοΫε

    w ଟ͘ͷશจݕࡧΤϯδϯͰ࠾༻͞Ε͍ͯΔ w Πϝʔδ͸ຊͷޙΖʹ͍͍ͭͯΔࡧҾ w Ωʔϫʔυͱϖʔδ
  14. సஔΠϯσοΫε ٯΠϯσοΫε w ୯ޠͱͦΕؚ͕·Ε͍ͯΔจষͷϚοϐϯάΛอ࣋͢ΔΠϯσοΫεܕͷσʔλ ߏ଄ ˞  w ࣙॻͱϙεςΟϯάͰߏ੒͞ΕΔ 1ZUIPO

    &MBTUJDTFBSDI            ࣙॻ ϙεςΟϯά ϙεςΟϯάϦετ ϙεςΟϯά W w ˞XJLJQFEJB w ˞ਤʮ৘ใݕࡧͷجૅʯڞཱग़൛ W సஔΠϯσοΫε
  15. సஔΠϯσοΫεͷΠϯσοΫε୯Ґ w Ϩίʔυ୯ҐసஔΠϯσοΫε w ୯ޠͱ୯ޠΛؚΉจষ จষJE ΛϦετͯ࣋ͭ͠ w JH1ZUIPO< 

       > w ϝϦοτγϯϓϧͰ࣮૷͠΍͍͢ɻσΟεΫ༰ྔ͕গͳ͍ɻ w σϝϦοτػೳੑ͕๡͍͠ w ୯ޠ୯ҐసஔΠϯσοΫε w ୯ޠͱ୯ޠΛؚΉจষ จষJE Ћͷ৘ใ JH୯ޠͷग़ݱҐஔ  w JH1ZUIPO<   > w ϝϦοτػೳੑ͕ߴ͍ɻྫ͑͹ϑϨʔζݕࡧ͕Ͱ͖Δɻ w σϝϦοτσΟεΫ༰ྔ͕ଟ͍ɻ w ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛
  16. సஔΠϯσοΫε w QZUIPOͰ͸EJDUJPOBSZ w LFZ͕Ұக͠ͳ͚Ε͹஋͸औಘͰ͖ͳ͍ w Ͳ͏͍͏୯ҐͰΠϯσοΫεΛ࡞੒͢Δ͔͕ॏཁ సஔΠϯσοΫε 1ZUIPO< 

       > 1ZUIPOɹ˓ QZUIPOɹ✕ 1Zɹ✕ ύΠιϯɹ✕
  17. ΠϯσοΫεͷநग़ख๏ w ୯ޠ෼ׂ τʔΫφΠζ  w 8IJUFTQBDF w ܗଶૉղੳ ೔ຊޠ

     w /HSBN w γάωνϟ w ઀ඌࣙ഑ྻ w FUD
  18. ܗଶૉղੳ XIJUFTQBDF ͱ/HSBN ܗଶૉղੳ /HSBN τʔΫφΠζ JHʮ౦ژʯʮ౎஌ࣄʯ JHʮ౦ژʯʮژ౎ʯʮ౎஌ʯʮ஌ࣄʯ τʔΫϯ਺ গͳ͍

    ଟ͍ ΠϯσοΫε αΠζ খ͍͞ େ͖͍ ݕࡧ࿙Ε ଟ͍ গͳ͍ ৽ޠରԠ ✕ ˓ ϊΠζ গͳ͍ ଟ͍
  19. ΠϯσοΫεੜ੒ w ΠϯσοΫεͷੜ੒࣌ʹ͸τʔΫφΠζ͚ͩͰͳ͘ςΩετʹॲཧΛՃ͑Δඞཁ ΋͋Δ w Ͳ͏͍͏ݕࡧΤϯδϯʹ͢Δ͔ґଘ͢Δ w JH)5.-Λݕࡧ͢ΔΤϯδϯ w HPPHMFͷΑ͏ʹίϯςϯπͷΈΛର৅ʹ͢Δ৔߹

    w )5.-λά͸ෆཁɹˠɹɹ)5.-λάΛআڈ͢Δ w (JUIVCͷΑ͏ʹίʔυΛର৅ʹ͢Δ৔߹ w )5.-λά͸ඞཁ
  20. ΠϯσοΫεੜ੒ॲཧ w ςΩετશମʹॲཧΛՃ͑Δ DIBSpMUFS  w JHIUNMλάΛআ͘ খจࣈ େจࣈ ʹ͢Δ

    w ༩͑ΒΕͨςΩετΛ෼ׂ͢Δ UPLFOJ[FS  w ෼ׂ͞ΕͨUPLFOʹॲཧΛՃ͑Δ UPLFOpMUFS  w JHਖ਼نԽ͢Δ ετοϓϫʔυΛআ͘ w ͜ΕΒͷॲཧΛ·ͱΊͯ"OBMZ[FSͱΑͿ
  21. "OBMZ[FSॲཧ $IBSpMUFS 5PLFOJ[FS 5PLFOpMUFS lI4JNQMFJTCFUUFSUIBODPNQMFYIz lTJNQMFJTCFUUFSUIBODPNQMFYz <lTJNQMFz lJTz lCFUUFSz lUIBOz

    lDPNQMFYz> <lTJNQMFz lJTz lCFUUFSHPPEz lUIBOz lDPNQMFYz> <lTJNQMFz lHPPEz lUIBOz lDPNQMFYz> 5FYU 5PLFOT "OBMZ[FS
  22. w ΠϯσΫγϯάॲཧ จॻΛड͚औΔ จॻΛ෼ׂ͢Δ จॻશମʹॲཧ τʔΫϯ෼ׂ τʔΫϯ͝ͱʹॲཧ จষ΍τʔΫϯ৘ใΛอଘ ϙεςΟϯάϦετΛ࡞੒ సஔΠϯσοΫεΛߋ৽ɾอଘ

    ݕࡧॲཧͷྲྀΕ  "OBMZ[FS *OEFYFS 4UPSBHF %PDVNFOU
  23. w ݕࡧॲཧ ΫΤϦड͚औΓ ΫΤϦύʔε ΠϯσΫγϯά࣌ͷUPLFOͷܗʹଗ ͑Δ ϙεςΟϯάϦετΛऔಘϚʔδ͢ Δ จॻΛऔಘ͢Δ ฒͼସ͑Δ

    ݁ՌΛฦ͢ ݕࡧॲཧͷྲྀΕ  1BSTFS "OBMZ[FS GFUDI 4PSUFS 2VFSZ 3FTVMUT .FSHF
  24. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ IUUQQZDPOTFBSDIEFWBQOPSUIFBTUFMBTUJDCFBOTUBMLDPN

  25. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ w ཁ݅ w QZDPOKQͷτʔΫΛݕࡧͰ͖Δ w 5JUMFͱৄࡉ w ࿦ཧݕࡧ BOEPSOPU

    ʹରԠ w ݁Ռ͸είΞ 5'*%' ॱʹฦ͢ w ϑϨʔζݕࡧ͸ରԠ͠ͳ͍ w ෳ਺ϑΟʔϧυͷݕࡧ͸ରԠ͠ͳ͍ w υΩϡϝϯτͷߋ৽͸͠ͳ͍ w ్தͰυΩϡϝϯτͷ௥ՃΛ͠ͳ͍
  26. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS 4FBSDIFS GFUDINFSHF 4PSUFS %PDVNFOU

    2VFSZ 3FTVMUT શମ૾
  27. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS 4FBSDIFS GFUDINFSHF 4PSUFS %PDVNFOU

    2VFSZ 3FTVMUT
  28. "OBMZ[FS࣮૷ w ӳޠͱ೔ຊޠͦΕͧΕͷ"OBMZ[FSΛ࡞੒ w ڞ௨ w IUNMλάআ֎ w ετοϓϫʔυআ֎ w

    ΞϧϑΝϕοτ͸͢΂ͯখจࣈʹม׵ w TUFBNJOHॲཧ w JHEPHTˠEPH w ೔ຊޠ w ܗଶૉղੳ෼ׂ w ॿࢺɾ෭ࢺɾه߸আ֎ w ӳޠ w 8IJUFTQBDF෼ׂ w ॿࢺ౳͸ετοϓϫʔυͰରԠ
  29. $IBSpMUFS w ڞ௨ͷΠϯλʔϑΣΠεΛ࡞੒ w ਖ਼نදݱͰIUNMλάΛআڈ w ΞϧϑΝϕοτΛMPXFSDBTFʹม׵ class CharacterFilter: @classmethod

    def filter(cls, text: str): raise NotImplementedError class HtmlStripFilter(CharacterFilter): @classmethod def filter(cls, text: str): html_pattern = re.compile(r"<[^>]*?>") return html_pattern.sub("", text) class LowercaseFilter(CharacterFilter): @classmethod def filter(cls, text: str): return text.lower()
  30. 5PLFOJ[FS w ڞ௨ͷΠϯλʔϑΣΠεΛ४උ w ܗଶૉղੳʹ͸+BOPNFΛ࢖༻ w ஫ҙ w +BOPNFͷ5PLFOJ[FSΦϒδΣΫτͷ ॳظԽ͸ίετ͕ߴ͍ͨΊɼΠϯελ

    ϯεΛ࢖͍ճ͢ɻ ˞ from janome.tokenizer import Tokenizer tokenizer = Tokenizer() class BaseTokenizer: @classmethod def tokenize(cls, text): raise NotImplementedError class JanomeTokenizer(BaseTokenizer): @classmethod def tokenize(cls, text): return (t for t in cls.tokenizer.tokeniz e(text)) class WhitespaceTokenizer(BaseTokenizer): @classmethod def tokenize(cls, text): return (t[0] for t in re.finditer(r"[^ \ t\r\n]+", text)) ˞IUUQTHJUIVCDPNNPDPCFUBKBOPNF ˞IUUQTNPDPCFUBHJUIVCJPKBOPNFUPLFOJ[FS
  31. 5PLFOpMUFS w ڞ௨ͷΠϯλʔϑΣΠεΛ४උ w 104'JMUFSͷ౎߹্UPLFOҾ਺͸TUSͱ KBOPNFͷUPLFOΦϒδΣΫτΛࢦఆ STOPWORDS = ("is", "was",

    "to", "the") def is_token_instance(token): return isinstance(token, Token) class TokenFilter: @classmethod def filter(cls, token): """ in: sting or janome.tokenizer.Token """ raise NotImplementedError class StopWordFilter(TokenFilter): @classmethod def filter(cls, token): if isinstance(token, Token): if token.surface in STOPWORDS: return None if token in STOPWORDS: return None return token
  32. 5PLFOpMUFS w 4UFNJOHॲཧ w ୯ޠͷޠװΛऔΓग़͢ॲཧ w JHʮEPHTʯˠʮEPHʯ w OMULTUFNύοέʔδΛར༻ ˞

     from nltk.stem.porter import PorterStemmer ps = PorterStemmer() class Stemmer(TokenFilter): @classmethod def filter(cls, token: str): if token: return ps.stem(token) ˞IUUQXXXOMULPSHBQJOMULTUFNIUNM
  33. 5PLFOpMUFS w 104'JMUFS w ಛఆͷ඼ࢺΛআ֎͢Δ w KBOPNFͷUPLFOΦϒδΣΫτͷ QBSU@PG@TQFFDIͰ൑ఆ class POSFilter(TokenFilter):

    """ ೔ຊޠͷॿࢺ/ه߸Λআ͘ϑΟϧλʔ """ @classmethod def filter(cls, token): """ in: janome token """ stop_pos_list = ("ॿࢺ", "෭ࢺ", "ه߸") if any([token.part_of_speech.startswith(pos) for pos in stop_pos_list]): return None return token
  34. "OBMZ[FS w #BTFDMBTT w Ϋϥεม਺ͰࢦఆͰ͖ΔΑ͏ʹ͢Δ w UPLFOJ[FS  w DIBSpMFS

     w UPLFO@pMUFS w pMUFS͸ෳ਺ࢦఆ͢ΔͨΊ഑ྻ w લ͔Βॱ൪ʹॲཧ͍ͯ͘͠ class Analyzer: tokenizer = None char_filters = [] token_filters = [] @classmethod def analyze(cls, text: str): text = cls._char_filter(text) tokens = cls.tokenizer.tokenize(text) filtered_token = (cls._token_filter(token) for t oken in tokens) return [parse_token(t) for t in filtered_token i f t] @classmethod def _char_filter(cls, text): for char_filter in cls.char_filters: text = char_filter.filter(text) return text @classmethod def _token_filter(cls, token): for token_filter in cls.token_filters: token = token_filter.filter(token) return token
  35. "OBMZ[FS w ݸผͷ"OBMZ[FS w ೔ຊޠ༻ͷ+BQBOFTF5PLFOJ[FS w ӳޠ༻ͷ&OHMJTI5PLFOJ[FS w ͔ΜͨΜʹผͷ"OBMZ[FS΋࡞੒Մೳ class

    JapaneseAnalyzer(Analyzer): tokenizer = JanomeTokenizer char_filters = [HtmlStripFilter, LowercaseFilter] token_filters = [StopWordFilter, POSFilter, Stemmer] class EnglishAnalyzer(Analyzer): tokenizer = WhitespaceTokenizer char_filters = [HtmlStripFilter, LowercaseFilter] token_filters = [StopWordFilter, Stemmer]
  36. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS 4FBSDIFS GFUDINFSHF 4PSUFS %PDVNFOU

    2VFSZ 3FTVMUT
  37. *OEFYFS w จষΛड͚औΓసஔΠϯσοΫεΛ࡞੒ w ϝϞϦ্ʹҰ࣌తͳసஔΠϯσοΫε EJDUJPOBZ Ͱอ࣋ w ϝϞϦ্ʹҰఆ਺ͷసஔΠϯσοΫε͕ ཷ·ͬͨΒετϨʔδʹอଘ

    class InvertedIndex: def __init__( self, token_id: int, token: str, postings_list=[ ], docs_count=0 ) -> None: self.token_id = token_id self.token = token self.postings_list = [] self.__hash_handle = {} self.docs_count = 0 def add_document(doc: str): """ υΩϡϝϯτΛσʔλϕʔεʹ௥Ճ͠సஔΠϯσοΫεΛߏங͢Δ """ if not doc: return # # จॻIDͱจষ಺༰ΛجʹϛχసஔΠϯσοΫε࡞੒ text_to_postings_lists(doc) # # # Ұఆ਺ͷυΩϡϝϯτ͕ϛχసஔΠϯσοΫε͕ཷ·ͬͨΒ Ϛʔδ if len(TEMP_INVERT_INDEX) >= LIMIT: for inverted_index in TEMP_INVERT_INDEX.values() : save_index(inverted_index)
  38. *OEFYFS w ϙεςΟϯάϦετͷ࡞੒ w BOBMZ[FSͰUPLFOΛ࡞੒ w είΞܭࢉͷͨΊจॻதͷτʔΫϯ૯਺Λ͋ Β͔͡Ίܭࢉ͓ͯ͘͠ w ʮ୯ޠ୯ҐసஔΠϯσοΫεʯ࡞੒

    w ϑϨʔζݕࡧΛ͠ͳ͍ͨΊ w ͨͩ͠ɺείΞܭࢉͷͨΊͦͷτʔΫϯ͕จ ॻதʹ͍ͭ͋͘Δ͔ܭࢉͯ͠ΠϯσοΫεʹ ΋͓ͬͯ͘ w 1ZUIPO<EPD EPD> def text_to_postings_lists(text) -> list: """ จষ୯ҐͷసஔϦετΛ࡞Δ """ tokens = JapaneseAnalyzer.analyze(text) token_count = len(tokens) document_id = save_document(text, token_count) cnt = Counter(tokens) for token, c in cnt.most_common(): token_to_posting_list(token, document_id, c) def token_to_posting_list(token: str, document_id: int, token_count: int): """ token͔Βposting listΛ࡞Δ """ token_id = get_token_id(token) index = TEMP_INVERT_INDEX.get(token_id) if not index: index = InvertedIndex(token_id, token) posting = "{}: {}".format(str(document_id), str(token_count)) index.add_posting(posting) TEMP_INVERT_INDEX[token_id] = index
  39. *OEFYFS࡞੒Πϝʔδ w EPDVNFOU w EPDʮࢲ͸ࢲͰ͢ɻʯ w EPDʮࢲͱQZUIPOɻʯ w 5PLFO w

    UPLFOࢲ w UPLFOQZUIPO w τʔΫϯ਺ w EPD w EPD w ϙεςΟϯάϦετ w 1ZUIPO<> w ࢲ< >
  40. *OEFYFSͷϙΠϯτ w ϙεςΟϯάϦετ͸JEॱʹιʔτ͢Δ w ϝϞϦ্ͷసஔΠϯσοΫε͕Ұఆ਺Λ௒͑ͨΒετϨʔδʹอଘ

  41. *OEFYFSͷϙΠϯτ w ϙεςΟϯάϦετ͸JEॱʹιʔτ͢Δ w ϝϞϦ্ͷసஔΠϯσοΫε͕Ұఆ਺Λ௒͑ͨΒετϨʔδʹอଘ

  42. ϙεςΟϯάϦετ͸JEॱʹιʔτ͢Δ w ݕࡧͷޮ཰తʹߦ͏ͨΊ w ࠓճͷ࣮૷Ͱ͸৮Εͳ͍ w JHεΩοϓϙΠϯλʔ ˞  w

    ࠓճ͸VQEBUF͠ͳ͍ͷͰBQQFOE͢Δ͚ͩͰ0, ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛
  43. ϝϞϦͱετϨʔδͷసஔΠϯσοΫε w શจݕࡧΤϯδϯͰੑೳͷϘτϧωοΫͷଟ͘ΛετϨʔδͷJP͕઎ΊΔ w ϝϞϦ࢖༻ྔͱ଎౓ͷτϨʔυΦϑ w ϝϞϦ্ʹ͓͍͓ͯ͘΄͏͕ੑೳ͸͍͍͕ϝϞϦ࢖༻཰্͕͕Δ w ετϨʔδʹසൟʹΞΫηε͢ΔΑ͏ʹ͢Ε͹ϝϞϦ࢖༻཰͸Լ͕Δ͕஗͘ͳ Δ

    ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛ ˞ʮݕࡧΤϯδϯࣗ࡞ೖ໳ʯٕज़ධ࿦ࣾ
  44. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS 4FBSDIFS GFUDINFSHF 4PSUFS %PDVNFOU

    2VFSZ 3FTVMUT
  45. 4UPSBHF w ࠓճTRMJUFTRMBMDIFNZΛ࢖༻ w ͳΜͰ΋͍͍ w /P42-Λ࢖͏ͱָ w ࠓճ͸ۃྗQZUIPO͚ͩࡁΉΑ͏ʹ͢ΔͨΊTRMJUFΛ࠾༻ w

    ΑΓޮ཰ΛٻΊΔͳΒࣗ෼Ͱ࣮૷͢Δඞཁ͕͋Δ
  46. 4UPSBHF w σʔλϕʔεεΩʔϚ w %PDVNFUTUBCMFʹςΩετΛอଘ w ݕࡧର৅ͷϑΟʔϧυͷςΩετΛUFYU ϑΟʔϧυͰอ͓࣋ͯ͘͠ class Documents(Base):

    __tablename__ = "documents" id = Column(Integer, primary_key=True) text = Column(String) token_count = Column(Integer) date = Column(String) time = Column(String) room = Column(String) title = Column(String) abstract = Column(String) speaker = Column(String) self_intro = Column(String) detail = Column(String) session_type = Column(String) class Tokens(Base): __tablename__ = "tokens" id = Column(Integer, primary_key=True) token = Column(String) class InvertedIndexDB(Base): __tablename__ = "index" id = Column(Integer, primary_key=True) token = Column(String) postings_list = Column(String) docs_count = Column(Integer) token_count = Column(Integer)
  47. 4UPSBHF w σʔλϕʔεॲཧͳͲͷVUJMTؔ਺Λ࣮૷ w τʔΫϯͷ௥Ճɾऔಘ w υΩϡϝϯτͷ௥Ճɾऔಘ w సஔΠϯσοΫεͷ௥Ճɾߋ৽ɾऔಘ def

    add_token(token: str) -> int: SESSION = get_session() token = Tokens(token=token) SESSION.add(token) SESSION.commit() token_id = token.id SESSION.close() return token_id def fetch_doc(doc_id): SESSION = get_session() doc = SESSION.query(Documents).filter(Documents.id = = doc_id).first() SESSION.close() if doc: return doc else: return None
  48. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS NFSHFGFUDI 4PSUFS %PDVNFOU 2VFSZ

    3FTVMUT
  49. 4FBSDIFS w RVFSZΛड͚औͬͯ w ΫΤϦͷύʔε 1BSTFS"OBMZ[FS  w ݁ՌͷEPD@JEΛऔಘ .FSHFS

     w จষΛετϨʔδ͔Βऔಘ 'FUDIFS  w είΞॱʹฒͼସ͑Δ 4PSUFS def search_by_query(query): if not query: return [] # parse parsed_query = tokenize(query) parsed_query = analyzed_query(parsed_query) rpn_tokens = parse_rpn(parsed_query) # merge doc_ids, query_postings = merge(rpn_tokens) print(doc_ids, query_postings) # fetch docs = [fetch_doc(doc_id) for doc_id in doc_ids] # sort sorted_docs = sort(docs, query_postings) return [_parse_doc(doc) for doc, _ in sorted_docs]
  50. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS 4FBSDIFS GFUDINFSHF 4PSUFS %PDVNFOU

    2VFSZ 3FTVMUT
  51. 1BSTFS w ϢʔβʔͷೖྗΫΤϦΛύʔε͢Δ w "/% 03 /05ͳͲͷ࿦ཧݕࡧʹ࢖͏ΦϖϨʔλʔ΍ԋࢉࢠΛ࢖͏৔߹ʹඞཁ w ࿦ཧݕࡧ͠ͳ͍ͳΒෆཁ w

    ࠓճ͸ٯϙʔϥϯυه๏ʹม׵
  52. ٯϙʔϥϯυه๏ ޙஔه๏ w ԋࢉࢠΛඃԋࢉࢠͷޙΖʹஔ͘ه๏ ˞  w JH w ʮ

    ʯˠʮ ʯ w ʮ      ʯˠʮ  ʯ w ʮQZUIPO"/%ݕࡧʯˠʮQZUIPOݕࡧ"/%ʯ w ϝϦοτ w ࣜͷධՁ ܭࢉ ͕γϯϓϧʹͳΔ w ઌ಄͔ΒධՁ͢Δ͚ͩͰࡁΉ w ˞XJLJQFEJBIUUQTKBXJLJQFEJBPSHXJLJ &&%&#$&"&#&&"& #
  53. ૢं৔ΞϧΰϦζϜৄࡉ • ٯϙʔϥϯυه๏΁ͷม׵͸ૢं৔ΞϧΰϦζϜΛར༻ • ΞϧΰϦζϜͷৄࡉ͸XJLJQFEJBͷૢं৔ΞϧΰϦζϜͷϖʔδΛࢀর w ˞XJLJQFEJBIUUQTKBXJLJQFEJBPSHXJLJ &%&##"&"#&"&"#&#&""& #"&"

  54. 1BSTFS import re REGEX_PATTERN = r"\s*(\d+|\w+|.)" SPLITTER = re.compile(REGEX_PATTERN) LEFT

    = True RIGHT = False OPERATER = {"AND": (3, LEFT), "OR": (2, LEFT), "NOT": (1 , RIGHT)} def tokenize(text): return SPLITTER.findall(text) def parse_rpn(tokens: list): ɹɹɹ# ΞϧΰϦζϜͷ࣮૷ • ΞϧΰϦζϜΛ࣮૷ • தஔه๏Λޙஔه๏ʹ͢Δ͚ͩͳͷͰ࣮૷ ͢Δͷ͸จࣈྻͱԋࢉࢠ ׅހͷΈ • ਖ਼نදݱͰεϖʔεͱׅހͰ෼ׂ • JH<l"z l"/%z l l l#z 03l l$z l z> • ར༻͢ΔΦϖϨʔλʔͱɺͦͷ༏ઌ౓ɺ ݁߹ํ๏Λࢦఆ • ༏ઌ౓ • /0503"/% • ݁߹ • "/% 03ࠨ݁߹ • /05ӈ݁߹ • ʮ""/% #03$ ʯ • ɹˠɹ<" # $ 03 "/%>
  55. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS 4FBSDIFS GFUDINFSHF 4PSUFS %PDVNFOU

    2VFSZ 3FTVMUT
  56. "OBMZ[FS w సஔΠϯσοΫε࡞੒࣌ʹࣙॻʹ࢖ͬ ͨUPLFOͰݕࡧ͢Δඞཁ͕͋Δ w *OEFY࣌ʹ࢖ͬͨ"OBMZ[FSΛͦͷ··࢖ͬ ͯ΋0, w +BQBOFTF5PLFOJ[FSͱ&OHMJTI5PLFOJ[FS ʹରԠ͢Δ

    w ݕࡧ࣌ʹ͸ӳ୯ޠͰݕࡧ w 8IJUFTQBDFUPLFOJ[FS͸ߟྀ͠ͳ͍ w ࠓճ͸෼ׂ͞ΕͨτʔΫϯΛ03ݕࡧ ͱͯ͠ѻ͏ w ʮػցֶश"/%QZUIPOʯ w ˠʮػց03ֶश"/%QZUIPOʯ w ٯϙʔϥϯυه๏ม׵લʹ"OBMZ[F͢ Δ def analyzed_query(parsed_query): return_val = [] for q in parsed_query: if q in OPRS: return_val.append(q) else: analyzed_q = JapaneseAnalyzer.analyze(q) if analyzed_q: tmp = " OR ".join(analyzed_q) return_val += tmp.split(" ") return return_val
  57. .FSHF "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS 4FBSDIFS GFUDINFSHF 4PSUFS %PDVNFOU

    2VFSZ 3FTVMUT
  58. ٯϙʔϥϯυه๏ධՁ w ࡞੒ͨ͠ٯϙʔϥϯυه๏ΛධՁ͍ͯ͘͠ w खॱ w ʮ  ʯ 

        ͷ৔߹  ඃԋࢉࢠͷ৔߹ ԋࢉࢠ ̏ ΛελοΫʹੵΉTUBDL<̏>  ඃԋࢉࢠͷ৔߹ ԋࢉࢠ ̐ ΛελοΫʹੵΉTUBDL<̏ ̐>  ԋࢉࢠͷ৔߹ τοϓ̎ͭΛελοΫ͔ΒऔΓग़͢TUBDL<> ܭࢉ ̏ʴ̐ ͯ͠ελοΫʹੵΉTUBDL<̓>  ඃԋࢉࢠͷ৔߹ ԋࢉࢠ ̍ ΛελοΫʹੵΉTUBDL<̓ ̍>  ඃԋࢉࢠͷ৔߹ ԋࢉࢠ ̎ ΛελοΫʹੵΉTUBDL<̓ ̍ ̎>  ԋࢉࢠͷ৔߹ τοϓ̎ͭΛελοΫ͔ΒऔΓग़͢TUBDL<̓> ܭࢉ ̍̎ ͯ͠ελοΫʹੵΉTUBDL<̓ ̍>  ԋࢉࢠͷ৔߹ τοϓ̎ͭΛελοΫ͔ΒऔΓग़͢TUBDL<> ܭࢉ ̓ ̍ ͯ͠ελοΫʹੵΉTUBDL<̓>
  59. ٯϙʔϥϯυه๏ධՁ w .FSHFٯϙʔϥϯυه๏ͷධՁ w ٯϙʔϥϯυه๏ͷධՁ͸̏ύλʔϯ w ʮελοΫʹੵΉʯɺʮελοΫ͔ΒऔΓग़͢ʯɺʮܭࢉ͢Δʯ w ܭࢉ͸̎ͭͷඃԋࢉࢠͷධՁͰ͢Ή w

    ̎ͭͷτʔΫϯͷϙεςΟϯάϦετΛϚʔδ͢Δ w ελοΫʹ͸ɺτʔΫϯ͔Βऔಘͨ͠ϙεςΟϯάϦετΛ௥Ճ͢Δ
  60. .FSHF w ٯϙʔϥϯυه๏ධՁखॱ௨Γʹ࣮૷ w UPLFO͕ඃԋࢉࢠτʔΫϯͷ৔߹ʹ ϙεςΟϯάϦετΛऔಘޙTUPDLʹ ௥Ճ w UPLFO͕ԋࢉࢠͷ৔߹͸ɺTUPDL͔Β UPQ̎ͭΛऔΓग़ͯ͠NFSHFޙTUPDL

    ʹ௥Ճ w ϙεςΟϯάϦετ͸είΞܭࢉΑ͏ ʹEJDUJPOBSZͰ؅ཧ def merge(tokens: list): target_posting = {} stack = [] for token in tokens: if token not in OPRS: token_id = get_token_id(token) postings_list = fetch_postings_list(token_id ) # scoreܭࢉ༻ʹอ࣋ target_posting[token] = postings_list # doc_idͷΈΛstackʹ௥Ճ doc_ids = set([p[0] for p in postings_list]) stack.append(doc_ids) # token͕operaterͩͬͨ৔߹ else: if not stack: raise if len(stack) == 1: # NOTͷΈڐ༰ if token == "NOT": # NOTͷॲཧ return not_doc_ids, {} else: raise doc_ids1 = stack.pop() doc_ids2 = stack.pop() stack.append(merge_posting(token, doc_ids1, doc_ids2))
  61. .FSHF w ʮ/05IPHFʯͷରԠ w ඃԋࢉࢠ̎ͭͷධՁҎ֎ͷධՁํ๏Λ/05 ͷΈڐՄ w UPLFO͕ԋࢉࢠͷ৔߹ʹTUBDLͷαΠζͱ UPLFOͷத਎Ͱ൑ఆ #

    ʮNOT hogeʯରԠ if len(stack) == 1: # NOTͷΈڐ༰ if token == "NOT": # NOTͷॲཧ doc_ids = stack.pop() not_doc_ids = fetch_not_docs_id(doc_ids) return not_doc_ids, {} else: raise
  62. .FSHF w ΦϖϨʔλʔΛϢʔβʔ͕هड़͠ͳ͍ ৔߹ͷରԠ w ʮQZUIPOػցֶशʯ w ˠʮQZUIPO03ػցֶशʯ  w

    ˠʮQZUIPO"/%ػցֶशʯ  w ࠓճ͸03Λ࠾༻ w ΦϖϨʔλʔΛهड़͠ͳ͍ඃԋࢉࢠ ͱධՁͷ਺͕߹Θͳ͍ w ࠷ޙ·ͰධՁͯ͠΋TUBDL͕̎ͭҎ্͋Δ w શͯͷτʔΫϯΛ࣮૷ͨ͋͠ͱʹTUBDL ͕ʹͳΔ·Ͱ03NFSHFΛ܁Γฦ͢ for token in tokens: # ධՁॲཧ while len(stack) != 1: doc_ids1 = stack.pop() doc_ids2 = stack.pop() stack.append(merge_posting("OR", doc_ids1, doc_ids2) )
  63. .FSHF "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS 4FBSDIFS GFUDINFSHF 4PSUFS %PDVNFOU

    2VFSZ 3FTVMUT
  64. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ "OBMZ[FS *OEFYFS 4UPSBHF 1BSTFS "OBMZ[FS 4FBSDIFS GFUDINFSHF 4PSUFS %PDVNFOU

    2VFSZ 3FTVMUT
  65. είΞϦϯά w 5'*%' w 5'UFSNGSFRVFODZ  w จॻதͷ୯ޠͷׂ߹ w ୯ޠ

    U จষதͷ୯ޠ਺ 5  w J%'JOWFSUEPDVNFOUGSFRVFODZ w શจॻதͷͦͷ୯ޠΛؚΉจষͷׂ߹ w MPH ͦͷ୯ޠΛؚΉจॻ਺ % શจॻ਺ "  w จষதʹΑ͘ग़͖ͯͯɺଞͷจষʹͰͯ͜ͳ͍΋ͷ΄ͲείΞ͕ߴ͍
  66. είΞϦϯά w ݕࡧʹ͓͍ͯͷ*%'  w Ωʔϫʔυ͚ͩͰݕࡧ͢Δͱશจจॻͷ*%'͸ಉ͡ w 5'͚ͩͰΑͦ͞͏ w ͨͩ͠ɺ"/%

    03Ͱݕࡧͨ͠ͱ͖ʹҙຯ͕มΘͬͯ͘Δɻ w ෳ਺ͷΫΤϦʹର͢Δจॻ܊ͷൺֱͷͨΊͷॏΈ෇͚ͱߟ͑Δ w ʮ""/%#ʯͰݕࡧͨ͠ͱ͖ͷจॻЋͷ5'*%'஋ w จॻЋͷ5'*%'஋ΫΤϦ"ͷ5'*%' ΫΤϦ#ͷ5'*%'
  67. 4PSUFS w είΞܭࢉʹ࢖͏஋͸΄΅ΠϯσοΫ ε࣌ʹूܭ͍ͯ͠Δ΋ͷΛ࢖͏ w จॻதͷݕࡧτʔΫϯ਺ U  w QPTUJOHͰอ࣋

    w  w จॻதͷτʔΫϯ਺ 5  w %PDVNFOUTอଘ࣌ʹҰॹʹอଘ w EPDUPLFO@DPVOU w ݕࡧτʔΫϯ U ΛؚΉจॻ਺ %  w UͷϙεςΟϯάϦετͷ௕͞ w શจॻ਺ "  w %PDVNFOUTUBCMFͷΧ΢ϯτ def sort(doc_ids, query_postings): docs = [] all_docs = count_all_docs() for doc_id in doc_ids: doc = fetch_doc(doc_id) doc_tfidf = 0 for token, postings_list in query_postings.items (): idf = math.log10(all_docs / len(postings_lis t)) + 1 posting = [p for p in postings_list if p[0] == doc.id] if posting: tf = round(posting[0] [1] / doc.token_count, 2) else: tf = 0 token_tfidf = tf * idf doc_tfidf += token_tfidf docs.append((doc, doc_tfidf)) return sorted(docs, key=lambda x: x[1], reverse=True )
  68. σϓϩΠ5JQT w ϝϞϦΛଟΊʹ͢Δ w ੑೳ͕͕͋Δ w +BOPNF͕QJQͷϏϧυ࣌ʹ.#d.#ఔ౓ϝϞϦ͕ඞཁ ˞  w

    IFSPLV͸ݫ͍͠ w GSFFɾIPCZ͸✕ w 4UBOEBSE΋͓ͦΒ͘✕ w &$UNJDSP΋✕ w &$UTNBMM͸˓ w ˞IUUQTNPDPCFUBHJUIVCJPKBOPNF
  69. %&.0 %#ͷίʔυ %&.0

  70. վળ఺ w ΠϯσΫγϯάɾݕࡧͷ଎౓վળ w ϙεςΟϯάϦετͷѹॖ w ޮ཰ͷ͍͍ΞϧΰϦζϜͱσʔλߏ଄Λ࢖͏ w ಈతͳΠϯσοΫεͷ࡞੒ w

    ΫΤϦ֦ு w ϑϨʔζݕࡧ w ෳ਺ϑΟʔϧυରԠ w ԋࢉࢠͷ֦ு w ετϨʔδ෦෼ ϑΝΠϧγεςϜ ࣮૷ w ϕΫτϧݕࡧ w ෼ࢄԽ w FUD
  71. վળͷͨΊͷࢀߟࢿྉ w ॻ੶ w ʮݕࡧΤϯδϯࣗ࡞ೖ໳ʯٕज़ධ࿦ࣾ w ʮ৘ใݕࡧͷجૅʯڞཱग़൛ w ʮߴ଎จࣈྻղੳͷੈքʯؠ೾ॻళ w

    ʮ%FFQ-FBSOJOHGPS4FBSDIʯ."//*/( w 044 w ʮ8IPPTIʯ w IUUQTCJUCVDLFUPSHNDIBQVUXIPPTITSDEFGBVMUEPDT TPVSDFJOUSPSTU w ʮ&MBTUJDTFBSDIʯ w IUUQTHJUIVCDPNFMBTUJDFMBTUJDTFBSDI w ʮ"QBDIF-VDFOFʯ w IUUQMVDFOFBQBDIFPSHDPSFEPDVNFOUBUJPOIUNM
  72. ·ͱΊ w ݕࡧΤϯδϯͷ࢓૊Έͷઆ໌͔ΒجຊతͳݕࡧΤϯδϯͷ࣮૷·Ͱઆ໌ͨ͠ w ୯७ʹݟ͑ΔݕࡧΤϯδϯ΋࣮૷͢Δͱ৭ʑߟ͑Δ͜ͱ͕͋Δ w ษڧɾझຯʹ͓͍ͯंྠ࠶ൃ໌͸༗ҙٛ w ݕࡧΤϯδϯΛ࡞Δͷ͸ָ͍͠ w

    ΈΜͳͰʮ΅͘ͷ͔Μ͕͍͖͑ͨ͞ΐ͏ͷݕࡧΤϯδϯʯΛ࡞Γ·͠ΐ͏