$30 off During Our Annual Pro Sale. View Details »

The first step self made full text search

ryokato
September 17, 2019
6.1k

The first step self made full text search

ryokato

September 17, 2019
Tweet

Transcript

  1. ೖ໳
    ࣗ࡞ݕࡧΤϯδϯ
    1Z$PO+1

    View Slide

  2. ࣗݾ঺հ
    Ճ౻ྒྷ 3ZP,BUP

    !@SZPPL
    αʔόʔαΠυ"1*ݕࡧ
    QZUIPO %KBOHP
    &MBTUJDTFBSDI

    View Slide

  3. એ఻
    https://search-tech.connpass.com/
    TFBSDIUFDIKQ

    View Slide

  4. એ఻
    εϐʔΧʔɾձ৔࠙਌ձεϙϯαʔืूதͰ͢
    εϐʔΧʔԠืϑΥʔϜ εϙϯαʔԠืϑΥʔϜ

    View Slide

  5. ͜ͷൃදʹ͍ͭͯ
    w ೖ໳ࣗ࡞ݕࡧΤϯδϯ
    w ݕࡧΤϯδϯΛࣗ࡞͢Δ͜ͱʹೖ໳ͨ͠࿩
    w ୭͠΋Ұ౓͸ݕࡧΤϯδϯΛ࡞Γ͍ͨͱࢥ͏͸ͣ
    w ॳΊͯݕࡧΤϯδϯΛ࡞Δͱ͖ͷ஌ݟΛҰํతʹڞ༗͢ΔτʔΫ

    View Slide

  6. ର৅
    w શจݕࡧͷ͜ͱΛ஌Βͳ͍ͳΜͱͳ͘஌͍ͬͯΔਓ
    w ݕࡧʹڵຯ͕͋Δਓ
    w ݕࡧΤϯδϯΛ࡞ͬͯΈ͍ͨਓ

    View Slide

  7. ࿩͞ͳ͍͜ͱ
    w &MBTUJDTFBSDI4PMSͱ͍ͬͨશจݕࡧΤϯδϯΛ࢖ͬͨݕࡧΞϓϦέʔγϣϯ
    ͷ࿩ɾQZUIPOͰ࢖͏UJQT
    w ஌Γ͍ͨਓ͸1Z$PO+1ͷൃද ˞
    ݟ͍ͯͩ͘͞PSݕࡧٕज़ษڧձ΁
    w ΋͘͠͸ੋඇۭ͖࣌ؒʹ࿩͠·͠ΐ͏
    w ΫϩʔϦϯάεΫϨΠϐϯάʹ͍ͭͯ
    w ؾʹͳΔਓ͸ɺຊΛಡΉPS!TIJOZPSLF͞Μ ˞
    !WBBBBBORVJTI͞Μ ˞
    ͷϒϩάΛݟͯ
    ͍ͩ͘͞
    w ˞IUUQTTMJEFTIJQDPNVTFST!JLUBLBIJSPQSFTFOUBUJPOT%3+YK,G#'&(43DW:K$"G
    w ˞IUUQTTIJOZPSLFIBUFOBCMPHDPNFOUSZLPXBLVOBJDSBXMBOETDSBQJOH
    w ˞IUUQTWBBBBBBORVJTIIBUFOBCMPHDPNFOUSZ

    View Slide

  8. ໨࣍
    w શจݕࡧʹ͍ͭͯ
    w γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    w ͞ΒͳΔվળʹ͍ͭͯ
    w ·ͱΊ

    View Slide

  9. શจݕࡧʹ͍ͭͯ
    શจݕࡧͱ͸ʁ
    w lݕࡧͷର৅͕ʮςΩετ͔ΒͳΔจষͷશ෦ͷจʯͰ͋Δ৔߹ʹɺͦͷจষʹର
    ͯ͠ݕࡧΛߦ͏͜ͱz ˞

    ݕࡧΤϯδϯͱ͸ʁ
    w lจষͷू߹͔Βɺ୯ޠ΍࣭໰ͳͲ͔ΒͳΔ৘ใཁٻʹద߹͢ΔจষΛݟ͚ͭΔ ݕ
    ࡧ͢Δ
    ͨΊͷγεςϜ΍ιϑτ΢ΣΞͷ૯শz ˞

    ˞ʮݕࡧΤϯδϯࣗ࡞ೖ໳ʯٕज़ධ࿦ࣾ

    View Slide

  10. ෼Ͱ࡞ΔQZUIPO੡ͷݕࡧΤϯδϯ
    w ػೳ
    w ςΩετͷ௥Ճ͕Ͱ͖Δ
    w Ωʔϫʔυݕࡧ͕Ͱ͖Δ

    View Slide

  11. ෼Ͱ࡞ΔQZUIPO੡ͷݕࡧΤϯδϯ
    def add_text(text: str):
    with open("db.txt", "a") as f:
    f.write(text + "\n")
    def search(keyword: str):
    with open("db.txt", "r") as f:
    return [l.strip() for l in f if keyword in l]
    if __name__ == "__main__":
    texts = [
    "Beautiful is better than ugly.",
    "Explicit is better than implicit.",
    "Simple is better than complex."
    ]
    for text in texts:
    add_text(text)
    results = search("Simple")
    for result in results:
    print(result)

    View Slide

  12. શจݕࡧʹ͍ͭͯ
    (SFQܕ ˞

    w ઢܗ૸ࠪΛߦ͏
    w ݱࡏͷίϯϐϡʔλʔͰ͸ɺγΣʔΫεϐΞશू ໿ສޠ
    ن໛ͷจষʹର
    ͯ͠ͷ୯७ͳΫΤϦʹରͯ͠͸͜ΕͰॆ෼ͱ͍͏આ΋͋Δ ˞

    ࡧҾ ΠϯσοΫε
    ܕ ˞

    w ͋Β͔͡Ίݕࡧର৅ͱͳΔจॻ܈Λ૸ࠪͯ͠ࡧҾσʔλΛ࡞͓ͬͯ͘
    ϕΫτϧܕ
    w ಛ௃ϕΫτϧΛ࡞੒ͯ͠ϕΫτϧؒͷڑ཭Λܭࢉ
    ˞IUUQTKBXJLJQFEJBPSHXJLJ&"&&"$&#"
    ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛

    View Slide

  13. ࡧҾ ΠϯσοΫε
    ܕ
    w ͋Β͔͡Ίݕࡧର৅ͷจॻΛ૸ࠪͯ͠ɺࡧҾσʔλΛ४උ͓ͯ͘͠ํ๏
    w ࡧҾϑΝΠϧΛ࡞੒͢Δ͜ͱΛΠϯσΫγϯάɺੜ੒͞ΕΔσʔλΛΠϯσοΫ
    εͱ͍͏
    w ࡧҾͷ࡞Γํ͸༷ʑ͋Δ͕ɺҰൠతͳͷ͸సஔΠϯσοΫε
    w ଟ͘ͷશจݕࡧΤϯδϯͰ࠾༻͞Ε͍ͯΔ
    w Πϝʔδ͸ຊͷޙΖʹ͍͍ͭͯΔࡧҾ
    w Ωʔϫʔυͱϖʔδ

    View Slide

  14. సஔΠϯσοΫε ٯΠϯσοΫε

    w ୯ޠͱͦΕؚ͕·Ε͍ͯΔจষͷϚοϐϯάΛอ࣋͢ΔΠϯσοΫεܕͷσʔλ
    ߏ଄ ˞

    w ࣙॻͱϙεςΟϯάͰߏ੒͞ΕΔ
    1ZUIPO
    &MBTUJDTFBSDI


    ࣙॻ ϙεςΟϯά
    ϙεςΟϯάϦετ
    ϙεςΟϯά
    W
    w ˞XJLJQFEJB
    w ˞ਤʮ৘ใݕࡧͷجૅʯڞཱग़൛
    W సஔΠϯσοΫε

    View Slide

  15. సஔΠϯσοΫεͷΠϯσοΫε୯Ґ
    w Ϩίʔυ୯ҐసஔΠϯσοΫε
    w ୯ޠͱ୯ޠΛؚΉจষ จষJE
    ΛϦετͯ࣋ͭ͠
    w JH1ZUIPO< >
    w ϝϦοτγϯϓϧͰ࣮૷͠΍͍͢ɻσΟεΫ༰ྔ͕গͳ͍ɻ
    w σϝϦοτػೳੑ͕๡͍͠
    w ୯ޠ୯ҐసஔΠϯσοΫε
    w ୯ޠͱ୯ޠΛؚΉจষ จষJE
    Ћͷ৘ใ JH୯ޠͷग़ݱҐஔ

    w JH1ZUIPO< >
    w ϝϦοτػೳੑ͕ߴ͍ɻྫ͑͹ϑϨʔζݕࡧ͕Ͱ͖Δɻ
    w σϝϦοτσΟεΫ༰ྔ͕ଟ͍ɻ
    w ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛

    View Slide

  16. సஔΠϯσοΫε
    w QZUIPOͰ͸EJDUJPOBSZ
    w LFZ͕Ұக͠ͳ͚Ε͹஋͸औಘͰ͖ͳ͍
    w Ͳ͏͍͏୯ҐͰΠϯσοΫεΛ࡞੒͢Δ͔͕ॏཁ
    సஔΠϯσοΫε
    1ZUIPO< >
    1ZUIPOɹ˓
    QZUIPOɹ✕
    1Zɹ✕
    ύΠιϯɹ✕

    View Slide

  17. ΠϯσοΫεͷநग़ख๏
    w ୯ޠ෼ׂ τʔΫφΠζ

    w 8IJUFTQBDF
    w ܗଶૉղੳ ೔ຊޠ

    w /HSBN
    w γάωνϟ
    w ઀ඌࣙ഑ྻ
    w FUD

    View Slide

  18. ܗଶૉղੳ XIJUFTQBDF
    ͱ/HSBN
    ܗଶૉղੳ /HSBN
    τʔΫφΠζ JHʮ౦ژʯʮ౎஌ࣄʯ JHʮ౦ژʯʮژ౎ʯʮ౎஌ʯʮ஌ࣄʯ
    τʔΫϯ਺ গͳ͍ ଟ͍
    ΠϯσοΫε
    αΠζ
    খ͍͞ େ͖͍
    ݕࡧ࿙Ε ଟ͍ গͳ͍
    ৽ޠରԠ ✕ ˓
    ϊΠζ গͳ͍ ଟ͍

    View Slide

  19. ΠϯσοΫεੜ੒
    w ΠϯσοΫεͷੜ੒࣌ʹ͸τʔΫφΠζ͚ͩͰͳ͘ςΩετʹॲཧΛՃ͑Δඞཁ
    ΋͋Δ
    w Ͳ͏͍͏ݕࡧΤϯδϯʹ͢Δ͔ґଘ͢Δ
    w JH)5.-Λݕࡧ͢ΔΤϯδϯ
    w HPPHMFͷΑ͏ʹίϯςϯπͷΈΛର৅ʹ͢Δ৔߹
    w )5.-λά͸ෆཁɹˠɹɹ)5.-λάΛআڈ͢Δ
    w (JUIVCͷΑ͏ʹίʔυΛର৅ʹ͢Δ৔߹
    w )5.-λά͸ඞཁ

    View Slide

  20. ΠϯσοΫεੜ੒ॲཧ
    w ςΩετશମʹॲཧΛՃ͑Δ DIBSpMUFS

    w JHIUNMλάΛআ͘ খจࣈ େจࣈ
    ʹ͢Δ
    w ༩͑ΒΕͨςΩετΛ෼ׂ͢Δ UPLFOJ[FS

    w ෼ׂ͞ΕͨUPLFOʹॲཧΛՃ͑Δ UPLFOpMUFS

    w JHਖ਼نԽ͢Δ ετοϓϫʔυΛআ͘
    w ͜ΕΒͷॲཧΛ·ͱΊͯ"OBMZ[FSͱΑͿ

    View Slide

  21. "OBMZ[FSॲཧ
    $IBSpMUFS
    5PLFOJ[FS
    5PLFOpMUFS
    lI4JNQMFJTCFUUFSUIBODPNQMFYIz
    lTJNQMFJTCFUUFSUIBODPNQMFYz



    5FYU
    5PLFOT
    "OBMZ[FS

    View Slide

  22. w ΠϯσΫγϯάॲཧ
    จॻΛड͚औΔ
    จॻΛ෼ׂ͢Δ
    จॻશମʹॲཧ
    τʔΫϯ෼ׂ
    τʔΫϯ͝ͱʹॲཧ
    จষ΍τʔΫϯ৘ใΛอଘ
    ϙεςΟϯάϦετΛ࡞੒
    సஔΠϯσοΫεΛߋ৽ɾอଘ
    ݕࡧॲཧͷྲྀΕ

    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    %PDVNFOU

    View Slide

  23. w ݕࡧॲཧ
    ΫΤϦड͚औΓ
    ΫΤϦύʔε
    ΠϯσΫγϯά࣌ͷUPLFOͷܗʹଗ
    ͑Δ
    ϙεςΟϯάϦετΛऔಘϚʔδ͢
    Δ
    จॻΛऔಘ͢Δ
    ฒͼସ͑Δ
    ݁ՌΛฦ͢
    ݕࡧॲཧͷྲྀΕ

    1BSTFS
    "OBMZ[FS
    GFUDI
    4PSUFS
    2VFSZ
    3FTVMUT
    .FSHF

    View Slide

  24. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    IUUQQZDPOTFBSDIEFWBQOPSUIFBTUFMBTUJDCFBOTUBMLDPN

    View Slide

  25. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    w ཁ݅
    w QZDPOKQͷτʔΫΛݕࡧͰ͖Δ
    w 5JUMFͱৄࡉ
    w ࿦ཧݕࡧ BOEPSOPU
    ʹରԠ
    w ݁Ռ͸είΞ 5'*%'
    ॱʹฦ͢
    w ϑϨʔζݕࡧ͸ରԠ͠ͳ͍
    w ෳ਺ϑΟʔϧυͷݕࡧ͸ରԠ͠ͳ͍
    w υΩϡϝϯτͷߋ৽͸͠ͳ͍
    w ్தͰυΩϡϝϯτͷ௥ՃΛ͠ͳ͍

    View Slide

  26. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    4FBSDIFS GFUDINFSHF

    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT
    શମ૾

    View Slide

  27. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    4FBSDIFS GFUDINFSHF

    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT

    View Slide

  28. "OBMZ[FS࣮૷
    w ӳޠͱ೔ຊޠͦΕͧΕͷ"OBMZ[FSΛ࡞੒
    w ڞ௨
    w IUNMλάআ֎
    w ετοϓϫʔυআ֎
    w ΞϧϑΝϕοτ͸͢΂ͯখจࣈʹม׵
    w TUFBNJOHॲཧ
    w JHEPHTˠEPH
    w ೔ຊޠ
    w ܗଶૉղੳ෼ׂ
    w ॿࢺɾ෭ࢺɾه߸আ֎
    w ӳޠ
    w 8IJUFTQBDF෼ׂ
    w ॿࢺ౳͸ετοϓϫʔυͰରԠ

    View Slide

  29. $IBSpMUFS
    w ڞ௨ͷΠϯλʔϑΣΠεΛ࡞੒
    w ਖ਼نදݱͰIUNMλάΛআڈ
    w ΞϧϑΝϕοτΛMPXFSDBTFʹม׵
    class CharacterFilter:
    @classmethod
    def filter(cls, text: str):
    raise NotImplementedError
    class HtmlStripFilter(CharacterFilter):
    @classmethod
    def filter(cls, text: str):
    html_pattern = re.compile(r"<[^>]*?>")
    return html_pattern.sub("", text)
    class LowercaseFilter(CharacterFilter):
    @classmethod
    def filter(cls, text: str):
    return text.lower()

    View Slide

  30. 5PLFOJ[FS
    w ڞ௨ͷΠϯλʔϑΣΠεΛ४උ
    w ܗଶૉղੳʹ͸+BOPNFΛ࢖༻
    w ஫ҙ
    w +BOPNFͷ5PLFOJ[FSΦϒδΣΫτͷ
    ॳظԽ͸ίετ͕ߴ͍ͨΊɼΠϯελ
    ϯεΛ࢖͍ճ͢ɻ ˞

    from janome.tokenizer import Tokenizer
    tokenizer = Tokenizer()
    class BaseTokenizer:
    @classmethod
    def tokenize(cls, text):
    raise NotImplementedError
    class JanomeTokenizer(BaseTokenizer):
    @classmethod
    def tokenize(cls, text):
    return (t for t in cls.tokenizer.tokeniz
    e(text))
    class WhitespaceTokenizer(BaseTokenizer):
    @classmethod
    def tokenize(cls, text):
    return (t[0] for t in re.finditer(r"[^ \
    t\r\n]+", text))
    ˞IUUQTHJUIVCDPNNPDPCFUBKBOPNF
    ˞IUUQTNPDPCFUBHJUIVCJPKBOPNFUPLFOJ[FS

    View Slide

  31. 5PLFOpMUFS
    w ڞ௨ͷΠϯλʔϑΣΠεΛ४උ
    w 104'JMUFSͷ౎߹্UPLFOҾ਺͸TUSͱ
    KBOPNFͷUPLFOΦϒδΣΫτΛࢦఆ
    STOPWORDS = ("is", "was", "to", "the")
    def is_token_instance(token):
    return isinstance(token, Token)
    class TokenFilter:
    @classmethod
    def filter(cls, token):
    """
    in: sting or janome.tokenizer.Token
    """
    raise NotImplementedError
    class StopWordFilter(TokenFilter):
    @classmethod
    def filter(cls, token):
    if isinstance(token, Token):
    if token.surface in STOPWORDS:
    return None
    if token in STOPWORDS:
    return None
    return token

    View Slide

  32. 5PLFOpMUFS
    w 4UFNJOHॲཧ
    w ୯ޠͷޠװΛऔΓग़͢ॲཧ
    w JHʮEPHTʯˠʮEPHʯ
    w OMULTUFNύοέʔδΛར༻ ˞

    from nltk.stem.porter import PorterStemmer
    ps = PorterStemmer()
    class Stemmer(TokenFilter):
    @classmethod
    def filter(cls, token: str):
    if token:
    return ps.stem(token)
    ˞IUUQXXXOMULPSHBQJOMULTUFNIUNM

    View Slide

  33. 5PLFOpMUFS
    w 104'JMUFS
    w ಛఆͷ඼ࢺΛআ֎͢Δ
    w KBOPNFͷUPLFOΦϒδΣΫτͷ
    QBSU@PG@TQFFDIͰ൑ఆ
    class POSFilter(TokenFilter):
    """
    ೔ຊޠͷॿࢺ/ه߸Λআ͘ϑΟϧλʔ
    """
    @classmethod
    def filter(cls, token):
    """
    in: janome token
    """
    stop_pos_list = ("ॿࢺ", "෭ࢺ", "ه߸")
    if any([token.part_of_speech.startswith(pos) for
    pos in stop_pos_list]):
    return None
    return token

    View Slide

  34. "OBMZ[FS
    w #BTFDMBTT
    w Ϋϥεม਺ͰࢦఆͰ͖ΔΑ͏ʹ͢Δ
    w UPLFOJ[FS
    w DIBSpMFS
    w UPLFO@pMUFS
    w pMUFS͸ෳ਺ࢦఆ͢ΔͨΊ഑ྻ
    w લ͔Βॱ൪ʹॲཧ͍ͯ͘͠
    class Analyzer:
    tokenizer = None
    char_filters = []
    token_filters = []
    @classmethod
    def analyze(cls, text: str):
    text = cls._char_filter(text)
    tokens = cls.tokenizer.tokenize(text)
    filtered_token = (cls._token_filter(token) for t
    oken in tokens)
    return [parse_token(t) for t in filtered_token i
    f t]
    @classmethod
    def _char_filter(cls, text):
    for char_filter in cls.char_filters:
    text = char_filter.filter(text)
    return text
    @classmethod
    def _token_filter(cls, token):
    for token_filter in cls.token_filters:
    token = token_filter.filter(token)
    return token

    View Slide

  35. "OBMZ[FS
    w ݸผͷ"OBMZ[FS
    w ೔ຊޠ༻ͷ+BQBOFTF5PLFOJ[FS
    w ӳޠ༻ͷ&OHMJTI5PLFOJ[FS
    w ͔ΜͨΜʹผͷ"OBMZ[FS΋࡞੒Մೳ
    class JapaneseAnalyzer(Analyzer):
    tokenizer = JanomeTokenizer
    char_filters = [HtmlStripFilter, LowercaseFilter]
    token_filters = [StopWordFilter, POSFilter, Stemmer]
    class EnglishAnalyzer(Analyzer):
    tokenizer = WhitespaceTokenizer
    char_filters = [HtmlStripFilter, LowercaseFilter]
    token_filters = [StopWordFilter, Stemmer]

    View Slide

  36. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    4FBSDIFS GFUDINFSHF

    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT

    View Slide

  37. *OEFYFS
    w จষΛड͚औΓసஔΠϯσοΫεΛ࡞੒
    w ϝϞϦ্ʹҰ࣌తͳసஔΠϯσοΫε
    EJDUJPOBZ
    Ͱอ࣋
    w ϝϞϦ্ʹҰఆ਺ͷసஔΠϯσοΫε͕
    ཷ·ͬͨΒετϨʔδʹอଘ
    class InvertedIndex:
    def __init__(
    self, token_id: int, token: str, postings_list=[
    ], docs_count=0
    ) -> None:
    self.token_id = token_id
    self.token = token
    self.postings_list = []
    self.__hash_handle = {}
    self.docs_count = 0
    def add_document(doc: str):
    """
    υΩϡϝϯτΛσʔλϕʔεʹ௥Ճ͠సஔΠϯσοΫεΛߏங͢Δ
    """
    if not doc:
    return
    # # จॻIDͱจষ಺༰ΛجʹϛχసஔΠϯσοΫε࡞੒
    text_to_postings_lists(doc)
    # # # Ұఆ਺ͷυΩϡϝϯτ͕ϛχసஔΠϯσοΫε͕ཷ·ͬͨΒ
    Ϛʔδ
    if len(TEMP_INVERT_INDEX) >= LIMIT:
    for inverted_index in TEMP_INVERT_INDEX.values()
    :
    save_index(inverted_index)

    View Slide

  38. *OEFYFS
    w ϙεςΟϯάϦετͷ࡞੒
    w BOBMZ[FSͰUPLFOΛ࡞੒
    w είΞܭࢉͷͨΊจॻதͷτʔΫϯ૯਺Λ͋
    Β͔͡Ίܭࢉ͓ͯ͘͠
    w ʮ୯ޠ୯ҐసஔΠϯσοΫεʯ࡞੒
    w ϑϨʔζݕࡧΛ͠ͳ͍ͨΊ
    w ͨͩ͠ɺείΞܭࢉͷͨΊͦͷτʔΫϯ͕จ
    ॻதʹ͍ͭ͋͘Δ͔ܭࢉͯ͠ΠϯσοΫεʹ
    ΋͓ͬͯ͘
    w 1ZUIPO
    def text_to_postings_lists(text) -> list:
    """
    จষ୯ҐͷసஔϦετΛ࡞Δ
    """
    tokens = JapaneseAnalyzer.analyze(text)
    token_count = len(tokens)
    document_id = save_document(text, token_count)
    cnt = Counter(tokens)
    for token, c in cnt.most_common():
    token_to_posting_list(token, document_id, c)
    def token_to_posting_list(token: str, document_id: int,
    token_count: int):
    """
    token͔Βposting listΛ࡞Δ
    """
    token_id = get_token_id(token)
    index = TEMP_INVERT_INDEX.get(token_id)
    if not index:
    index = InvertedIndex(token_id, token)
    posting = "{}:
    {}".format(str(document_id), str(token_count))
    index.add_posting(posting)
    TEMP_INVERT_INDEX[token_id] = index

    View Slide

  39. *OEFYFS࡞੒Πϝʔδ
    w EPDVNFOU
    w EPDʮࢲ͸ࢲͰ͢ɻʯ
    w EPDʮࢲͱQZUIPOɻʯ
    w 5PLFO
    w UPLFOࢲ
    w UPLFOQZUIPO
    w τʔΫϯ਺
    w EPD
    w EPD
    w ϙεςΟϯάϦετ
    w 1ZUIPO<>
    w ࢲ< >

    View Slide

  40. *OEFYFSͷϙΠϯτ
    w ϙεςΟϯάϦετ͸JEॱʹιʔτ͢Δ
    w ϝϞϦ্ͷసஔΠϯσοΫε͕Ұఆ਺Λ௒͑ͨΒετϨʔδʹอଘ

    View Slide

  41. *OEFYFSͷϙΠϯτ
    w ϙεςΟϯάϦετ͸JEॱʹιʔτ͢Δ
    w ϝϞϦ্ͷసஔΠϯσοΫε͕Ұఆ਺Λ௒͑ͨΒετϨʔδʹอଘ

    View Slide

  42. ϙεςΟϯάϦετ͸JEॱʹιʔτ͢Δ
    w ݕࡧͷޮ཰తʹߦ͏ͨΊ
    w ࠓճͷ࣮૷Ͱ͸৮Εͳ͍
    w JHεΩοϓϙΠϯλʔ ˞

    w ࠓճ͸VQEBUF͠ͳ͍ͷͰBQQFOE͢Δ͚ͩͰ0,
    ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛

    View Slide

  43. ϝϞϦͱετϨʔδͷసஔΠϯσοΫε
    w શจݕࡧΤϯδϯͰੑೳͷϘτϧωοΫͷଟ͘ΛετϨʔδͷJP͕઎ΊΔ
    w ϝϞϦ࢖༻ྔͱ଎౓ͷτϨʔυΦϑ
    w ϝϞϦ্ʹ͓͍͓ͯ͘΄͏͕ੑೳ͸͍͍͕ϝϞϦ࢖༻཰্͕͕Δ
    w ετϨʔδʹසൟʹΞΫηε͢ΔΑ͏ʹ͢Ε͹ϝϞϦ࢖༻཰͸Լ͕Δ͕஗͘ͳ
    Δ
    ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛
    ˞ʮݕࡧΤϯδϯࣗ࡞ೖ໳ʯٕज़ධ࿦ࣾ

    View Slide

  44. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    4FBSDIFS GFUDINFSHF

    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT

    View Slide

  45. 4UPSBHF
    w ࠓճTRMJUFTRMBMDIFNZΛ࢖༻
    w ͳΜͰ΋͍͍
    w /P42-Λ࢖͏ͱָ
    w ࠓճ͸ۃྗQZUIPO͚ͩࡁΉΑ͏ʹ͢ΔͨΊTRMJUFΛ࠾༻
    w ΑΓޮ཰ΛٻΊΔͳΒࣗ෼Ͱ࣮૷͢Δඞཁ͕͋Δ

    View Slide

  46. 4UPSBHF
    w σʔλϕʔεεΩʔϚ
    w %PDVNFUTUBCMFʹςΩετΛอଘ
    w ݕࡧର৅ͷϑΟʔϧυͷςΩετΛUFYU
    ϑΟʔϧυͰอ͓࣋ͯ͘͠
    class Documents(Base):
    __tablename__ = "documents"
    id = Column(Integer, primary_key=True)
    text = Column(String)
    token_count = Column(Integer)
    date = Column(String)
    time = Column(String)
    room = Column(String)
    title = Column(String)
    abstract = Column(String)
    speaker = Column(String)
    self_intro = Column(String)
    detail = Column(String)
    session_type = Column(String)
    class Tokens(Base):
    __tablename__ = "tokens"
    id = Column(Integer, primary_key=True)
    token = Column(String)
    class InvertedIndexDB(Base):
    __tablename__ = "index"
    id = Column(Integer, primary_key=True)
    token = Column(String)
    postings_list = Column(String)
    docs_count = Column(Integer)
    token_count = Column(Integer)

    View Slide

  47. 4UPSBHF
    w σʔλϕʔεॲཧͳͲͷVUJMTؔ਺Λ࣮૷
    w τʔΫϯͷ௥Ճɾऔಘ
    w υΩϡϝϯτͷ௥Ճɾऔಘ
    w సஔΠϯσοΫεͷ௥Ճɾߋ৽ɾऔಘ
    def add_token(token: str) -> int:
    SESSION = get_session()
    token = Tokens(token=token)
    SESSION.add(token)
    SESSION.commit()
    token_id = token.id
    SESSION.close()
    return token_id
    def fetch_doc(doc_id):
    SESSION = get_session()
    doc = SESSION.query(Documents).filter(Documents.id =
    = doc_id).first()
    SESSION.close()
    if doc:
    return doc
    else:
    return None

    View Slide

  48. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    NFSHFGFUDI
    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT

    View Slide

  49. 4FBSDIFS
    w RVFSZΛड͚औͬͯ
    w ΫΤϦͷύʔε 1BSTFS"OBMZ[FS

    w ݁ՌͷEPD@JEΛऔಘ .FSHFS

    w จষΛετϨʔδ͔Βऔಘ 'FUDIFS

    w είΞॱʹฒͼସ͑Δ 4PSUFS

    def search_by_query(query):
    if not query:
    return []
    # parse
    parsed_query = tokenize(query)
    parsed_query = analyzed_query(parsed_query)
    rpn_tokens = parse_rpn(parsed_query)
    # merge
    doc_ids, query_postings = merge(rpn_tokens)
    print(doc_ids, query_postings)
    # fetch
    docs = [fetch_doc(doc_id) for doc_id in doc_ids]
    # sort
    sorted_docs = sort(docs, query_postings)
    return [_parse_doc(doc) for doc, _ in sorted_docs]

    View Slide

  50. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    4FBSDIFS GFUDINFSHF

    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT

    View Slide

  51. 1BSTFS
    w ϢʔβʔͷೖྗΫΤϦΛύʔε͢Δ
    w "/% 03 /05ͳͲͷ࿦ཧݕࡧʹ࢖͏ΦϖϨʔλʔ΍ԋࢉࢠΛ࢖͏৔߹ʹඞཁ
    w ࿦ཧݕࡧ͠ͳ͍ͳΒෆཁ
    w ࠓճ͸ٯϙʔϥϯυه๏ʹม׵

    View Slide

  52. ٯϙʔϥϯυه๏ ޙஔه๏

    w ԋࢉࢠΛඃԋࢉࢠͷޙΖʹஔ͘ه๏ ˞

    w JH
    w ʮʯˠʮʯ
    w ʮ

    ʯˠʮʯ
    w ʮQZUIPO"/%ݕࡧʯˠʮQZUIPOݕࡧ"/%ʯ
    w ϝϦοτ
    w ࣜͷධՁ ܭࢉ
    ͕γϯϓϧʹͳΔ
    w ઌ಄͔ΒධՁ͢Δ͚ͩͰࡁΉ
    w ˞XJLJQFEJBIUUQTKBXJLJQFEJBPSHXJLJ
    &&%&#$&"&#&&"&
    #

    View Slide

  53. ૢं৔ΞϧΰϦζϜৄࡉ
    • ٯϙʔϥϯυه๏΁ͷม׵͸ૢं৔ΞϧΰϦζϜΛར༻
    • ΞϧΰϦζϜͷৄࡉ͸XJLJQFEJBͷૢं৔ΞϧΰϦζϜͷϖʔδΛࢀর
    w ˞XJLJQFEJBIUUQTKBXJLJQFEJBPSHXJLJ
    &%&##"&"#&"&"#&#&""&
    #"&"

    View Slide

  54. 1BSTFS
    import re
    REGEX_PATTERN = r"\s*(\d+|\w+|.)"
    SPLITTER = re.compile(REGEX_PATTERN)
    LEFT = True
    RIGHT = False
    OPERATER = {"AND": (3, LEFT), "OR": (2, LEFT), "NOT": (1
    , RIGHT)}
    def tokenize(text):
    return SPLITTER.findall(text)
    def parse_rpn(tokens: list):
    ɹɹɹ# ΞϧΰϦζϜͷ࣮૷
    • ΞϧΰϦζϜΛ࣮૷
    • தஔه๏Λޙஔه๏ʹ͢Δ͚ͩͳͷͰ࣮૷
    ͢Δͷ͸จࣈྻͱԋࢉࢠ ׅހͷΈ
    • ਖ਼نදݱͰεϖʔεͱׅހͰ෼ׂ
    • JHz>
    • ར༻͢ΔΦϖϨʔλʔͱɺͦͷ༏ઌ౓ɺ
    ݁߹ํ๏Λࢦఆ
    • ༏ઌ౓
    • /0503"/%
    • ݁߹
    • "/% 03ࠨ݁߹
    • /05ӈ݁߹
    • ʮ""/% #03$
    ʯ
    • ɹˠɹ<" # $ 03 "/%>

    View Slide

  55. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    4FBSDIFS GFUDINFSHF

    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT

    View Slide

  56. "OBMZ[FS
    w సஔΠϯσοΫε࡞੒࣌ʹࣙॻʹ࢖ͬ
    ͨUPLFOͰݕࡧ͢Δඞཁ͕͋Δ
    w *OEFY࣌ʹ࢖ͬͨ"OBMZ[FSΛͦͷ··࢖ͬ
    ͯ΋0,
    w +BQBOFTF5PLFOJ[FSͱ&OHMJTI5PLFOJ[FS
    ʹରԠ͢Δ
    w ݕࡧ࣌ʹ͸ӳ୯ޠͰݕࡧ
    w 8IJUFTQBDFUPLFOJ[FS͸ߟྀ͠ͳ͍
    w ࠓճ͸෼ׂ͞ΕͨτʔΫϯΛ03ݕࡧ
    ͱͯ͠ѻ͏
    w ʮػցֶश"/%QZUIPOʯ
    w ˠʮػց03ֶश"/%QZUIPOʯ
    w ٯϙʔϥϯυه๏ม׵લʹ"OBMZ[F͢
    Δ
    def analyzed_query(parsed_query):
    return_val = []
    for q in parsed_query:
    if q in OPRS:
    return_val.append(q)
    else:
    analyzed_q = JapaneseAnalyzer.analyze(q)
    if analyzed_q:
    tmp = " OR ".join(analyzed_q)
    return_val += tmp.split(" ")
    return return_val

    View Slide

  57. .FSHF
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    4FBSDIFS GFUDINFSHF

    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT

    View Slide

  58. ٯϙʔϥϯυه๏ධՁ
    w ࡞੒ͨ͠ٯϙʔϥϯυه๏ΛධՁ͍ͯ͘͠
    w खॱ
    w ʮʯ

    ͷ৔߹
    ඃԋࢉࢠͷ৔߹
    ԋࢉࢠ ̏
    ΛελοΫʹੵΉTUBDL<̏>
    ඃԋࢉࢠͷ৔߹
    ԋࢉࢠ ̐
    ΛελοΫʹੵΉTUBDL<̏ ̐>
    ԋࢉࢠͷ৔߹
    τοϓ̎ͭΛελοΫ͔ΒऔΓग़͢TUBDL<>
    ܭࢉ ̏ʴ̐
    ͯ͠ελοΫʹੵΉTUBDL<̓>
    ඃԋࢉࢠͷ৔߹
    ԋࢉࢠ ̍
    ΛελοΫʹੵΉTUBDL<̓ ̍>
    ඃԋࢉࢠͷ৔߹
    ԋࢉࢠ ̎
    ΛελοΫʹੵΉTUBDL<̓ ̍ ̎>
    ԋࢉࢠͷ৔߹
    τοϓ̎ͭΛελοΫ͔ΒऔΓग़͢TUBDL<̓>
    ܭࢉ ̍̎
    ͯ͠ελοΫʹੵΉTUBDL<̓ ̍>
    ԋࢉࢠͷ৔߹
    τοϓ̎ͭΛελοΫ͔ΒऔΓग़͢TUBDL<>
    ܭࢉ ̓̍
    ͯ͠ελοΫʹੵΉTUBDL<̓>

    View Slide

  59. ٯϙʔϥϯυه๏ධՁ
    w .FSHFٯϙʔϥϯυه๏ͷධՁ
    w ٯϙʔϥϯυه๏ͷධՁ͸̏ύλʔϯ
    w ʮελοΫʹੵΉʯɺʮελοΫ͔ΒऔΓग़͢ʯɺʮܭࢉ͢Δʯ
    w ܭࢉ͸̎ͭͷඃԋࢉࢠͷධՁͰ͢Ή
    w ̎ͭͷτʔΫϯͷϙεςΟϯάϦετΛϚʔδ͢Δ
    w ελοΫʹ͸ɺτʔΫϯ͔Βऔಘͨ͠ϙεςΟϯάϦετΛ௥Ճ͢Δ

    View Slide

  60. .FSHF
    w ٯϙʔϥϯυه๏ධՁखॱ௨Γʹ࣮૷
    w UPLFO͕ඃԋࢉࢠτʔΫϯͷ৔߹ʹ
    ϙεςΟϯάϦετΛऔಘޙTUPDLʹ
    ௥Ճ
    w UPLFO͕ԋࢉࢠͷ৔߹͸ɺTUPDL͔Β
    UPQ̎ͭΛऔΓग़ͯ͠NFSHFޙTUPDL
    ʹ௥Ճ
    w ϙεςΟϯάϦετ͸είΞܭࢉΑ͏
    ʹEJDUJPOBSZͰ؅ཧ
    def merge(tokens: list):
    target_posting = {}
    stack = []
    for token in tokens:
    if token not in OPRS:
    token_id = get_token_id(token)
    postings_list = fetch_postings_list(token_id
    )
    # scoreܭࢉ༻ʹอ࣋
    target_posting[token] = postings_list
    # doc_idͷΈΛstackʹ௥Ճ
    doc_ids = set([p[0] for p in postings_list])
    stack.append(doc_ids)
    # token͕operaterͩͬͨ৔߹
    else:
    if not stack:
    raise
    if len(stack) == 1:
    # NOTͷΈڐ༰
    if token == "NOT":
    # NOTͷॲཧ
    return not_doc_ids, {}
    else:
    raise
    doc_ids1 = stack.pop()
    doc_ids2 = stack.pop()
    stack.append(merge_posting(token, doc_ids1,
    doc_ids2))

    View Slide

  61. .FSHF
    w ʮ/05IPHFʯͷରԠ
    w ඃԋࢉࢠ̎ͭͷධՁҎ֎ͷධՁํ๏Λ/05
    ͷΈڐՄ
    w UPLFO͕ԋࢉࢠͷ৔߹ʹTUBDLͷαΠζͱ
    UPLFOͷத਎Ͱ൑ఆ
    # ʮNOT hogeʯରԠ
    if len(stack) == 1:
    # NOTͷΈڐ༰
    if token == "NOT":
    # NOTͷॲཧ
    doc_ids = stack.pop()
    not_doc_ids = fetch_not_docs_id(doc_ids)
    return not_doc_ids, {}
    else:
    raise

    View Slide

  62. .FSHF
    w ΦϖϨʔλʔΛϢʔβʔ͕هड़͠ͳ͍
    ৔߹ͷରԠ
    w ʮQZUIPOػցֶशʯ
    w ˠʮQZUIPO03ػցֶशʯ
    w ˠʮQZUIPO"/%ػցֶशʯ
    w ࠓճ͸03Λ࠾༻
    w ΦϖϨʔλʔΛهड़͠ͳ͍ඃԋࢉࢠ
    ͱධՁͷ਺͕߹Θͳ͍
    w ࠷ޙ·ͰධՁͯ͠΋TUBDL͕̎ͭҎ্͋Δ
    w શͯͷτʔΫϯΛ࣮૷ͨ͋͠ͱʹTUBDL
    ͕ʹͳΔ·Ͱ03NFSHFΛ܁Γฦ͢
    for token in tokens:
    # ධՁॲཧ
    while len(stack) != 1:
    doc_ids1 = stack.pop()
    doc_ids2 = stack.pop()
    stack.append(merge_posting("OR", doc_ids1, doc_ids2)
    )

    View Slide

  63. .FSHF
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    4FBSDIFS GFUDINFSHF

    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT

    View Slide

  64. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ
    "OBMZ[FS
    *OEFYFS
    4UPSBHF
    1BSTFS
    "OBMZ[FS
    4FBSDIFS GFUDINFSHF

    4PSUFS
    %PDVNFOU 2VFSZ 3FTVMUT

    View Slide

  65. είΞϦϯά
    w 5'*%'
    w 5'UFSNGSFRVFODZ

    w จॻதͷ୯ޠͷׂ߹
    w ୯ޠ U
    จষதͷ୯ޠ਺ 5

    w J%'JOWFSUEPDVNFOUGSFRVFODZ
    w શจॻதͷͦͷ୯ޠΛؚΉจষͷׂ߹
    w MPH ͦͷ୯ޠΛؚΉจॻ਺ %
    શจॻ਺ "


    w จষதʹΑ͘ग़͖ͯͯɺଞͷจষʹͰͯ͜ͳ͍΋ͷ΄ͲείΞ͕ߴ͍

    View Slide

  66. είΞϦϯά
    w ݕࡧʹ͓͍ͯͷ*%'
    w Ωʔϫʔυ͚ͩͰݕࡧ͢Δͱશจจॻͷ*%'͸ಉ͡
    w 5'͚ͩͰΑͦ͞͏
    w ͨͩ͠ɺ"/% 03Ͱݕࡧͨ͠ͱ͖ʹҙຯ͕มΘͬͯ͘Δɻ
    w ෳ਺ͷΫΤϦʹର͢Δจॻ܊ͷൺֱͷͨΊͷॏΈ෇͚ͱߟ͑Δ
    w ʮ""/%#ʯͰݕࡧͨ͠ͱ͖ͷจॻЋͷ5'*%'஋
    w จॻЋͷ5'*%'஋ΫΤϦ"ͷ5'*%'ΫΤϦ#ͷ5'*%'

    View Slide

  67. 4PSUFS
    w είΞܭࢉʹ࢖͏஋͸΄΅ΠϯσοΫ
    ε࣌ʹूܭ͍ͯ͠Δ΋ͷΛ࢖͏
    w จॻதͷݕࡧτʔΫϯ਺ U

    w QPTUJOHͰอ࣋
    w
    w จॻதͷτʔΫϯ਺ 5

    w %PDVNFOUTอଘ࣌ʹҰॹʹอଘ
    w EPDUPLFO@DPVOU
    w ݕࡧτʔΫϯ U
    ΛؚΉจॻ਺ %

    w UͷϙεςΟϯάϦετͷ௕͞
    w શจॻ਺ "

    w %PDVNFOUTUBCMFͷΧ΢ϯτ
    def sort(doc_ids, query_postings):
    docs = []
    all_docs = count_all_docs()
    for doc_id in doc_ids:
    doc = fetch_doc(doc_id)
    doc_tfidf = 0
    for token, postings_list in query_postings.items
    ():
    idf = math.log10(all_docs / len(postings_lis
    t)) + 1
    posting = [p for p in postings_list if p[0]
    == doc.id]
    if posting:
    tf = round(posting[0]
    [1] / doc.token_count, 2)
    else:
    tf = 0
    token_tfidf = tf * idf
    doc_tfidf += token_tfidf
    docs.append((doc, doc_tfidf))
    return sorted(docs, key=lambda x: x[1], reverse=True
    )

    View Slide

  68. σϓϩΠ5JQT
    w ϝϞϦΛଟΊʹ͢Δ
    w ੑೳ͕͕͋Δ
    w +BOPNF͕QJQͷϏϧυ࣌ʹ.#d.#ఔ౓ϝϞϦ͕ඞཁ ˞

    w IFSPLV͸ݫ͍͠
    w GSFFɾIPCZ͸✕
    w 4UBOEBSE΋͓ͦΒ͘✕
    w &$UNJDSP΋✕
    w &$UTNBMM͸˓
    w ˞IUUQTNPDPCFUBHJUIVCJPKBOPNF

    View Slide

  69. %&.0
    %#ͷίʔυ
    %&.0

    View Slide

  70. վળ఺
    w ΠϯσΫγϯάɾݕࡧͷ଎౓վળ
    w ϙεςΟϯάϦετͷѹॖ
    w ޮ཰ͷ͍͍ΞϧΰϦζϜͱσʔλߏ଄Λ࢖͏
    w ಈతͳΠϯσοΫεͷ࡞੒
    w ΫΤϦ֦ு
    w ϑϨʔζݕࡧ
    w ෳ਺ϑΟʔϧυରԠ
    w ԋࢉࢠͷ֦ு
    w ετϨʔδ෦෼ ϑΝΠϧγεςϜ
    ࣮૷
    w ϕΫτϧݕࡧ
    w ෼ࢄԽ
    w FUD

    View Slide

  71. վળͷͨΊͷࢀߟࢿྉ
    w ॻ੶
    w ʮݕࡧΤϯδϯࣗ࡞ೖ໳ʯٕज़ධ࿦ࣾ
    w ʮ৘ใݕࡧͷجૅʯڞཱग़൛
    w ʮߴ଎จࣈྻղੳͷੈքʯؠ೾ॻళ
    w ʮ%FFQ-FBSOJOHGPS4FBSDIʯ."//*/(
    w 044
    w ʮ8IPPTIʯ
    w IUUQTCJUCVDLFUPSHNDIBQVUXIPPTITSDEFGBVMUEPDT
    TPVSDFJOUSPSTU
    w ʮ&MBTUJDTFBSDIʯ
    w IUUQTHJUIVCDPNFMBTUJDFMBTUJDTFBSDI
    w ʮ"QBDIF-VDFOFʯ
    w IUUQMVDFOFBQBDIFPSHDPSFEPDVNFOUBUJPOIUNM

    View Slide

  72. ·ͱΊ
    w ݕࡧΤϯδϯͷ࢓૊Έͷઆ໌͔ΒجຊతͳݕࡧΤϯδϯͷ࣮૷·Ͱઆ໌ͨ͠
    w ୯७ʹݟ͑ΔݕࡧΤϯδϯ΋࣮૷͢Δͱ৭ʑߟ͑Δ͜ͱ͕͋Δ
    w ษڧɾझຯʹ͓͍ͯंྠ࠶ൃ໌͸༗ҙٛ
    w ݕࡧΤϯδϯΛ࡞Δͷ͸ָ͍͠
    w ΈΜͳͰʮ΅͘ͷ͔Μ͕͍͖͑ͨ͞ΐ͏ͷݕࡧΤϯδϯʯΛ࡞Γ·͠ΐ͏

    View Slide