w ΫϩʔϦϯάεΫϨΠϐϯάʹ͍ͭͯ w ؾʹͳΔਓɺຊΛಡΉPS!TIJOZPSLF͞Μ ˞ !WBBBBBORVJTI͞Μ ˞ ͷϒϩάΛݟͯ ͍ͩ͘͞ w ˞IUUQTTMJEFTIJQDPNVTFST!JLUBLBIJSPQSFTFOUBUJPOT%3+YK,G#'&(43DW:K$"G w ˞IUUQTTIJOZPSLFIBUFOBCMPHDPNFOUSZLPXBLVOBJDSBXMBOETDSBQJOH w ˞IUUQTWBBBBBBORVJTIIBUFOBCMPHDPNFOUSZ
+ "\n") def search(keyword: str): with open("db.txt", "r") as f: return [l.strip() for l in f if keyword in l] if __name__ == "__main__": texts = [ "Beautiful is better than ugly.", "Explicit is better than implicit.", "Simple is better than complex." ] for text in texts: add_text(text) results = search("Simple") for result in results: print(result)
> w ϝϦοτγϯϓϧͰ࣮͍͢͠ɻσΟεΫ༰ྔ͕গͳ͍ɻ w σϝϦοτػೳੑ͕͍͠ w ୯ޠ୯ҐసஔΠϯσοΫε w ୯ޠͱ୯ޠΛؚΉจষ จষJE Ћͷใ JH୯ޠͷग़ݱҐஔ w JH1ZUIPO< > w ϝϦοτػೳੑ͕ߴ͍ɻྫ͑ϑϨʔζݕࡧ͕Ͱ͖Δɻ w σϝϦοτσΟεΫ༰ྔ͕ଟ͍ɻ w ˞ʮใݕࡧͷجૅʯڞཱग़൛
ϯεΛ͍ճ͢ɻ ˞ from janome.tokenizer import Tokenizer tokenizer = Tokenizer() class BaseTokenizer: @classmethod def tokenize(cls, text): raise NotImplementedError class JanomeTokenizer(BaseTokenizer): @classmethod def tokenize(cls, text): return (t for t in cls.tokenizer.tokeniz e(text)) class WhitespaceTokenizer(BaseTokenizer): @classmethod def tokenize(cls, text): return (t[0] for t in re.finditer(r"[^ \ t\r\n]+", text)) ˞IUUQTHJUIVCDPNNPDPCFUBKBOPNF ˞IUUQTNPDPCFUBHJUIVCJPKBOPNFUPLFOJ[FS
w [email protected] w pMUFSෳࢦఆ͢ΔͨΊྻ w લ͔Βॱ൪ʹॲཧ͍ͯ͘͠ class Analyzer: tokenizer = None char_filters = [] token_filters = [] @classmethod def analyze(cls, text: str): text = cls._char_filter(text) tokens = cls.tokenizer.tokenize(text) filtered_token = (cls._token_filter(token) for t oken in tokens) return [parse_token(t) for t in filtered_token i f t] @classmethod def _char_filter(cls, text): for char_filter in cls.char_filters: text = char_filter.filter(text) return text @classmethod def _token_filter(cls, token): for token_filter in cls.token_filters: token = token_filter.filter(token) return token
w ݕࡧ࣌ʹӳ୯ޠͰݕࡧ w 8IJUFTQBDFUPLFOJ[FSߟྀ͠ͳ͍ w ࠓճׂ͞ΕͨτʔΫϯΛ03ݕࡧ ͱͯ͠ѻ͏ w ʮػցֶश"/%QZUIPOʯ w ˠʮػց03ֶश"/%QZUIPOʯ w ٯϙʔϥϯυه๏มલʹ"OBMZ[F͢ Δ def analyzed_query(parsed_query): return_val = [] for q in parsed_query: if q in OPRS: return_val.append(q) else: analyzed_q = JapaneseAnalyzer.analyze(q) if analyzed_q: tmp = " OR ".join(analyzed_q) return_val += tmp.split(" ") return return_val
ˠʮQZUIPO"/%ػցֶशʯ w ࠓճ03Λ࠾༻ w ΦϖϨʔλʔΛهड़͠ͳ͍ඃԋࢉࢠ ͱධՁͷ͕߹Θͳ͍ w ࠷ޙ·ͰධՁͯ͠TUBDL͕̎ͭҎ্͋Δ w શͯͷτʔΫϯΛ࣮ͨ͋͠ͱʹTUBDL ͕ʹͳΔ·Ͱ03NFSHFΛ܁Γฦ͢ for token in tokens: # ධՁॲཧ while len(stack) != 1: doc_ids1 = stack.pop() doc_ids2 = stack.pop() stack.append(merge_posting("OR", doc_ids1, doc_ids2) )
ʮ%FFQ-FBSOJOHGPS4FBSDIʯ."//*/( w 044 w ʮ8IPPTIʯ w IUUQTCJUCVDLFUPSHNDIBQVUXIPPTITSDEFGBVMUEPDT TPVSDFJOUSPSTU w ʮ&MBTUJDTFBSDIʯ w IUUQTHJUIVCDPNFMBTUJDFMBTUJDTFBSDI w ʮ"QBDIF-VDFOFʯ w IUUQMVDFOFBQBDIFPSHDPSFEPDVNFOUBUJPOIUNM