Slide 38
Slide 38 text
*OEFYFS
w ϙεςΟϯάϦετͷ࡞
w BOBMZ[FSͰUPLFOΛ࡞
w είΞܭࢉͷͨΊจॻதͷτʔΫϯ૯Λ͋
Β͔͡Ίܭࢉ͓ͯ͘͠
w ʮ୯ޠ୯ҐసஔΠϯσοΫεʯ࡞
w ϑϨʔζݕࡧΛ͠ͳ͍ͨΊ
w ͨͩ͠ɺείΞܭࢉͷͨΊͦͷτʔΫϯ͕จ
ॻதʹ͍ͭ͋͘Δ͔ܭࢉͯ͠ΠϯσοΫεʹ
͓ͬͯ͘
w 1ZUIPO
def text_to_postings_lists(text) -> list:
"""
จষ୯ҐͷసஔϦετΛ࡞Δ
"""
tokens = JapaneseAnalyzer.analyze(text)
token_count = len(tokens)
document_id = save_document(text, token_count)
cnt = Counter(tokens)
for token, c in cnt.most_common():
token_to_posting_list(token, document_id, c)
def token_to_posting_list(token: str, document_id: int,
token_count: int):
"""
token͔Βposting listΛ࡞Δ
"""
token_id = get_token_id(token)
index = TEMP_INVERT_INDEX.get(token_id)
if not index:
index = InvertedIndex(token_id, token)
posting = "{}:
{}".format(str(document_id), str(token_count))
index.add_posting(posting)
TEMP_INVERT_INDEX[token_id] = index