Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CharacterTextSplitter(LangChain)

LiberalArts
September 02, 2024
35

 CharacterTextSplitter(LangChain)

下記で取り扱ったLangChainのCharacterTextSplitterやTextLoaderについての記述を公開します。

・簡易RAG構築を通して学ぶLangChain超入門
https://booth.pm/ja/items/5938488

LangChainは多機能なライブラリである一方で使い方を抑えるのがなかなか大変なので、ざっくり把握するにあたって参考にすると良いのではないかと思います。

LiberalArts

September 02, 2024
Tweet

Transcript

  1. 2.3 CharacterTextSplitter 3 {'question ': "If I am 6 ft

    4 inches , how tall am I in centimeters ? "}, 4 {'question ': "Who was the 12th person on the moon?"} 5 ] 6 res = llm_chain.invoke(qs) 7 print(res) ˛ ࣮ߦ݁Ռ The New Orleans Saints did not win the Super Bowl in the 2010 season. It was actually the Pittsburgh Steelers who won the Super Bowl XLV in that year , defeating the Green Bay Packers. To convert your height from feet and inches to centimeters : First , multiply your height in inches by 2.54 (since one inch is equal to 2.54 centimeters ). Then add any extra inches converted fully into centimeters . So for someone who is 6 ft 4 inches tall , it would be calculated as follows: (6 feet * 12 inches/foot) + 4 inches = 76 inches 76 inches * 2.54 cm/inch = 193.04 centimeters Therefore , a person who is 6 ft 4 inches tall is approximately 193 centimeters tall. Neil Armstrong was the first person to step onto the moon 's surface during the Apollo 11 mission in July 1969 and he wasn 't specifically the 12th man on a lunar expedition , but as part of the Apollo 11 crew with Buzz Aldrin. 2.3 CharacterTextSplitter LangChain ʹ͸ RAG ͳͲΛ࣮૷͢Δࡍʹߦ͏ςΩετ෼ׂͷػೳ (Text Splitters) ͕༻ҙ͞Ε͍ͯ·͢ɻϓϩάϥϜ 2.8 ͸จࣈ਺ʹԊͬͯςΩετ ͷ෼ׂΛߦ͏ CharacterTextSplitter ͷαϯϓϧίʔυͰ͢*3 ɻ ˛ϓϩάϥϜ 2.8 CharacterTextSplitter ᶃ 1 from langchain_text_splitters import CharacterTextSplitter 2 3 text = " ޗ ഐ ͸ ೣ Ͱ ͋ Δ ɻ ໊ લ ͸ · ͩ ͳ ͍ ɻ " 4 5 text_splitter = CharacterTextSplitter ( *3 Ҏޙ 2.3 અͷϓϩάϥϜͰ͸ from langchain_text_splitters import Characte rTextSplitter Λলུ͠·͢ɻ 19
  2. ୈ 2 ষ LangChain ௒ೖ໳ 6 separator=" ɻ ", 7

    chunk_size =10, 8 chunk_overlap =2, 9 length_function =len , 10 is_separator_regex =False , 11 ) 12 13 texts = text_splitter .split_text(text) 14 print(texts) ˛ ࣮ߦ݁Ռ [' ޗ ഐ ͸ ೣ Ͱ ͋ Δ ', ' ໊ લ ͸ · ͩ ͳ ͍ '] CharacterTextSplitter Ͱ͸·ͣ separator Ͱೖྗ͞ΕͨςΩετΛ ۠੾Γɺ࣍ʹ chunk_size Ͱࢦఆͨ͠จࣈ਺Λ௒͑ͳ͍Α͏ʹจࣈྻͷ࿈݁ Λߦ͍·͢ɻϓϩάϥϜ 2.8 Ͱ͸ chunk(෼ׂ͞Εͨจॻͷ୯Ґ) ͷ࠷େͷେ ͖͞Λ 10 ͱࢦఆͨ͠ͷͰɺ ʮޗഐ͸ೣͰ͋Δʯͱʮ໊લ͸·ͩͳ͍ʯ͕ͦΕ ͧΕҟͳΔ chunk Ͱग़ྗ͞Ε͍ͯ·͢ɻ จࣈ਺ͷΧ΢ϯτΛͲͷΑ͏ʹߦͳ͍ͬͯΔ͔ʹ͍ͭͯ֬ೝ͢Δʹ͋ͨͬ ͯɺϓϩάϥϜ 2.9 Ͱ͸ chunk_size ΛมԽͤͯ͞ग़ྗ͕Ͳ͏มΘΔ͔Λ࣮ ݧ͠·͢ɻ ˛ϓϩάϥϜ 2.9 CharacterTextSplitter ᶄ 1 for i in range (12 ,17): 2 text_splitter = CharacterTextSplitter ( 3 separator=" ɻ ", 4 chunk_size=i, 5 chunk_overlap =2, 6 length_function =len , 7 is_separator_regex =False , 8 ) 9 texts = text_splitter .split_text(text) 10 print(i, texts) ˛ ࣮ߦ݁Ռ 12 [' ޗ ഐ ͸ ೣ Ͱ ͋ Δ ', ' ໊ લ ͸ · ͩ ͳ ͍ '] 13 [' ޗ ഐ ͸ ೣ Ͱ ͋ Δ ', ' ໊ લ ͸ · ͩ ͳ ͍ '] 14 [' ޗ ഐ ͸ ೣ Ͱ ͋ Δ ', ' ໊ લ ͸ · ͩ ͳ ͍ '] 15 [' ޗ ഐ ͸ ೣ Ͱ ͋ Δ ɻ ໊ લ ͸ · ͩ ͳ ͍ '] 16 [' ޗ ഐ ͸ ೣ Ͱ ͋ Δ ɻ ໊ લ ͸ · ͩ ͳ ͍ '] 20
  3. 2.3 CharacterTextSplitter ϓϩάϥϜ 2.9 ͷ࣮ߦ݁ՌΑΓɺchunk_size ͕ 15 Ҏ߱͸จࣈྻ͕࿈݁ ͞ΕΔ͜ͱ͕֬ೝͰ͖·͢ɻ·ͨɺ ʮޗഐ͸ೣͰ͋Δɻ໊લ͸·ͩͳ͍ʯͷจ

    ࣈ਺Λ਺͑Δͱʮɻ ʯࠐΈͰͪΐ͏Ͳ 15 Ͱ͋Δ͜ͱ΋֬ೝͰ͖·͢ɻจͱจ ͷͭͳ͗͸ separator Ͱࢦఆ͞Εͨʮɻ ʯͰ࿈݁͞Ε͍ͯΔҰํͰɺҰ൪ޙ Ζͷจʹ͸ʮɻ ʯ͕͔ͭͳ͍͜ͱ΋߹Θͤͯ֬ೝ͓ͯ͘͠ͱྑ͍ͱࢥ͍·͢ɻ ࣍ʹ chunk_overlap ʹ͍ͭͯ֬ೝ͠·͢ɻϓϩάϥϜ 2.10 Λ࣮ߦ͢Δ ͜ͱͰɺchunk_overlap ͷ஋ʹΑͬͯ෼ׂͷ݁Ռ͕ͲͷΑ͏ʹมΘΔ͔Λ ֬ೝ͢Δ͜ͱ͕Ͱ͖·͢ɻ ˛ϓϩάϥϜ 2.10 CharacterTextSplitter ᶅ 1 text = " ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ " 2 3 for i in range (0 ,3): 4 text_splitter = CharacterTextSplitter ( 5 separator=" ɺ ", 6 chunk_size =5, 7 chunk_overlap =i, 8 length_function =len , 9 is_separator_regex =False , 10 ) 11 texts = text_splitter .split_text(text) 12 print(i, texts) ˛ ࣮ߦ݁Ռ 0 [' ੺ ɺ ੨ ɺ ྘ ', ' ࢵ ɺ ന ', ' ԫ ৭ ɺ ਫ ৭ '] 1 [' ੺ ɺ ੨ ɺ ྘ ', ' ྘ ɺ ࢵ ɺ ന ', ' ന ɺ ԫ ৭ ', ' ਫ ৭ '] 2 [' ੺ ɺ ੨ ɺ ྘ ', ' ྘ ɺ ࢵ ɺ ന ', ' ന ɺ ԫ ৭ ', ' ԫ ৭ ɺ ਫ ৭ '] ϓϩάϥϜ 2.10 ͷ࣮ߦ݁ՌΑΓɺchunk_overlap ʹࢦఆͨ͠จࣈ਺෼ ͷॏෳ͕ߦΘΕΔ͜ͱ͕֬ೝͰ͖·͢ɻͨͱ͑͹ chunk_overlap=1 ͷ৔ ߹͸ʮ྘ʯ ɺ ʮനʯ͕ॏෳͯ͠෼ׂ͞Εɺchunk_overlap=2 ͷ৔߹͸ʮ྘ʯ ɺ ʮനʯ ɺ ʮԫ৭ʯ͕ॏෳͯ͠෼ׂ͞ΕΔ͜ͱ͕֬ೝͰ͖·͢ɻ͜ͷΑ͏ʹॏෳ ͋ΓͰจॻΛ෼ׂ͢Δ͜ͱͰɺ ʮʓʓͱ͍͏՝୊͕͋Δɻͦ͜Ͱ˚˚͕࣮ߦ ͞ΕΔɻ ʯͷΑ͏ʹؔ࿈͢Δෳ਺จΛಉ͡ chunk ʹ෼ׂ͢Δ֬཰Λ্͛Δ͜ ͱ͕ՄೳʹͳΓ·͢ɻ ·ͨɺ2.3 અͰ͸ CharacterTextSplitter ͷ split_text ͱ͍͏ϝιο υΛ༻͍·͕ͨ͠ɺCharacterTextSplitter ʹ͸ split_text Ҏ֎ʹ΋ 21
  4. ୈ 2 ষ LangChain ௒ೖ໳ ϝιου͕༻ҙ͞Ε͍ͯΔ͜ͱ͸߹Θͤͯ཈͓͑ͯ͘ͱྑ͍Ͱ͢ɻsplit_ text Ҏ֎ͷϝιουʹ͍ͭͯৄ͘͠͸ਤ 2.2 ʹදͨ͠Α͏ʹɺLangChain

    ͷυΩϡϝϯτΛ֬ೝ͢Δͱྑ͍ͱࢥ͍·͢ɻ ˛ਤ 2.2 CharacterTextSplitter ͷϝιουҰཡ https://api.python.langchain.com/en/latest/character/langc hain_text_splitters.character.CharacterTextSplitter.html ਤ 2.2 ͷதͰ΋ split_textɺcreate_documentsɺsplit_documents Λ཈͓͚͑ͯ͹େ֓ͷॲཧ͕࣮ݱͰ͖·͢ɻcreate_documents ͸ϓϩά ϥϜ 2.11 ͷΑ͏ʹ࢖༻͠·͢ɻ ˛ϓϩάϥϜ 2.11 create_documents 1 text = " ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ ɻ ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ ɻ " 2 3 text_splitter1 = CharacterTextSplitter ( 4 separator=" ɻ ", 5 chunk_size =20, 6 chunk_overlap =0, 7 length_function =len , 8 is_separator_regex =False , 9 ) 10 11 texts = text_splitter1 .split_text(text) 12 docs = text_splitter1 . create_documents (texts) 13 22
  5. 2.3 CharacterTextSplitter 14 print(texts) 15 print(type(texts [0])) 16 print("===") 17

    pprint.pp(docs) 18 print(type(docs [0])) ˛ ࣮ߦ݁Ռ [' ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ ', ' ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ '] <class 'str '> === [Document( page_content =' ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ '), Document( page_content =' ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ ')] <class 'langchain_core .documents.base.Document '> ϓϩάϥϜ 2.11 ͷ࣮ߦ݁ՌΑΓɺcreate_documents ϝιουͷద༻લ ͷϦετ texts ͷཁૉ͸<class ’str’>ɺϝιουͷద༻ޙͷϦετ do cs ͷཁૉ͸<class ’langchain_core.documents.base.Document’>Ͱ ͋Δ͜ͱ͕֬ೝͰ͖·͢ɻ<class ’langchain_core.documents.base. Document’>͸ 2.5 અͰ TextLoader Λ༻͍ͯಡΈࠐΜͩࡍ΋ಉ༷ͷܗࣜ ʹͳΔͷͰɺLangChain Ͱ chunk ΛऔΓѻ͏ࡍͷσϑΥϧτͷܗࣜͰ͋Δ ͱղऍ͓ͯ͘͠ͱྑ͍ͱࢥ͍·͢ɻ·ͨɺsplit_documents Λ༻͍Δ͜ͱ ͰϓϩάϥϜ 2.12 ͷΑ͏ʹ෼ׂ݁Ռͷ docs Λ͞Βʹ෼ׂ͢Δ͜ͱ͕Մೳ Ͱ͢ɻ ˛ϓϩάϥϜ 2.12 split_documents 1 text_splitter2 = CharacterTextSplitter ( 2 separator=" ɺ ", 3 chunk_size =5, 4 chunk_overlap =0, 5 length_function =len , 6 is_separator_regex =False , 7 ) 8 9 docs2 = text_splitter2 . split_documents (docs) 10 11 pprint.pp(docs) 12 print(type(docs [0])) 13 print("===") 14 pprint.pp(docs2) 15 print(type(docs2 [0])) 23
  6. ୈ 2 ষ LangChain ௒ೖ໳ ˛ ࣮ߦ݁Ռ [Document( page_content ='

    ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ '), Document( page_content =' ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ ')] <class 'langchain_core .documents.base.Document '> === [Document( page_content =' ੺ ɺ ੨ ɺ ྘ '), Document( page_content =' ࢵ ɺ ന '), Document( page_content =' ԫ ৭ ɺ ਫ ৭ '), Document( page_content =' ੺ ɺ ੨ ɺ ྘ '), Document( page_content =' ࢵ ɺ ന '), Document( page_content =' ԫ ৭ ɺ ਫ ৭ ')] <class 'langchain_core .documents.base.Document '> 2.4 RecursiveCharacterTextSplitter* 2.3 અͰऔΓѻͬͨ CharacterTextSplitter Ͱ͸۠੾ΓͷจࣈΛࢦఆ͢ Δ separator Λ 1 ͭࢦఆ͢Δ͜ͱͰ෼ׂॲཧΛߦ͍·͕ͨ͠ɺRecursive CharacterTextSplitter Λ༻͍Δ͜ͱͰ separator Λࢦఆ͠ͳ͍৔߹΍ ෳ਺ࢦఆ͢Δ৔߹Ͱ΋෼ׂॲཧΛߦ͏͜ͱ͕ՄೳͰ͢ɻϓϩάϥϜ 2.13 Λ ࣮ߦ͢Δ͜ͱͰ RecursiveCharacterTextSplitter Λಈ͔͢͜ͱ͕Ͱ͖ ·͢ɻ ˛ϓϩάϥϜ 2.13 RecursiveCharacterTextSplitter 1 from langchain_text_splitters import RecursiveCharacterTextSplitter 2 3 text = " ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ ɻ ੺ ɺ ੨ ɺ ྘ ɺ ࢵ ɺ ന ɺ ԫ ৭ ɺ ਫ ৭ ɻ " 4 5 text_splitter = RecursiveCharacterTextSplitter ( 6 separators =[ 7 " ɺ ", 8 " ɻ " 9 ], 10 chunk_size =5, 11 chunk_overlap =0, 12 length_function =len , 13 ) 14 15 texts = text_splitter . create_documents ([ text ]) 16 17 print(len(texts)) 18 pprint.pp(texts) 24
  7. 2.4 RecursiveCharacterTextSplitter* ˛ ࣮ߦ݁Ռ 8 [Document( page_content =' ੺ ɺ

    ੨ ɺ ྘ '), Document( page_content =' ɺ ࢵ ɺ ന '), Document( page_content =' ɺ ԫ ৭ '), Document( page_content =' ɺ ਫ ৭ ɻ ੺ '), Document( page_content =' ɺ ੨ ɺ ྘ '), Document( page_content =' ɺ ࢵ ɺ ന '), Document( page_content =' ɺ ԫ ৭ '), Document( page_content =' ɺ ਫ ৭ ɻ ')] ϓϩάϥϜ 2.13 ͷ෼ׂલͷςΩετ͸ϓϩάϥϜ 2.11 ͱϓϩάϥϜ 2.12 ͰऔΓѻͬͨςΩετͱಉ͡΋ͷΛ༻͍·ͨ͠ɻ2 ͭͷ݁ՌΛ֬ೝ͢Δͱɺ ʮCharacterTextSplitter Λ༻͍Δํ͕ڍಈΛௐ੔͠΍͍͢ʯΑ͏ʹݟ͑ ·͢*4 ɻ·ͨɺRecursiveCharacterTextSplitter ͷϝιουʹ͍ͭͯ͸ Լهͷਤ 2.3 ͔Β֬ೝͰ͖·͢ɻ ˛ਤ 2.3 RecursiveCharacterTextSplitter ͷϝιουҰཡ https://api.python.langchain.com/en/latest/character/langc hain_text_splitters.character.RecursiveCharacterTextSplitt *4 CharacterTextSplitter Λ༻͍Δ͜ͱͰେ֓ͷॲཧ͸࣮ݱՄೳͳͷͰɺجຊతʹ͸ C haracterTextSplitter Λ༻͍Δͱ͍͏͜ͱͰྑ͍ͷͰ͸ͳ͍͔ͱࢥ͍·͢ɻ 25
  8. ୈ 2 ষ LangChain ௒ೖ໳ er.html RecursiveCharacterTextSplitter ͷϝιου͸ CharacterTextSpl itter

    ͷϝιουͱඞͣ͠΋Ұக͠ͳ͍͜ͱ͸཈͓͑ͯ͘ͱྑ͍ͱࢥ͍· ͢*5 ɻ 2.5 TextLoader TextLoader ͸֎෦ͷςΩετϑΝΠϧΛಡΈࠐΉࡍʹ༻͍ΔػೳͰ͢ɻ ಡΈࠐΈݩͷςΩετϑΝΠϧ͕ඞཁͳͷͰ texts σΟϨΫτϦΛ࡞੒͠ɺ ԼهΛ./texts/sample.txt ʹอଘ͠·͢*6 ɻ ˛ ./texts/sample.txt The nearest neighbor algorithm was one of the first algorithms used to solve the travelling salesman problem approximately . In that problem , the salesman starts at a random city and repeatedly visits the nearest city until all have been visited. The algorithm quickly yields a short tour , but usually not the optimal one. In computer science , Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes , most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree. In statistical modeling , regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome ' or 'response ' variable , or a ' label ' in machine learning parlance) and one or more independent variables (often called 'predictors ', 'covariates ', 'explanatory variables ' or 'features '). The most common form of regression analysis is linear regression , in which one finds the line (or a more complex linear combination ) that most closely fits the data according to a specific mathematical criterion. For example , the method of ordinary least squares computes the unique line (or hyperplane) that minimizes the sum of squared differences between the true data and that line (or hyperplane). *5 Python ίʔυͷ෼ׂΛ෇࿥ A ͰऔΓѻ͍·͕͢ɺ෼ׂʹ༻͍Δ from_language ͸ Re cursiveCharacterTextSplitter ʹ͸ଘࡏ͢ΔҰํͰ CharacterTextSplitter ʹ ͸ଘࡏ͠·ͤΜɻ͜ͷΑ͏ʹ RecursiveCharacterTextSplitter ͱ CharacterText Splitter ͷػೳ͸ࣅ௨͍ͬͯΔҰํͰඞͣ͠΋Ұக͠ͳ͍ͱ͍͏ͷ͸஫ҙ͓ͯ͘͠ͱ ྑ͍ͱࢥ͍·͢ɻ *6 ͜ͷςΩετ͸ Wikipedia ͷ Nearest Neighbor ๏ɺMCTSɺճؼ෼ੳͷهࣄͷຊจ ͷҰ෦Ͱ͢ɻ 26
  9. 2.5 TextLoader ༻ҙͨ͠ςΩετ͸ϓϩάϥϜ 2.14 ͷΑ͏ʹ langchain_community.d ocument_loaders.TextLoader Λ༻͍Δ͜ͱͰಡΈࠐΉ͜ͱ͕Ͱ͖·͢ɻ ˛ϓϩάϥϜ 2.14

    ςΩετͷಡΈࠐΈ 1 from langchain_community . document_loaders import TextLoader 2 3 loader = TextLoader("./ texts/sample.txt") 4 documents = loader.load () 5 6 print(type(documents)) 7 print(len(documents)) 8 print(documents) ˛ ࣮ߦ݁Ռ <class 'list '> 1 [Document(metadata ={'source ': './ texts/sample.txt '}, page_content ="The nearest neighbor algorithm was one of the first algorithms used to solve the travelling salesman problem approximately . In that problem , the salesman starts at a random city and repeatedly visits the nearest city until all have been visited. The algorithm quickly yields a short tour , but usually not the optimal one .\n\nIn computer science , ... ϓϩάϥϜ 2.14 ͷ documents ͷத਎Λ֬ೝ͢Δʹ͋ͨͬͯ͸ɺϓϩάϥ Ϝ 2.15 Λ࣮ߦ͢Δͱྑ͍Ͱ͢ɻ ˛ϓϩάϥϜ 2.15 ϓϩάϥϜ 2.14 ͷ documents ͷத਎ͷ֬ೝ 1 print(documents [0]. metadata) 2 print(len(documents [0]. page_content )) ˛ ࣮ߦ݁Ռ {'source ': './ texts/sample.txt '} 1322 ϓϩάϥϜ 2.15 ͷΑ͏ʹ metadata ΍ page_content Λࢦఆ͢Δ͜ͱͰ ಡΈࠐΜͩςΩετ΍ಡΈࠐΈݩʹ͍ͭͯ֬ೝ͢Δ͜ͱ͕Ͱ͖·͢ɻ 27