Upgrade to Pro — share decks privately, control downloads, hide ads and more …

日本OSS推進フォーラム | AIに関する勉強会 | LLMの評価方法

Avatar for Logbii Logbii
August 30, 2025
9

日本OSS推進フォーラム | AIに関する勉強会 | LLMの評価方法

Avatar for Logbii

Logbii

August 30, 2025
Tweet

Transcript

  1. Logbii, Inc. 2 দా ࣗݾ঺հ ָఱɿΞϓϦέʔγϣϯΤϯδχΞ 2009೥5݄ʙ2013೥4݄ ɾָఱΧʔυ ՃໍళαΠτ ɾָఱࢢ৔

    Bookmark ViibarʢݱVideoTouchʣɿٕज़੹೚ऀ 2013೥4݄ʙ2015೥4݄ ɾಈըͷΫϥ΢υιʔγϯά ϩάϏʔʢLogbiiʣɿCEO݉CTO 2015೥5݄ʙ ɾAI/ITιϦϡʔγϣϯ ɾΤϯδχΞධՁ੍౓ 2012೥ ָఱςΫϊϩδʔΧϯϑΝϨϯε ࣮ߦҕһ 2017೥ʙ ݩָఱ ։ൃ෦ͷू͍ ӡӦ 2024೥ ೔ຊCTOڠձ CTOωοτϫʔΩϯά PM झຯɿίϛϡχςΟӡӦɾاը Profile
  2. Logbii, Inc. 3 ຊ೔ͷετʔϦʔ • ىɹ্௕͔Βͷؙ౤͛ʮࣾ௕͕AIͰͳʹ͔΍Γ͍ͨΈ͍ͨʯ • ঝɹAIͷධՁ͸Ͳ͏΍Δʁ • సɹ૿͑Δ૬ஊ

    • ݁ɹײँ Story ొ৔ਓ෺ ্௕ ଠ࿠ʢԾ໊ʣ ΤϯδχΞྺ3೥ɺੜ੒AIษڧத ઌഐ ΤϯδχΞྺ10೥ɺAIྺ5೥ ετʔϦʔ ※ ϑΟΫγϣϯͰ͢
  3. Logbii, Inc. 6 ͜ͱͷ࢝·Γ ʢͲ͔͜ΒखΛ෇͚͍͍͔ͯΑ͘Θ͔Βͳ͍͠ɺૣ଎͚ͩͲઌഐʹ૬ஊ ͯ͠͠·͓͏ɺɺʣ ઌഐɺࢩʑવʑͳ͜ͱΛɺ্௕͔Βґཔ͞ΕͨΜͰ͕͢ɺͲͷล͔Βख Λ͚ͭͨΒ͍͍Ͱ͔͢Ͷʁ ଠ࿠ ઌഐ

    ͦ͏ͩͳ͊ɺ·ͣ͸࠷ۙͷੜ੒AIʹݶΒͣɺҎલ͔Β͋ΔAIͷධՁͳͲ ͔Β࢝ΊͯɺLLMͷධՁΛཧղ͍ͯ͘͠ͷ͕͍͍Μ͡Όͳ͍͔ͳɻ Intro ※ ࠓճ͸LLMʹ͍ͭͯ঺հ͠·͢
  4. Logbii, Inc. 9 AIͷධՁࢦඪͷྫʢੜ੒AIҎલʣ Evaluation λεΫ ධՁࢦඪ ֓ཁ ෼ྨ ਖ਼ղ཰

    ը૾શ෦ͷதͰݘͱ༧૝࣮ͯ͠ࡍݘׂͩͬͨ߹ ࠶ݱ཰ ࣮ࡍͷݘը૾ͷதͰݘͱ༧૝ׂͨ͠߹ ʢݟམͱ͠ͷগͳ͞ʣ ద߹཰ ݘͱ༧૝ͨ͠தͰ࣮ࡍʹݘׂͩͬͨ߹ ʢϜμͷগͳ͞ʣ ճؼ ฏۉೋ৐ޡࠩ ࣮ࡍͷՈՁ֨ͱ༧ଌ஋ͷޡࠩΛೋ৐ͨ͠΋ͷΛ݅਺ͰΘͬͨ஋ ʢখ͍͞΄ͲΑ͍ʣ ਖ਼ղ͕͸͖ͬΓ͍ͯ͠Δ͔ΒධՁ͠΍͍͢ͳʂ ※ ࠓճ͸ڭࢣ͋ΓͷέʔεΛ঺հ͠·͢
  5. Logbii, Inc. 11 AzureͷLLMͷධՁࢦඪͷྫ Azure λεΫ छྨ ධՁ಺༰ ධՁࢦඪ ֓ཁ

    ձ࿩ LLM as a Judge ඼࣭ Groundedness ༩͑ͨจ຺ʹͲΕ͚ͩཪ͚ͮΒΕ͍ͯΔ͔ Relevance ࣭໰ʹରͯ͠ͲΕ͚ͩత֬ʹ౴͍͑ͯΔ͔ Coherence จ͕ࣗવʹྲྀΕɺಡΈ΍͍͔͢ ҆શੑ Hate/Unfair ࠩผతͳදݱ͸ͳ͍͔ Protected material ஶ࡞ݖΛ৵֐͍ͯ͠ͳ͍͔ Code vulnerability ϓϩάϥϜίʔυʹ੬ऑੑ͕ͳ͍͔ LLM͕LLMΛධՁͯ͘͠ΕΔͷ͔ʂ
  6. Logbii, Inc. 12 LLM-as-a-Judgeͷϓϩϯϓτྫ Prompt # ͋ͳͨͷ໾ׂ ͋ͳͨ͸ݫີͳϑΝΫτνΣοΧʔͰ͢ɻ༩͑ΒΕͨ CONTEXT ͷൣғ಺͔Ͳ͏͔ͰɺRESPONSE

    ͷࣄ࣮੔߹ੑ ʢGroundednessʣΛධՁ͠·͢ɻ # ධՁج४ - Ԡ౴͸ CONTEXT ʹ໌ࣔతʹؚ·ΕΔ৘ใͷΈͰߏ੒͞Ε͍ͯ Δ͜ͱ - CONTEXT ʹແ͍ओு/਺஋/ݻ༗໊͸ʮࠜڌͳ͠ʯͱΈͳ͢ - ໃ६͢Δओு͕͋Ε͹ݮ఺ # ࠾఺ن४ʢ1–5ʣ 5: ͢΂ͯͷओு͕ CONTEXT ʹࠜڌΛ࣋ͭʢ׬શʹ groundedʣ 4: ֓Ͷ groundedʢܰඍͳলུ΍ݴ͍׵͑ͷΈʣ 3: Ұ෦͸ grounded ͕ͩɺࠜڌෆे෼ͳओு͕ࠞࡏ 2: ଟ͕ࠜ͘ڌෆे෼·ͨ͸Ұ෦ʹໃ६ 1: ΄΅/શ͘ grounded Ͱͳ͍ # ग़ྗϑΥʔϚοτʢJSONʣ - score: 1ʙ5 ͷ੔਺ - verdict: "pass" ·ͨ͸ "fail"ʢ͖͍͠஋=3; 3Ҏ্ͳΒ passʣ - reasons: ൑ఆཧ༝ʢ؆ܿʹʣ - citations: ࠜڌͱͳΔ CONTEXT ͷจ൪߸΍ൈਮʢՄೳͳΒʣ # ೖྗ QUERY: {query} CONTEXT: {context} RESPONSE: {response} # ࣮ߦखॱ 1) RESPONSE ͷओுΛྻڍ 2) ֤ओுʹରԠ͢Δ CONTEXT ͷࠜڌΛରԠ෇͚ 3) ࠜڌͷແ͍ओு/ໃ६Λྻڍ 4) ن४ʹैͬͯ score ΛܾΊɺJSON ͚ͩΛฦ͢ ͳΔ΄Ͳ
  7. Logbii, Inc. 13 AzureͷLLMͷධՁࢦඪͷྫ Azure λεΫ छྨ ධՁ಺༰ ධՁࢦඪ ֓ཁ

    ձ࿩ NLP/਺ࣜ ඼࣭ ROUGE N-gram͕ͲΕ͘Β͍ॏෳ͍ͯ͠Δ͔ Similarity ਖ਼ղͱAIճ౴͕ͲΕ͘Β͍͍ۙϕΫτϧ͔ ਺ࣜͳͲͷఆྔతͳධՁ΋͋Δͷ͔
  8. Logbii, Inc. 15 RAGͷධՁͷ՝୊ Problem ઌഐʂΞυόΠεͷ௨ΓAzureͷLLMධՁΛগ͠ௐ΂ͨΒɺཧղ͕ਐΈ· ͨ͠ʂͱ͜ΖͰɺRAGͷධՁͷ৔߹ɺࢩʑવʑͰLLM-as-a-Judge͚ͩͰ ͸ෆे෼ͱײ͡·ͨ͠ɻ ଠ࿠ ઌഐ

    ͍͍ͱ͜Ζʹؾ͍ͮͨͶɻRAGͷ৔߹͸ɺϕΫτϧDB͔Βऔಘ্ͨ͠Ґ ͷ৘ใ͕࣭໰ʹରͯ͠ద੾͔Ͳ͏͔ɺLLM-as-a-JudgeͱผʹධՁ͢Δඞ ཁ͕͋ΔͶɻϥϯΩϯάͷධՁࢦඪ΍ɺRAGʹಛԽͨ͠ࢦඪ΋͋Δ͔ Βɺௐ΂Δͱ޿͕ΔΜ͡Όͳ͍͔ͳʂ
  9. Logbii, Inc. 16 ϥϯΩϯά݁ՌͷධՁࢦඪͷྫ Ranking λεΫ ධՁࢦඪ ֓ཁ • ݕࡧ

    • Ϩίϝϯυ Recall@k શମͷʮؔ࿈͋ΓʯͷதͰɺ্Ґk݅ʹʮؔ࿈͋Γʯ ؚ͕·Εׂͨ߹ Precision@k ্Ґk݅ͷதͰɺʮؔ࿈͋Γʯؚ͕·Εׂͨ߹ nDCG@k ʮؔ࿈ੑʢ0~5ͳͲʣʯ͕࢖͑Δ৔߹ʹར༻ • RAG Context Precision LLMʹ౉ͨ͠ίϯςΩετͷதͰɺʮ໾ཱͬͨίϯ ςΩετʯؚ͕·Εׂͨ߹ ݕࡧ΍ϨίϝϯυͷධՁ΋͍Ζ͍Ζ͋Δͳ
  10. Logbii, Inc. 19 ࣾ಺نఆ RAG֓ཁ RAG ϕΫτϧ DB ϕΫτϧݕࡧ ࣾ಺نఆ

    ʹ͍࣭ͭͯ໰ ্Ґؔ࿈৘ใ ΞϓϦ ճ౴ ࣭໰ʴ্Ґؔ࿈৘ใ ճ౴ੜ੒ LLM ࠓճͷΠϝʔδ͸͜Μͳײ͔͡ ଠ࿠ ࣾ಺نఆ αϯϓϧ ࣭໰ ճ౴ XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX ૯຿෦͔Βͷ αϯϓϧQ&A ࣾһ ࣾ಺نఆ 3"(֓ཁ
  11. Logbii, Inc. 21 RAGASͰͷRAGධՁͷ࣮૷ྫ RAGAS data_samples = { # ࣾ಺نఆʹ͍ͭͯͷ࣭໰

    "question": [ "ՆٳΈ͸Կ೔Ͱ͔͢ʁ", "೥຤೥࢝ٳՋ͸Կ೔Ͱ͔͢ʁ" ], # ૯຿͕४උͨ͠໛ൣղ౴ "ground_truth": [ "4೔", "4೔" ], # LLMͷճ౴ "answer": [ "3೔Ͱ͢ɻ", "5೔Ͱ͢ɻ" ], # ϕΫτϧDB͔Βऔಘ্ͨ͠ҐίϯςΩετ "contexts": [ [ "ՆٳΈ͸7ʙ9݄ͷӦۀ೔ͷதͰ4೔ɺબΜͰऔಘ͠·͢ɻ", "ՆٳΈ͸༗څͱ͸ผ్෇༩͞Ε·͢ɻ" ], [ "೥຤೥࢝ٳՋ͸11ʙ1݄ͷӦۀ೔ͷதͰ4೔ɺબΜͰऔಘ͠·͢ɻ", "೥຤೥࢝ٳՋ͸༗څͱ͸ผ్෇༩͞Ε·͢ɻ" ] ] } dataset = Dataset.from_dict(data_samples) # LLM-as-a-Judgeͱͯ͠ GPT-4oΛར༻ llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0)) # Faithfulness(≒Groundedness), Relevancy, Context PrecisionΛධՁ result = evaluate( dataset, metrics=[ faithfulness, answer_relevancy, context_precision ], llm=llm ) ͳΔ΄Ͳ
  12. Logbii, Inc. 23 ౴͑΍ίϯςΩετ͕ͳ͍৔߹ͷ՝୊ Problem ͋ɺઌഐ໭ΒΕͨΜͰ͢Ͷʂॿ͔Γ·ͨ͠ɻ ͱ͜ΖͰࢩʑવʑͰࠔ͍ͬͯΔ͜ͱ͕͋Γ·͢ɻ ϨίϝϯυͷධՁࢦඪͱͯ͠ɺnDCG@kͳͲ͕͋Δͷ͸Θ͔ͬͨΜͰ͢ ͕ɺࠓճͷέʔεͩͱɺؔ࿈ੑͷ஋͕ͳ͍ͷͰ࢖͑ͳͦ͏Ͱ͢ɻ ଠ࿠

    ઌഐ ৭ʑཁ๬͕དྷͯେมͩͶɻ ࠓճΈ͍ͨʹ౴͑΍ίϯςΩετ͕ͳ͍ͱɺϢʔβʔͷ൓Ԡ΍ҙݟΛධ Ձ͢Δͷ͕Α͍͔ͳɻ۩ମతʹ͸ɺ࣮ࡍʹώΞϦϯάͨ͠ΓΞϯέʔτ ΛऔΔํ๏΍ɺΞϓϦ಺ͰϩάΛऔಘͯ͠ޮՌΛݕূ͢Δͷ͕͋ΔͶɻ Ϣʔβʔϩά͔Βؔ࿈ੑΛఆٛ͢Ε͹nDCG@kͳͲ΋Ͱ͖Δͱࢥ͏Αɻ
  13. Logbii, Inc. 24 ౴͑΍ίϯςΩετ͕ͳ͍৔߹ͷݕূྫ Verification ख๏ ӡ༻ྫ ABςετ • ABςετ

    νϟοτར༻ / ඇར༻ͷϢʔβʔͷ ΫʔϙϯऔಘΛൺֱ ҼՌਪ࿦ • ܏޲είΞϚονϯά • ࠩ෼ͷࠩ෼ ABςετ͕Ͱ͖ͳ͍৔߹΍όΠΞεิਖ਼͕ ඞཁͳ৔߹ʹ࣮ࢪ ϥϯΩϯάධՁ • Recall@k • Precision@k • nDCG@k ΫʔϙϯऔಘͷߦಈΛؔ࿈ͱͯ͠ѻ͍ධՁ Ϣʔβʔௐࠪ • Ξϯέʔτ • ΠϯλϏϡʔ Ϩίϝϯυ͕Ͳͷఔ౓Ϋʔϙϯऔಘʹ໾ཱ ͔ͬͨΛௐࠪ ϩάऔಘ͸ΞϓϦ։ൃ෦ʹґཔ͠ͳ͍ͱɺɺ