Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
日本OSS推進フォーラム | AIに関する勉強会 | LLMの評価方法
Search
Logbii
August 30, 2025
140
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
日本OSS推進フォーラム | AIに関する勉強会 | LLMの評価方法
Logbii
August 30, 2025
More Decks by Logbii
See All by Logbii
Logbii Docs 紹介資料
logbii
0
140
Logbii Research 紹介資料
logbii
0
130
EnBrew(エンブリュー)紹介資料
logbii
0
400
Logbii(ログビー) Company Deck
logbii
1
4.5k
Featured
See All Featured
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
170
YesSQL, Process and Tooling at Scale
rocio
174
15k
Tell your own story through comics
letsgokoyo
1
950
AI Search: Implications for SEO and How to Move Forward - #ShenzhenSEOConference
aleyda
1
1.3k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
How to Think Like a Performance Engineer
csswizardry
28
2.6k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
850
Mind Mapping
helmedeiros
PRO
1
250
A Modern Web Designer's Workflow
chriscoyier
698
190k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
200
Optimising Largest Contentful Paint
csswizardry
37
3.7k
Transcript
Logbii, Inc. 1 גࣜձࣾϩάϏʔ දऔకCEO݉CTO দా ರٛ 20258݄21 ຊ044ਪਐϑΥʔϥϜ"*ʹؔ͢Δษڧձ --.ͷධՁํ๏
Logbii, Inc. 2 দా ࣗݾհ ָఱɿΞϓϦέʔγϣϯΤϯδχΞ 20095݄ʙ20134݄ ɾָఱΧʔυ ՃໍళαΠτ ɾָఱࢢ
Bookmark ViibarʢݱVideoTouchʣɿٕज़ऀ 20134݄ʙ20154݄ ɾಈըͷΫϥυιʔγϯά ϩάϏʔʢLogbiiʣɿCEO݉CTO 20155݄ʙ ɾAI/ITιϦϡʔγϣϯ ɾΤϯδχΞධՁ੍ 2012 ָఱςΫϊϩδʔΧϯϑΝϨϯε ࣮ߦҕһ 2017ʙ ݩָఱ ։ൃ෦ͷू͍ ӡӦ 2024 ຊCTOڠձ CTOωοτϫʔΩϯά PM झຯɿίϛϡχςΟӡӦɾاը Profile
Logbii, Inc. 3 ຊͷετʔϦʔ • ىɹ্͔Βͷؙ͛ʮ͕ࣾAIͰͳʹ͔Γ͍ͨΈ͍ͨʯ • ঝɹAIͷධՁͲ͏Δʁ • సɹ૿͑Δ૬ஊ
• ݁ɹײँ Story ొਓ ্ ଠʢԾ໊ʣ ΤϯδχΞྺ3ɺੜAIษڧத ઌഐ ΤϯδχΞྺ10ɺAIྺ5 ετʔϦʔ ※ ϑΟΫγϣϯͰ͢
Logbii, Inc. 4 ى ্͔Βͷؙ͛ INTRODUCTION
Logbii, Inc. 5 ͜ͱͷ࢝·Γ ͕ࣾAIͰͳʹ͔Γ͍ͨͬͯݴͬͯΔΜ͚ͩͲɺ͏ͪͷ෦ॺͰAIͷ ධՁํ๏ͷௐࠪΛͯ͘͠Εͱ͍ΘΕͯ͠·ͬͨΑɻΘ͔Δͷઌഐͱଠ ͞Μ͔͍͠ͳͯ͘͞ɺઌഐϓϩδΣΫτͰ͍͔͠Βɺଠ͞Μ͕ϝ ΠϯͰ͢͢Ίͯ͘ΕΔʁઌഐʹద࣭͍͍ٓͯ͠Αɻ ·͡Ͱ͔͢ɺࣗௐͳ͕ΒʹͳΓ·͚͢Ͳɺɺઌഐ͕͍Ε҆৺͔ ͳʂͻͱ·ͣྃղͰ͢ʂ
্ ଠ Intro
Logbii, Inc. 6 ͜ͱͷ࢝·Γ ʢͲ͔͜ΒखΛ͚͍͍͔ͯΑ͘Θ͔Βͳ͍͠ɺૣ͚ͩͲઌഐʹ૬ஊ ͯ͠͠·͓͏ɺɺʣ ઌഐɺࢩʑવʑͳ͜ͱΛɺ্͔Βґཔ͞ΕͨΜͰ͕͢ɺͲͷล͔Βख Λ͚ͭͨΒ͍͍Ͱ͔͢Ͷʁ ଠ ઌഐ
ͦ͏ͩͳ͊ɺ·ͣ࠷ۙͷੜAIʹݶΒͣɺҎલ͔Β͋ΔAIͷධՁͳͲ ͔Β࢝ΊͯɺLLMͷධՁΛཧղ͍ͯ͘͠ͷ͕͍͍Μ͡Όͳ͍͔ͳɻ Intro ※ ࠓճLLMʹ͍ͭͯհ͠·͢
Logbii, Inc. 7 ঝ AIͷධՁͲ͏Δʁ DEVELOPMENT
Logbii, Inc. 8 AI͕Ͱ͖ΔλεΫͷྫ Capability ੜ"*Ҏલ ੜ"* ͳΔ΄Ͳ ※ ࠓճڭࢣ͋ΓͷέʔεΛհ͠·͢
Logbii, Inc. 9 AIͷධՁࢦඪͷྫʢੜAIҎલʣ Evaluation λεΫ ධՁࢦඪ ֓ཁ ྨ ਖ਼ղ
ը૾શ෦ͷதͰݘͱ༧࣮ͯ͠ࡍݘׂͩͬͨ߹ ࠶ݱ ࣮ࡍͷݘը૾ͷதͰݘͱ༧ׂͨ͠߹ ʢݟམͱ͠ͷগͳ͞ʣ ద߹ ݘͱ༧ͨ͠தͰ࣮ࡍʹݘׂͩͬͨ߹ ʢϜμͷগͳ͞ʣ ճؼ ฏۉೋޡࠩ ࣮ࡍͷՈՁ֨ͱ༧ଌͷޡࠩΛೋͨ͠ͷΛ݅ͰΘͬͨ ʢখ͍͞΄ͲΑ͍ʣ ਖ਼ղ͕͖ͬΓ͍ͯ͠Δ͔ΒධՁ͍͢͠ͳʂ ※ ࠓճڭࢣ͋ΓͷέʔεΛհ͠·͢
Logbii, Inc. 10 ੜAIͷධՁͷ՝ Problem ʢੜAIҎલ͕͑໌͚֬ͩͬͨͲɺੜAIͷ߹ɺੜ͞ΕͨςΩ ετը૾ΛͲ͏ධՁ͢ΔΜͩΖ͏ʁʁ·ͨઌഐʹ૬ஊ͠Α͏ʂʣ ઌഐɺੜAIͷ߹ɺੜ͞ΕͨςΩετը૾ਓʹΑͬͯड͚औΓ ํҧ͏͠ɺͲ͏ͬͯධՁ͢Ε͍͍ΜͰ͔͢ʁ ଠ
ઌഐ ͦ͏ͩͶɺ·ͣଠܦݧ͕͋ΔAzureͰɺLLMΛධՁ͢ΔΈΛఏ ڙ͍ͯ͠Δ͔ΒɺͦΕΛௐͯΈΕͲ͏͔ͳɻ ※ ࠓճLLMʹ͍ͭͯհ͠·͢
Logbii, Inc. 11 AzureͷLLMͷධՁࢦඪͷྫ Azure λεΫ छྨ ධՁ༰ ධՁࢦඪ ֓ཁ
ձ LLM as a Judge ࣭ Groundedness ༩͑ͨจ຺ʹͲΕ͚ͩཪ͚ͮΒΕ͍ͯΔ͔ Relevance ࣭ʹରͯ͠ͲΕ͚ͩత֬ʹ͍͑ͯΔ͔ Coherence จ͕ࣗવʹྲྀΕɺಡΈ͍͔͢ ҆શੑ Hate/Unfair ࠩผతͳදݱͳ͍͔ Protected material ஶ࡞ݖΛ৵͍ͯ͠ͳ͍͔ Code vulnerability ϓϩάϥϜίʔυʹ੬ऑੑ͕ͳ͍͔ LLM͕LLMΛධՁͯ͘͠ΕΔͷ͔ʂ
Logbii, Inc. 12 LLM-as-a-Judgeͷϓϩϯϓτྫ Prompt # ͋ͳͨͷׂ ͋ͳͨݫີͳϑΝΫτνΣοΧʔͰ͢ɻ༩͑ΒΕͨ CONTEXT ͷൣғ͔Ͳ͏͔ͰɺRESPONSE
ͷࣄ࣮߹ੑ ʢGroundednessʣΛධՁ͠·͢ɻ # ධՁج४ - Ԡ CONTEXT ʹ໌ࣔతʹؚ·ΕΔใͷΈͰߏ͞Ε͍ͯ Δ͜ͱ - CONTEXT ʹແ͍ओு//ݻ༗໊ʮࠜڌͳ͠ʯͱΈͳ͢ - ໃ६͢Δओு͕͋Εݮ # ࠾ن४ʢ1–5ʣ 5: ͯ͢ͷओு͕ CONTEXT ʹࠜڌΛ࣋ͭʢશʹ groundedʣ 4: ֓Ͷ groundedʢܰඍͳলུݴ͍͑ͷΈʣ 3: Ұ෦ grounded ͕ͩɺࠜڌෆेͳओு͕ࠞࡏ 2: ଟ͕ࠜ͘ڌෆे·ͨҰ෦ʹໃ६ 1: ΄΅/શ͘ grounded Ͱͳ͍ # ग़ྗϑΥʔϚοτʢJSONʣ - score: 1ʙ5 ͷ - verdict: "pass" ·ͨ "fail"ʢ͖͍͠=3; 3Ҏ্ͳΒ passʣ - reasons: ఆཧ༝ʢ؆ܿʹʣ - citations: ࠜڌͱͳΔ CONTEXT ͷจ൪߸ൈਮʢՄೳͳΒʣ # ೖྗ QUERY: {query} CONTEXT: {context} RESPONSE: {response} # ࣮ߦखॱ 1) RESPONSE ͷओுΛྻڍ 2) ֤ओுʹରԠ͢Δ CONTEXT ͷࠜڌΛରԠ͚ 3) ࠜڌͷແ͍ओு/ໃ६Λྻڍ 4) ن४ʹैͬͯ score ΛܾΊɺJSON ͚ͩΛฦ͢ ͳΔ΄Ͳ
Logbii, Inc. 13 AzureͷLLMͷධՁࢦඪͷྫ Azure λεΫ छྨ ධՁ༰ ධՁࢦඪ ֓ཁ
ձ NLP/ࣜ ࣭ ROUGE N-gram͕ͲΕ͘Β͍ॏෳ͍ͯ͠Δ͔ Similarity ਖ਼ղͱAIճ͕ͲΕ͘Β͍͍ۙϕΫτϧ͔ ࣜͳͲͷఆྔతͳධՁ͋Δͷ͔
Logbii, Inc. 14 RAGͷධՁͷ՝ Problem ͦ͏͍͑࠷ۙRAGͷษڧΛͨ͠ͳɻRAGͷ߹ɺLLM-as-a-Judgeͩ ͱɺͦͦϕΫτϧDB͔Βؒҧͬͨ݁Ռ͕औಘ͞Εͨ߹ɺϢʔβʔ ʹͱͬͯྑ͍݁ՌʹͳΒͳͦ͏ͩͳɾɾʁ ଠ ϕΫτϧ
DB ϕΫτϧݕࡧ ࣭ ্Ґؔ࿈ใ ΞϓϦ ճ ࣭ʴ্Ґؔ࿈ใ ճੜ LLM-as-a-Judge͕ίϯ ςΩετͱͯ͠ར༻ 3"(֓ཁ LLM
Logbii, Inc. 15 RAGͷධՁͷ՝ Problem ઌഐʂΞυόΠεͷ௨ΓAzureͷLLMධՁΛগ͠ௐͨΒɺཧղ͕ਐΈ· ͨ͠ʂͱ͜ΖͰɺRAGͷධՁͷ߹ɺࢩʑવʑͰLLM-as-a-Judge͚ͩͰ ෆेͱײ͡·ͨ͠ɻ ଠ ઌഐ
͍͍ͱ͜Ζʹؾ͍ͮͨͶɻRAGͷ߹ɺϕΫτϧDB͔Βऔಘ্ͨ͠Ґ ͷใ͕࣭ʹରͯ͠ద͔Ͳ͏͔ɺLLM-as-a-JudgeͱผʹධՁ͢Δඞ ཁ͕͋ΔͶɻϥϯΩϯάͷධՁࢦඪɺRAGʹಛԽͨ͠ࢦඪ͋Δ͔ ΒɺௐΔͱ͕ΔΜ͡Όͳ͍͔ͳʂ
Logbii, Inc. 16 ϥϯΩϯά݁ՌͷධՁࢦඪͷྫ Ranking λεΫ ධՁࢦඪ ֓ཁ • ݕࡧ
• Ϩίϝϯυ Recall@k શମͷʮؔ࿈͋ΓʯͷதͰɺ্Ґk݅ʹʮؔ࿈͋Γʯ ؚ͕·Εׂͨ߹ Precision@k ্Ґk݅ͷதͰɺʮؔ࿈͋Γʯؚ͕·Εׂͨ߹ nDCG@k ʮؔ࿈ੑʢ0~5ͳͲʣʯ͕͑Δ߹ʹར༻ • RAG Context Precision LLMʹͨ͠ίϯςΩετͷதͰɺʮཱͬͨίϯ ςΩετʯؚ͕·Εׂͨ߹ ݕࡧϨίϝϯυͷධՁ͍Ζ͍Ζ͋Δͳ
Logbii, Inc. 17 స ૿͑Δ૬ஊ TWIST
Logbii, Inc. 18 ૯෦͔Βͷ૬ஊ ͋Γ͕ͱ͏ɻͱ͜ΖͰɺੜAIͰ࠷ॳʹΔςʔϚ͕ܾ·ͬͨΑɻ ࣾنఆʹؔ͢ΔQ&AϘοτͭ͘Δ͜ͱʹͳͬͨɻ ૯͕ॻྨϑΝΠϧͱQ&AͷσʔλαϯϓϧΛ͘ΕͨͷͰɺͦΕΛͬ ͯධՁͯͬͯ͠ཁ͕͖͚ͨͲɺධՁͷ؆୯ͳϞοΫ։ൃͰ͖ͦ͏ʁ ͓ർΕ༷Ͱ͢ɻࢩʑવʑͰɺLLMͷධՁʹ͍ͭͯɺௐ·ͨ͠ʂ ্
ଠ Consultation ঝ͠·ͨ͠ʂ ଠ
Logbii, Inc. 19 ࣾنఆ RAG֓ཁ RAG ϕΫτϧ DB ϕΫτϧݕࡧ ࣾنఆ
ʹ͍࣭ͭͯ ্Ґؔ࿈ใ ΞϓϦ ճ ࣭ʴ্Ґؔ࿈ใ ճੜ LLM ࠓճͷΠϝʔδ͜Μͳײ͔͡ ଠ ࣾنఆ αϯϓϧ ࣭ ճ XXXXX XXXXX XXXXX XXXXX XXXXX XXXXX ૯෦͔Βͷ αϯϓϧQ&A ࣾһ ࣾنఆ 3"(֓ཁ
Logbii, Inc. 20 RAGͷϞοΫ࣮ RAGAS ؆୯ͳϞοΫͬͯԿͰ࡞Ε͍͍ΜͩΖ͏ʁʁࠔͬͨͱ͖ͷઌഐཔΈ ͩɻͱࢥͬͨΒɺग़ுͰ͍ͳ͍ʂʂ͜Μͳͱ͖ͦ͜AIʹฉ͍ͯΈΔ͔ɻ ଠ AI ؆୯ͳϞοΫͩͬͨΒRAGAS͕͍͍Μ͡Όͳ͍ɻ
Logbii, Inc. 21 RAGASͰͷRAGධՁͷ࣮ྫ RAGAS data_samples = { # ࣾنఆʹ͍ͭͯͷ࣭
"question": [ "ՆٳΈԿͰ͔͢ʁ", "࢝ٳՋԿͰ͔͢ʁ" ], # ૯͕४උͨ͠ൣղ "ground_truth": [ "4", "4" ], # LLMͷճ "answer": [ "3Ͱ͢ɻ", "5Ͱ͢ɻ" ], # ϕΫτϧDB͔Βऔಘ্ͨ͠ҐίϯςΩετ "contexts": [ [ "ՆٳΈ7ʙ9݄ͷӦۀͷதͰ4ɺબΜͰऔಘ͠·͢ɻ", "ՆٳΈ༗څͱผ్༩͞Ε·͢ɻ" ], [ "࢝ٳՋ11ʙ1݄ͷӦۀͷதͰ4ɺબΜͰऔಘ͠·͢ɻ", "࢝ٳՋ༗څͱผ్༩͞Ε·͢ɻ" ] ] } dataset = Dataset.from_dict(data_samples) # LLM-as-a-Judgeͱͯ͠ GPT-4oΛར༻ llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0)) # Faithfulness(≒Groundedness), Relevancy, Context PrecisionΛධՁ result = evaluate( dataset, metrics=[ faithfulness, answer_relevancy, context_precision ], llm=llm ) ͳΔ΄Ͳ
Logbii, Inc. 22 γε͔Βͷ૬ஊ ͋Γ͕ͱ͏ɻͱ͜ΖͰɺརްੜͰ͔ͭ͑ΔΫʔϙϯݕࡧγεςϜΛӡ ༻͍ͯ͠Δγε͔ΒɺAIΛ׆༻ͨ͠ΫʔϙϯͷϨίϝϯυνϟοτ ϘοτΛ࡞Γ͍ͨͱ͍͏ཁ͕͋ΔͷͰɺͲ͏͍͏ධՁΛͨ͠Β͍͍͔ ૬ஊΛड͚ͯΔΜͩɻͪΐͬͱߟ͑ͯΈͯɻ ͓ർΕ༷Ͱ͢ɻධՁΛ͢Δ؆୯ͳϞοΫΛ։ൃͯ͠Έ·ͨ͠ʂ ্
ଠ ϨίϝϯυνϟοτϘοτͷධՁͰ͔͢ʂʁ ͻͱ·ͣྃղͰ͢ɺɺ ଠ Consultation
Logbii, Inc. 23 ͑ίϯςΩετ͕ͳ͍߹ͷ՝ Problem ͋ɺઌഐΒΕͨΜͰ͢Ͷʂॿ͔Γ·ͨ͠ɻ ͱ͜ΖͰࢩʑવʑͰࠔ͍ͬͯΔ͜ͱ͕͋Γ·͢ɻ ϨίϝϯυͷධՁࢦඪͱͯ͠ɺnDCG@kͳͲ͕͋ΔͷΘ͔ͬͨΜͰ͢ ͕ɺࠓճͷέʔεͩͱɺؔ࿈ੑͷ͕ͳ͍ͷͰ͑ͳͦ͏Ͱ͢ɻ ଠ
ઌഐ ৭ʑཁ͕དྷͯେมͩͶɻ ࠓճΈ͍ͨʹ͑ίϯςΩετ͕ͳ͍ͱɺϢʔβʔͷԠҙݟΛධ Ձ͢Δͷ͕Α͍͔ͳɻ۩ମతʹɺ࣮ࡍʹώΞϦϯάͨ͠ΓΞϯέʔτ ΛऔΔํ๏ɺΞϓϦͰϩάΛऔಘͯ͠ޮՌΛݕূ͢Δͷ͕͋ΔͶɻ Ϣʔβʔϩά͔Βؔ࿈ੑΛఆٛ͢ΕnDCG@kͳͲͰ͖Δͱࢥ͏Αɻ
Logbii, Inc. 24 ͑ίϯςΩετ͕ͳ͍߹ͷݕূྫ Verification ख๏ ӡ༻ྫ ABςετ • ABςετ
νϟοτར༻ / ඇར༻ͷϢʔβʔͷ ΫʔϙϯऔಘΛൺֱ ҼՌਪ • είΞϚονϯά • ࠩͷࠩ ABςετ͕Ͱ͖ͳ͍߹όΠΞεิਖ਼͕ ඞཁͳ߹ʹ࣮ࢪ ϥϯΩϯάධՁ • Recall@k • Precision@k • nDCG@k ΫʔϙϯऔಘͷߦಈΛؔ࿈ͱͯ͠ѻ͍ධՁ Ϣʔβʔௐࠪ • Ξϯέʔτ • ΠϯλϏϡʔ Ϩίϝϯυ͕ͲͷఔΫʔϙϯऔಘʹཱ ͔ͬͨΛௐࠪ ϩάऔಘΞϓϦ։ൃ෦ʹґཔ͠ͳ͍ͱɺɺ
Logbii, Inc. 25 ݁ ײँ CONCLUSION
Logbii, Inc. 26 ײँ ࠓճ͍Ζ͍Ζͱ͋Γ͕ͱ͏ʂ ͓͔͛͞·Ͱɺ૯෦ɺγεͱʹͱͯײँΛ͍ͯͨ͠Αʂ ্ ΄Μͱ͏Ͱ͔͢ʂ ઌഐ͕৭ʑΞυόΠεΛ͘Ε͓͔ͨ͛Ͱ͢ɺɺʢরʣ ଠ
Thanks ઌഐ ͍͍ଠ͕ؤுͬͨՌ͞ ͱΓ͋͑ͣɺଧ্ͪ͛ͩͳʂ
Logbii, Inc. 27 ଧ্ͪ͛ ্ ଠ Party ઌഐ סഋʂʂ