JGLUEの構築そして日本語LLM評価のこれから

by Keisuke Kamata

Slide 1

Slide 1 text

JGLUEͷߏஙͦͯ͠ ೔ຊޠLLMධՁͷ͜Ε͔Β Տݪେี ૣҴాେֶ W&B ౦ژϛʔτΞοϓ #8 (2023/11/15)

Slide 2

Slide 2 text

େن໛ݴޠϞσϧ(LLM)ͷਐల 2 https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the- worlds-largest-and-most-powerful-generative-language-model/

Slide 3

Slide 3 text

LLMͷ෼ྨ 3 Input: sentence in source language Output: next word in target language I am a ju suis étudiant student Words previously generated https://jalammar.github.io/illustrated-transformer/ 6 layers 6 layers Τϯίʔμɾσίʔμ (T5ͳͲ) Attention in source language Attention in target language Attention between source and target languages … … Τϯίʔμ (BERTܥ) σίʔμ (GPTܥ)

Slide 4

Slide 4 text

ݴޠཧղϕϯνϚʔΫ: GLUE 4 General Language Understanding Evaluation [Wang+ 2018] λεΫ આ໌ SST-2 өըͷϨϏϡʔʹର͢Δײ৘෼ੳ (positive/negative) CoLA จ͕acceptable͔Ͳ͏͔ MRPC 2จ͕ಉ͡ҙຯ͔Ͳ͏͔ STS-B 2จͷྨࣅ౓ (1-5) QQP 2ͭͷ࣭໰จ͕ಉ͡ҙຯ͔Ͳ͏͔ MNLI 2จͷؚҙؔ܎ೝࣝ (entailment/contradiction/neutral) QNLI จ͕࣭໰ʹର͢Δ౴͑ΛؚΉ͔Ͳ͏͔ (SQuAD) RTE 2จؚ͕ҙؔ܎Λ΋͔ͭͲ͏͔ WNLI 2จؚ͕ҙؔ܎Λ΋͔ͭͲ͏͔ (Winograd Schema Challenge) จ จ

Slide 5

Slide 5 text

GLUE Leaderboard 5 (2019೥11݄࣌఺) Human performance: 87.1 BERT: 80.5 Baseline (ELMo): 70.0 T5: 89.7

Slide 6

Slide 6 text

೔ຊޠݴޠཧղϕϯνϚʔΫ JGLUE 6 [܀ݪ+ 2022] [܀ݪ+ 2023] ܀ݪ݈ଠ࿠ ࣲా஌ल Տݪେี

Slide 7

Slide 7 text

എܠ (1/2) • LLMͷ໢ཏతධՁɾ෼ੳʹ͸GLUE [Wang+ 2018]ͷΑ͏ͳ ϕϯνϚʔΫ͕ෆՄܽ • ӳޠҎ֎ͷݴޠͰ΋ϕϯνϚʔΫ͕ߏங͞Ε͍ͯΔ • ϑϥϯεޠFLUE [Le+ 2020], தࠃޠCLUE [Xu+ 2020], ؖࠃޠKLUE [Park+ 2021], ... → ೔ຊޠݴޠཧղϕϯνϚʔΫJGLUE Λߏங 7 ͞Βʹ೉͍͠ϕϯνϚʔΫͷߏங LLMͷੑೳ޲্

Slide 8

Slide 8 text

എܠ (2/2) • طଘͷ೔ຊޠσʔληοτͷ՝୊ 1. ຋༁ (JSNLI [٢ӽ+ 2020], JSICK [Yanaka+ 2021]ͳͲ) • ػցɾਓख຋༁ʹ͓͚Δ೔ຊޠͷෆࣗવ͞ • ೔ຊͱͷ஍ҬɾจԽࠩ (ྫ: ΞϝϦΧͷ஍໊ɾ੓࣏ՈͳͲʹؔ͢Δจষ͕ଟ͍) 2. ಛఆυϝΠϯ • ྫ: JRTE [Hayashibe+ 2020]: ϗςϧͷϨϏϡʔ • ೔ຊޠݴޠཧղϕϯνϚʔΫJGLUEΛߏங͠ɺݴޠཧղݚڀΛ ଅਐ 8 ˠ೔ຊޠͰҰ͔Βߏங ˠҰൠυϝΠϯͰߏங

Slide 9

Slide 9 text

JGLUEͷߏ੒ • GLUE΍SuperGLUEͷλεΫΛ޿͘Χόʔ͢ΔΑ͏ʹߏ੒ • ߏஙʹ͸Yahoo!Ϋϥ΢υιʔγϯάΛར༻ 9 λεΫ σʔληοτ train dev test จষ෼ྨ MARC-ja 187,528 5,654 5,639 JCoLA [છ୩+ 2022] - - - จϖΞ෼ྨ JSTS 12,451 1,457 1,589 JNLI 20,073 2,434 2,508 QA JSQuAD 62,859 4,442 4,420 JCommonsenseQA 8,939 1,119 1,118

Slide 10

Slide 10 text

Ϋϥ΢υιʔγϯάͰͷճ౴ positive: 0, negative: 10 10 ৭΋ཤ͖৺஍΋࠷ߴͰ͢ɻࢲͷ৔߹͸Ն৔ͷધ௼Γʹ ࢖͍ͬͯ·͢ɻ MARC-ja JSTS/JNLI จ֗தͷಓ࿏Λେ͖ͳόε͕૸͍ͬͯ·͢ɻ จಓ࿏Λେ͖ͳόε͕૸͍ͬͯ·͢ɻ positive Կ͜ͷ4%ΧʔυσʔλʔҠͤͳ͍͠ίϚϯυͰௐ΂ͨ ΒΤϥʔग़Δ͜͠Ε͸ͳ͍ negative ୯७ʹҰिؒͷఱؾΛ஌Γ͍ͨͷͰ͋Ε͹͜ΕͰे෼ɻ ͕ͩ͜ͷఔ౓Ͱ༗ྉ͸͍͔͕ͳ΋ͷ͔ positive → negative ྨࣅ౓: 4.4, ਪ࿦ؔ܎: entailment จςʔϒϧʹྉཧ͕ͳΒ΂ΒΕ͍ͯ·͢ɻ จςʔϒϧʹ৯΂͔͚ͷྉཧ͕͋Γ·͢ɻ ྨࣅ౓: 3.0, ਪ࿦ؔ܎: neutral จ໺ٿબख͕όοτΛεΠϯά͍ͯ͠·͢ɻ จ໺ٿબख͕ΩϟονϘʔϧΛ͍ͯ͠·͢ɻ ྨࣅ౓: 2.0, ਪ࿦ؔ܎: contradiction JGLUEσʔλྫ

Slide 11

Slide 11 text

JGLUEσʔλྫ 11 [λΠτϧ] ౦ւಓ৽װઢ 1987೥ʢত࿨62೥ʣ4݄1೔ͷࠃమ෼ׂຽӦԽʹΑΓɺ JR౦ւ͕ӡӦΛܧঝͨ͠ɻ੢೔ຊཱྀ٬మಓʢJR੢೔ ຊʣ͕ܧঝͨ͠ࢁཅ৽װઢͱ͸૬ޓ৐ΓೖΕ͕ߦΘΕ ͓ͯΓɺ౦ւಓ৽װઢ۠ؒͷΈͰӡస͞ΕΔྻंʹ΋ JR੢೔ຊॴ༗ͷं͕྆࢖༻͞ΕΔ͜ͱ͕͋Δɻ2020೥ ʢྩ࿨2೥ʣ3݄ݱࡏɺ౦ژӺ - ৽େࡕӺؒͷॴཁ࣌ؒ ͸࠷଎2࣌ؒ21෼࠷ߴ଎౓285km/hͰӡߦ͞Ε͍ͯΔɻ ࣭໰: 2020೥ɺ౦ژʙ৽େࡕؒͷ࠷଎ͷॴཁ࣌ؒ͸ ౴͑: 2࣌ؒ21෼ ࣭໰: ౦ւಓ৽װઢͱ૬ޓ৐ΓೖΕ͕͞Ε͍ͯΔ࿏ઢ͸ Ͳ͔͜ʁ ౴͑: ࢁཅ৽װઢ JSQuAD JCommonsenseQA ໰୊: ձࣾͷ࠷ߴ੹೚ऀΛԿͱ͍͏͔ʁ બ୒ࢶ: ڭࢣ, ෦௕, ࣾ௕, ෦Լ, όΠτ ໰୊: εʔϓΛҿΉ࣌ʹ࢖͏ಓ۩͸ʁ બ୒ࢶ: εϓʔϯ, ϝχϡʔ, ࡼ, ϑΥʔΫ, ͸͠

Slide 12

Slide 12 text

i-1: 夕焼けに... i-1-h1 : 月光に... score: 1.2 1-1: 青い車が... 1-3: …... label: entailment ( JNLI-A ) ・・ 1-1: 青い車が走っている 1-2: 海沿いを青い車が走っている。 .. 1-5: 歩道の反対側を車が走っている。 2-1: 草原が広がっている。 2-2: 遠くに山がそびえたっている。 .. 2-5: 山の麓に木々が生えている。 i-1: 夕焼けに照らされている男性。 i-2: 短髪の男性が立っている。 .. i-5: 黒い服を着た男性が笑っている。・・・・・・・・・・ 1-1: 青い車が... 1-2: 海沿いを... 1-1: 青い車が... 1-3: …… 2-1: 草原が... 2-2: 遠くに... ・・・・ i-1: 夕焼けに... j-1: 白い犬が... ・・ i-1: 夕焼けに... i-2: 短髪の... ・・類似度付与類似度付与 i-1: 夕焼けに... j-1: 白い犬が... score: 1.2 ・・・・・・・・ 1-1: 青い車が... 1-2: 海沿いを... label: neutral 1-2: 海沿いを... 1-1: 青い車が... label: entailment i-1: 夕焼けに... i-1-h1 : 月光に... label: contradiction 推論関係付与類似度付与矛盾文作成 JSTS JNLI ( JSTS-A ) 1-1: 青い車が... 1-2: 海沿いを... score: 3.8 1-1: 青い車が... 1-3: …... score: 4.5 ( JNLI-C ) ・・ ( JSTS-B ) ( JSTS-C ) 画像の出典: いらすとや(https://www.irasutoya.com/), ONWAイラスト(https://onwa-illust.com/) JSTSɾJNLIͷߏஙϑϩʔ

Slide 13

Slide 13 text

֤λεΫͷղ౴ํ๏ 13 1จ෼ྨ໰୊ (MARC-ja) positive [CLS] … この … PC … は … 丈夫 … ##で … 軽い … 。 … [SEP] … จϖΞ෼ྨ/ճؼ໰୊ (JSTS, JNLI) entailment [CLS] … 彼 … … … ⾷べた … [SEP] … 彼 … … … ⾷べた … [SEP] … εύϯநग़ (JSQuAD) Start/End Span [CLS] … … … どこ … ？ … [SEP] … … … 東京 … … … [SEP] … ଟࢶબ୒ࣜ໰୊ (JCommonsenseQA) score1 [CLS] … … … [SEP] … … … [SEP] … 問題選択肢1 score5 [CLS] … … … [SEP] … … … [SEP] … 問題選択肢5 … softmax …

Slide 14

Slide 14 text

࣮ݧ݁Ռ devηοτ 14 https://github.com/yahoojapan/JGLUE#baseline-scores

Slide 15

Slide 15 text

JCommonsenseQA 2.0: ܭࢉػͱਓͷڠಇʹΑΔৗࣝਪ࿦σʔληοτͷվྑ 15 ࣭໰ϓϨθϯτʹਓؾͷ͋Δจ๪۩͸ʁ બ୒ࢶສ೥ච ԰֎ Ϝʔϯ ίοϓ ஡࿸ ࣭໰ϓϨθϯτʹਓؾͷ͋Δจ๪۩͸ʁ બ୒ࢶສ೥ච ϖϯγϧ ંΓࢴ ίοϓ ஡࿸ ࣭໰ΠϯΫΛิॆͯ͠ॻ͘΋ͷ͸ʁ બ୒ࢶສ೥ච ϖϯγϧ ંΓࢴ ίοϓ ஡࿸ V1 V2 ޡΓબ୒ࢶੜ੒ ࣭໰ϦϥΠτ ਓؒ 0.988 0.997 0.996 ౦๺େBERTBASE 0.782 0.571 0.678 ౦๺େBERTLARGE 0.822 0.617 0.736 ૣҴాେRoBERTaBASE 0.849 0.551 0.672 ૣҴాେRoBERTaLARGE 0.901 0.807 0.865 V1 V2 ޡΓબ୒ࢶੜ੒ V2 ࣭໰ϦϥΠτ ͞Βʹ೉͍͠ ϕϯνϚʔΫͷߏங LLMͷ ੑೳ޲্ V1 → V2 [܀ݪ+ 2023]

Slide 16

Slide 16 text

σίʔμΛ༻͍ͨੜ੒ܥ--.ͷਐల 16 [awesome-japanese-llm]

Slide 17

Slide 17 text

ੜ੒ܥ --.࣌୅ͷϕϯνϚʔΫ ӳޠฤ 17 MMLU, lm-evaluation-harness, Open LLM Leaderboard, AlpacaEval, Chatbot Arena, MT-Bench

Slide 18

Slide 18 text

MMLU • ਺ֶɺ෺ཧɺ๏ֶɺྺ࢙ͳͲ57Պ໨ͷ4୒໰୊ • େֶӃਐֶదੑࢼݧ(GRE)ɺถࠃҩࢣ໔ڐࢼݧͳͲΛؚΉ • Ұൠతʹ͸few-shotͰղ౴ɺධՁ 18 Measuring Massive Multitask Language Understanding [Hendrycks+ 2021] Published as a conference paper at ICLR 2021 One of the reasons that the government discourages and regulates monopolies is that (A) producer surplus is lost and consumer surplus is gained. (B) monopoly prices ensure productive efficiency but cost society allocative efficiency. (C) monopoly firms do not engage in significant research and development. (D) consumer surplus is lost with higher prices and lower levels of output. Microeconomics Figure 3: Examples from the Microeconomics task. When you drop a ball from rest it accelerates downward at 9.8 m/s². If you instead throw it downward assuming no air resistance its acceleration immediately after leaving your hand is (A) 9.8 m/s² (B) more than 9.8 m/s² (C) less than 9.8 m/s² (D) Cannot say unless the speed of throw is given. Conceptual Physics College Mathematics In the complex z-plane, the set of points satisfying the equation z² = |z|² is a (A) pair of points (B) circle (C) half-line (D) line

Slide 19

Slide 19 text

lm-evaluation-harness (EleutherAI) • 200݅Ҏ্ͷσʔληοτʹ͓͍ͯ ੜ੒ܥLLMΛ౷ҰతʹධՁՄೳ • ARC, BIG-Bench, BLiMP, CrowS-Pairs, Drop, LAMBADA, MGSM, MMLU, PAWS-X, QNLI, SQuAD v2, SWAG, TruthfulQA, XCOPA, XWinograd, ... 19 https://github.com/EleutherAI/lm-evaluation-harness

Slide 20

Slide 20 text

20 https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard ϞσϧͷߜΓࠐΈ͕Մೳ lm-evaluation-harness ར༻ ͭͷλεΫͷฏۉείΞͰϥϯΩϯά

Slide 21

Slide 21 text

21 • 805݅ͷଟ༷ͳ໰୊ (ੜ੒λεΫ) • ࣗಈධՁ (GPT-4, Claude) • ϖΞൺֱʹجͮ͘উ཰ʹΑΔ ϥϯΩϯά • ର text-davinci-003 https://tatsu-lab.github.io/alpaca_eval/

Slide 22

Slide 22 text

22 https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard Chatbot Arena (LMSYS) • ਓखධՁ (Ϋϥ΢υιʔγϯά) • ϖΞൺֱʹجͮ͘Elo ratingsʹ ΑΔϥϯΩϯά https://lmsys.org/blog/2023-05-03-arena/

Slide 23

Slide 23 text

23 https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard MT-Bench (LMSYS) • ࣗಈධՁ(GPT-4) or ਓखධՁ • ઈରείΞ(1-10) or ϖΞൺֱ • Ϛϧνλʔϯର࿩ೳྗɺࢦࣔʹ ै͏ೳྗΛ໰͏80໰ • 8ΧςΰϦ (10໰ͣͭ, 2λʔϯ) Writing Roleplay Reasoning Math Coding Extraction STEM Humanities 0 2 4 6 8 10 model GPT-4 Claude-v1 GPT-3.5-turbo Vicuna-13B Alpaca-13B LLaMA-13B Figure 20: Category-wise scores of 6 models on MT-bench. 27 writing, roleplay, reasoning, math, coding, extraction, knowledge I (STEM), knowledge II (humanities/social science) [Zheng+ 2023]

Slide 24

Slide 24 text

MT-Bench: Ϛϧνλʔϯͷ໰୊ͷྫ 24 LLM benchmarks: by combining the existing capability-based benchmarks and the new preferen based benchmarks with LLM-as-a-judge, one can swiftly and automatically evaluate both the c capabilities and human alignment of models. We publicly release 80 MT-bench questions, 3K exp votes, and 30K conversations with human preferences for future study. Table 1: Sample multi-turn questions in MT-bench. Category Sample Questions Writing 1st Turn Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. 2nd Turn Rewrite your previous response. Start every sentence with the letter A. Math 1st Turn Given that f(x) = 4x3 9x 14, find the value of f(2). 2nd Turn Find x such that f(x) = 0. Knowledge 1st Turn Provide insights into the correlation between economic indicators such as GDP, inflation, and unemployment rates. Explain how fiscal and monetary policies ... 2nd Turn Now, explain them again like I’m five. 2 MT-bench and Chatbot Arena [Zheng+ 2023]

Slide 25

Slide 25 text

ઈରείΞධՁ ͷϓϩϯϓτ 25 [System] Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. You evaluation should focus on the assistant's answer to the second question. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]". <|The Start of Reference Answer|> ### User: {question_1} ### Reference answer: {ref_answer_1} ### User: {question_2} ### Reference answer: {ref_answer_2} <|The End of Reference Answer|> <|The Start of Assistant A's Conversation with User|> ### User: {question_1} ### Assistant A: {answer_1} ### User: {question_2} ### Assistant A: {answer_2} <|The End of Assistant A's Conversation with User|> 25

Slide 26

Slide 26 text

LLMʹΑΔࣗಈධՁͷ՝୊ • Position bias • ࠷ॳʹఏࣔ͞Εͨճ౴͕ΑΓΑ͍ͱ൑அͯ͠͠·͏ → A-B, B-Aͷ2छྨͷఏࣔॱͰධՁ • Name bias • “Assistant A”Λ“Assistant B”ΑΓڧ͍ͱ൑அͯ͠͠·͏ • Verbosity bias (length bias) • ΑΓ௕͍ճ౴ΛΑ͍ͱ൑அͯ͠͠·͏ • Self-enhancement bias • ධՁLLMʹΑΔࣗ෼ࣗ਎ͷੜ੒͕Α͍ͱ൑அͯ͠͠·͏ • (਺ֶɾਪ࿦ೳྗͷݶք) 26 [Zheng+ 2023]

Slide 27

Slide 27 text

GPT-4ʹΑΔධՁͷ੍໿ • OpenAIͷར༻ن໿ ʹΑΔͱ... 2. Usage Requirements (c) Restrictions You may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI; (iv) ... • (OpenAIʹڝ߹͢Δ) LLMͷ։ൃऀ͸ɺGPT-4ͷग़ྗ(=ධՁ݁Ռ) Λ࢖ͬͯ͸͍͚ͳ͍ 27

Slide 28

Slide 28 text

ੜ੒ܥ --.࣌୅ͷϕϯνϚʔΫ ೔ຊޠฤ 28

Slide 29

Slide 29 text

࠷ۙͷ೔ຊޠϦʔμʔϘʔυɾϕϯνϚʔΫ ໰୊࡞੒ ໰୊਺ λεΫछผ ධՁํ๏ lm-evaluation-harness º - ෼ྨɾੜ੒ ࣗಈ Nejumi º - ෼ྨ ࣗಈ Rakuda ˓ 40 ੜ੒ ࣗಈ Japanese VicunaQA ˓ 80 ੜ੒ ࣗಈ Japanese MT-Bench ˓ 80 ੜ੒ ࣗಈ ELYZA-tasks-100 ˓ 100 ੜ੒ ਓखɾ(ࣗಈ) 29

Slide 30

Slide 30 text

lm-evaluation-harness (Stability AI) • EleutherAI/lm-evaluation-harness ͷ ೔ຊޠ൛ • αϙʔτ͍ͯ͠Δσʔληοτ: JGLUE (JCommonsenseQA, JNLI, MARC-ja, JSQuAD), JACKET v2, XLSum (ja), XWinograd (ja), MGSM • few-shot (2, 3-shot) ͰධՁ • ਫ਼౓ͷฏۉʹΑΔϥϯΩϯά (ϦʔμʔϘʔυ) https://github.com/Stability-AI/lm-evaluation-harness/

Slide 31

Slide 31 text

Nejumi (Weights & Biases) • JGLUEΛϦʔμʔϘʔυԽ • MARC-ja, JNLI, JSQuAD, JCommonsenseQA • zero-shot ͰධՁ • ਫ਼౓ͷฏۉʹΑΔϥϯΩϯά • lm-evaluation-harnessͱͷҧ͍ 31 https://note.com/wandb_jp/n/n2464e3d85c1a • MARC-ja, JNLI, JCommonsenseQAͷςετʹ͓͍ͯɺStability AIͷධՁํ๏Ͱ͸બ୒ࢶʹؚ·ΕΔ ީิͷத͔Βର਺໬౓࠷େͷ΋ͷΛճ౴ͱ͢Δ͍Θ͹෼ྨثతͳΞϓϩʔνΛ࠾༻͍ͯ͠ΔͨΊʹ ແؔ܎ͳճ౴΍ϑΥʔϚοτΤϥʔɺεϖϧϛεͳͲ͕ى͜Γಘͳ͍ͷʹରͯ͠ɺࢲͨͪ͸શͯͷ ϘΩϟϒϥϦ͔Βࣗ༝ʹग़ྗ͍ͤͯ͞ΔͨΊʹ͜ΕΒΛࠀ෰͠ͳ͍ͱಘ఺Ͱ͖ͳ͍ɻ • JSQuADͷςετʹ͓͍ͯɺStability AIͷධՁํ๏Ͱ͸ਖ਼ղͷτʔΫϯ਺Λ༩͑ͯͦͷ෼͚ͩग़ྗ͞ ͍ͤͯΔͷʹରͯ͠ɺࢲͨͪͷධՁͰ͸ࣗྗͰଧͪ੾Βͳ͚Ε͹ͳΒͳ͍ɻ https://wandb.me/nejumi

Slide 32

Slide 32 text

Rakuda (YuzuAI) • ೔ຊͷ஍ཧɺ੓࣏ɺྺ࢙ɺࣾձʹ ؔ͢Δ40໰ (ਓख࡞੒) • ࣗಈධՁ (GPT-4) • ϖΞൺֱ (2छྨͷఏࣔॱ) • Bradley-Terry strengths (Elo ratings ͷվྑ൛) ͰείΞԽ͠Ϧʔμʔ Ϙʔυʹ 32 https://yuzuai.jp/benchmark

Slide 33

Slide 33 text

Rakudaͷ໰୊ͷྫ • ೔ຊͷ࠷๺୺ͱ࠷ೆ୺ʹҐஔ͢Δ஍໊Λ౴͍͑ͯͩ͘͞ɻ·ͨɺ ͦΕͧΕͲͷ౎ಓ෎ݝʹॴଐ͢Δ͔΋هड़͍ͯͩ͘͠͞ɻ • ઓޙͷ೔ຊ੓࣏ʹ͓͍ͯ࠷΋Өڹྗͷ͋ͬͨ੓࣏ՈΛҰਓڍ͛ɺ ͦͷߩݙʹ͍ͭͯৄ͘͠ड़΂͍ͯͩ͘͞ɻ • ฏ҆࣌୅ʹ੒ཱͨ͠و଒ࣾձͷಛ௃Λड़΂ɺͦΕ͕೔ຊจԽʢจ ֶɺܳज़ɺफڭͳͲʣʹͲͷΑ͏ʹӨڹΛ༩͔͑ͨʹ͍ͭͯ࿦͡ ͍ͯͩ͘͞ɻ • ೔ຊͷʮࡾҐҰମվֵʯʹ͍ͭͯड़΂ɺͦͷܦࡁʹର͢ΔӨڹʹ ͍ͭͯղઆ͍ͯͩ͘͠͞ɻ 33

Slide 34

Slide 34 text

Japanese VicunaQA (ژେ) • Ұൠɺ஌ࣝɺϩʔϧϓϨΠɺৗࣝɺϑΣϧϛਪఆɺ൓࣮Ծ૝ɺίʔσΟϯάɺ ਺ֶɺϥΠςΟϯάʹؔ͢Δ80໰ • MT-Benchͷલ਎Ͱ͋ΔVicuna Eval 80໰ͷ຋༁ • ࣗಈධՁ (GPT-4) • ϖΞൺֱ (2छྨͷఏࣔॱ)ʹجͮ͘উ཰Λܭࢉ • ྫ • ετϨεͱ্खʹ෇͖߹͏ʹ͸ɺͲͷΑ͏ͳํ๏͕͋Γ·͔͢ʁ • ΠϯΫϧʔγϒͰΞΫηγϒϧͳެڞަ௨γεςϜΛઃܭ͢ΔࡍɺͲͷΑ͏ͳཁૉΛߟ ྀ͠·͔͢ʁ • ͋ͳ͕ͨ΋͠ւ଑ͷધ௕ͩͬͨΒɺๅ୳͠ͷϞνϕʔγϣϯΛߴΊΔͨΊ৐૊һʹͲΜ ͳݴ༿Λ͔͚·͔͢ʁ 34

Slide 35

Slide 35 text

Japanese MT-Bench (Stability AI) • Ϛϧνλʔϯձ࿩ೳྗɺࢦࣔʹै͏ ೳྗΛ໰͏80໰ • 8ΧςΰϦ (10໰ͣͭ, 2λʔϯ) • writing, roleplay, reasoning, math, coding, extraction, knowledge I (STEM), knowledge II (humanities/social science) • MT-BenchΛ຋༁ɺ೔ຊͷจԽʹ߹͏ Α͏ʹमਖ਼ • ࣗಈධՁ (GPT-4) • ઈରείΞ (1-10) 35 shi3z͞ΜʹΑΔධՁ࣮ߦ݁Ռ https://note.com/shi3zblog/n/n6b2ac5874021

Slide 36

Slide 36 text

Japanese MT-Benchͷ໰୊ͷྫ • ৽ೖࣾһ΁ͷϏδωεϝʔϧͷΤνέοτʹ͍ͭͯͷࢦಋॻΛ࡞੒͠ ͍ͯͩ͘͞ɻܟޠͷਖ਼͍͠࢖͍ํ΍ɺ೔ຊͷϏδωεจԽͰͷ஫ҙ఺ ΛऔΓೖΕ͍ͯͩ͘͞ɻ • ࣗ෼ͷ࡞੒ͨ͠ࢦಋॻΛ٬؍తʹධՁ͠ɺվળ఺͕͋Ε͹ࢦఠ͍ͯͩ͘͠͞ɻ • υϥ͑΋ΜͷʮͷͼଠʯʹͳΓ͖ͬͯձ࿩Λ࢝Ί·͠ΐ͏ɻͰ͸ҎԼ ͷ࣭໰͔Β࢝Ί͍ͯͩ͘͞ɿzखΛચͬͨޙɺΤΞυϥΠϠʔ͸ඞཁ ͩͱࢥ͍·͔͢ʁz • ொͰҰॹʹ৯ࣄΛ͠·͠ΐ͏ɻόεͰҰॹʹߦ͖·ͤΜ͔ʁ • ͋ͳͨͷࠨʹඒ͍͠੺͍Ո͕ɺӈʹ͸ݬ૝తͳԹ͕ࣨɺਖ਼໘ʹ͸ັྗ తͳϐϯΫͷ৔ॴ͕ݟ͑·͢ɻͰ͸ɺന͍Ո͸Ͳ͜ʹ͋Γ·͔͢ʁ • ݩͷ࣭໰ʹ͸ɺന͍ՈͷҐஔΛ֬ఆతʹܾఆ͢ΔͨΊͷख͕͔Γؚ͕·Ε͍ͯ ·͔͢ʁ 36

Slide 37

Slide 37 text

ELYZA-tasks-100 (ELYZA) • ෳࡶͳࢦࣔɾλεΫΛؚΉ100໰ • ਖ਼ղྫɺධՁ؍఺෇͖ • ओʹਓखධՁ (5ஈ֊ɺ3ਓͷධՁऀ) • ධՁ݁Ռγʔτ • ໰୊ͷྫ • ࢓ࣄͷ೤ҙΛऔΓ໭ͨ͢ΊͷΞΠσΞΛ5ͭڍ͍͛ͯͩ͘͞ɻ • ࣍ͷจষΛಡΜͰɺͦͷਓ͕Ͳͷఔ౓ౖ͍ͬͯΔ͔ɺ1ʙ10ͷई౓ͰධՁ ͍ͯͩ͘͠͞ɻ(1ʹౖ͍ͬͯͳ͍ɺ10ʹඇৗʹౖ͍ͬͯΔ)ɻ 1. ·ͨςε τͰ੺఺͔ɻ܅͸શ͘... 2. ςετͰ੺఺ʁࠓճ͸೉͔ͬͨ͠Ͷɻ • ҎԼͷϝʔϧʹฦ৴͍ͯͩ͘͠͞ɻ ͓ർΕ༷Ͱ͢ɻ ຊ೔ମௐෆྑʹΑΓɺ ༧ఆΑΓ౸ண͕গ͠஗Εͯ͠·͍ͦ͏Ͱ͢ɻ ஗͘ͱ΋13࣌ա͗ʹ͸ண͘ ͱࢥ͍·͢ɻ ͝໎࿭Λ͓͔͚ͯ͠ڪॖͰ͸͍͟͝·͕͢ɺ Կଔ͝༰͍ࣻ ͚ͨͩ·͢Α͏͓ئ͍ਃ্͛͠·͢ɻ 37

Slide 38

Slide 38 text

೔ຊޠLLMධՁͷ͜Ε͔Β 38

Slide 39

Slide 39 text

LLMධՁʹ͓͚Δ؍఺ • Seen/Unseen: ڭࢣ͋Γֶश͕͞Ε͍ͯΔ͔Ͳ͏͔ • GLUEͳͲैདྷͷϕϯνϚʔΫ͸seenઃఆ • ࠷ۙͷϦʔμʔϘʔυ͸҉໧తʹunseenઃఆ(zero/few-host)Ͱ͋Δ͜ͱ͕΄ͱΜͲ • Contamination • ධՁσʔλֶ͕शʹ࢖ΘΕ͍ͯΔՄೳੑ • cf. “Catch me if you can! How to beat GPT-4 with a 13B model” [Blog] • λεΫछผ: ෼ྨ(ཧղ)ɾੜ੒ • ධՁํ๏: ࣗಈɾਓख • ෼ྨλεΫ͸ࣗಈධՁ • ੜ੒λεΫ͸྆ํ (ੜ੒λεΫͷࣗಈධՁ͸GPT-4ར༻͕ओྲྀ͕ͩɻɻɻ) • Ϟσϧछผ • ֶशํ๏: Pretrained, Fine-tuned (SFT), RLHF • ύϥϝʔλ਺ • ֶशݴޠ 39

Slide 40

Slide 40 text

ݱࡏͷ೔ຊޠLLMධՁͷ՝୊ • ෼ྨ(ཧղ)ܥλεΫ(JGLUEͳͲ)ͷΈͷධՁͰ͸Ұ໘త • ݱࡏͷੜ੒ܥσʔληοτͷ՝୊ • େ͖ͳن໛ͷσʔληοτ͸গͳ͍ • χϡʔεهࣄͷཁ໿σʔληοτ: XLSum (ja) [Hasan+ 2021] • ೔ৗର࿩ίʔύε: Japanese Daily Dialogue [੺ؒ+ 2023] • LLMධՁ༻ͷੜ੒໰୊ͷ՝୊ • σʔληοτ͋ͨΓ਺ेʙ100໰ఔ౓Ͱগͳ͍ • ධՁํ๏͕ਓखɺ΋͘͠͸ɺGPT-4ʹΑΔࣗಈධՁ 40

Slide 41

Slide 41 text

LLMධՁʹదͨ͠ϕϯνϚʔΫʹ޲͚ͯ • ෼ྨ(ཧղ)ܥσʔληοτͷ֦ॆ • MMLUͷ೔ຊޠ΁ͷ຋༁ (ૣେՏݪݚɾཧԽֶݚڀॴAIP) • llm-jp-evalɾjasterͷ׆ಈ (LLMษڧձ) • ੜ੒ܥσʔληοτͷ֦ॆ • ࣗಈධՁ͕ෆՄܽ • GPT-4ʹΑΔධՁ͸ආ͚͍ͨ • Α͍ɾѱ͍ੜ੒ΛΞϊςʔγϣϯͨ͠σʔλΛ࡞੒͠ɺfine-tuningʹΑͬͯධՁثΛߏங • cf. BLEURT [Sellam+ 2020], COMET [Rei+ 2020] • ࡶஊର࿩ͷΑ͏ͳopen-endedੑ͕ߴ͍λεΫ͸(ࣗಈ)ධՁ͕೉͍͠ • ཁ໿΍QAͳͲ͕ީิ (JGLUE v2) • Ξϊςʔγϣϯର৅ͷςΩετͱͯ͠ɺΦʔϓϯͳ΋ͷʹՃ͑ͯاۀ಺ςΩετ΋࢖͑Δ Α͏ʹݕ౼த • ධՁઃఆͷ͋Γํͷݕ౼ • unseenઃఆ?ɺfew-shotઃఆ? • ϓϩϯϓτහײੑ΁ͷରॲ 41

Slide 42

Slide 42 text

LLMධՁʹదͨ͠ϕϯνϚʔΫʹ޲͚ͯ ͞Βʹ͸ 42 Question Answering Tool Learning Reasoning K nowledge Com pletion Ethics and Morality Bias Toxicity Truthfulness Robustness Evaluation Risk Evaluation Biology and M edicine Education Legislation Computer Science Finance Benchmarks for Holistic Evaluation Benchmarks for Knowledge and Reasoning Benchmarks for NLU and NLG Knowled ge and Capability Large Langauge Model Evaluation Alignment Eva luation Safety Specialized LLMs Evaluation Organization … Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation. [Guo+ 2023] [Awesome-LLMs-Evaluation-Papers]