JGLUEの構築そして日本語LLM評価のこれから

JGLUEͷߏஙͦͯ͠ ೔ຊޠLLMධՁͷ͜Ε͔Β Տݪେี ૣҴాେֶ W&B ౦ژϛʔτΞοϓ #8 (2023/11/15)

େن໛ݴޠϞσϧ(LLM)ͷਐల 2 https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the- worlds-largest-and-most-powerful-generative-language-model/

LLMͷ෼ྨ 3 Input: sentence in source language Output: next word
in target language I am a ju suis étudiant student Words previously generated https://jalammar.github.io/illustrated-transformer/ 6 layers 6 layers Τϯίʔμɾσίʔμ (T5ͳͲ) Attention in source language Attention in target language Attention between source and target languages … … Τϯίʔμ (BERTܥ) σίʔμ (GPTܥ)

ݴޠཧղϕϯνϚʔΫ: GLUE 4 General Language Understanding Evaluation [Wang+ 2018] λεΫ
આ໌ SST-2 өըͷϨϏϡʔʹର͢Δײ৘෼ੳ (positive/negative) CoLA จ͕acceptable͔Ͳ͏͔ MRPC 2จ͕ಉ͡ҙຯ͔Ͳ͏͔ STS-B 2จͷྨࣅ౓ (1-5) QQP 2ͭͷ࣭໰จ͕ಉ͡ҙຯ͔Ͳ͏͔ MNLI 2จͷؚҙؔ܎ೝࣝ (entailment/contradiction/neutral) QNLI จ͕࣭໰ʹର͢Δ౴͑ΛؚΉ͔Ͳ͏͔ (SQuAD) RTE 2จؚ͕ҙؔ܎Λ΋͔ͭͲ͏͔ WNLI 2จؚ͕ҙؔ܎Λ΋͔ͭͲ͏͔ (Winograd Schema Challenge) จ จ

GLUE Leaderboard 5 (2019೥11݄࣌఺) Human performance: 87.1 BERT: 80.5 Baseline
(ELMo): 70.0 T5: 89.7

೔ຊޠݴޠཧղϕϯνϚʔΫ JGLUE 6 [܀ݪ+ 2022] [܀ݪ+ 2023] ܀ݪ݈ଠ࿠ ࣲా஌ल Տݪେี

എܠ (1/2) • LLMͷ໢ཏతධՁɾ෼ੳʹ͸GLUE [Wang+ 2018]ͷΑ͏ͳ ϕϯνϚʔΫ͕ෆՄܽ • ӳޠҎ֎ͷݴޠͰ΋ϕϯνϚʔΫ͕ߏங͞Ε͍ͯΔ •
ϑϥϯεޠFLUE [Le+ 2020], தࠃޠCLUE [Xu+ 2020], ؖࠃޠKLUE [Park+ 2021], ... → ೔ຊޠݴޠཧղϕϯνϚʔΫJGLUE Λߏங 7 ͞Βʹ೉͍͠ϕϯνϚʔΫͷߏங LLMͷੑೳ޲্

എܠ (2/2) • طଘͷ೔ຊޠσʔληοτͷ՝୊ 1. ຋༁ (JSNLI [٢ӽ+ 2020], JSICK
[Yanaka+ 2021]ͳͲ) • ػցɾਓख຋༁ʹ͓͚Δ೔ຊޠͷෆࣗવ͞ • ೔ຊͱͷ஍ҬɾจԽࠩ (ྫ: ΞϝϦΧͷ஍໊ɾ੓࣏ՈͳͲʹؔ͢Δจষ͕ଟ͍) 2. ಛఆυϝΠϯ • ྫ: JRTE [Hayashibe+ 2020]: ϗςϧͷϨϏϡʔ • ೔ຊޠݴޠཧղϕϯνϚʔΫJGLUEΛߏங͠ɺݴޠཧղݚڀΛ ଅਐ 8 ˠ೔ຊޠͰҰ͔Βߏங ˠҰൠυϝΠϯͰߏங

JGLUEͷߏ੒ • GLUE΍SuperGLUEͷλεΫΛ޿͘Χόʔ͢ΔΑ͏ʹߏ੒ • ߏஙʹ͸Yahoo!Ϋϥ΢υιʔγϯάΛར༻ 9 λεΫ σʔληοτ train dev
test จষ෼ྨ MARC-ja 187,528 5,654 5,639 JCoLA [છ୩+ 2022] - - - จϖΞ෼ྨ JSTS 12,451 1,457 1,589 JNLI 20,073 2,434 2,508 QA JSQuAD 62,859 4,442 4,420 JCommonsenseQA 8,939 1,119 1,118

Ϋϥ΢υιʔγϯάͰͷճ౴ positive: 0, negative: 10 10 ৭΋ཤ͖৺஍΋࠷ߴͰ͢ɻࢲͷ৔߹͸Ն৔ͷધ௼Γʹ ࢖͍ͬͯ·͢ɻ MARC-ja JSTS/JNLI
จ֗தͷಓ࿏Λେ͖ͳόε͕૸͍ͬͯ·͢ɻ จಓ࿏Λେ͖ͳόε͕૸͍ͬͯ·͢ɻ positive Կ͜ͷ4%ΧʔυσʔλʔҠͤͳ͍͠ίϚϯυͰௐ΂ͨ ΒΤϥʔग़Δ͜͠Ε͸ͳ͍ negative ୯७ʹҰिؒͷఱؾΛ஌Γ͍ͨͷͰ͋Ε͹͜ΕͰे෼ɻ ͕ͩ͜ͷఔ౓Ͱ༗ྉ͸͍͔͕ͳ΋ͷ͔ positive → negative ྨࣅ౓: 4.4, ਪ࿦ؔ܎: entailment จςʔϒϧʹྉཧ͕ͳΒ΂ΒΕ͍ͯ·͢ɻ จςʔϒϧʹ৯΂͔͚ͷྉཧ͕͋Γ·͢ɻ ྨࣅ౓: 3.0, ਪ࿦ؔ܎: neutral จ໺ٿબख͕όοτΛεΠϯά͍ͯ͠·͢ɻ จ໺ٿબख͕ΩϟονϘʔϧΛ͍ͯ͠·͢ɻ ྨࣅ౓: 2.0, ਪ࿦ؔ܎: contradiction JGLUEσʔλྫ

JGLUEσʔλྫ 11 [λΠτϧ] ౦ւಓ৽װઢ 1987೥ʢত࿨62೥ʣ4݄1೔ͷࠃమ෼ׂຽӦԽʹΑΓɺ JR౦ւ͕ӡӦΛܧঝͨ͠ɻ੢೔ຊཱྀ٬మಓʢJR੢೔ ຊʣ͕ܧঝͨ͠ࢁཅ৽װઢͱ͸૬ޓ৐ΓೖΕ͕ߦΘΕ ͓ͯΓɺ౦ւಓ৽װઢ۠ؒͷΈͰӡస͞ΕΔྻंʹ΋ JR੢೔ຊॴ༗ͷं͕྆࢖༻͞ΕΔ͜ͱ͕͋Δɻ2020೥
ʢྩ࿨2೥ʣ3݄ݱࡏɺ౦ژӺ - ৽େࡕӺؒͷॴཁ࣌ؒ ͸࠷଎2࣌ؒ21෼࠷ߴ଎౓285km/hͰӡߦ͞Ε͍ͯΔɻ ࣭໰: 2020೥ɺ౦ژʙ৽େࡕؒͷ࠷଎ͷॴཁ࣌ؒ͸ ౴͑: 2࣌ؒ21෼ ࣭໰: ౦ւಓ৽װઢͱ૬ޓ৐ΓೖΕ͕͞Ε͍ͯΔ࿏ઢ͸ Ͳ͔͜ʁ ౴͑: ࢁཅ৽װઢ JSQuAD JCommonsenseQA ໰୊: ձࣾͷ࠷ߴ੹೚ऀΛԿͱ͍͏͔ʁ બ୒ࢶ: ڭࢣ, ෦௕, ࣾ௕, ෦Լ, όΠτ ໰୊: εʔϓΛҿΉ࣌ʹ࢖͏ಓ۩͸ʁ બ୒ࢶ: εϓʔϯ, ϝχϡʔ, ࡼ, ϑΥʔΫ, ͸͠

i-1: 夕焼けに... i-1-h1 : 月光に... score: 1.2 1-1: 青い車が... 1-3:
…... label: entailment ( JNLI-A ) ・・ 1-1: 青い車が走っている 1-2: 海沿いを青い車が走っている。 .. 1-5: 歩道の反対側を車が走っている。 2-1: 草原が広がっている。 2-2: 遠くに山がそびえたっている。 .. 2-5: 山の麓に木々が生えている。 i-1: 夕焼けに照らされている男性。 i-2: 短髪の男性が立っている。 .. i-5: 黒い服を着た男性が笑っている。・・・・・・・・・・ 1-1: 青い車が... 1-2: 海沿いを... 1-1: 青い車が... 1-3: …… 2-1: 草原が... 2-2: 遠くに... ・・・・ i-1: 夕焼けに... j-1: 白い犬が... ・・ i-1: 夕焼けに... i-2: 短髪の... ・・類似度付与類似度付与 i-1: 夕焼けに... j-1: 白い犬が... score: 1.2 ・・・・・・・・ 1-1: 青い車が... 1-2: 海沿いを... label: neutral 1-2: 海沿いを... 1-1: 青い車が... label: entailment i-1: 夕焼けに... i-1-h1 : 月光に... label: contradiction 推論関係付与類似度付与矛盾文作成 JSTS JNLI ( JSTS-A ) 1-1: 青い車が... 1-2: 海沿いを... score: 3.8 1-1: 青い車が... 1-3: …... score: 4.5 ( JNLI-C ) ・・ ( JSTS-B ) ( JSTS-C ) 画像の出典: いらすとや(https://www.irasutoya.com/), ONWAイラスト(https://onwa-illust.com/) JSTSɾJNLIͷߏஙϑϩʔ

֤λεΫͷղ౴ํ๏ 13 1จ෼ྨ໰୊ (MARC-ja) positive [CLS] … この … PC
… は … 丈夫 … ##で … 軽い … 。 … [SEP] … จϖΞ෼ྨ/ճؼ໰୊ (JSTS, JNLI) entailment [CLS] … 彼 … … … ⾷べた … [SEP] … 彼 … … … ⾷べた … [SEP] … εύϯநग़ (JSQuAD) Start/End Span [CLS] … … … どこ … ？ … [SEP] … … … 東京 … … … [SEP] … ଟࢶબ୒ࣜ໰୊ (JCommonsenseQA) score1 [CLS] … … … [SEP] … … … [SEP] … 問題選択肢1 score5 [CLS] … … … [SEP] … … … [SEP] … 問題選択肢5 … softmax …

࣮ݧ݁Ռ devηοτ 14 https://github.com/yahoojapan/JGLUE#baseline-scores

JCommonsenseQA 2.0: ܭࢉػͱਓͷڠಇʹΑΔৗࣝਪ࿦σʔληοτͷվྑ 15 ࣭໰ϓϨθϯτʹਓؾͷ͋Δจ๪۩͸ʁ બ୒ࢶສ೥ච ԰֎ Ϝʔϯ ίοϓ ஡࿸
࣭໰ϓϨθϯτʹਓؾͷ͋Δจ๪۩͸ʁ બ୒ࢶສ೥ච ϖϯγϧ ંΓࢴ ίοϓ ஡࿸ ࣭໰ΠϯΫΛิॆͯ͠ॻ͘΋ͷ͸ʁ બ୒ࢶສ೥ච ϖϯγϧ ંΓࢴ ίοϓ ஡࿸ V1 V2 ޡΓબ୒ࢶੜ੒ ࣭໰ϦϥΠτ ਓؒ 0.988 0.997 0.996 ౦๺େBERTBASE 0.782 0.571 0.678 ౦๺େBERTLARGE 0.822 0.617 0.736 ૣҴాେRoBERTaBASE 0.849 0.551 0.672 ૣҴాେRoBERTaLARGE 0.901 0.807 0.865 V1 V2 ޡΓબ୒ࢶੜ੒ V2 ࣭໰ϦϥΠτ ͞Βʹ೉͍͠ ϕϯνϚʔΫͷߏங LLMͷ ੑೳ޲্ V1 → V2 [܀ݪ+ 2023]

σίʔμΛ༻͍ͨੜ੒ܥ--.ͷਐల 16 [awesome-japanese-llm]

ੜ੒ܥ --.࣌୅ͷϕϯνϚʔΫ ӳޠฤ 17 MMLU, lm-evaluation-harness, Open LLM Leaderboard, AlpacaEval,
Chatbot Arena, MT-Bench

MMLU • ਺ֶɺ෺ཧɺ๏ֶɺྺ࢙ͳͲ57Պ໨ͷ4୒໰୊ • େֶӃਐֶదੑࢼݧ(GRE)ɺถࠃҩࢣ໔ڐࢼݧͳͲΛؚΉ • Ұൠతʹ͸few-shotͰղ౴ɺධՁ 18 Measuring Massive
Multitask Language Understanding [Hendrycks+ 2021] Published as a conference paper at ICLR 2021 One of the reasons that the government discourages and regulates monopolies is that (A) producer surplus is lost and consumer surplus is gained. (B) monopoly prices ensure productive efficiency but cost society allocative efficiency. (C) monopoly firms do not engage in significant research and development. (D) consumer surplus is lost with higher prices and lower levels of output. Microeconomics Figure 3: Examples from the Microeconomics task. When you drop a ball from rest it accelerates downward at 9.8 m/s². If you instead throw it downward assuming no air resistance its acceleration immediately after leaving your hand is (A) 9.8 m/s² (B) more than 9.8 m/s² (C) less than 9.8 m/s² (D) Cannot say unless the speed of throw is given. Conceptual Physics College Mathematics In the complex z-plane, the set of points satisfying the equation z² = |z|² is a (A) pair of points (B) circle (C) half-line (D) line

lm-evaluation-harness (EleutherAI) • 200݅Ҏ্ͷσʔληοτʹ͓͍ͯ ੜ੒ܥLLMΛ౷ҰతʹධՁՄೳ • ARC, BIG-Bench, BLiMP, CrowS-Pairs,
Drop, LAMBADA, MGSM, MMLU, PAWS-X, QNLI, SQuAD v2, SWAG, TruthfulQA, XCOPA, XWinograd, ... 19 https://github.com/EleutherAI/lm-evaluation-harness

20 https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard ϞσϧͷߜΓࠐΈ͕Մೳ lm-evaluation-harness ར༻ ͭͷλεΫͷฏۉείΞͰϥϯΩϯά

21 • 805݅ͷଟ༷ͳ໰୊ (ੜ੒λεΫ) • ࣗಈධՁ (GPT-4, Claude) • ϖΞൺֱʹجͮ͘উ཰ʹΑΔ
ϥϯΩϯά • ର text-davinci-003 https://tatsu-lab.github.io/alpaca_eval/

22 https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard Chatbot Arena (LMSYS) • ਓखධՁ (Ϋϥ΢υιʔγϯά) • ϖΞൺֱʹجͮ͘Elo
ratingsʹ ΑΔϥϯΩϯά https://lmsys.org/blog/2023-05-03-arena/

23 https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard MT-Bench (LMSYS) • ࣗಈධՁ(GPT-4) or ਓखධՁ • ઈରείΞ(1-10)
or ϖΞൺֱ • Ϛϧνλʔϯର࿩ೳྗɺࢦࣔʹ ै͏ೳྗΛ໰͏80໰ • 8ΧςΰϦ (10໰ͣͭ, 2λʔϯ) Writing Roleplay Reasoning Math Coding Extraction STEM Humanities 0 2 4 6 8 10 model GPT-4 Claude-v1 GPT-3.5-turbo Vicuna-13B Alpaca-13B LLaMA-13B Figure 20: Category-wise scores of 6 models on MT-bench. 27 writing, roleplay, reasoning, math, coding, extraction, knowledge I (STEM), knowledge II (humanities/social science) [Zheng+ 2023]

MT-Bench: Ϛϧνλʔϯͷ໰୊ͷྫ 24 LLM benchmarks: by combining the existing capability-based
benchmarks and the new preferen based benchmarks with LLM-as-a-judge, one can swiftly and automatically evaluate both the c capabilities and human alignment of models. We publicly release 80 MT-bench questions, 3K exp votes, and 30K conversations with human preferences for future study. Table 1: Sample multi-turn questions in MT-bench. Category Sample Questions Writing 1st Turn Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. 2nd Turn Rewrite your previous response. Start every sentence with the letter A. Math 1st Turn Given that f(x) = 4x3 9x 14, find the value of f(2). 2nd Turn Find x such that f(x) = 0. Knowledge 1st Turn Provide insights into the correlation between economic indicators such as GDP, inflation, and unemployment rates. Explain how fiscal and monetary policies ... 2nd Turn Now, explain them again like I’m five. 2 MT-bench and Chatbot Arena [Zheng+ 2023]

ઈରείΞධՁ ͷϓϩϯϓτ 25 [System] Please act as an impartial judge
and evaluate the quality of the response provided by an AI assistant to the user question. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. You evaluation should focus on the assistant's answer to the second question. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]". <|The Start of Reference Answer|> ### User: {question_1} ### Reference answer: {ref_answer_1} ### User: {question_2} ### Reference answer: {ref_answer_2} <|The End of Reference Answer|> <|The Start of Assistant A's Conversation with User|> ### User: {question_1} ### Assistant A: {answer_1} ### User: {question_2} ### Assistant A: {answer_2} <|The End of Assistant A's Conversation with User|> 25

LLMʹΑΔࣗಈධՁͷ՝୊ • Position bias • ࠷ॳʹఏࣔ͞Εͨճ౴͕ΑΓΑ͍ͱ൑அͯ͠͠·͏ → A-B, B-Aͷ2छྨͷఏࣔॱͰධՁ •
Name bias • “Assistant A”Λ“Assistant B”ΑΓڧ͍ͱ൑அͯ͠͠·͏ • Verbosity bias (length bias) • ΑΓ௕͍ճ౴ΛΑ͍ͱ൑அͯ͠͠·͏ • Self-enhancement bias • ධՁLLMʹΑΔࣗ෼ࣗ਎ͷੜ੒͕Α͍ͱ൑அͯ͠͠·͏ • (਺ֶɾਪ࿦ೳྗͷݶք) 26 [Zheng+ 2023]

GPT-4ʹΑΔධՁͷ੍໿ • OpenAIͷར༻ن໿ ʹΑΔͱ... 2. Usage Requirements (c) Restrictions You
may not (i) use the Services in a way that infringes, misappropriates or violates any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or otherwise attempt to discover the source code or underlying components of models, algorithms, and systems of the Services (except to the extent such restrictions are contrary to applicable law); (iii) use output from the Services to develop models that compete with OpenAI; (iv) ... • (OpenAIʹڝ߹͢Δ) LLMͷ։ൃऀ͸ɺGPT-4ͷग़ྗ(=ධՁ݁Ռ) Λ࢖ͬͯ͸͍͚ͳ͍ 27

ੜ੒ܥ --.࣌୅ͷϕϯνϚʔΫ ೔ຊޠฤ 28

࠷ۙͷ೔ຊޠϦʔμʔϘʔυɾϕϯνϚʔΫ ໰୊࡞੒ ໰୊਺ λεΫछผ ධՁํ๏ lm-evaluation-harness º - ෼ྨɾੜ੒ ࣗಈ
Nejumi º - ෼ྨ ࣗಈ Rakuda ˓ 40 ੜ੒ ࣗಈ Japanese VicunaQA ˓ 80 ੜ੒ ࣗಈ Japanese MT-Bench ˓ 80 ੜ੒ ࣗಈ ELYZA-tasks-100 ˓ 100 ੜ੒ ਓखɾ(ࣗಈ) 29

lm-evaluation-harness (Stability AI) • EleutherAI/lm-evaluation-harness ͷ ೔ຊޠ൛ • αϙʔτ͍ͯ͠Δσʔληοτ: JGLUE
(JCommonsenseQA, JNLI, MARC-ja, JSQuAD), JACKET v2, XLSum (ja), XWinograd (ja), MGSM • few-shot (2, 3-shot) ͰධՁ • ਫ਼౓ͷฏۉʹΑΔϥϯΩϯά (ϦʔμʔϘʔυ) https://github.com/Stability-AI/lm-evaluation-harness/

Nejumi (Weights & Biases) • JGLUEΛϦʔμʔϘʔυԽ • MARC-ja, JNLI, JSQuAD,
JCommonsenseQA • zero-shot ͰධՁ • ਫ਼౓ͷฏۉʹΑΔϥϯΩϯά • lm-evaluation-harnessͱͷҧ͍ 31 https://note.com/wandb_jp/n/n2464e3d85c1a • MARC-ja, JNLI, JCommonsenseQAͷςετʹ͓͍ͯɺStability AIͷධՁํ๏Ͱ͸બ୒ࢶʹؚ·ΕΔ ީิͷத͔Βର਺໬౓࠷େͷ΋ͷΛճ౴ͱ͢Δ͍Θ͹෼ྨثతͳΞϓϩʔνΛ࠾༻͍ͯ͠ΔͨΊʹ ແؔ܎ͳճ౴΍ϑΥʔϚοτΤϥʔɺεϖϧϛεͳͲ͕ى͜Γಘͳ͍ͷʹରͯ͠ɺࢲͨͪ͸શͯͷ ϘΩϟϒϥϦ͔Βࣗ༝ʹग़ྗ͍ͤͯ͞ΔͨΊʹ͜ΕΒΛࠀ෰͠ͳ͍ͱಘ఺Ͱ͖ͳ͍ɻ • JSQuADͷςετʹ͓͍ͯɺStability AIͷධՁํ๏Ͱ͸ਖ਼ղͷτʔΫϯ਺Λ༩͑ͯͦͷ෼͚ͩग़ྗ͞ ͍ͤͯΔͷʹରͯ͠ɺࢲͨͪͷධՁͰ͸ࣗྗͰଧͪ੾Βͳ͚Ε͹ͳΒͳ͍ɻ https://wandb.me/nejumi

Rakuda (YuzuAI) • ೔ຊͷ஍ཧɺ੓࣏ɺྺ࢙ɺࣾձʹ ؔ͢Δ40໰ (ਓख࡞੒) • ࣗಈධՁ (GPT-4) •
ϖΞൺֱ (2छྨͷఏࣔॱ) • Bradley-Terry strengths (Elo ratings ͷվྑ൛) ͰείΞԽ͠Ϧʔμʔ Ϙʔυʹ 32 https://yuzuai.jp/benchmark

Rakudaͷ໰୊ͷྫ • ೔ຊͷ࠷๺୺ͱ࠷ೆ୺ʹҐஔ͢Δ஍໊Λ౴͍͑ͯͩ͘͞ɻ·ͨɺ ͦΕͧΕͲͷ౎ಓ෎ݝʹॴଐ͢Δ͔΋هड़͍ͯͩ͘͠͞ɻ • ઓޙͷ೔ຊ੓࣏ʹ͓͍ͯ࠷΋Өڹྗͷ͋ͬͨ੓࣏ՈΛҰਓڍ͛ɺ ͦͷߩݙʹ͍ͭͯৄ͘͠ड़΂͍ͯͩ͘͞ɻ • ฏ҆࣌୅ʹ੒ཱͨ͠و଒ࣾձͷಛ௃Λड़΂ɺͦΕ͕೔ຊจԽʢจ ֶɺܳज़ɺफڭͳͲʣʹͲͷΑ͏ʹӨڹΛ༩͔͑ͨʹ͍ͭͯ࿦͡
͍ͯͩ͘͞ɻ • ೔ຊͷʮࡾҐҰମվֵʯʹ͍ͭͯड़΂ɺͦͷܦࡁʹର͢ΔӨڹʹ ͍ͭͯղઆ͍ͯͩ͘͠͞ɻ 33

Japanese VicunaQA (ژେ) • Ұൠɺ஌ࣝɺϩʔϧϓϨΠɺৗࣝɺϑΣϧϛਪఆɺ൓࣮Ծ૝ɺίʔσΟϯάɺ ਺ֶɺϥΠςΟϯάʹؔ͢Δ80໰ • MT-Benchͷલ਎Ͱ͋ΔVicuna Eval 80໰ͷ຋༁
• ࣗಈධՁ (GPT-4) • ϖΞൺֱ (2छྨͷఏࣔॱ)ʹجͮ͘উ཰Λܭࢉ • ྫ • ετϨεͱ্खʹ෇͖߹͏ʹ͸ɺͲͷΑ͏ͳํ๏͕͋Γ·͔͢ʁ • ΠϯΫϧʔγϒͰΞΫηγϒϧͳެڞަ௨γεςϜΛઃܭ͢ΔࡍɺͲͷΑ͏ͳཁૉΛߟ ྀ͠·͔͢ʁ • ͋ͳ͕ͨ΋͠ւ଑ͷધ௕ͩͬͨΒɺๅ୳͠ͷϞνϕʔγϣϯΛߴΊΔͨΊ৐૊һʹͲΜ ͳݴ༿Λ͔͚·͔͢ʁ 34

Japanese MT-Bench (Stability AI) • Ϛϧνλʔϯձ࿩ೳྗɺࢦࣔʹै͏ ೳྗΛ໰͏80໰ • 8ΧςΰϦ (10໰ͣͭ,
2λʔϯ) • writing, roleplay, reasoning, math, coding, extraction, knowledge I (STEM), knowledge II (humanities/social science) • MT-BenchΛ຋༁ɺ೔ຊͷจԽʹ߹͏ Α͏ʹमਖ਼ • ࣗಈධՁ (GPT-4) • ઈରείΞ (1-10) 35 shi3z͞ΜʹΑΔධՁ࣮ߦ݁Ռ https://note.com/shi3zblog/n/n6b2ac5874021

Japanese MT-Benchͷ໰୊ͷྫ • ৽ೖࣾһ΁ͷϏδωεϝʔϧͷΤνέοτʹ͍ͭͯͷࢦಋॻΛ࡞੒͠ ͍ͯͩ͘͞ɻܟޠͷਖ਼͍͠࢖͍ํ΍ɺ೔ຊͷϏδωεจԽͰͷ஫ҙ఺ ΛऔΓೖΕ͍ͯͩ͘͞ɻ • ࣗ෼ͷ࡞੒ͨ͠ࢦಋॻΛ٬؍తʹධՁ͠ɺվળ఺͕͋Ε͹ࢦఠ͍ͯͩ͘͠͞ɻ • υϥ͑΋ΜͷʮͷͼଠʯʹͳΓ͖ͬͯձ࿩Λ࢝Ί·͠ΐ͏ɻͰ͸ҎԼ
ͷ࣭໰͔Β࢝Ί͍ͯͩ͘͞ɿzखΛચͬͨޙɺΤΞυϥΠϠʔ͸ඞཁ ͩͱࢥ͍·͔͢ʁz • ொͰҰॹʹ৯ࣄΛ͠·͠ΐ͏ɻόεͰҰॹʹߦ͖·ͤΜ͔ʁ • ͋ͳͨͷࠨʹඒ͍͠੺͍Ո͕ɺӈʹ͸ݬ૝తͳԹ͕ࣨɺਖ਼໘ʹ͸ັྗ తͳϐϯΫͷ৔ॴ͕ݟ͑·͢ɻͰ͸ɺന͍Ո͸Ͳ͜ʹ͋Γ·͔͢ʁ • ݩͷ࣭໰ʹ͸ɺന͍ՈͷҐஔΛ֬ఆతʹܾఆ͢ΔͨΊͷख͕͔Γؚ͕·Ε͍ͯ ·͔͢ʁ 36

ELYZA-tasks-100 (ELYZA) • ෳࡶͳࢦࣔɾλεΫΛؚΉ100໰ • ਖ਼ղྫɺධՁ؍఺෇͖ • ओʹਓखධՁ (5ஈ֊ɺ3ਓͷධՁऀ) •
ධՁ݁Ռγʔτ • ໰୊ͷྫ • ࢓ࣄͷ೤ҙΛऔΓ໭ͨ͢ΊͷΞΠσΞΛ5ͭڍ͍͛ͯͩ͘͞ɻ • ࣍ͷจষΛಡΜͰɺͦͷਓ͕Ͳͷఔ౓ౖ͍ͬͯΔ͔ɺ1ʙ10ͷई౓ͰධՁ ͍ͯͩ͘͠͞ɻ(1ʹౖ͍ͬͯͳ͍ɺ10ʹඇৗʹౖ͍ͬͯΔ)ɻ 1. ·ͨςε τͰ੺఺͔ɻ܅͸શ͘... 2. ςετͰ੺఺ʁࠓճ͸೉͔ͬͨ͠Ͷɻ • ҎԼͷϝʔϧʹฦ৴͍ͯͩ͘͠͞ɻ ͓ർΕ༷Ͱ͢ɻ ຊ೔ମௐෆྑʹΑΓɺ ༧ఆΑΓ౸ண͕গ͠஗Εͯ͠·͍ͦ͏Ͱ͢ɻ ஗͘ͱ΋13࣌ա͗ʹ͸ண͘ ͱࢥ͍·͢ɻ ͝໎࿭Λ͓͔͚ͯ͠ڪॖͰ͸͍͟͝·͕͢ɺ Կଔ͝༰͍ࣻ ͚ͨͩ·͢Α͏͓ئ͍ਃ্͛͠·͢ɻ 37

೔ຊޠLLMධՁͷ͜Ε͔Β 38

LLMධՁʹ͓͚Δ؍఺ • Seen/Unseen: ڭࢣ͋Γֶश͕͞Ε͍ͯΔ͔Ͳ͏͔ • GLUEͳͲैདྷͷϕϯνϚʔΫ͸seenઃఆ • ࠷ۙͷϦʔμʔϘʔυ͸҉໧తʹunseenઃఆ(zero/few-host)Ͱ͋Δ͜ͱ͕΄ͱΜͲ • Contamination
• ධՁσʔλֶ͕शʹ࢖ΘΕ͍ͯΔՄೳੑ • cf. “Catch me if you can! How to beat GPT-4 with a 13B model” [Blog] • λεΫछผ: ෼ྨ(ཧղ)ɾੜ੒ • ධՁํ๏: ࣗಈɾਓख • ෼ྨλεΫ͸ࣗಈධՁ • ੜ੒λεΫ͸྆ํ (ੜ੒λεΫͷࣗಈධՁ͸GPT-4ར༻͕ओྲྀ͕ͩɻɻɻ) • Ϟσϧछผ • ֶशํ๏: Pretrained, Fine-tuned (SFT), RLHF • ύϥϝʔλ਺ • ֶशݴޠ 39

ݱࡏͷ೔ຊޠLLMධՁͷ՝୊ • ෼ྨ(ཧղ)ܥλεΫ(JGLUEͳͲ)ͷΈͷධՁͰ͸Ұ໘త • ݱࡏͷੜ੒ܥσʔληοτͷ՝୊ • େ͖ͳن໛ͷσʔληοτ͸গͳ͍ • χϡʔεهࣄͷཁ໿σʔληοτ: XLSum
(ja) [Hasan+ 2021] • ೔ৗର࿩ίʔύε: Japanese Daily Dialogue [੺ؒ+ 2023] • LLMධՁ༻ͷੜ੒໰୊ͷ՝୊ • σʔληοτ͋ͨΓ਺ेʙ100໰ఔ౓Ͱগͳ͍ • ධՁํ๏͕ਓखɺ΋͘͠͸ɺGPT-4ʹΑΔࣗಈධՁ 40

LLMධՁʹదͨ͠ϕϯνϚʔΫʹ޲͚ͯ • ෼ྨ(ཧղ)ܥσʔληοτͷ֦ॆ • MMLUͷ೔ຊޠ΁ͷ຋༁ (ૣେՏݪݚɾཧԽֶݚڀॴAIP) • llm-jp-evalɾjasterͷ׆ಈ (LLMษڧձ) •
ੜ੒ܥσʔληοτͷ֦ॆ • ࣗಈධՁ͕ෆՄܽ • GPT-4ʹΑΔධՁ͸ආ͚͍ͨ • Α͍ɾѱ͍ੜ੒ΛΞϊςʔγϣϯͨ͠σʔλΛ࡞੒͠ɺfine-tuningʹΑͬͯධՁثΛߏங • cf. BLEURT [Sellam+ 2020], COMET [Rei+ 2020] • ࡶஊର࿩ͷΑ͏ͳopen-endedੑ͕ߴ͍λεΫ͸(ࣗಈ)ධՁ͕೉͍͠ • ཁ໿΍QAͳͲ͕ީิ (JGLUE v2) • Ξϊςʔγϣϯର৅ͷςΩετͱͯ͠ɺΦʔϓϯͳ΋ͷʹՃ͑ͯاۀ಺ςΩετ΋࢖͑Δ Α͏ʹݕ౼த • ධՁઃఆͷ͋Γํͷݕ౼ • unseenઃఆ?ɺfew-shotઃఆ? • ϓϩϯϓτහײੑ΁ͷରॲ 41

LLMධՁʹదͨ͠ϕϯνϚʔΫʹ޲͚ͯ ͞Βʹ͸ 42 Question Answering Tool Learning Reasoning K nowledge
Com pletion Ethics and Morality Bias Toxicity Truthfulness Robustness Evaluation Risk Evaluation Biology and M edicine Education Legislation Computer Science Finance Benchmarks for Holistic Evaluation Benchmarks for Knowledge and Reasoning Benchmarks for NLU and NLG Knowled ge and Capability Large Langauge Model Evaluation Alignment Eva luation Safety Specialized LLMs Evaluation Organization … Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation. [Guo+ 2023] [Awesome-LLMs-Evaluation-Papers]

JGLUEの構築そして 日本語LLM評価のこれから

JGLUEの構築そして 日本語LLM評価のこれから

More Decks by Keisuke Kamata

Featured

Transcript

JGLUEの構築そして日本語LLM評価のこれから

JGLUEの構築そして日本語LLM評価のこれから