Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGLUEの構築そして 日本語LLM評価のこれから

Keisuke Kamata
November 15, 2023
2.3k

JGLUEの構築そして 日本語LLM評価のこれから

Keisuke Kamata

November 15, 2023
Tweet

Transcript

  1. JGLUEͷߏஙͦͯ͠
    ೔ຊޠLLMධՁͷ͜Ε͔Β
    Տݪେี
    ૣҴాେֶ
    W&B ౦ژϛʔτΞοϓ #8 (2023/11/15)

    View full-size slide

  2. େن໛ݴޠϞσϧ(LLM)ͷਐల
    2
    https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-
    worlds-largest-and-most-powerful-generative-language-model/

    View full-size slide

  3. LLMͷ෼ྨ
    3
    Input: sentence in source language
    Output: next word in target language
    I am a
    ju suis étudiant
    student
    Words previously generated
    https://jalammar.github.io/illustrated-transformer/
    6 layers
    6 layers
    Τϯίʔμɾσίʔμ (T5ͳͲ)
    Attention in
    source language
    Attention in
    target language
    Attention between source
    and target languages
    … …
    Τϯίʔμ (BERTܥ) σίʔμ (GPTܥ)

    View full-size slide

  4. ݴޠཧղϕϯνϚʔΫ: GLUE
    4
    General Language Understanding Evaluation
    [Wang+ 2018]
    λεΫ આ໌
    SST-2 өըͷϨϏϡʔʹର͢Δײ৘෼ੳ (positive/negative)
    CoLA จ͕acceptable͔Ͳ͏͔
    MRPC 2จ͕ಉ͡ҙຯ͔Ͳ͏͔
    STS-B 2จͷྨࣅ౓ (1-5)
    QQP 2ͭͷ࣭໰จ͕ಉ͡ҙຯ͔Ͳ͏͔
    MNLI 2จͷؚҙؔ܎ೝࣝ (entailment/contradiction/neutral)
    QNLI จ͕࣭໰ʹର͢Δ౴͑ΛؚΉ͔Ͳ͏͔ (SQuAD)
    RTE 2จؚ͕ҙؔ܎Λ΋͔ͭͲ͏͔
    WNLI 2จؚ͕ҙؔ܎Λ΋͔ͭͲ͏͔ (Winograd Schema Challenge)


    View full-size slide

  5. GLUE Leaderboard
    5
    (2019೥11݄࣌఺)
    Human performance: 87.1
    BERT: 80.5
    Baseline (ELMo): 70.0
    T5: 89.7

    View full-size slide

  6. ೔ຊޠݴޠཧղϕϯνϚʔΫ
    JGLUE
    6
    [܀ݪ+ 2022] [܀ݪ+ 2023]
    ܀ݪ݈ଠ࿠ ࣲా஌ल
    Տݪେี

    View full-size slide

  7. എܠ (1/2)
    • LLMͷ໢ཏతධՁɾ෼ੳʹ͸GLUE [Wang+ 2018]ͷΑ͏ͳ
    ϕϯνϚʔΫ͕ෆՄܽ
    • ӳޠҎ֎ͷݴޠͰ΋ϕϯνϚʔΫ͕ߏங͞Ε͍ͯΔ
    • ϑϥϯεޠFLUE [Le+ 2020], தࠃޠCLUE [Xu+ 2020],
    ؖࠃޠKLUE [Park+ 2021], ...
    → ೔ຊޠݴޠཧղϕϯνϚʔΫJGLUE Λߏங
    7
    ͞Βʹ೉͍͠ϕϯνϚʔΫͷߏங
    LLMͷੑೳ޲্

    View full-size slide

  8. എܠ (2/2)
    • طଘͷ೔ຊޠσʔληοτͷ՝୊
    1. ຋༁ (JSNLI [٢ӽ+ 2020], JSICK [Yanaka+ 2021]ͳͲ)
    • ػցɾਓख຋༁ʹ͓͚Δ೔ຊޠͷෆࣗવ͞
    • ೔ຊͱͷ஍ҬɾจԽࠩ (ྫ: ΞϝϦΧͷ஍໊ɾ੓࣏ՈͳͲʹؔ͢Δจষ͕ଟ͍)
    2. ಛఆυϝΠϯ
    • ྫ: JRTE [Hayashibe+ 2020]: ϗςϧͷϨϏϡʔ
    • ೔ຊޠݴޠཧղϕϯνϚʔΫJGLUEΛߏங͠ɺݴޠཧղݚڀΛ
    ଅਐ
    8
    ˠ೔ຊޠͰҰ͔Βߏங
    ˠҰൠυϝΠϯͰߏங

    View full-size slide

  9. JGLUEͷߏ੒
    • GLUE΍SuperGLUEͷλεΫΛ޿͘Χόʔ͢ΔΑ͏ʹߏ੒
    • ߏஙʹ͸Yahoo!Ϋϥ΢υιʔγϯάΛར༻
    9
    λεΫ σʔληοτ train dev test
    จষ෼ྨ
    MARC-ja 187,528 5,654 5,639
    JCoLA [છ୩+ 2022] - - -
    จϖΞ෼ྨ
    JSTS 12,451 1,457 1,589
    JNLI 20,073 2,434 2,508
    QA
    JSQuAD 62,859 4,442 4,420
    JCommonsenseQA 8,939 1,119 1,118

    View full-size slide

  10. Ϋϥ΢υιʔγϯάͰͷճ౴
    positive: 0, negative: 10 10
    ৭΋ཤ͖৺஍΋࠷ߴͰ͢ɻࢲͷ৔߹͸Ն৔ͷધ௼Γʹ
    ࢖͍ͬͯ·͢ɻ
    MARC-ja JSTS/JNLI
    จ֗தͷಓ࿏Λେ͖ͳόε͕૸͍ͬͯ·͢ɻ
    จಓ࿏Λେ͖ͳόε͕૸͍ͬͯ·͢ɻ
    positive
    Կ͜ͷ4%ΧʔυσʔλʔҠͤͳ͍͠ίϚϯυͰௐ΂ͨ
    ΒΤϥʔग़Δ͜͠Ε͸ͳ͍
    negative
    ୯७ʹҰिؒͷఱؾΛ஌Γ͍ͨͷͰ͋Ε͹͜ΕͰे෼ɻ
    ͕ͩ͜ͷఔ౓Ͱ༗ྉ͸͍͔͕ͳ΋ͷ͔
    positive → negative
    ྨࣅ౓: 4.4, ਪ࿦ؔ܎: entailment
    จςʔϒϧʹྉཧ͕ͳΒ΂ΒΕ͍ͯ·͢ɻ
    จςʔϒϧʹ৯΂͔͚ͷྉཧ͕͋Γ·͢ɻ
    ྨࣅ౓: 3.0, ਪ࿦ؔ܎: neutral
    จ໺ٿબख͕όοτΛεΠϯά͍ͯ͠·͢ɻ
    จ໺ٿબख͕ΩϟονϘʔϧΛ͍ͯ͠·͢ɻ
    ྨࣅ౓: 2.0, ਪ࿦ؔ܎: contradiction
    JGLUEσʔλྫ

    View full-size slide

  11. JGLUEσʔλྫ

    11
    [λΠτϧ] ౦ւಓ৽װઢ
    1987೥ʢত࿨62೥ʣ4݄1೔ͷࠃమ෼ׂຽӦԽʹΑΓɺ
    JR౦ւ͕ӡӦΛܧঝͨ͠ɻ੢೔ຊཱྀ٬మಓʢJR੢೔
    ຊʣ͕ܧঝͨ͠ࢁཅ৽װઢͱ͸૬ޓ৐ΓೖΕ͕ߦΘΕ
    ͓ͯΓɺ౦ւಓ৽װઢ۠ؒͷΈͰӡస͞ΕΔྻंʹ΋
    JR੢೔ຊॴ༗ͷं͕྆࢖༻͞ΕΔ͜ͱ͕͋Δɻ2020೥
    ʢྩ࿨2೥ʣ3݄ݱࡏɺ౦ژӺ - ৽େࡕӺؒͷॴཁ࣌ؒ
    ͸࠷଎2࣌ؒ21෼࠷ߴ଎౓285km/hͰӡߦ͞Ε͍ͯΔɻ
    ࣭໰: 2020೥ɺ౦ژʙ৽େࡕؒͷ࠷଎ͷॴཁ࣌ؒ͸
    ౴͑: 2࣌ؒ21෼
    ࣭໰: ౦ւಓ৽װઢͱ૬ޓ৐ΓೖΕ͕͞Ε͍ͯΔ࿏ઢ͸
    Ͳ͔͜ʁ
    ౴͑: ࢁཅ৽װઢ
    JSQuAD JCommonsenseQA
    ໰୊: ձࣾͷ࠷ߴ੹೚ऀΛԿͱ͍͏͔ʁ
    બ୒ࢶ: ڭࢣ, ෦௕, ࣾ௕, ෦Լ, όΠτ
    ໰୊: εʔϓΛҿΉ࣌ʹ࢖͏ಓ۩͸ʁ
    બ୒ࢶ: εϓʔϯ, ϝχϡʔ, ࡼ, ϑΥʔΫ, ͸͠

    View full-size slide

  12. i-1: 夕焼けに...
    i-1-h1
    : 月光に...
    score: 1.2
    1-1: 青い車が...
    1-3: …...
    label: entailment
    ( JNLI-A )


    1-1: 青い車が走っている
    1-2: 海沿いを青い車が走っている。
    ..
    1-5: 歩道の反対側を車が走っている。
    2-1: 草原が広がっている。
    2-2: 遠くに山がそびえたっている。
    ..
    2-5: 山の麓に木々が生えている。
    i-1: 夕焼けに照らされている男性。
    i-2: 短髪の男性が立っている。
    ..
    i-5: 黒い服を着た男性が笑っている。










    1-1: 青い車が...
    1-2: 海沿いを...
    1-1: 青い車が...
    1-3: ……
    2-1: 草原が...
    2-2: 遠くに...




    i-1: 夕焼けに...
    j-1: 白い犬が...


    i-1: 夕焼けに...
    i-2: 短髪の...


    類似度付与
    類似度付与
    i-1: 夕焼けに...
    j-1: 白い犬が...
    score: 1.2








    1-1: 青い車が...
    1-2: 海沿いを...
    label: neutral
    1-2: 海沿いを...
    1-1: 青い車が...
    label: entailment
    i-1: 夕焼けに...
    i-1-h1
    : 月光に...
    label: contradiction
    推論関係付与
    類似度付与
    矛盾文作成
    JSTS JNLI
    ( JSTS-A )
    1-1: 青い車が...
    1-2: 海沿いを...
    score: 3.8
    1-1: 青い車が...
    1-3: …...
    score: 4.5
    ( JNLI-C )


    ( JSTS-B )
    ( JSTS-C )
    画像の出典: いらすとや(https://www.irasutoya.com/), ONWAイラスト(https://onwa-illust.com/)
    JSTSɾJNLIͷߏஙϑϩʔ

    View full-size slide

  13. ֤λεΫͷղ౴ํ๏
    13
    1จ෼ྨ໰୊ (MARC-ja)
    positive
    [CLS]

    この

    PC



    丈夫

    ##で

    軽い



    [SEP]

    จϖΞ෼ྨ/ճؼ໰୊ (JSTS, JNLI)
    entailment
    [CLS]





    ⾷べた

    [SEP]





    ⾷べた

    [SEP]

    εύϯநग़ (JSQuAD)
    Start/End Span
    [CLS]



    どこ



    [SEP]



    東京



    [SEP]

    ଟࢶબ୒ࣜ໰୊ (JCommonsenseQA)
    score1
    [CLS]



    [SEP]



    [SEP]

    問題 選択肢1
    score5
    [CLS]



    [SEP]



    [SEP]

    問題 選択肢5

    softmax

    View full-size slide

  14. ࣮ݧ݁Ռ devηοτ

    14
    https://github.com/yahoojapan/JGLUE#baseline-scores

    View full-size slide

  15. JCommonsenseQA 2.0:
    ܭࢉػͱਓͷڠಇʹΑΔৗࣝਪ࿦σʔληοτͷվྑ
    15
    ࣭໰ϓϨθϯτʹਓؾͷ͋Δจ๪۩͸ʁ
    બ୒ࢶສ೥ච ԰֎ Ϝʔϯ ίοϓ ஡࿸
    ࣭໰ϓϨθϯτʹਓؾͷ͋Δจ๪۩͸ʁ
    બ୒ࢶສ೥ච ϖϯγϧ ંΓࢴ ίοϓ ஡࿸
    ࣭໰ΠϯΫΛิॆͯ͠ॻ͘΋ͷ͸ʁ
    બ୒ࢶສ೥ච ϖϯγϧ ંΓࢴ ίοϓ ஡࿸
    V1 V2
    ޡΓબ୒ࢶੜ੒ ࣭໰ϦϥΠτ
    ਓؒ 0.988 0.997 0.996
    ౦๺େBERTBASE
    0.782 0.571 0.678
    ౦๺େBERTLARGE
    0.822 0.617 0.736
    ૣҴాେRoBERTaBASE
    0.849 0.551 0.672
    ૣҴాେRoBERTaLARGE
    0.901 0.807 0.865
    V1
    V2
    ޡΓબ୒ࢶੜ੒
    V2
    ࣭໰ϦϥΠτ
    ͞Βʹ೉͍͠
    ϕϯνϚʔΫͷߏங
    LLMͷ
    ੑೳ޲্
    V1 → V2
    [܀ݪ+ 2023]

    View full-size slide

  16. σίʔμΛ༻͍ͨੜ੒ܥ--.ͷਐల
    16
    [awesome-japanese-llm]

    View full-size slide

  17. ੜ੒ܥ
    --.࣌୅ͷϕϯνϚʔΫ
    ӳޠฤ
    17
    MMLU, lm-evaluation-harness, Open LLM Leaderboard, AlpacaEval, Chatbot Arena, MT-Bench

    View full-size slide

  18. MMLU
    • ਺ֶɺ෺ཧɺ๏ֶɺྺ࢙ͳͲ57Պ໨ͷ4୒໰୊
    • େֶӃਐֶదੑࢼݧ(GRE)ɺถࠃҩࢣ໔ڐࢼݧͳͲΛؚΉ
    • Ұൠతʹ͸few-shotͰղ౴ɺධՁ
    18
    Measuring Massive Multitask Language Understanding
    [Hendrycks+ 2021]
    Published as a conference paper at ICLR 2021
    One of the reasons that the government discourages and regulates monopolies is that
    (A) producer surplus is lost and consumer surplus is gained.
    (B) monopoly prices ensure productive efficiency but cost society allocative efficiency.
    (C) monopoly firms do not engage in significant research and development.
    (D) consumer surplus is lost with higher prices and lower levels of output.
    Microeconomics
    Figure 3: Examples from the Microeconomics task.
    When you drop a ball from rest it accelerates downward at 9.8 m/s². If you instead throw it
    downward assuming no air resistance its acceleration immediately after leaving your hand is
    (A) 9.8 m/s²
    (B) more than 9.8 m/s²
    (C) less than 9.8 m/s²
    (D) Cannot say unless the speed of throw is given.
    Conceptual
    Physics
    College
    Mathematics
    In the complex z-plane, the set of points satisfying the equation z² = |z|² is a
    (A) pair of points
    (B) circle
    (C) half-line
    (D) line

    View full-size slide

  19. lm-evaluation-harness (EleutherAI)
    • 200݅Ҏ্ͷσʔληοτʹ͓͍ͯ
    ੜ੒ܥLLMΛ౷ҰతʹධՁՄೳ
    • ARC, BIG-Bench, BLiMP, CrowS-Pairs,
    Drop, LAMBADA, MGSM, MMLU,
    PAWS-X, QNLI, SQuAD v2, SWAG,
    TruthfulQA, XCOPA, XWinograd, ...
    19
    https://github.com/EleutherAI/lm-evaluation-harness

    View full-size slide

  20. 20
    https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
    ϞσϧͷߜΓࠐΈ͕Մೳ
    lm-evaluation-harness ར༻
    ͭͷλεΫͷฏۉείΞͰϥϯΩϯά

    View full-size slide

  21. 21
    • 805݅ͷଟ༷ͳ໰୊
    (ੜ੒λεΫ)
    • ࣗಈධՁ (GPT-4, Claude)
    • ϖΞൺֱʹجͮ͘উ཰ʹΑΔ
    ϥϯΩϯά
    • ର text-davinci-003
    https://tatsu-lab.github.io/alpaca_eval/

    View full-size slide

  22. 22
    https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
    Chatbot Arena (LMSYS)
    • ਓखධՁ (Ϋϥ΢υιʔγϯά)
    • ϖΞൺֱʹجͮ͘Elo ratingsʹ
    ΑΔϥϯΩϯά
    https://lmsys.org/blog/2023-05-03-arena/

    View full-size slide

  23. 23
    https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
    MT-Bench (LMSYS)
    • ࣗಈධՁ(GPT-4) or ਓखධՁ
    • ઈରείΞ(1-10) or ϖΞൺֱ
    • Ϛϧνλʔϯର࿩ೳྗɺࢦࣔʹ
    ै͏ೳྗΛ໰͏80໰
    • 8ΧςΰϦ (10໰ͣͭ, 2λʔϯ)
    Writing
    Roleplay
    Reasoning
    Math
    Coding
    Extraction
    STEM
    Humanities
    0 2 4 6 8 10
    model
    GPT-4
    Claude-v1
    GPT-3.5-turbo
    Vicuna-13B
    Alpaca-13B
    LLaMA-13B
    Figure 20: Category-wise scores of 6 models on MT-bench.
    27
    writing, roleplay, reasoning,
    math, coding, extraction,
    knowledge I (STEM),
    knowledge II
    (humanities/social science)
    [Zheng+ 2023]

    View full-size slide

  24. MT-Bench: Ϛϧνλʔϯͷ໰୊ͷྫ
    24
    LLM benchmarks: by combining the existing capability-based benchmarks and the new preferen
    based benchmarks with LLM-as-a-judge, one can swiftly and automatically evaluate both the c
    capabilities and human alignment of models. We publicly release 80 MT-bench questions, 3K exp
    votes, and 30K conversations with human preferences for future study.
    Table 1: Sample multi-turn questions in MT-bench.
    Category Sample Questions
    Writing
    1st Turn Compose an engaging travel blog post about a recent trip to Hawaii, highlighting
    cultural experiences and must-see attractions.
    2nd Turn Rewrite your previous response. Start every sentence with the letter A.
    Math
    1st Turn Given that f(x) = 4x3 9x 14, find the value of f(2).
    2nd Turn Find x such that f(x) = 0.
    Knowledge
    1st Turn Provide insights into the correlation between economic indicators such as GDP,
    inflation, and unemployment rates. Explain how fiscal and monetary policies ...
    2nd Turn Now, explain them again like I’m five.
    2 MT-bench and Chatbot Arena
    [Zheng+ 2023]

    View full-size slide

  25. ઈରείΞධՁ
    ͷϓϩϯϓτ
    25
    [System]
    Please act as an impartial judge and evaluate the quality of the response provided by an
    AI assistant to the user question. Your evaluation should consider correctness and
    helpfulness. You will be given a reference answer and the assistant's answer. You
    evaluation should focus on the assistant's answer to the second question. Begin your
    evaluation by comparing the assistant's answer with the reference answer. Identify and
    correct any mistakes. Be as objective as possible. After providing your explanation, you
    must rate the response on a scale of 1 to 10 by strictly following this format:
    "[[rating]]", for example: "Rating: [[5]]".
    <|The Start of Reference Answer|>
    ### User:
    {question_1}
    ### Reference answer:
    {ref_answer_1}
    ### User:
    {question_2}
    ### Reference answer:
    {ref_answer_2}
    <|The End of Reference Answer|>
    <|The Start of Assistant A's Conversation with User|>
    ### User:
    {question_1}
    ### Assistant A:
    {answer_1}
    ### User:
    {question_2}
    ### Assistant A:
    {answer_2}
    <|The End of Assistant A's Conversation with User|> 25

    View full-size slide

  26. LLMʹΑΔࣗಈධՁͷ՝୊
    • Position bias
    • ࠷ॳʹఏࣔ͞Εͨճ౴͕ΑΓΑ͍ͱ൑அͯ͠͠·͏
    → A-B, B-Aͷ2छྨͷఏࣔॱͰධՁ
    • Name bias
    • “Assistant A”Λ“Assistant B”ΑΓڧ͍ͱ൑அͯ͠͠·͏
    • Verbosity bias (length bias)
    • ΑΓ௕͍ճ౴ΛΑ͍ͱ൑அͯ͠͠·͏
    • Self-enhancement bias
    • ධՁLLMʹΑΔࣗ෼ࣗ਎ͷੜ੒͕Α͍ͱ൑அͯ͠͠·͏
    • (਺ֶɾਪ࿦ೳྗͷݶք)
    26
    [Zheng+ 2023]

    View full-size slide

  27. GPT-4ʹΑΔධՁͷ੍໿
    • OpenAIͷར༻ن໿ ʹΑΔͱ...
    2. Usage Requirements
    (c) Restrictions
    You may not (i) use the Services in a way that infringes, misappropriates or violates
    any person’s rights; (ii) reverse assemble, reverse compile, decompile, translate or
    otherwise attempt to discover the source code or underlying components of models,
    algorithms, and systems of the Services (except to the extent such restrictions are
    contrary to applicable law); (iii) use output from the Services to develop models
    that compete with OpenAI; (iv) ...
    • (OpenAIʹڝ߹͢Δ) LLMͷ։ൃऀ͸ɺGPT-4ͷग़ྗ(=ධՁ݁Ռ)
    Λ࢖ͬͯ͸͍͚ͳ͍
    27

    View full-size slide

  28. ੜ੒ܥ
    --.࣌୅ͷϕϯνϚʔΫ
    ೔ຊޠฤ
    28

    View full-size slide

  29. ࠷ۙͷ೔ຊޠϦʔμʔϘʔυɾϕϯνϚʔΫ
    ໰୊࡞੒ ໰୊਺ λεΫछผ ධՁํ๏
    lm-evaluation-harness º - ෼ྨɾੜ੒ ࣗಈ
    Nejumi º - ෼ྨ ࣗಈ
    Rakuda ˓ 40 ੜ੒ ࣗಈ
    Japanese VicunaQA ˓ 80 ੜ੒ ࣗಈ
    Japanese MT-Bench ˓ 80 ੜ੒ ࣗಈ
    ELYZA-tasks-100 ˓ 100 ੜ੒ ਓखɾ(ࣗಈ)
    29

    View full-size slide

  30. lm-evaluation-harness (Stability AI)
    • EleutherAI/lm-evaluation-harness ͷ
    ೔ຊޠ൛
    • αϙʔτ͍ͯ͠Δσʔληοτ:
    JGLUE (JCommonsenseQA, JNLI,
    MARC-ja, JSQuAD), JACKET v2,
    XLSum (ja), XWinograd (ja), MGSM
    • few-shot (2, 3-shot) ͰධՁ
    • ਫ਼౓ͷฏۉʹΑΔϥϯΩϯά
    (ϦʔμʔϘʔυ)
    https://github.com/Stability-AI/lm-evaluation-harness/

    View full-size slide

  31. Nejumi (Weights & Biases)
    • JGLUEΛϦʔμʔϘʔυԽ
    • MARC-ja, JNLI, JSQuAD,
    JCommonsenseQA
    • zero-shot ͰධՁ
    • ਫ਼౓ͷฏۉʹΑΔϥϯΩϯά
    • lm-evaluation-harnessͱͷҧ͍
    31
    https://note.com/wandb_jp/n/n2464e3d85c1a
    • MARC-ja, JNLI, JCommonsenseQAͷςετʹ͓͍ͯɺStability AIͷධՁํ๏Ͱ͸બ୒ࢶʹؚ·ΕΔ
    ީิͷத͔Βର਺໬౓࠷େͷ΋ͷΛճ౴ͱ͢Δ͍Θ͹෼ྨثతͳΞϓϩʔνΛ࠾༻͍ͯ͠ΔͨΊʹ
    ແؔ܎ͳճ౴΍ϑΥʔϚοτΤϥʔɺεϖϧϛεͳͲ͕ى͜Γಘͳ͍ͷʹରͯ͠ɺࢲͨͪ͸શͯͷ
    ϘΩϟϒϥϦ͔Βࣗ༝ʹग़ྗ͍ͤͯ͞ΔͨΊʹ͜ΕΒΛࠀ෰͠ͳ͍ͱಘ఺Ͱ͖ͳ͍ɻ
    • JSQuADͷςετʹ͓͍ͯɺStability AIͷධՁํ๏Ͱ͸ਖ਼ղͷτʔΫϯ਺Λ༩͑ͯͦͷ෼͚ͩग़ྗ͞
    ͍ͤͯΔͷʹରͯ͠ɺࢲͨͪͷධՁͰ͸ࣗྗͰଧͪ੾Βͳ͚Ε͹ͳΒͳ͍ɻ
    https://wandb.me/nejumi

    View full-size slide

  32. Rakuda (YuzuAI)
    • ೔ຊͷ஍ཧɺ੓࣏ɺྺ࢙ɺࣾձʹ
    ؔ͢Δ40໰ (ਓख࡞੒)
    • ࣗಈධՁ (GPT-4)
    • ϖΞൺֱ (2छྨͷఏࣔॱ)
    • Bradley-Terry strengths (Elo ratings
    ͷվྑ൛) ͰείΞԽ͠Ϧʔμʔ
    Ϙʔυʹ
    32
    https://yuzuai.jp/benchmark

    View full-size slide

  33. Rakudaͷ໰୊ͷྫ
    • ೔ຊͷ࠷๺୺ͱ࠷ೆ୺ʹҐஔ͢Δ஍໊Λ౴͍͑ͯͩ͘͞ɻ·ͨɺ
    ͦΕͧΕͲͷ౎ಓ෎ݝʹॴଐ͢Δ͔΋هड़͍ͯͩ͘͠͞ɻ
    • ઓޙͷ೔ຊ੓࣏ʹ͓͍ͯ࠷΋Өڹྗͷ͋ͬͨ੓࣏ՈΛҰਓڍ͛ɺ
    ͦͷߩݙʹ͍ͭͯৄ͘͠ड़΂͍ͯͩ͘͞ɻ
    • ฏ҆࣌୅ʹ੒ཱͨ͠و଒ࣾձͷಛ௃Λड़΂ɺͦΕ͕೔ຊจԽʢจ
    ֶɺܳज़ɺफڭͳͲʣʹͲͷΑ͏ʹӨڹΛ༩͔͑ͨʹ͍ͭͯ࿦͡
    ͍ͯͩ͘͞ɻ
    • ೔ຊͷʮࡾҐҰମվֵʯʹ͍ͭͯड़΂ɺͦͷܦࡁʹର͢ΔӨڹʹ
    ͍ͭͯղઆ͍ͯͩ͘͠͞ɻ
    33

    View full-size slide

  34. Japanese VicunaQA (ژେ)
    • Ұൠɺ஌ࣝɺϩʔϧϓϨΠɺৗࣝɺϑΣϧϛਪఆɺ൓࣮Ծ૝ɺίʔσΟϯάɺ
    ਺ֶɺϥΠςΟϯάʹؔ͢Δ80໰
    • MT-Benchͷલ਎Ͱ͋ΔVicuna Eval 80໰ͷ຋༁
    • ࣗಈධՁ (GPT-4)
    • ϖΞൺֱ (2छྨͷఏࣔॱ)ʹجͮ͘উ཰Λܭࢉ
    • ྫ
    • ετϨεͱ্खʹ෇͖߹͏ʹ͸ɺͲͷΑ͏ͳํ๏͕͋Γ·͔͢ʁ
    • ΠϯΫϧʔγϒͰΞΫηγϒϧͳެڞަ௨γεςϜΛઃܭ͢ΔࡍɺͲͷΑ͏ͳཁૉΛߟ
    ྀ͠·͔͢ʁ
    • ͋ͳ͕ͨ΋͠ւ଑ͷધ௕ͩͬͨΒɺๅ୳͠ͷϞνϕʔγϣϯΛߴΊΔͨΊ৐૊һʹͲΜ
    ͳݴ༿Λ͔͚·͔͢ʁ 34

    View full-size slide

  35. Japanese MT-Bench (Stability AI)
    • Ϛϧνλʔϯձ࿩ೳྗɺࢦࣔʹै͏
    ೳྗΛ໰͏80໰
    • 8ΧςΰϦ (10໰ͣͭ, 2λʔϯ)
    • writing, roleplay, reasoning, math, coding,
    extraction, knowledge I (STEM),
    knowledge II (humanities/social science)
    • MT-BenchΛ຋༁ɺ೔ຊͷจԽʹ߹͏
    Α͏ʹमਖ਼
    • ࣗಈධՁ (GPT-4)
    • ઈରείΞ (1-10)
    35
    shi3z͞ΜʹΑΔධՁ࣮ߦ݁Ռ
    https://note.com/shi3zblog/n/n6b2ac5874021

    View full-size slide

  36. Japanese MT-Benchͷ໰୊ͷྫ
    • ৽ೖࣾһ΁ͷϏδωεϝʔϧͷΤνέοτʹ͍ͭͯͷࢦಋॻΛ࡞੒͠
    ͍ͯͩ͘͞ɻܟޠͷਖ਼͍͠࢖͍ํ΍ɺ೔ຊͷϏδωεจԽͰͷ஫ҙ఺
    ΛऔΓೖΕ͍ͯͩ͘͞ɻ
    • ࣗ෼ͷ࡞੒ͨ͠ࢦಋॻΛ٬؍తʹධՁ͠ɺվળ఺͕͋Ε͹ࢦఠ͍ͯͩ͘͠͞ɻ
    • υϥ͑΋ΜͷʮͷͼଠʯʹͳΓ͖ͬͯձ࿩Λ࢝Ί·͠ΐ͏ɻͰ͸ҎԼ
    ͷ࣭໰͔Β࢝Ί͍ͯͩ͘͞ɿzखΛચͬͨޙɺΤΞυϥΠϠʔ͸ඞཁ
    ͩͱࢥ͍·͔͢ʁz
    • ொͰҰॹʹ৯ࣄΛ͠·͠ΐ͏ɻόεͰҰॹʹߦ͖·ͤΜ͔ʁ
    • ͋ͳͨͷࠨʹඒ͍͠੺͍Ո͕ɺӈʹ͸ݬ૝తͳԹ͕ࣨɺਖ਼໘ʹ͸ັྗ
    తͳϐϯΫͷ৔ॴ͕ݟ͑·͢ɻͰ͸ɺന͍Ո͸Ͳ͜ʹ͋Γ·͔͢ʁ
    • ݩͷ࣭໰ʹ͸ɺന͍ՈͷҐஔΛ֬ఆతʹܾఆ͢ΔͨΊͷख͕͔Γؚ͕·Ε͍ͯ
    ·͔͢ʁ
    36

    View full-size slide

  37. ELYZA-tasks-100 (ELYZA)
    • ෳࡶͳࢦࣔɾλεΫΛؚΉ100໰
    • ਖ਼ղྫɺධՁ؍఺෇͖
    • ओʹਓखධՁ (5ஈ֊ɺ3ਓͷධՁऀ)
    • ධՁ݁Ռγʔτ
    • ໰୊ͷྫ
    • ࢓ࣄͷ೤ҙΛऔΓ໭ͨ͢ΊͷΞΠσΞΛ5ͭڍ͍͛ͯͩ͘͞ɻ
    • ࣍ͷจষΛಡΜͰɺͦͷਓ͕Ͳͷఔ౓ౖ͍ͬͯΔ͔ɺ1ʙ10ͷई౓ͰධՁ
    ͍ͯͩ͘͠͞ɻ(1ʹౖ͍ͬͯͳ͍ɺ10ʹඇৗʹౖ͍ͬͯΔ)ɻ 1. ·ͨςε
    τͰ੺఺͔ɻ܅͸શ͘... 2. ςετͰ੺఺ʁࠓճ͸೉͔ͬͨ͠Ͷɻ
    • ҎԼͷϝʔϧʹฦ৴͍ͯͩ͘͠͞ɻ ͓ർΕ༷Ͱ͢ɻ ຊ೔ମௐෆྑʹΑΓɺ
    ༧ఆΑΓ౸ண͕গ͠஗Εͯ͠·͍ͦ͏Ͱ͢ɻ ஗͘ͱ΋13࣌ա͗ʹ͸ண͘
    ͱࢥ͍·͢ɻ ͝໎࿭Λ͓͔͚ͯ͠ڪॖͰ͸͍͟͝·͕͢ɺ Կଔ͝༰͍ࣻ
    ͚ͨͩ·͢Α͏͓ئ͍ਃ্͛͠·͢ɻ
    37

    View full-size slide

  38. ೔ຊޠLLMධՁͷ͜Ε͔Β
    38

    View full-size slide

  39. LLMධՁʹ͓͚Δ؍఺
    • Seen/Unseen: ڭࢣ͋Γֶश͕͞Ε͍ͯΔ͔Ͳ͏͔
    • GLUEͳͲैདྷͷϕϯνϚʔΫ͸seenઃఆ
    • ࠷ۙͷϦʔμʔϘʔυ͸҉໧తʹunseenઃఆ(zero/few-host)Ͱ͋Δ͜ͱ͕΄ͱΜͲ
    • Contamination
    • ධՁσʔλֶ͕शʹ࢖ΘΕ͍ͯΔՄೳੑ
    • cf. “Catch me if you can! How to beat GPT-4 with a 13B model” [Blog]
    • λεΫछผ: ෼ྨ(ཧղ)ɾੜ੒
    • ධՁํ๏: ࣗಈɾਓख
    • ෼ྨλεΫ͸ࣗಈධՁ
    • ੜ੒λεΫ͸྆ํ (ੜ੒λεΫͷࣗಈධՁ͸GPT-4ར༻͕ओྲྀ͕ͩɻɻɻ)
    • Ϟσϧछผ
    • ֶशํ๏: Pretrained, Fine-tuned (SFT), RLHF
    • ύϥϝʔλ਺
    • ֶशݴޠ
    39

    View full-size slide

  40. ݱࡏͷ೔ຊޠLLMධՁͷ՝୊
    • ෼ྨ(ཧղ)ܥλεΫ(JGLUEͳͲ)ͷΈͷධՁͰ͸Ұ໘త
    • ݱࡏͷੜ੒ܥσʔληοτͷ՝୊
    • େ͖ͳن໛ͷσʔληοτ͸গͳ͍
    • χϡʔεهࣄͷཁ໿σʔληοτ: XLSum (ja) [Hasan+ 2021]
    • ೔ৗର࿩ίʔύε: Japanese Daily Dialogue [੺ؒ+ 2023]
    • LLMධՁ༻ͷੜ੒໰୊ͷ՝୊
    • σʔληοτ͋ͨΓ਺ेʙ100໰ఔ౓Ͱগͳ͍
    • ධՁํ๏͕ਓखɺ΋͘͠͸ɺGPT-4ʹΑΔࣗಈධՁ
    40

    View full-size slide

  41. LLMධՁʹదͨ͠ϕϯνϚʔΫʹ޲͚ͯ
    • ෼ྨ(ཧղ)ܥσʔληοτͷ֦ॆ
    • MMLUͷ೔ຊޠ΁ͷ຋༁ (ૣେՏݪݚɾཧԽֶݚڀॴAIP)
    • llm-jp-evalɾjasterͷ׆ಈ (LLMษڧձ)
    • ੜ੒ܥσʔληοτͷ֦ॆ
    • ࣗಈධՁ͕ෆՄܽ
    • GPT-4ʹΑΔධՁ͸ආ͚͍ͨ
    • Α͍ɾѱ͍ੜ੒ΛΞϊςʔγϣϯͨ͠σʔλΛ࡞੒͠ɺfine-tuningʹΑͬͯධՁثΛߏங
    • cf. BLEURT [Sellam+ 2020], COMET [Rei+ 2020]
    • ࡶஊର࿩ͷΑ͏ͳopen-endedੑ͕ߴ͍λεΫ͸(ࣗಈ)ධՁ͕೉͍͠
    • ཁ໿΍QAͳͲ͕ީิ (JGLUE v2)
    • Ξϊςʔγϣϯର৅ͷςΩετͱͯ͠ɺΦʔϓϯͳ΋ͷʹՃ͑ͯاۀ಺ςΩετ΋࢖͑Δ
    Α͏ʹݕ౼த
    • ධՁઃఆͷ͋Γํͷݕ౼
    • unseenઃఆ?ɺfew-shotઃఆ?
    • ϓϩϯϓτහײੑ΁ͷରॲ 41

    View full-size slide

  42. LLMධՁʹదͨ͠ϕϯνϚʔΫʹ޲͚ͯ
    ͞Βʹ͸
    42
    Question
    Answering
    Tool
    Learning
    Reasoning
    K
    nowledge
    Com
    pletion
    Ethics
    and
    Morality Bias
    Toxicity
    Truthfulness
    Robustness
    Evaluation
    Risk
    Evaluation
    Biology
    and
    M
    edicine
    Education
    Legislation
    Computer
    Science
    Finance
    Benchmarks
    for
    Holistic
    Evaluation
    Benchmarks
    for
    Knowledge
    and Reasoning
    Benchmarks
    for
    NLU
    and NLG
    Knowled
    ge and Capability
    Large Langauge
    Model Evaluation
    Alignment Eva
    luation
    Safety
    Specialized LLMs
    Evaluation Organization

    Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation.
    [Guo+ 2023]
    [Awesome-LLMs-Evaluation-Papers]

    View full-size slide