Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[輪講資料] LoRA: Low-Rank Adaptation of
 Large Language Models

[輪講資料] LoRA: Low-Rank Adaptation of
 Large Language Models

パラメータを固定した事前学習済みモデルに対して、ごく少数のパラメータからなる低ランク行列を導入・学習することで、モデル全体のfine-tuningと同等の性能を発揮できる手法であるLoRAと、その論文について解説した資料です。
深層学習を用いた自然言語処理の歴史的な変遷と周辺技術から、LoRAが必要とされるに至った背景まで丁寧に解説します。

Hayato Tsukagoshi

April 18, 2023
Tweet

More Decks by Hayato Tsukagoshi

Other Decks in Research

Transcript

  1. LoRA: Low-Rank Adaptation of

    Large Language Models
    Graduate School of Informatics, Nagoya University, Japan.
    ൃදऀ: Hayato Tsukagoshi
    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

    ICLR 2022

    https://arxiv.org/abs/2106.09685

    View Slide

  2. •ࣄલֶशࡁΈϞσϧʹରͯ͠Ұ෦ͷ

    ύϥϝʔλͷΈΛඍௐ੔͢Δख๏ΛఏҊ

    •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ

    • A = (d × r) ߦྻ, B = (r × d) ߦྻ

    • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK

    •׬શͳ
    fi
    ne-tuningʹରͯ͠

    • ֶशύϥϝʔλ਺Λܶతʹ࡟ݮ
    • ಉ౳ੑೳ
    •ਪ࿦ίετΛ૿Ճͤͣ͞ʹϞσϧͷඍௐ੔͕Մೳ
    ࿦จ֓ཁ
    2

    View Slide

  3. •ۙ೥ඇৗʹར༻ऀ͕૿͍͑ͯΔϞσϧ

    • ը૾ɾΠϥετੜ੒෼໺Ͱͷ੝Γ্͕ΓʹΑͬͯೝ஌౓͕ߴ·ͬͨ

    •େن໛ݴޠϞσϧ࣌୅ͷσϑΝΫτελϯμʔυʹͳΓ͏Δख๏
    • ͢ͰʹLLM΁ͷద༻͕ۃΊͯ੝Μ

    • MetaͷLLaMAʹର͢ΔAlpacaͱAlpaca-LoRAͱ͔

    •ਂ૚ֶशϞσϧͷֶशख๏ͷྺ࢙Λ၆ᛌ͢Δͷʹஸ౓Α͔ͬͨ

    • BERTҎલ: શͯࣗ෼Ͱֶश

    • ࣄલֶश (pre-training) → ඍௐ੔ (
    fi
    ne-tuning)

    • ෦෼తͳඍௐ੔ख๏ or Few-shot / Zero-shot learning
    બఆཧ༝
    3

    View Slide

  4. ࣄલ஌ࣝ
    •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ

    •஫ҙػߏɺTransformerɺBERT, …

    LoRA
    •େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ͱղܾࡦ

    •ؔ࿈ݚڀ

    •ఏҊख๏ɾධՁ࣮ݧ

    LoRAͷ࢖͍ํ
    •PEFT: Parameter-E
    ff i
    cient Fine-Tuning
    ໨࣍
    4

    View Slide

  5. ࣄલ஌ࣝ
    •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ

    •஫ҙػߏɺTransformerɺBERT, …

    LoRA
    •େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ͱղܾࡦ

    •ؔ࿈ݚڀ

    •ఏҊख๏ɾධՁ࣮ݧ

    LoRAͷ࢖͍ํ
    •PEFT: Parameter-E
    ff i
    cient Fine-Tuning
    ໨࣍
    5

    View Slide

  6. ࣄલ஌ࣝ

    View Slide

  7. •ਂ૚ֶशϞσϧΛ༻͍ͨݴޠϞσϧ͕த৺తଘࡏ

    •ʮࣗવݴޠͷ୯ޠ΍จষ͕ੜ੒͞ΕΔ֬཰ΛϞσϧԽͨ͠΋ͷʯ*

    • ͕ͩɺ࠷ۙ͸ʮݴޠΛѻ͑ΔϞσϧʯ͘Β͍ͷҙຯͰ࢖ΘΕ͕ͪ

    •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅٕज़

    • Ϟσϧྫ: BERT, GPT-2, GPT-3, GPT-4, LLaMA, Alpaca, Vicuna, Dolly, …

    • ͜ΕΒ͸ج൫Ϟσϧ (Foundation Models) ͱ΋ݺ͹ΕΔ
    ݴޠϞσϧͷ࢖ΘΕํ
    •จΛ୯ޠϕΫτϧͷྻɾจϕΫτϧʹม׵

    •͋Δจʹଓ࣍͘ͷ୯ޠɾจΛ༧ଌ
    ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ
    7
    *IT Text ʰࣗવݴޠॲཧͷجૅʱΑΓҾ༻

    View Slide

  8. •ۙ೥ͷϞσϧͷଟ͘͸஫ҙػߏ(Attention Mechanism)ʹجͮ͘
    TransformerͰߏ੒

    •͍Ζ͍Ζͳछྨ͕ଘࡏ
    ݴޠϞσϧ: Language Models
    8
    ଞʹ΋ݴޠϞσϧʹ͸͞·͟·ͳछྨ͕ଘࡏɻྫ: XLNet, ELECTRA, UL2, …
    BERTͷ֓ཁਤ

    View Slide

  9. •ۙ೥ͷϞσϧͷଟ͘͸஫ҙػߏ(Attention Mechanism)ʹجͮ͘
    TransformerͰߏ੒

    •͍Ζ͍Ζͳछྨ͕ଘࡏ

    ࣗݾճؼܕݴޠϞσϧ (Causal LM)
    •ࠨ͔Βӈʹ୯ޠΛ༧ଌͯ͠܇࿅

    •ྫ: GPT, GPT-2, GPT-3, …

    ϚεΫݴޠϞσϧ (Masked LM)
    •จதͷҰ෦ΛӅ͢ɾ༧ଌͯ͠܇࿅

    •ྫ: BERT, RoBERTa, DeBERTa, …
    ݴޠϞσϧ: Language Models
    9
    ଞʹ΋ݴޠϞσϧʹ͸͞·͟·ͳछྨ͕ଘࡏɻྫ: XLNet, ELECTRA, UL2, …
    BERTͷ֓ཁਤ

    View Slide

  10. •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ
    •ೖྗΛQ (Query), K (Key), V (Value)ʹ෼͚ͯܭࢉ

    • K, V: nݸͷd࣍ݩϕΫτϧ

    • Q: mݸͷd࣍ݩϕΫτϧ
    ஫ҙػߏ (Attention Mechanism)
    10
    ਤ͸ Jaegle et al., Perceiver IO: A General Architecture for Structured Inputs & Outputs, ICLR 2022. ΑΓҾ༻

    Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳAttention - શྖҬʹԠ༻͞Ε࠷ߴਫ਼౓Λୟ͖ग़͢஫ҙػߏͷ࢓૊ΈʲσΟʔϓϥʔχϯάͷੈք vol. 24ʳ

    View Slide

  11. •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ
    •ೖྗΛQ (Query), K (Key), V (Value)ʹ෼͚ͯܭࢉ

    • K, V: nݸͷd࣍ݩϕΫτϧ

    • Q: mݸͷd࣍ݩϕΫτϧ

    •Qͷ֤ϕΫτϧʹର͢ΔKͷ֤ϕΫτϧͷॏཁ౓Λܭࢉ

    • Attention Weights: ܭࢉͷ݁ՌಘΒΕΔ(m × n)ߦྻ

    •Self Attention (ࣗݾ஫ҙػߏ): Q, K, VΛಉ͡ϕΫτϧྻ͔Βߏ੒ (i.e. n=m)
    •Cross Attention: ʮQʯͱʮK, VʯΛҟͳΔϕΫτϧྻ͔Βߏ੒
    ஫ҙػߏ (Attention Mechanism)
    11
    ਤ͸ Jaegle et al., Perceiver IO: A General Architecture for Structured Inputs & Outputs, ICLR 2022. ΑΓҾ༻

    Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳAttention - શྖҬʹԠ༻͞Ε࠷ߴਫ਼౓Λୟ͖ग़͢஫ҙػߏͷ࢓૊ΈʲσΟʔϓϥʔχϯάͷੈք vol. 24ʳ

    View Slide

  12. •஫ҙػߏͷΈͰߏ੒͞ΕͨϞσϧߏ଄

    • ͦΕ·ͰNLPͰΑ͘ར༻͞Ε͍ͯͨ

    RNN, LSTM΍CNNΛഉআ

    •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ
    • ೖྗϕΫτϧಉ࢜ͷ૬ޓ࡞༻Λߟྀ
    Transformer
    12
    Vaswani etl al., Attention Is All You Need, NeurIPS 2017.

    Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳTransformer - Multi-Head AttentionΛཧղͯ͠΍Ζ͏͡Όͳ͍ͷʲσΟʔϓϥʔχϯάͷੈքvol.28ʳ
    ֓ཁਤ
    Encoder Decoder

    View Slide

  13. •஫ҙػߏͷΈͰߏ੒͞ΕͨϞσϧߏ଄

    • ͦΕ·ͰNLPͰΑ͘ར༻͞Ε͍ͯͨ

    RNN, LSTM΍CNNΛഉআ

    •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ
    • ೖྗϕΫτϧಉ࢜ͷ૬ޓ࡞༻Λߟྀ

    •EncoderͱDecoderͷೋछྨ͕ଘࡏ

    • EncoderͷΈ: BERT, LUKE, …

    • DecoderͷΈ: GPT, GPT-2, GPT-3, …

    • Encoder-Decoder: BART, T5, UL2, …
    Transformer
    13
    Vaswani etl al., Attention Is All You Need, NeurIPS 2017.

    Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳTransformer - Multi-Head AttentionΛཧղͯ͠΍Ζ͏͡Όͳ͍ͷʲσΟʔϓϥʔχϯάͷੈքvol.28ʳ
    ֓ཁਤ
    Encoder Decoder

    View Slide

  14. •Transformer EncoderΛෳ਺૚ॏͶͯେن໛ʹࣄલֶशͨ͠Ϟσϧ

    • base͸12૚ (1.1ԯύϥϝʔλ)ɺlarge͸24૚ (3.3ԯύϥϝʔλ)

    •ࣄલֶश (pre-training) → ඍௐ੔ (
    fi
    ne-tuning) ͱ͍͏ύϥμΠϜ͕ීٴ
    BERT: Bidirectional Encoder Representations from Transformers
    14
    Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019.

    View Slide

  15. •BERTҎલ: ݸผλεΫͷσʔληοτͰϞσϧΛؤுͬͯ܇࿅

    •BERTΛར༻: ݸผλεΫͷσʔληοτͰBERTΛඍௐ੔(
    fi
    ne-tuning)

    ϙΠϯτ
    •BERT͸ݴޠͱ͸Կ͔ͱ͍͏஌ࣝΛࣄલֶशʹΑͬͯ֫ಘ

    • ͦͷ஌ࣝΛ࢖͏ͷͰɺݸผλεΫʹରͯ͠͸ “গ͠” ௐ੔͢Δ͚ͩͰ͍͍

    •܇࿅σʔλ͕ൺֱతগྔͰ΋ߴ͍ੑೳΛಘΒΕΔΑ͏ʹ (10ສ~ →1000~)

    • ܇࿅ίετ (σʔληοτऩूɾ܇࿅࣌ؒ) ͕ܶతʹݮগ
    BERTΛ༻͍ͨࣗવݴޠॲཧ
    15
    Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019.

    View Slide

  16. •175B (1750ԯ) ύϥϝʔλͱ͍͏ඇৗʹڊେͳࣗݾճؼܕݴޠϞσϧΛ܇࿅
    • ࢀߟ: BERT-base͸ 110M (=0.11B) ύϥϝʔλ

    •Few-shot / Zero-shot learningͱݺ͹ΕΔೳྗΛ֫ಘ

    • ͘͝গ਺(0~100)ͷࣄྫΛݟΔ͚ͩͰλεΫΛ͋Δఔ౓ղ͚ΔΑ͏ʹ
    GPT-3
    16

    View Slide

  17. •175B (1750ԯ) ύϥϝʔλͱ͍͏ඇৗʹڊେͳࣗݾճؼܕݴޠϞσϧΛ܇࿅
    • ࢀߟ: BERT-base͸ 110M (=0.11B) ύϥϝʔλ

    •Few-shot / Zero-shot learningͱݺ͹ΕΔೳྗΛ֫ಘ

    • ͘͝গ਺(0~100)ͷࣄྫΛݟΔ͚ͩͰλεΫΛ͋Δఔ౓ղ͚ΔΑ͏ʹ

    Scaling law

    •ϞσϧɾσʔλɾܭࢉྔΛσΧ͘͢Ε͹͢Δ΄Ͳੑೳ্͕͕Δͱ͍͏ܦݧଇ

    •**·ͩݶք͕ݟ͍͑ͯͳ͍**
    • ܭࢉྔΛ૿΍͢͜ͱͷΈͰਓྨͷ஌తೳྗΛ௒͑ΒΕΔʁ

    • ਓྨ͕௕೥ເݟͨʮ൚༻ਓ޻஌ೳʯ΁ͷҰา͔ʁ
    GPT-3
    17

    View Slide

  18. •େن໛ݴޠϞσϧΛ༻͍ͨਪ࿦ख๏ͷҰͭ

    • In-context learning (จ຺಺ֶश) ͱ΋

    •λεΫͷʮೖग़ྗྫʯΛݟͤΔ͚ͩͰ

    ͋Δఔ౓ਖ਼͘͠λεΫ͕ղ͚Δ

    • λεΫΛղ͚ΔΑ͏ʹݴޠϞσϧΛ

    ͏·͘ʮ৚݅෇͚ʯ͢Δ

    •ʮ৚݅෇͚ʯͷͨΊͷࣄલೖྗΛ

    ϓϩϯϓτ (prompt)ͱ͍͏
    Few-shot / Zero-shot learning
    18
    ਤ͸GPT-3࿦จΑΓҾ༻
    Zero-shot learning
    Few-shot learning

    View Slide

  19. •ଟ͘ͷ৔߹prompt͸ͨͩͷจࣈྻ

    • ࿈ଓతͳϕΫτϧΛpromptʹ͢Δ೿ൊ΋

    • Soft Promptsͱ͍͏

    • ޙड़͢ΔPre
    fi
    x-Tuning΍Prompt Tuning

    •͋ΔλεΫʹର͢Δੑೳ͸promptʹେ͖͘ґଘ

    • promptΛ͍͍ײ͡ʹվળ͢Δඞཁ

    • ͜ͷϓϩηεΛPrompt Engineeringͱ͍͏

    • Chain-of-ThoughtͳͲͷํ๏࿦͕ଘࡏ

    •Prompt Engineeringͷ·ͱΊ΋
    Prompt / Prompt Engineering
    19
    ਤ͸ Kojima et al. Large Language Models are Zero-Shot Reasoners, NeurIPS 2022. ΑΓҾ༻

    View Slide

  20. •OpenAIͷChatGPT APIͱGPT-4, MetaͷLLaMAͷެ։ʹΑͬͯେྲྀߦ

    • GPT-4͕࢖͑Δ ChatGPT Plus ΋ΠΠκ

    •LLMͷద༻ൣғͷ֦େͱLLMΛखݩͰಈ͔ͨ͢Ίͷٕज़։ൃ͕രൃతʹྲྀߦ

    • llama.cpp΍LangChain, LlamaIndexͳͲपลٕज़ͷ։ൃ΋Ξπ͍

    • Alpaca-LoRAͳͲLLMΛLoRAͰௐ੔͢Δ࿩΋ଟ਺

    •ϑϦʔͷେن໛ݴޠϞσϧΛެ։͢Δಈ͖΋׆ൃ

    • OPT, Alpaca, Vicuna, Dolly, RWKV, ChatRWKV, …
    େن໛ݴޠϞσϧ (Large Language Models; LLMs)
    20
    RNNϕʔεͷṖͷLLM

    (࿦จະެ։)

    View Slide

  21. •೔ຊޠʹಛԽͯ͠ࣄલֶश͞ΕͨϞσϧ΋ଟ਺ଘࡏ

    • ౦๺େBERT

    • ૣҴాେRoBERTa

    • Studio Ousia೔ຊޠLUKE

    • ژେDeBERTa-v2

    • rinnaࣾ GPT-2 (1.3B)

    • ૣҴాେ GPT-2 (1.5B)

    • ABEJAࣾ GPT-NeoX-Japanese (2.7B)
    ࢀߟ: ೔ຊޠࣄલֶशࡁΈݴޠϞσϧ
    21
    ࢀߟจݙ:

    ϑϦʔͰ࢖͑Δ೔ຊޠͷओͳେن໛ݴޠϞσϧʢLLMʣ·ͱΊ

    ݁ہ೔ຊޠେن໛ݴޠϞσϧʢLLMʣͬͯͲΕΛ࢖͑͹͍͍ͷʁJGLUEϕϯνϚʔΫඇެࣜ·ͱΊ


    View Slide

  22. ࣄલ஌ࣝ
    •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ

    •஫ҙػߏɺTransformerɺBERT, …

    LoRA
    •େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ͱղܾࡦ

    •ؔ࿈ݚڀ

    •ఏҊख๏ɾධՁ࣮ݧ

    LoRAͷ࢖͍ํ
    •PEFT: Parameter-E
    ff i
    cient Fine-Tuning
    ໨࣍
    22

    View Slide

  23. LoRA

    View Slide

  24. •ඍௐ੔ޙͷϞσϧͷอଘ͕ͱͯ΋େม

    • ྫ: GPT-3ͷ৔߹ɺඍௐ੔ͷͨͼʹ175BͷϞσϧ͕ग़དྷ্͕Δ

    • 175BͷϞσϧ͸ҰͭͰ350GBఔ౓ͷετϨʔδ༰ྔ*

    •େ͖ͳϞσϧ͸GPUʹࡌͤΔͷ͕೉͍͠

    • ྫ: GPT-3͸
    fi
    ne-tuningͷͨΊʹGPUͷϝϞϦ͕1.2TBඞཁ
    େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊
    24
    *LoRA࿦จͷهड़
    A100 80GB (150ສԁ) × 15 🙄

    View Slide

  25. •ඍௐ੔ޙͷϞσϧͷอଘ͕ͱͯ΋େม

    • ྫ: GPT-3ͷ৔߹ɺඍௐ੔ͷͨͼʹ175BͷϞσϧ͕ग़དྷ্͕Δ

    • 175BͷϞσϧ͸ҰͭͰ350GBఔ౓ͷετϨʔδ༰ྔ*

    •େ͖ͳϞσϧ͸GPUʹࡌͤΔͷ͕೉͍͠

    • ྫ: GPT-3͸
    fi
    ne-tuningͷͨΊʹGPUͷϝϞϦ͕1.2TBඞཁ
    •Few-shot / Zero-shot learningͳΒϞσϧͷߋ৽͸ෆཁ͕ͩ…

    • ҰൠʹͪΌΜͱϞσϧΛௐ੔ͯ͋͛͠Δํ͕ੑೳ͸ߴ͍

    •ͳΒ͹ͤΊͯอଘ/࠷దԽ͢ΔύϥϝʔλΛݮΒ͍ͨ͠

    • ϞσϧͷҰ෦ͷΈͷߋ৽ͰλεΫΛղ͚ͨΒخ͍͠
    େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊
    25
    *LoRA࿦จͷهड़
    A100 80GB (150ສԁ) × 15 🙄

    View Slide

  26. •ج൫Ϟσϧͷོ੝ʹΑΓɺ෦෼తͳඍௐ੔ख๏ͷݚڀ͕੝Μʹ

    • ຊࢿྉͰ঺հ͢ΔLoRA΋͜ͷྨͷख๏

    •͍ͣΕͷख๏΋ʮج൫ϞσϧΛগ͚ͩ͠ௐ੔͢Δʯͱ͍͏ίϯηϓτ

    طଘݚڀ
    •BitFit: ϞσϧͷόΠΞε߲ͷΈߋ৽

    •Adapter: ϞσϧʹλεΫઐ༻ͷ૚Λ௥Ճ
    •Pre
    fi
    x-Tuning: λεΫઐ༻ͷೖྗϕΫτϧ+ϞσϧͷҰ෦ͷΈߋ৽

    •Prompt Tuning: λεΫઐ༻ͷೖྗϕΫτϧͷΈߋ৽
    ෦෼తͳඍௐ੔
    26

    View Slide

  27. •TransformerͷதʹֶशՄೳͳ

    MLP૚Λࠩ͠ࠐΉख๏

    • MLP૚͸ඇઢܗม׵Λඋ͑Δ

    •MLP૚ͷࠩ͠ࠐΉ৔ॴʹΑͬͯ

    ͍͔ͭ͘มछ͕ଘࡏ

    •ຊࢿྉͰ঺հ͢ΔLoRA͸

    AdapterͷҰछ
    Adapter
    27
    Houlsby et al., Parameter-E
    ff
    i
    cient Transfer Learning for NLP, ICML 2019.
    Adapterͷࠩ͠ࠐΈํ Adapterͷߏ଄
    γϯϓϧͳMLP

    View Slide

  28. •ೖྗͷઌ಄ʹλεΫ༻ͷϕΫτϧ(Soft Prompts)Λ༻ҙ

    • λεΫ༻ϕΫτϧ (+தؒ૚ͷҰ෦)Λ࠷దԽ

    •ಉ࣌ظʹఏҊ͞Εͨख๏

    • Prompt Tuning͸Soft PromptsͷΈߋ৽
    Pre
    fi
    x-Tuning / Prompt Tuning
    28
    Li et al., Pre
    fi
    x-Tuning: Optimizing Continuous Prompts for Generation, ACL-IJCNLP 2021.

    Lester et al., The Power of Scale for Parameter-E
    ff i
    cient Prompt Tuning, EMNLP 2021.
    LLMͰ༻͍ΒΕΔ “཭ࢄతͳ”

    promptͱ͸ରরత
    Prompt Tuning
    Pre
    fi
    x-Tuning

    View Slide

  29. •ύϥϝʔλͷߋ৽ͷͨΊʹ͸ٯ఻೻͕ඞཁ

    • ࣮͸ٯ఻೻ͷͨΊͷܭࢉ݁Ռͷอଘίετ͕ΊͪΌͪ͘Όେ͖͍

    • ٯ఻೻ͷඞཁ͕ͳ͍෦෼͸ϘτϧωοΫͰ͸ͳ͍
    ͳͥϝϞϦ࢖༻ྔ͕࡟ݮͰ͖Δͷ͔?
    29
    https://huggingface.co/docs/transformers/main/en/performance
    ܇࿅࣌ͷϝϞϦ

    ࢖༻ྔΛࣔ͢ਤ

    View Slide

  30. •ύϥϝʔλͷߋ৽ͷͨΊʹ͸ٯ఻೻͕ඞཁ

    • ࣮͸ٯ఻೻ͷͨΊͷܭࢉ݁Ռͷอଘίετ͕ΊͪΌͪ͘Όେ͖͍

    • ٯ఻೻ͷඞཁ͕ͳ͍෦෼͸ϘτϧωοΫͰ͸ͳ͍
    ͳͥϝϞϦ࢖༻ྔ͕࡟ݮͰ͖Δͷ͔?
    30
    https://huggingface.co/docs/transformers/main/en/performance
    ͍Ζ͍Ζͳ޻෉ΛೖΕͯ

    ΍ͬͱϝϞϦ࢖༻ྔ͕গͳ͘ͳΔ
    ܇࿅࣌ͷϝϞϦ

    ࢖༻ྔΛࣔ͢ਤ
    Ϟσϧࣗମ͸ൺֱతখ͍͞

    View Slide

  31. ਪ࿦ίετͷ૿Ճ
    •طଘͷAdapter͸૚Λ௚ྻతʹ௥Ճ

    • “ܭࢉ଴ͪ”ʹΑΓGPUͷฒྻॲཧੑೳΛ͏·͘׆͔ͤͳ͍

    • ϕʔεϞσϧʹ༨෼ͳܭࢉίετΛಋೖͯ͠͠·͏

    طଘख๏ͷ໰୊఺
    31

    View Slide

  32. ਪ࿦ίετͷ૿Ճ
    •طଘͷAdapter͸૚Λ௚ྻతʹ௥Ճ

    • “ܭࢉ଴ͪ”ʹΑΓGPUͷฒྻॲཧੑೳΛ͏·͘׆͔ͤͳ͍

    • ϕʔεϞσϧʹ༨෼ͳܭࢉίετΛಋೖͯ͠͠·͏

    ࠷దԽͷ೉͠͞
    •(ಛʹ) Pre
    fi
    x-Tuning͸࠷దԽ͕ࠔ೉Ͱੑೳ΋༧ଌ͕͖ͭͮΒ͍

    • ܇࿅ՄೳͳύϥϝʔλΛ૿΍ͯ͠΋ੑೳ্͕͕Βͳ͍͜ͱ͕͋Δ(ޙड़)

    •ੑೳΛग़͢ʹ͸128 token΄ͲೖΕΔඞཁ͕͋Δ

    • ೖྗܥྻΛ͔ͳΓѹഭ͢Δ
    طଘख๏ͷ໰୊఺
    32

    View Slide

  33. LoRA: Low-Rank Adaptation of Large Language Models
    33
    •ϕʔεϞσϧͷઢܗ૚ͷྡʹ

    ࠩ෼ߦྻ Λ௥Ճ

    •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ

    • ߦྻA: (d × r)
    • ߦྻB: (r × d)

    • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK
    ΔW = BA

    View Slide

  34. LoRA: Low-Rank Adaptation of Large Language Models
    34
    •ϕʔεϞσϧͷઢܗ૚ͷྡʹ

    ࠩ෼ߦྻ Λ௥Ճ

    •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ

    • ߦྻA: (d × r)
    • ߦྻB: (r × d)

    • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK

    •׬શͳ
    fi
    ne-tuningͱൺֱͯ͠

    • ֶशύϥϝʔλ਺Λܶతʹ࡟ݮ
    • ಉ౳ੑೳ
    ΔW = BA
    ϕʔεϞσϧதͷ

    ઢܗ૚Λݻఆ

    View Slide

  35. LoRA: Low-Rank Adaptation of Large Language Models
    35
    •ϕʔεϞσϧͷઢܗ૚ͷྡʹ

    ࠩ෼ߦྻ Λ௥Ճ

    •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ

    • ߦྻA: (d × r)
    • ߦྻB: (r × d)

    • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK

    •׬શͳ
    fi
    ne-tuningͱൺֱͯ͠

    • ֶशύϥϝʔλ਺Λܶతʹ࡟ݮ
    • ಉ౳ੑೳ
    ΔW = BA
    ઢܗม׵ͯ͠

    ଍͚ͩ͢
    Transformerதͷ஫ҙػߏͷ

    ઢܗ૚ ͳͲ
    Wq
    , Wv
    ϕʔεϞσϧதͷ

    ઢܗ૚Λݻఆ
    ஫ҙ: ϞσϧࣗମͰ͸ͳ͘

    ϞσϧதͷҰ෦ͷઢܗ૚

    View Slide

  36. λεΫͷ੾Γସ͕͑؆୯
    •λεΫ͝ͱʹLoRA૚Λ༻ҙ

    •ϕʔεϞσϧ͸ͦͷ··ʹLoRA૚ͷΈΛࠩ͠ସ͑Δ͚ͩ

    อଘ༰ྔ͕খ͍͞
    •ϕʔεϞσϧ͸มߋෆཁɺLoRA૚͚ͩอଘ͢Ε͹ྑ͍

    ਪ࿦ίετ͕૿Ճ͠ͳ͍
    •LoRA૚͸ઢܗม׵Λ͍ͯ͠Δ͚ͩ

    → ΦϦδφϧͷઢܗ૚ͱϚʔδͰ͖Δ

    •ਪ࿦࣌ͷલॲཧͱͯ͠ϚʔδΛઌʹ͓͚ͯ͠͹OK
    LoRA͸Կ͕خ͍͠ͷ͔ʁ
    36
    h = W0
    x + ΔWx
    = W0
    x + BAx
    = (W0
    + BA)x
    = W′

    x
    ϕʔεϞσϧ͸GPUʹ

    ࡌͤͬͺͳ͠ͰOK

    View Slide

  37. GPT-3ద༻࣌ͷྫ (r=4)
    •܇࿅࣌ͷVRAM (GPUͷϝϞϦ)Λ 1.2TB → 350GB ʹ࡟ݮ

    • Gradient CheckpointingͳͲଞͷ޻෉Ͱ͞Βʹ࡟ݮͰ͖ͦ͏

    •อଘ͢Δύϥϝʔλͷ༰ྔΛ 350GB → 35MB ʹ࡟ݮ

    • ༰ྔ͕ݮͬͯอଘɾI/Oίετ͕ܶతʹখ͘͞

    • ݸผϢʔβɾλεΫͷͨΊʹΧελϚΠζͨ͠Ϟσϧͷ࡞੒͕༰қʹ

    •܇࿅͕25%ఔ౓ߴ଎Խ

    • ͦ͜·Ͱ଎͘ͳ͍ͬͯͳ͍ͷ͸ผͷ෦෼͕ϘτϧωοΫ͔ͩΒʁ
    LoRAͷ༗༻ੑ
    37
    ܇࿅༻σʔλͷసૹͱ͔ϊʔυؒ௨৴ʁ

    View Slide

  38. ॳظԽͷ޻෉
    •A͸ฏۉ0ͷਖ਼ن෼෍ɺB͸ྵߦྻͰॳظԽ

    • ͸࠷ॳԿ΋͠ͳ͍
    • LoRA૚͕ݩͷϞσϧΛअຐ͠ͳ͍
    Ϟσϧߏ଄ΛมԽͤ͞ͳ͍
    •ֶशՄೳύϥϝʔλ਺Λ૿΍͍ͯ͘͠ͱ…
    • LoRA: r͕ݩͷߦྻͱಉ͡ = ݩͷϞσϧͱ΄΅ಉ͡
    • Adapter: MLP૚Λ௥Ճͨ͠ϞσϧΛFull FT͢Δͷͱಉ౳

    • Prompt Tuning: ೖྗ௕͕গ͠୹͘ͳͬͨϞσϧΛFull FT͢Δͷͱಉ౳
    ΔW = BA
    LoRA͸ͳͥ͏·͍͘͘ͷ͔ʁ
    38

    View Slide

  39. •LoRAࣗମ͸ͦ͜·Ͱ໨৽͍͠࿩Ͱ͸ͳ͍͕…

    • Adapter͔ΒඇઢܗੑΛൈ͍͚ͨͩͱ͍͏ؾ΋͢Δ

    ͓΋͠ΖϙΠϯτ
    •AdapterΑΓগͳ͍ύϥϝʔλ਺Ͱ΋ੑೳ͸ग़ͤΔ

    •AdapterͷΑ͏ʹඇઢܗੑΛಋೖ͠ͳͯ͘΋ੑೳ͸ग़ͤΔ
    LoRAͷ৽نੑɾ஫ҙ఺
    39

    View Slide

  40. •LoRAࣗମ͸ͦ͜·Ͱ໨৽͍͠࿩Ͱ͸ͳ͍͕…

    • Adapter͔ΒඇઢܗੑΛൈ͍͚ͨͩͱ͍͏ؾ΋͢Δ

    ͓΋͠ΖϙΠϯτ
    •AdapterΑΓগͳ͍ύϥϝʔλ਺Ͱ΋ੑೳ͸ग़ͤΔ

    •AdapterͷΑ͏ʹඇઢܗੑΛಋೖ͠ͳͯ͘΋ੑೳ͸ग़ͤΔ

    ஫ҙ఺
    •10BҎ্ͷϞσϧͩͱLoRA૚Ͱ͢Β20MҎ্ͷύϥϝʔλ਺ʹͳΔ͜ͱ΋

    •Ϟσϧ͕े෼େ͖͘ͳ͍ͱ͏·͘ಈ࡞͠ͳ͍͔΋

    • 6.7Bͩͱ͏·͍͕͘͘ɺ1BϞσϧͩͱ͏·͘ඍௐ੔Ͱ͖ͳ͍ͱ͍͏ࣄྫ
    LoRAͷ৽نੑɾ஫ҙ఺
    40

    View Slide

  41. •LoRAࣗମ͸ͦ͜·Ͱ໨৽͍͠࿩Ͱ͸ͳ͍͕…

    • Adapter͔ΒඇઢܗੑΛൈ͍͚ͨͩͱ͍͏ؾ΋͢Δ

    ͓΋͠ΖϙΠϯτ
    •AdapterΑΓগͳ͍ύϥϝʔλ਺Ͱ΋ੑೳ͸ग़ͤΔ

    •AdapterͷΑ͏ʹඇઢܗੑΛಋೖ͠ͳͯ͘΋ੑೳ͸ग़ͤΔ

    ஫ҙ఺
    •10BҎ্ͷϞσϧͩͱLoRA૚Ͱ͢Β20MҎ্ͷύϥϝʔλ਺ʹͳΔ͜ͱ΋

    •Ϟσϧ͕े෼େ͖͘ͳ͍ͱ͏·͘ಈ࡞͠ͳ͍͔΋

    • 6.7Bͩͱ͏·͍͕͘͘ɺ1BϞσϧͩͱ͏·͘ඍௐ੔Ͱ͖ͳ͍ͱ͍͏ࣄྫ
    LoRAͷ৽نੑɾ஫ҙ఺
    41
    ͦΜͳʹখ͘͞ͳ͍…
    ಉ͡όονʹҟͳΔ

    λεΫΛೖΕΔͷ΋

    গ͠೉͍͠
    খ͞ΊͷϞσϧͩͱͲͷ͘Β͍

    ϋΠύϥௐ੔͕ඞཁ / sensitive ͔एׯෆ໌

    View Slide

  42. •LoRA͸AdapterͷҰछ

    • ඇઢܗੑΛഉͨ͠Adapterͷಛघܥͱ΋ଊ͑ΒΕΔ

    •Pre
    fi
    x- / Prompt TuningΑΓֶशύϥϝʔλ਺͸ଟ͘ͳΔ
    LoRAͱطଘख๏ͱͷؔ܎
    42
    ख๏໊
    fi
    ne-tuning Adapter LoRA Pre
    fi
    x-Tuning / 

    Prompt Tuning
    few-shot / zero-shot
    learning
    ࠷దԽର৅ ͢΂ͯ
    MLP

    (ଟ૚ύʔηϓτϩϯ)
    ௿ϥϯΫߦྻ Soft Prompts

    + α
    -
    ਪ࿦ίετ͸

    ૿Ճ͢Δʁ
    No Yes No
    Yes

    (ܥྻ௕͕૿Ճ)
    -

    View Slide

  43. •ਂ૚ֶशͰ͸ߦྻܭࢉ͕ඞཁෆՄܽ

    • ઢܗ૚(ॏΈߦྻ + όΠΞε)͸ಛʹසग़

    •௚ײతʹ͸ΨϦόʔτϯωϧΈ͍ͨͳ΋ͷ

    • ϕΫτϧΛೖΕΔͱग़ޱͷܗʹͳͬͯग़ͯ͘Δ

    • ϕΫτϧΛ೚ҙͷܗʹมߋͰ͖Δػߏ

    • ྫ: ෼ྨ໰୊͸(ϕΫτϧ࣍ݩ਺, Ϋϥε਺)Ͱมܗ
    ؓ࿩ٳ୊: ߦྻ (ςϯιϧ) ܭࢉͷ௚ײతཧղ
    43
    ը૾͸ ςϨϏே೔ ʰͻΈͭಓ۩ΧλϩάʱΑΓҾ༻
    ΨϦόʔτϯωϧ
    nn.Linear(5, 3)
    ग़ޱͷܗʹ֦େɾॖখ͢ΔͻΈͭಓ۩

    View Slide

  44. •LoRAͷ༗༻ੑΛ
    fi
    ne-tuning, AdapterͳͲͱൺֱͯ͠ݕূ

    •ෳ਺छྨͷλεΫ

    • ࣗવݴޠཧղ (Natural Language Understanding; NLU) ܥλεΫ

    • ࣗવݴޠੜ੒ (Natural Language Generation; NLG) ܥλεΫ

    •ෳ਺ͷϞσϧ

    • RoBERTa base (125M) / large (355M)

    • DeBERTa XXL (1.5B)

    • GPT-2 medium (355M) / large (774M)

    • GPT-3 (175B)
    ධՁ࣮ݧ
    44

    View Slide

  45. •GLUEͰͷੑೳΛධՁɺશମͱͯ͠ଟ͘ͷλεΫͰLoRA͕࠷ྑͷ݁Ռ


    fi
    ne-tuningΑΓΉ͠Ζੑೳ্͕͕͍ͬͯΔ
    NLU: RoBERTa base / large & DeBERTa XXL
    45

    View Slide

  46. •E2E NLG ChallengeͰͷੑೳΛධՁɺ΄ͱΜͲͷ৔߹ʹLoRA͕࠷ྑͷ݁Ռ

    • ൺֱతখ͞ͳϞσϧͰ΋LoRA͸༗ޮͦ͏
    NLG: GPT-2 medium / large
    46

    View Slide

  47. •WikiSQL΍MNLI, SAMSumͰධՁɺLoRA͕ߴ͍ੑೳ

    • ෳ਺ͷrͰ
    fi
    ne-tuningΛ্ճΔੑೳΛࣔ͢͜ͱ΋
    NLU & NLG: GPT-3 175B
    47
    ࣗવݴޠˠSQL ձ࿩ཁ໿
    ࣗવݴޠਪ࿦

    View Slide

  48. •WikiSQL΍MNLI, SAMSumͰධՁɺLoRA͕ߴ͍ੑೳ


    fi
    ne-tuning͸few-shotΑΓ

    ֨ஈʹੑೳ͕ߴ͍ˠ


    fi
    ne-tuning͕ऑ͍Θ͚Ͱ͸ͳ͍

    ͳͥLoRA͸
    fi
    ne-tuningͱಉ౳ʁ
    •ݩͷϞσϧʹඞཁͳ஌͕ࣝଘࡏ

    •LoRA͸ͦΕΛڧௐ͢Δ͚ͩ*

    • ࣮͸ͦΕ͚ͩͰे෼ͩͬͨʁ
    NLU & NLG: GPT-3 175B
    48
    *ݩ࿦จͷ7અΛࢀর
    ࣗવݴޠˠSQL ձ࿩ཁ໿
    ࣗવݴޠਪ࿦
    fi
    ne-tuningࣗମ͸༗ޮ

    View Slide

  49. •GPT-3ΛLoRAνϡʔχϯάɺֶशՄೳύϥϝʔλ਺ͱੑೳͷؔ܎Λ؍࡯

    •LoRA͸ൺֱతੑೳ͕҆ఆ͍ͯͯ͠ѻ͍΍ͦ͢͏

    • Pre
    fi
    x Tuning͸ੑೳ͕ෆ҆ఆ
    ֶशՄೳύϥϝʔλ਺ͱੑೳͷؔ܎
    49

    View Slide

  50. •GPT-3ΛLoRAνϡʔχϯάɺֶशՄೳύϥϝʔλ਺ͱੑೳͷؔ܎Λ؍࡯

    •LoRA͸ൺֱతੑೳ͕҆ఆ͍ͯͯ͠ѻ͍΍ͦ͢͏

    • Pre
    fi
    x Tuning͸ੑೳ͕ෆ҆ఆ
    ֶशՄೳύϥϝʔλ਺ͱੑೳͷؔ܎
    50
    256 tokens௒
    256 tokens௒
    32 tokensҎԼ 32 tokensҎԼ
    LoRA͸҆ఆ

    View Slide

  51. •GPT-3ΛLoRAνϡʔχϯάɺ܇࿅Մೳύϥϝʔλ਺Λݻఆ

    • ༷ʑͳ૚ɾϥϯΫͰͷੑೳมԽΛ؍࡯

    •ෳ਺ͷ૚ʹখ͞ͳϥϯΫ > ୯Ұͷ૚ʹେ͖ͳϥϯΫ

    • ʮ࠷దԽ͢Δ૚ͷଟ༷ੑʯ͕େࣄͦ͏
    Ͳͷ૚Λඍௐ੔͢Ε͹͍͍ͷ͔ʁ
    51

    View Slide

  52. •GPT-3ΛLoRAνϡʔχϯάɺrʹΑΔੑೳมԽΛ؍࡯

    • r=2ͳͲඇৗʹখ͍࣍͞ݩ਺Ͱ΋ߴ͍ੑೳΛग़ͤΔ͜ͱ͕Θ͔Δ

    • λεΫͷͨΊʹඞཁͳ “෦෼ۭؒ” ͕े෼޿͍ʁ
    r͸Ͳͷ͘Β͍ͷେ͖͕͍͍͞ͷ͔ʁ
    52

    View Slide

  53. •ϞσϧͷύϥϝʔλΛ෦෼తʹඍௐ੔͢Δख๏ΛఏҊ

    •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ

    •ֶशύϥϝʔλ਺Λܶతʹ࡟ݮɺ
    fi
    ne-tuningͱಉ౳ੑೳ
    ॴײ
    •ݚڀతͳ৽نੑΑΓɺૉ௚Ͱศརͳઃܭʹྗ఺Λஔ͍ͨख๏ (ҹ৅)

    •Appendix͔ΒϋΠύϥௐ੔ͷͨΊͷਘৗͳΒ͟Δ౒ྗͷ੻͕Ӑ͑Δ

    • ͦ͜·ͰϋΠύϥʹsensitiveͳख๏Ͱ͸ͳͦ͞͏͕ͩ…(ମײ)

    •ઢܗม׵ͷΈͷߏ੒ʹΑΓϕʔεϞσϧʹ༥߹ͤ͞ΒΕΔͷ͸໘ന͍

    •ͱͯ΋γϯϓϧͳख๏ͳͷͰద༻ൣғ͸͔ͳΓ޿ͦ͏
    ·ͱΊɾॴײ
    53

    View Slide

  54. ࣄલ஌ࣝ
    •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ

    •஫ҙػߏɺTransformerɺBERT, …

    LoRA
    •େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ͱղܾࡦ

    •ؔ࿈ݚڀ

    •ఏҊख๏ɾධՁ࣮ݧ

    LoRAͷ࢖͍ํ
    •PEFT: Parameter-E
    ff i
    cient Fine-Tuning
    ໨࣍
    54

    View Slide

  55. LoRAͷ࢖͍ํ

    View Slide

  56. •ۙ೥୆಄͍ͯ͠Δਂ૚ֶशϞσϧΛ؆୯ʹར༻͢ΔͨΊͷϥΠϒϥϦ

    • ஶ໊ͳਂ૚ֶशϞσϧɾΞʔΩςΫνϟ࣮૷ͷར༻

    • ࣄલ܇࿅ࡁΈϞσϧύϥϝʔλͷڞ༗ɾμ΢ϯϩʔυ

    • ࣄલఆٛɾࣄલ܇࿅͞ΕͨϞσϧΛ༻ֶ͍ͨशɾਪ࿦ͷ؆ུԽ

    •Ϟσϧͷμ΢ϯϩʔυɾॏΈͷϩʔυ·Ͱ1ߦͰ࣮૷Մೳ

    •໊લ͕Ϟσϧߏ଄ͷTransformerͱࣅ͍ͯͯͱͯ΋΍΍͍͜͠
    ͓͞Β͍: Transformers 🤗
    56

    View Slide

  57. •HuggingFace͕ఏڙ͢ΔϥΠϒϥϦ

    • https://github.com/huggingface/peft

    • LoRA΍Prompt Tuning, AdaLoRA (LoRAͷޙଓख๏) ౳͕࣮૷
    PEFT: Parameter-E
    ffi
    cient Fine-Tuning 🤗
    57

    View Slide

  58. •HuggingFace͕ఏڙ͢ΔϥΠϒϥϦ

    • https://github.com/huggingface/peft

    • LoRA΍Prompt Tuning, AdaLoRA (LoRAͷޙଓख๏) ౳͕࣮૷

    •PEFTΛ༻͍ͨLoRAʹΑΔඍௐ੔ͷྲྀΕ

    1. ϕʔεͱͳΔϞσϧΛ༻ҙ͢Δ

    2. ϕʔεϞσϧͷ૚ͷҰ෦ΛLoRAͷ૚ʹஔ׵

    3. LoRAͷ૚Ҏ֎ͷ૚Λݻఆ(freeze)

    4. LoRA෦෼ͷΈΛֶश

    •อଘ͢Δͷ͸௥Ճͨ͠૚ͷΈ (ϕʔεϞσϧͱൺֱͯ͠ۃΊͯখ͍͞)
    PEFT: Parameter-E
    ffi
    cient Fine-Tuning 🤗
    58

    View Slide

  59. PEFT: ࣮ࡍͷίʔυ
    59

    View Slide

  60. PEFT: ࣮ࡍͷίʔυ
    60
    ϕʔεϞσϧͷಡΈࠐΈͱ४උ

    View Slide

  61. PEFT: ࣮ࡍͷίʔυ
    61
    LoRAͷCon
    fi
    gΛࢦఆ

    View Slide

  62. PEFT: ࣮ࡍͷίʔυ
    62
    ϕʔεϞσϧͷҰ෦Λ

    LoRA૚ʹஔ͖׵͑

    View Slide

  63. PEFT: ࣮ࡍͷίʔυ
    63
    LoRAͷద༻ର৅Λ

    Ҿ਺(target_modules)Ͱ

    ࢦఆ͢Δ͜ͱ΋Մೳ
    ͜Ε͕ͳ͍ͱಈ͔ͳ͍͜ͱ͕

    ͋ΔͷͰ஫ҙ
    ਖ਼نදݱʹΑΔ
    Ϟδϡʔϧࢦఆ΋Մೳ

    ྫ: .*Attention.(q|k|v|o)$

    View Slide

  64. •GoogleͷFlan-UL2 (20B)ʹରͯ͠PEFTΛ༻͍ͯखݩͷλεΫʹద༻

    • ੑೳ͸ྑ޷ɺอଘ༰ྔ΋খ͘͞܇࿅΋ૣ͍

    • ྫ (r=16): ܇࿅ύϥϝʔλ਺ 25MɺVRAM࢖༻ྔ 55GiB

    PEFT: ࣮ࡍʹ৮ͬͨײ૝
    64
    Batch size 32, BF16, 

    Gradient Checkpointing

    View Slide

  65. •GoogleͷFlan-UL2 (20B)ʹରͯ͠PEFTΛ༻͍ͯखݩͷλεΫʹద༻

    • ੑೳ͸ྑ޷ɺอଘ༰ྔ΋খ͘͞܇࿅΋ૣ͍

    • ྫ (r=16): ܇࿅ύϥϝʔλ਺ 25MɺVRAM࢖༻ྔ 55GiB

    ༨ஊ
    •Flan-T5 XXL (11B) ͳͲFlan instruction tuning͞ΕͨϞσϧ͸ۃΊͯڧྗ

    • खݩͷλεΫͰۃΊͯߴ͍Zero-shotੑೳɺLoRAνϡʔχϯά΋ྑ޷

    • ͜ͷลΓɺInstructGPT΍ChatGPTʹ௨ͣΔ΋ͷ͕͋Γͦ͏

    •ӳޠλεΫ͸ͱΓ͋͑ͣFlan-T5 XXL / UL2 + LoRAΛߟ͑ͯ΋ྑͦ͞͏

    •Flan-UL2΋Flan-T5΋Encoder-DecoderϞσϧͷT5ϕʔεͳͷͰࣗ༝ࣗࡏ

    • BERTͷΑ͏ʹEncoderͷΈ࢖ͬͯ΋ྑ͍͠ɺੜ੒ͤͯ͞΋ྑ͍
    PEFT: ࣮ࡍʹ৮ͬͨײ૝
    65
    Batch size 32, BF16, 

    Gradient Checkpointing

    View Slide

  66. •GoogleͷFlan-UL2 (20B)ʹରͯ͠PEFTΛ༻͍ͯखݩͷλεΫʹద༻

    • ੑೳ͸ྑ޷ɺอଘ༰ྔ΋খ͘͞܇࿅΋ૣ͍

    • ྫ (r=16): ܇࿅ύϥϝʔλ਺ 25MɺVRAM࢖༻ྔ 55GiB

    ༨ஊ
    •Flan-T5 XXL (11B) ͳͲFlan instruction tuning͞ΕͨϞσϧ͸ۃΊͯڧྗ

    • खݩͷλεΫͰۃΊͯߴ͍Zero-shotੑೳɺLoRAνϡʔχϯά΋ྑ޷

    • ͜ͷลΓɺInstructGPT΍ChatGPTʹ௨ͣΔ΋ͷ͕͋Γͦ͏

    •ӳޠλεΫ͸ͱΓ͋͑ͣFlan-T5 XXL / UL2 + LoRAΛߟ͑ͯ΋ྑͦ͞͏

    •Flan-UL2΋Flan-T5΋Encoder-DecoderϞσϧͷT5ϕʔεͳͷͰࣗ༝ࣗࡏ

    • BERTͷΑ͏ʹEncoderͷΈ࢖ͬͯ΋ྑ͍͠ɺੜ੒ͤͯ͞΋ྑ͍
    PEFT: ࣮ࡍʹ৮ͬͨײ૝
    66
    Batch size 32, BF16, 

    Gradient Checkpointing
    T5 (11B) ͳΒVRAM 40GBͰ܇࿅ՄೳɺA6000Ͱಈ͘ʂ
    BERTͷ୅ସͱ͙ͯ͢͠ಈ͔ͤΔϞσϧײ

    View Slide