Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[輪講資料] LoRA: Low-Rank Adaptation of
 Large Language Models

[輪講資料] LoRA: Low-Rank Adaptation of
 Large Language Models

パラメータを固定した事前学習済みモデルに対して、ごく少数のパラメータからなる低ランク行列を導入・学習することで、モデル全体のfine-tuningと同等の性能を発揮できる手法であるLoRAと、その論文について解説した資料です。
深層学習を用いた自然言語処理の歴史的な変遷と周辺技術から、LoRAが必要とされるに至った背景まで丁寧に解説します。

Hayato Tsukagoshi

April 18, 2023
Tweet

More Decks by Hayato Tsukagoshi

Other Decks in Research

Transcript

  1. LoRA: Low-Rank Adaptation of
 Large Language Models Graduate School of

    Informatics, Nagoya University, Japan. ൃදऀ: Hayato Tsukagoshi Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
 ICLR 2022
 https://arxiv.org/abs/2106.09685
  2. •ࣄલֶशࡁΈϞσϧʹରͯ͠Ұ෦ͷ
 ύϥϝʔλͷΈΛඍௐ੔͢Δख๏ΛఏҊ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ • A = (d × r)

    ߦྻ, B = (r × d) ߦྻ • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK •׬શͳ fi ne-tuningʹରͯ͠ • ֶशύϥϝʔλ਺Λܶతʹ࡟ݮ • ಉ౳ੑೳ •ਪ࿦ίετΛ૿Ճͤͣ͞ʹϞσϧͷඍௐ੔͕Մೳ ࿦จ֓ཁ 2
  3. •ਂ૚ֶशϞσϧΛ༻͍ͨݴޠϞσϧ͕த৺తଘࡏ •ʮࣗવݴޠͷ୯ޠ΍จষ͕ੜ੒͞ΕΔ֬཰ΛϞσϧԽͨ͠΋ͷʯ* • ͕ͩɺ࠷ۙ͸ʮݴޠΛѻ͑ΔϞσϧʯ͘Β͍ͷҙຯͰ࢖ΘΕ͕ͪ •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅٕज़ • Ϟσϧྫ: BERT, GPT-2, GPT-3,

    GPT-4, LLaMA, Alpaca, Vicuna, Dolly, … • ͜ΕΒ͸ج൫Ϟσϧ (Foundation Models) ͱ΋ݺ͹ΕΔ ݴޠϞσϧͷ࢖ΘΕํ •จΛ୯ޠϕΫτϧͷྻɾจϕΫτϧʹม׵ •͋Δจʹଓ࣍͘ͷ୯ޠɾจΛ༧ଌ ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ 7 *IT Text ʰࣗવݴޠॲཧͷجૅʱΑΓҾ༻
  4. •ۙ೥ͷϞσϧͷଟ͘͸஫ҙػߏ(Attention Mechanism)ʹجͮ͘ TransformerͰߏ੒ •͍Ζ͍Ζͳछྨ͕ଘࡏ ࣗݾճؼܕݴޠϞσϧ (Causal LM) •ࠨ͔Βӈʹ୯ޠΛ༧ଌͯ͠܇࿅ •ྫ: GPT,

    GPT-2, GPT-3, … ϚεΫݴޠϞσϧ (Masked LM) •จதͷҰ෦ΛӅ͢ɾ༧ଌͯ͠܇࿅ •ྫ: BERT, RoBERTa, DeBERTa, … ݴޠϞσϧ: Language Models 9 ଞʹ΋ݴޠϞσϧʹ͸͞·͟·ͳछྨ͕ଘࡏɻྫ: XLNet, ELECTRA, UL2, … BERTͷ֓ཁਤ
  5. •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ •ೖྗΛQ (Query), K (Key), V (Value)ʹ෼͚ͯܭࢉ • K, V:

    nݸͷd࣍ݩϕΫτϧ • Q: mݸͷd࣍ݩϕΫτϧ ஫ҙػߏ (Attention Mechanism) 10 ਤ͸ Jaegle et al., Perceiver IO: A General Architecture for Structured Inputs & Outputs, ICLR 2022. ΑΓҾ༻ Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳAttention - શྖҬʹԠ༻͞Ε࠷ߴਫ਼౓Λୟ͖ग़͢஫ҙػߏͷ࢓૊ΈʲσΟʔϓϥʔχϯάͷੈք vol. 24ʳ
  6. •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ •ೖྗΛQ (Query), K (Key), V (Value)ʹ෼͚ͯܭࢉ • K, V:

    nݸͷd࣍ݩϕΫτϧ • Q: mݸͷd࣍ݩϕΫτϧ •Qͷ֤ϕΫτϧʹର͢ΔKͷ֤ϕΫτϧͷॏཁ౓Λܭࢉ • Attention Weights: ܭࢉͷ݁ՌಘΒΕΔ(m × n)ߦྻ •Self Attention (ࣗݾ஫ҙػߏ): Q, K, VΛಉ͡ϕΫτϧྻ͔Βߏ੒ (i.e. n=m) •Cross Attention: ʮQʯͱʮK, VʯΛҟͳΔϕΫτϧྻ͔Βߏ੒ ஫ҙػߏ (Attention Mechanism) 11 ਤ͸ Jaegle et al., Perceiver IO: A General Architecture for Structured Inputs & Outputs, ICLR 2022. ΑΓҾ༻ Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳAttention - શྖҬʹԠ༻͞Ε࠷ߴਫ਼౓Λୟ͖ग़͢஫ҙػߏͷ࢓૊ΈʲσΟʔϓϥʔχϯάͷੈք vol. 24ʳ
  7. •஫ҙػߏͷΈͰߏ੒͞ΕͨϞσϧߏ଄ • ͦΕ·ͰNLPͰΑ͘ར༻͞Ε͍ͯͨ
 RNN, LSTM΍CNNΛഉআ •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ • ೖྗϕΫτϧಉ࢜ͷ૬ޓ࡞༻Λߟྀ Transformer 12

    Vaswani etl al., Attention Is All You Need, NeurIPS 2017. Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳTransformer - Multi-Head AttentionΛཧղͯ͠΍Ζ͏͡Όͳ͍ͷʲσΟʔϓϥʔχϯάͷੈքvol.28ʳ ֓ཁਤ Encoder Decoder
  8. •஫ҙػߏͷΈͰߏ੒͞ΕͨϞσϧߏ଄ • ͦΕ·ͰNLPͰΑ͘ར༻͞Ε͍ͯͨ
 RNN, LSTM΍CNNΛഉআ •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ • ೖྗϕΫτϧಉ࢜ͷ૬ޓ࡞༻Λߟྀ •EncoderͱDecoderͷೋछྨ͕ଘࡏ •

    EncoderͷΈ: BERT, LUKE, … • DecoderͷΈ: GPT, GPT-2, GPT-3, … • Encoder-Decoder: BART, T5, UL2, … Transformer 13 Vaswani etl al., Attention Is All You Need, NeurIPS 2017. Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳTransformer - Multi-Head AttentionΛཧղͯ͠΍Ζ͏͡Όͳ͍ͷʲσΟʔϓϥʔχϯάͷੈքvol.28ʳ ֓ཁਤ Encoder Decoder
  9. •Transformer EncoderΛෳ਺૚ॏͶͯେن໛ʹࣄલֶशͨ͠Ϟσϧ • base͸12૚ (1.1ԯύϥϝʔλ)ɺlarge͸24૚ (3.3ԯύϥϝʔλ) •ࣄલֶश (pre-training) → ඍௐ੔

    ( fi ne-tuning) ͱ͍͏ύϥμΠϜ͕ීٴ BERT: Bidirectional Encoder Representations from Transformers 14 Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019.
  10. •BERTҎલ: ݸผλεΫͷσʔληοτͰϞσϧΛؤுͬͯ܇࿅ •BERTΛར༻: ݸผλεΫͷσʔληοτͰBERTΛඍௐ੔( fi ne-tuning) ϙΠϯτ •BERT͸ݴޠͱ͸Կ͔ͱ͍͏஌ࣝΛࣄલֶशʹΑͬͯ֫ಘ • ͦͷ஌ࣝΛ࢖͏ͷͰɺݸผλεΫʹରͯ͠͸

    “গ͠” ௐ੔͢Δ͚ͩͰ͍͍ •܇࿅σʔλ͕ൺֱతগྔͰ΋ߴ͍ੑೳΛಘΒΕΔΑ͏ʹ (10ສ~ →1000~) • ܇࿅ίετ (σʔληοτऩूɾ܇࿅࣌ؒ) ͕ܶతʹݮগ BERTΛ༻͍ͨࣗવݴޠॲཧ 15 Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019.
  11. •175B (1750ԯ) ύϥϝʔλͱ͍͏ඇৗʹڊେͳࣗݾճؼܕݴޠϞσϧΛ܇࿅ • ࢀߟ: BERT-base͸ 110M (=0.11B) ύϥϝʔλ •Few-shot

    / Zero-shot learningͱݺ͹ΕΔೳྗΛ֫ಘ • ͘͝গ਺(0~100)ͷࣄྫΛݟΔ͚ͩͰλεΫΛ͋Δఔ౓ղ͚ΔΑ͏ʹ GPT-3 16
  12. •175B (1750ԯ) ύϥϝʔλͱ͍͏ඇৗʹڊେͳࣗݾճؼܕݴޠϞσϧΛ܇࿅ • ࢀߟ: BERT-base͸ 110M (=0.11B) ύϥϝʔλ •Few-shot

    / Zero-shot learningͱݺ͹ΕΔೳྗΛ֫ಘ • ͘͝গ਺(0~100)ͷࣄྫΛݟΔ͚ͩͰλεΫΛ͋Δఔ౓ղ͚ΔΑ͏ʹ Scaling law •ϞσϧɾσʔλɾܭࢉྔΛσΧ͘͢Ε͹͢Δ΄Ͳੑೳ্͕͕Δͱ͍͏ܦݧଇ •**·ͩݶք͕ݟ͍͑ͯͳ͍** • ܭࢉྔΛ૿΍͢͜ͱͷΈͰਓྨͷ஌తೳྗΛ௒͑ΒΕΔʁ • ਓྨ͕௕೥ເݟͨʮ൚༻ਓ޻஌ೳʯ΁ͷҰา͔ʁ GPT-3 17
  13. •େن໛ݴޠϞσϧΛ༻͍ͨਪ࿦ख๏ͷҰͭ • In-context learning (จ຺಺ֶश) ͱ΋ •λεΫͷʮೖग़ྗྫʯΛݟͤΔ͚ͩͰ
 ͋Δఔ౓ਖ਼͘͠λεΫ͕ղ͚Δ • λεΫΛղ͚ΔΑ͏ʹݴޠϞσϧΛ


    ͏·͘ʮ৚݅෇͚ʯ͢Δ •ʮ৚݅෇͚ʯͷͨΊͷࣄલೖྗΛ
 ϓϩϯϓτ (prompt)ͱ͍͏ Few-shot / Zero-shot learning 18 ਤ͸GPT-3࿦จΑΓҾ༻ Zero-shot learning Few-shot learning
  14. •ଟ͘ͷ৔߹prompt͸ͨͩͷจࣈྻ • ࿈ଓతͳϕΫτϧΛpromptʹ͢Δ೿ൊ΋ • Soft Promptsͱ͍͏ • ޙड़͢ΔPre fi x-Tuning΍Prompt

    Tuning •͋ΔλεΫʹର͢Δੑೳ͸promptʹେ͖͘ґଘ • promptΛ͍͍ײ͡ʹվળ͢Δඞཁ • ͜ͷϓϩηεΛPrompt Engineeringͱ͍͏ • Chain-of-ThoughtͳͲͷํ๏࿦͕ଘࡏ •Prompt Engineeringͷ·ͱΊ΋ Prompt / Prompt Engineering 19 ਤ͸ Kojima et al. Large Language Models are Zero-Shot Reasoners, NeurIPS 2022. ΑΓҾ༻
  15. •OpenAIͷChatGPT APIͱGPT-4, MetaͷLLaMAͷެ։ʹΑͬͯେྲྀߦ • GPT-4͕࢖͑Δ ChatGPT Plus ΋ΠΠκ •LLMͷద༻ൣғͷ֦େͱLLMΛखݩͰಈ͔ͨ͢Ίͷٕज़։ൃ͕രൃతʹྲྀߦ •

    llama.cpp΍LangChain, LlamaIndexͳͲपลٕज़ͷ։ൃ΋Ξπ͍ • Alpaca-LoRAͳͲLLMΛLoRAͰௐ੔͢Δ࿩΋ଟ਺ •ϑϦʔͷେن໛ݴޠϞσϧΛެ։͢Δಈ͖΋׆ൃ • OPT, Alpaca, Vicuna, Dolly, RWKV, ChatRWKV, … େن໛ݴޠϞσϧ (Large Language Models; LLMs) 20 RNNϕʔεͷṖͷLLM
 (࿦จະެ։)
  16. •೔ຊޠʹಛԽͯ͠ࣄલֶश͞ΕͨϞσϧ΋ଟ਺ଘࡏ • ౦๺େBERT • ૣҴాେRoBERTa • Studio Ousia೔ຊޠLUKE • ژେDeBERTa-v2

    • rinnaࣾ GPT-2 (1.3B) • ૣҴాେ GPT-2 (1.5B) • ABEJAࣾ GPT-NeoX-Japanese (2.7B) ࢀߟ: ೔ຊޠࣄલֶशࡁΈݴޠϞσϧ 21 ࢀߟจݙ: ϑϦʔͰ࢖͑Δ೔ຊޠͷओͳେن໛ݴޠϞσϧʢLLMʣ·ͱΊ ݁ہ೔ຊޠେن໛ݴޠϞσϧʢLLMʣͬͯͲΕΛ࢖͑͹͍͍ͷʁJGLUEϕϯνϚʔΫඇެࣜ·ͱΊ

  17. •ඍௐ੔ޙͷϞσϧͷอଘ͕ͱͯ΋େม • ྫ: GPT-3ͷ৔߹ɺඍௐ੔ͷͨͼʹ175BͷϞσϧ͕ग़དྷ্͕Δ • 175BͷϞσϧ͸ҰͭͰ350GBఔ౓ͷετϨʔδ༰ྔ* •େ͖ͳϞσϧ͸GPUʹࡌͤΔͷ͕೉͍͠ • ྫ: GPT-3͸

    fi ne-tuningͷͨΊʹGPUͷϝϞϦ͕1.2TBඞཁ •Few-shot / Zero-shot learningͳΒϞσϧͷߋ৽͸ෆཁ͕ͩ… • ҰൠʹͪΌΜͱϞσϧΛௐ੔ͯ͋͛͠Δํ͕ੑೳ͸ߴ͍ •ͳΒ͹ͤΊͯอଘ/࠷దԽ͢ΔύϥϝʔλΛݮΒ͍ͨ͠ • ϞσϧͷҰ෦ͷΈͷߋ৽ͰλεΫΛղ͚ͨΒخ͍͠ େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ 25 *LoRA࿦จͷهड़ A100 80GB (150ສԁ) × 15 🙄
  18. •ೖྗͷઌ಄ʹλεΫ༻ͷϕΫτϧ(Soft Prompts)Λ༻ҙ • λεΫ༻ϕΫτϧ (+தؒ૚ͷҰ෦)Λ࠷దԽ •ಉ࣌ظʹఏҊ͞Εͨख๏ • Prompt Tuning͸Soft PromptsͷΈߋ৽

    Pre fi x-Tuning / Prompt Tuning 28 Li et al., Pre fi x-Tuning: Optimizing Continuous Prompts for Generation, ACL-IJCNLP 2021. Lester et al., The Power of Scale for Parameter-E ff i cient Prompt Tuning, EMNLP 2021. LLMͰ༻͍ΒΕΔ “཭ࢄతͳ”
 promptͱ͸ରরత Prompt Tuning Pre fi x-Tuning
  19. ਪ࿦ίετͷ૿Ճ •طଘͷAdapter͸૚Λ௚ྻతʹ௥Ճ • “ܭࢉ଴ͪ”ʹΑΓGPUͷฒྻॲཧੑೳΛ͏·͘׆͔ͤͳ͍ • ϕʔεϞσϧʹ༨෼ͳܭࢉίετΛಋೖͯ͠͠·͏ ࠷దԽͷ೉͠͞ •(ಛʹ) Pre fi

    x-Tuning͸࠷దԽ͕ࠔ೉Ͱੑೳ΋༧ଌ͕͖ͭͮΒ͍ • ܇࿅ՄೳͳύϥϝʔλΛ૿΍ͯ͠΋ੑೳ্͕͕Βͳ͍͜ͱ͕͋Δ(ޙड़) •ੑೳΛग़͢ʹ͸128 token΄ͲೖΕΔඞཁ͕͋Δ • ೖྗܥྻΛ͔ͳΓѹഭ͢Δ طଘख๏ͷ໰୊఺ 32
  20. LoRA: Low-Rank Adaptation of Large Language Models 33 •ϕʔεϞσϧͷઢܗ૚ͷྡʹ
 ࠩ෼ߦྻ

    Λ௥Ճ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ • ߦྻA: (d × r) • ߦྻB: (r × d) • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK ΔW = BA
  21. LoRA: Low-Rank Adaptation of Large Language Models 34 •ϕʔεϞσϧͷઢܗ૚ͷྡʹ
 ࠩ෼ߦྻ

    Λ௥Ճ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ • ߦྻA: (d × r) • ߦྻB: (r × d) • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK •׬શͳ fi ne-tuningͱൺֱͯ͠ • ֶशύϥϝʔλ਺Λܶతʹ࡟ݮ • ಉ౳ੑೳ ΔW = BA ϕʔεϞσϧதͷ
 ઢܗ૚Λݻఆ
  22. LoRA: Low-Rank Adaptation of Large Language Models 35 •ϕʔεϞσϧͷઢܗ૚ͷྡʹ
 ࠩ෼ߦྻ

    Λ௥Ճ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ • ߦྻA: (d × r) • ߦྻB: (r × d) • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK •׬શͳ fi ne-tuningͱൺֱͯ͠ • ֶशύϥϝʔλ਺Λܶతʹ࡟ݮ • ಉ౳ੑೳ ΔW = BA ઢܗม׵ͯ͠
 ଍͚ͩ͢ Transformerதͷ஫ҙػߏͷ
 ઢܗ૚ ͳͲ Wq , Wv ϕʔεϞσϧதͷ
 ઢܗ૚Λݻఆ ஫ҙ: ϞσϧࣗମͰ͸ͳ͘
 ϞσϧதͷҰ෦ͷઢܗ૚
  23. GPT-3ద༻࣌ͷྫ (r=4) •܇࿅࣌ͷVRAM (GPUͷϝϞϦ)Λ 1.2TB → 350GB ʹ࡟ݮ • Gradient

    CheckpointingͳͲଞͷ޻෉Ͱ͞Βʹ࡟ݮͰ͖ͦ͏ •อଘ͢Δύϥϝʔλͷ༰ྔΛ 350GB → 35MB ʹ࡟ݮ • ༰ྔ͕ݮͬͯอଘɾI/Oίετ͕ܶతʹখ͘͞ • ݸผϢʔβɾλεΫͷͨΊʹΧελϚΠζͨ͠Ϟσϧͷ࡞੒͕༰қʹ •܇࿅͕25%ఔ౓ߴ଎Խ • ͦ͜·Ͱ଎͘ͳ͍ͬͯͳ͍ͷ͸ผͷ෦෼͕ϘτϧωοΫ͔ͩΒʁ LoRAͷ༗༻ੑ 37 ܇࿅༻σʔλͷసૹͱ͔ϊʔυؒ௨৴ʁ
  24. ॳظԽͷ޻෉ •A͸ฏۉ0ͷਖ਼ن෼෍ɺB͸ྵߦྻͰॳظԽ • ͸࠷ॳԿ΋͠ͳ͍ • LoRA૚͕ݩͷϞσϧΛअຐ͠ͳ͍ Ϟσϧߏ଄ΛมԽͤ͞ͳ͍ •ֶशՄೳύϥϝʔλ਺Λ૿΍͍ͯ͘͠ͱ… • LoRA:

    r͕ݩͷߦྻͱಉ͡ = ݩͷϞσϧͱ΄΅ಉ͡ • Adapter: MLP૚Λ௥Ճͨ͠ϞσϧΛFull FT͢Δͷͱಉ౳ • Prompt Tuning: ೖྗ௕͕গ͠୹͘ͳͬͨϞσϧΛFull FT͢Δͷͱಉ౳ ΔW = BA LoRA͸ͳͥ͏·͍͘͘ͷ͔ʁ 38
  25. •LoRAࣗମ͸ͦ͜·Ͱ໨৽͍͠࿩Ͱ͸ͳ͍͕… • Adapter͔ΒඇઢܗੑΛൈ͍͚ͨͩͱ͍͏ؾ΋͢Δ ͓΋͠ΖϙΠϯτ •AdapterΑΓগͳ͍ύϥϝʔλ਺Ͱ΋ੑೳ͸ग़ͤΔ •AdapterͷΑ͏ʹඇઢܗੑΛಋೖ͠ͳͯ͘΋ੑೳ͸ग़ͤΔ ஫ҙ఺ •10BҎ্ͷϞσϧͩͱLoRA૚Ͱ͢Β20MҎ্ͷύϥϝʔλ਺ʹͳΔ͜ͱ΋ •Ϟσϧ͕े෼େ͖͘ͳ͍ͱ͏·͘ಈ࡞͠ͳ͍͔΋ •

    6.7Bͩͱ͏·͍͕͘͘ɺ1BϞσϧͩͱ͏·͘ඍௐ੔Ͱ͖ͳ͍ͱ͍͏ࣄྫ LoRAͷ৽نੑɾ஫ҙ఺ 41 ͦΜͳʹখ͘͞ͳ͍… ಉ͡όονʹҟͳΔ
 λεΫΛೖΕΔͷ΋
 গ͠೉͍͠ খ͞ΊͷϞσϧͩͱͲͷ͘Β͍
 ϋΠύϥௐ੔͕ඞཁ / sensitive ͔एׯෆ໌
  26. •LoRA͸AdapterͷҰछ • ඇઢܗੑΛഉͨ͠Adapterͷಛघܥͱ΋ଊ͑ΒΕΔ •Pre fi x- / Prompt TuningΑΓֶशύϥϝʔλ਺͸ଟ͘ͳΔ LoRAͱطଘख๏ͱͷؔ܎

    42 ख๏໊ fi ne-tuning Adapter LoRA Pre fi x-Tuning / 
 Prompt Tuning few-shot / zero-shot learning ࠷దԽର৅ ͢΂ͯ MLP
 (ଟ૚ύʔηϓτϩϯ) ௿ϥϯΫߦྻ Soft Prompts
 + α - ਪ࿦ίετ͸
 ૿Ճ͢Δʁ No Yes No Yes
 (ܥྻ௕͕૿Ճ) -
  27. •ਂ૚ֶशͰ͸ߦྻܭࢉ͕ඞཁෆՄܽ • ઢܗ૚(ॏΈߦྻ + όΠΞε)͸ಛʹසग़ •௚ײతʹ͸ΨϦόʔτϯωϧΈ͍ͨͳ΋ͷ • ϕΫτϧΛೖΕΔͱग़ޱͷܗʹͳͬͯग़ͯ͘Δ • ϕΫτϧΛ೚ҙͷܗʹมߋͰ͖Δػߏ

    • ྫ: ෼ྨ໰୊͸(ϕΫτϧ࣍ݩ਺, Ϋϥε਺)Ͱมܗ ؓ࿩ٳ୊: ߦྻ (ςϯιϧ) ܭࢉͷ௚ײతཧղ 43 ը૾͸ ςϨϏே೔ ʰͻΈͭಓ۩ΧλϩάʱΑΓҾ༻ ΨϦόʔτϯωϧ nn.Linear(5, 3) ग़ޱͷܗʹ֦େɾॖখ͢ΔͻΈͭಓ۩
  28. •LoRAͷ༗༻ੑΛ fi ne-tuning, AdapterͳͲͱൺֱͯ͠ݕূ •ෳ਺छྨͷλεΫ • ࣗવݴޠཧղ (Natural Language Understanding;

    NLU) ܥλεΫ • ࣗવݴޠੜ੒ (Natural Language Generation; NLG) ܥλεΫ •ෳ਺ͷϞσϧ • RoBERTa base (125M) / large (355M) • DeBERTa XXL (1.5B) • GPT-2 medium (355M) / large (774M) • GPT-3 (175B) ධՁ࣮ݧ 44
  29. •WikiSQL΍MNLI, SAMSumͰධՁɺLoRA͕ߴ͍ੑೳ • fi ne-tuning͸few-shotΑΓ
 ֨ஈʹੑೳ͕ߴ͍ˠ • fi ne-tuning͕ऑ͍Θ͚Ͱ͸ͳ͍ ͳͥLoRA͸

    fi ne-tuningͱಉ౳ʁ •ݩͷϞσϧʹඞཁͳ஌͕ࣝଘࡏ •LoRA͸ͦΕΛڧௐ͢Δ͚ͩ* • ࣮͸ͦΕ͚ͩͰे෼ͩͬͨʁ NLU & NLG: GPT-3 175B 48 *ݩ࿦จͷ7અΛࢀর ࣗવݴޠˠSQL ձ࿩ཁ໿ ࣗવݴޠਪ࿦ fi ne-tuningࣗମ͸༗ޮ
  30. •ϞσϧͷύϥϝʔλΛ෦෼తʹඍௐ੔͢Δख๏ΛఏҊ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ •ֶशύϥϝʔλ਺Λܶతʹ࡟ݮɺ fi ne-tuningͱಉ౳ੑೳ ॴײ •ݚڀతͳ৽نੑΑΓɺૉ௚Ͱศརͳઃܭʹྗ఺Λஔ͍ͨख๏ (ҹ৅) •Appendix͔ΒϋΠύϥௐ੔ͷͨΊͷਘৗͳΒ͟Δ౒ྗͷ੻͕Ӑ͑Δ

    • ͦ͜·ͰϋΠύϥʹsensitiveͳख๏Ͱ͸ͳͦ͞͏͕ͩ…(ମײ) •ઢܗม׵ͷΈͷߏ੒ʹΑΓϕʔεϞσϧʹ༥߹ͤ͞ΒΕΔͷ͸໘ന͍ •ͱͯ΋γϯϓϧͳख๏ͳͷͰద༻ൣғ͸͔ͳΓ޿ͦ͏ ·ͱΊɾॴײ 53
  31. •HuggingFace͕ఏڙ͢ΔϥΠϒϥϦ • https://github.com/huggingface/peft • LoRA΍Prompt Tuning, AdaLoRA (LoRAͷޙଓख๏) ౳͕࣮૷ •PEFTΛ༻͍ͨLoRAʹΑΔඍௐ੔ͷྲྀΕ

    1. ϕʔεͱͳΔϞσϧΛ༻ҙ͢Δ 2. ϕʔεϞσϧͷ૚ͷҰ෦ΛLoRAͷ૚ʹஔ׵ 3. LoRAͷ૚Ҏ֎ͷ૚Λݻఆ(freeze) 4. LoRA෦෼ͷΈΛֶश •อଘ͢Δͷ͸௥Ճͨ͠૚ͷΈ (ϕʔεϞσϧͱൺֱͯ͠ۃΊͯখ͍͞) PEFT: Parameter-E ffi cient Fine-Tuning 🤗 58
  32. •GoogleͷFlan-UL2 (20B)ʹରͯ͠PEFTΛ༻͍ͯखݩͷλεΫʹద༻ • ੑೳ͸ྑ޷ɺอଘ༰ྔ΋খ͘͞܇࿅΋ૣ͍ • ྫ (r=16): ܇࿅ύϥϝʔλ਺ 25MɺVRAM࢖༻ྔ 55GiB

    ༨ஊ •Flan-T5 XXL (11B) ͳͲFlan instruction tuning͞ΕͨϞσϧ͸ۃΊͯڧྗ • खݩͷλεΫͰۃΊͯߴ͍Zero-shotੑೳɺLoRAνϡʔχϯά΋ྑ޷ • ͜ͷลΓɺInstructGPT΍ChatGPTʹ௨ͣΔ΋ͷ͕͋Γͦ͏ •ӳޠλεΫ͸ͱΓ͋͑ͣFlan-T5 XXL / UL2 + LoRAΛߟ͑ͯ΋ྑͦ͞͏ •Flan-UL2΋Flan-T5΋Encoder-DecoderϞσϧͷT5ϕʔεͳͷͰࣗ༝ࣗࡏ • BERTͷΑ͏ʹEncoderͷΈ࢖ͬͯ΋ྑ͍͠ɺੜ੒ͤͯ͞΋ྑ͍ PEFT: ࣮ࡍʹ৮ͬͨײ૝ 65 Batch size 32, BF16, 
 Gradient Checkpointing
  33. •GoogleͷFlan-UL2 (20B)ʹରͯ͠PEFTΛ༻͍ͯखݩͷλεΫʹద༻ • ੑೳ͸ྑ޷ɺอଘ༰ྔ΋খ͘͞܇࿅΋ૣ͍ • ྫ (r=16): ܇࿅ύϥϝʔλ਺ 25MɺVRAM࢖༻ྔ 55GiB

    ༨ஊ •Flan-T5 XXL (11B) ͳͲFlan instruction tuning͞ΕͨϞσϧ͸ۃΊͯڧྗ • खݩͷλεΫͰۃΊͯߴ͍Zero-shotੑೳɺLoRAνϡʔχϯά΋ྑ޷ • ͜ͷลΓɺInstructGPT΍ChatGPTʹ௨ͣΔ΋ͷ͕͋Γͦ͏ •ӳޠλεΫ͸ͱΓ͋͑ͣFlan-T5 XXL / UL2 + LoRAΛߟ͑ͯ΋ྑͦ͞͏ •Flan-UL2΋Flan-T5΋Encoder-DecoderϞσϧͷT5ϕʔεͳͷͰࣗ༝ࣗࡏ • BERTͷΑ͏ʹEncoderͷΈ࢖ͬͯ΋ྑ͍͠ɺੜ੒ͤͯ͞΋ྑ͍ PEFT: ࣮ࡍʹ৮ͬͨײ૝ 66 Batch size 32, BF16, 
 Gradient Checkpointing T5 (11B) ͳΒVRAM 40GBͰ܇࿅ՄೳɺA6000Ͱಈ͘ʂ BERTͷ୅ସͱ͙ͯ͢͠ಈ͔ͤΔϞσϧײ