Slide 1

Slide 1 text

LoRA: Low-Rank Adaptation of
 Large Language Models Graduate School of Informatics, Nagoya University, Japan. ൃදऀ: Hayato Tsukagoshi Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
 ICLR 2022
 https://arxiv.org/abs/2106.09685

Slide 2

Slide 2 text

•ࣄલֶशࡁΈϞσϧʹରͯ͠Ұ෦ͷ
 ύϥϝʔλͷΈΛඍௐ੔͢Δख๏ΛఏҊ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ • A = (d × r) ߦྻ, B = (r × d) ߦྻ • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK •׬શͳ fi ne-tuningʹରͯ͠ • ֶशύϥϝʔλ਺Λܶతʹ࡟ݮ • ಉ౳ੑೳ •ਪ࿦ίετΛ૿Ճͤͣ͞ʹϞσϧͷඍௐ੔͕Մೳ ࿦จ֓ཁ 2

Slide 3

Slide 3 text

•ۙ೥ඇৗʹར༻ऀ͕૿͍͑ͯΔϞσϧ • ը૾ɾΠϥετੜ੒෼໺Ͱͷ੝Γ্͕ΓʹΑͬͯೝ஌౓͕ߴ·ͬͨ •େن໛ݴޠϞσϧ࣌୅ͷσϑΝΫτελϯμʔυʹͳΓ͏Δख๏ • ͢ͰʹLLM΁ͷద༻͕ۃΊͯ੝Μ • MetaͷLLaMAʹର͢ΔAlpacaͱAlpaca-LoRAͱ͔ •ਂ૚ֶशϞσϧͷֶशख๏ͷྺ࢙Λ၆ᛌ͢Δͷʹஸ౓Α͔ͬͨ • BERTҎલ: શͯࣗ෼Ͱֶश • ࣄલֶश (pre-training) → ඍௐ੔ ( fi ne-tuning) • ෦෼తͳඍௐ੔ख๏ or Few-shot / Zero-shot learning બఆཧ༝ 3

Slide 4

Slide 4 text

ࣄલ஌ࣝ •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ •஫ҙػߏɺTransformerɺBERT, … LoRA •େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ͱղܾࡦ •ؔ࿈ݚڀ •ఏҊख๏ɾධՁ࣮ݧ LoRAͷ࢖͍ํ •PEFT: Parameter-E ff i cient Fine-Tuning ໨࣍ 4

Slide 5

Slide 5 text

ࣄલ஌ࣝ •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ •஫ҙػߏɺTransformerɺBERT, … LoRA •େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ͱղܾࡦ •ؔ࿈ݚڀ •ఏҊख๏ɾධՁ࣮ݧ LoRAͷ࢖͍ํ •PEFT: Parameter-E ff i cient Fine-Tuning ໨࣍ 5

Slide 6

Slide 6 text

ࣄલ஌ࣝ

Slide 7

Slide 7 text

•ਂ૚ֶशϞσϧΛ༻͍ͨݴޠϞσϧ͕த৺తଘࡏ •ʮࣗવݴޠͷ୯ޠ΍จষ͕ੜ੒͞ΕΔ֬཰ΛϞσϧԽͨ͠΋ͷʯ* • ͕ͩɺ࠷ۙ͸ʮݴޠΛѻ͑ΔϞσϧʯ͘Β͍ͷҙຯͰ࢖ΘΕ͕ͪ •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅٕज़ • Ϟσϧྫ: BERT, GPT-2, GPT-3, GPT-4, LLaMA, Alpaca, Vicuna, Dolly, … • ͜ΕΒ͸ج൫Ϟσϧ (Foundation Models) ͱ΋ݺ͹ΕΔ ݴޠϞσϧͷ࢖ΘΕํ •จΛ୯ޠϕΫτϧͷྻɾจϕΫτϧʹม׵ •͋Δจʹଓ࣍͘ͷ୯ޠɾจΛ༧ଌ ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ 7 *IT Text ʰࣗવݴޠॲཧͷجૅʱΑΓҾ༻

Slide 8

Slide 8 text

•ۙ೥ͷϞσϧͷଟ͘͸஫ҙػߏ(Attention Mechanism)ʹجͮ͘ TransformerͰߏ੒ •͍Ζ͍Ζͳछྨ͕ଘࡏ ݴޠϞσϧ: Language Models 8 ଞʹ΋ݴޠϞσϧʹ͸͞·͟·ͳछྨ͕ଘࡏɻྫ: XLNet, ELECTRA, UL2, … BERTͷ֓ཁਤ

Slide 9

Slide 9 text

•ۙ೥ͷϞσϧͷଟ͘͸஫ҙػߏ(Attention Mechanism)ʹجͮ͘ TransformerͰߏ੒ •͍Ζ͍Ζͳछྨ͕ଘࡏ ࣗݾճؼܕݴޠϞσϧ (Causal LM) •ࠨ͔Βӈʹ୯ޠΛ༧ଌͯ͠܇࿅ •ྫ: GPT, GPT-2, GPT-3, … ϚεΫݴޠϞσϧ (Masked LM) •จதͷҰ෦ΛӅ͢ɾ༧ଌͯ͠܇࿅ •ྫ: BERT, RoBERTa, DeBERTa, … ݴޠϞσϧ: Language Models 9 ଞʹ΋ݴޠϞσϧʹ͸͞·͟·ͳछྨ͕ଘࡏɻྫ: XLNet, ELECTRA, UL2, … BERTͷ֓ཁਤ

Slide 10

Slide 10 text

•ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ •ೖྗΛQ (Query), K (Key), V (Value)ʹ෼͚ͯܭࢉ • K, V: nݸͷd࣍ݩϕΫτϧ • Q: mݸͷd࣍ݩϕΫτϧ ஫ҙػߏ (Attention Mechanism) 10 ਤ͸ Jaegle et al., Perceiver IO: A General Architecture for Structured Inputs & Outputs, ICLR 2022. ΑΓҾ༻ Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳAttention - શྖҬʹԠ༻͞Ε࠷ߴਫ਼౓Λୟ͖ग़͢஫ҙػߏͷ࢓૊ΈʲσΟʔϓϥʔχϯάͷੈք vol. 24ʳ

Slide 11

Slide 11 text

•ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ •ೖྗΛQ (Query), K (Key), V (Value)ʹ෼͚ͯܭࢉ • K, V: nݸͷd࣍ݩϕΫτϧ • Q: mݸͷd࣍ݩϕΫτϧ •Qͷ֤ϕΫτϧʹର͢ΔKͷ֤ϕΫτϧͷॏཁ౓Λܭࢉ • Attention Weights: ܭࢉͷ݁ՌಘΒΕΔ(m × n)ߦྻ •Self Attention (ࣗݾ஫ҙػߏ): Q, K, VΛಉ͡ϕΫτϧྻ͔Βߏ੒ (i.e. n=m) •Cross Attention: ʮQʯͱʮK, VʯΛҟͳΔϕΫτϧྻ͔Βߏ੒ ஫ҙػߏ (Attention Mechanism) 11 ਤ͸ Jaegle et al., Perceiver IO: A General Architecture for Structured Inputs & Outputs, ICLR 2022. ΑΓҾ༻ Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳAttention - શྖҬʹԠ༻͞Ε࠷ߴਫ਼౓Λୟ͖ग़͢஫ҙػߏͷ࢓૊ΈʲσΟʔϓϥʔχϯάͷੈք vol. 24ʳ

Slide 12

Slide 12 text

•஫ҙػߏͷΈͰߏ੒͞ΕͨϞσϧߏ଄ • ͦΕ·ͰNLPͰΑ͘ར༻͞Ε͍ͯͨ
 RNN, LSTM΍CNNΛഉআ •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ • ೖྗϕΫτϧಉ࢜ͷ૬ޓ࡞༻Λߟྀ Transformer 12 Vaswani etl al., Attention Is All You Need, NeurIPS 2017. Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳTransformer - Multi-Head AttentionΛཧղͯ͠΍Ζ͏͡Όͳ͍ͷʲσΟʔϓϥʔχϯάͷੈքvol.28ʳ ֓ཁਤ Encoder Decoder

Slide 13

Slide 13 text

•஫ҙػߏͷΈͰߏ੒͞ΕͨϞσϧߏ଄ • ͦΕ·ͰNLPͰΑ͘ར༻͞Ε͍ͯͨ
 RNN, LSTM΍CNNΛഉআ •ϕΫτϧྻΛೖྗʹϕΫτϧྻΛग़ྗ͢Δػߏ • ೖྗϕΫτϧಉ࢜ͷ૬ޓ࡞༻Λߟྀ •EncoderͱDecoderͷೋछྨ͕ଘࡏ • EncoderͷΈ: BERT, LUKE, … • DecoderͷΈ: GPT, GPT-2, GPT-3, … • Encoder-Decoder: BART, T5, UL2, … Transformer 13 Vaswani etl al., Attention Is All You Need, NeurIPS 2017. Θ͔Γ΍͍͢ղઆ: ʲਂ૚ֶशʳTransformer - Multi-Head AttentionΛཧղͯ͠΍Ζ͏͡Όͳ͍ͷʲσΟʔϓϥʔχϯάͷੈքvol.28ʳ ֓ཁਤ Encoder Decoder

Slide 14

Slide 14 text

•Transformer EncoderΛෳ਺૚ॏͶͯେن໛ʹࣄલֶशͨ͠Ϟσϧ • base͸12૚ (1.1ԯύϥϝʔλ)ɺlarge͸24૚ (3.3ԯύϥϝʔλ) •ࣄલֶश (pre-training) → ඍௐ੔ ( fi ne-tuning) ͱ͍͏ύϥμΠϜ͕ීٴ BERT: Bidirectional Encoder Representations from Transformers 14 Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019.

Slide 15

Slide 15 text

•BERTҎલ: ݸผλεΫͷσʔληοτͰϞσϧΛؤுͬͯ܇࿅ •BERTΛར༻: ݸผλεΫͷσʔληοτͰBERTΛඍௐ੔( fi ne-tuning) ϙΠϯτ •BERT͸ݴޠͱ͸Կ͔ͱ͍͏஌ࣝΛࣄલֶशʹΑͬͯ֫ಘ • ͦͷ஌ࣝΛ࢖͏ͷͰɺݸผλεΫʹରͯ͠͸ “গ͠” ௐ੔͢Δ͚ͩͰ͍͍ •܇࿅σʔλ͕ൺֱతগྔͰ΋ߴ͍ੑೳΛಘΒΕΔΑ͏ʹ (10ສ~ →1000~) • ܇࿅ίετ (σʔληοτऩूɾ܇࿅࣌ؒ) ͕ܶతʹݮগ BERTΛ༻͍ͨࣗવݴޠॲཧ 15 Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019.

Slide 16

Slide 16 text

•175B (1750ԯ) ύϥϝʔλͱ͍͏ඇৗʹڊେͳࣗݾճؼܕݴޠϞσϧΛ܇࿅ • ࢀߟ: BERT-base͸ 110M (=0.11B) ύϥϝʔλ •Few-shot / Zero-shot learningͱݺ͹ΕΔೳྗΛ֫ಘ • ͘͝গ਺(0~100)ͷࣄྫΛݟΔ͚ͩͰλεΫΛ͋Δఔ౓ղ͚ΔΑ͏ʹ GPT-3 16

Slide 17

Slide 17 text

•175B (1750ԯ) ύϥϝʔλͱ͍͏ඇৗʹڊେͳࣗݾճؼܕݴޠϞσϧΛ܇࿅ • ࢀߟ: BERT-base͸ 110M (=0.11B) ύϥϝʔλ •Few-shot / Zero-shot learningͱݺ͹ΕΔೳྗΛ֫ಘ • ͘͝গ਺(0~100)ͷࣄྫΛݟΔ͚ͩͰλεΫΛ͋Δఔ౓ղ͚ΔΑ͏ʹ Scaling law •ϞσϧɾσʔλɾܭࢉྔΛσΧ͘͢Ε͹͢Δ΄Ͳੑೳ্͕͕Δͱ͍͏ܦݧଇ •**·ͩݶք͕ݟ͍͑ͯͳ͍** • ܭࢉྔΛ૿΍͢͜ͱͷΈͰਓྨͷ஌తೳྗΛ௒͑ΒΕΔʁ • ਓྨ͕௕೥ເݟͨʮ൚༻ਓ޻஌ೳʯ΁ͷҰา͔ʁ GPT-3 17

Slide 18

Slide 18 text

•େن໛ݴޠϞσϧΛ༻͍ͨਪ࿦ख๏ͷҰͭ • In-context learning (จ຺಺ֶश) ͱ΋ •λεΫͷʮೖग़ྗྫʯΛݟͤΔ͚ͩͰ
 ͋Δఔ౓ਖ਼͘͠λεΫ͕ղ͚Δ • λεΫΛղ͚ΔΑ͏ʹݴޠϞσϧΛ
 ͏·͘ʮ৚݅෇͚ʯ͢Δ •ʮ৚݅෇͚ʯͷͨΊͷࣄલೖྗΛ
 ϓϩϯϓτ (prompt)ͱ͍͏ Few-shot / Zero-shot learning 18 ਤ͸GPT-3࿦จΑΓҾ༻ Zero-shot learning Few-shot learning

Slide 19

Slide 19 text

•ଟ͘ͷ৔߹prompt͸ͨͩͷจࣈྻ • ࿈ଓతͳϕΫτϧΛpromptʹ͢Δ೿ൊ΋ • Soft Promptsͱ͍͏ • ޙड़͢ΔPre fi x-Tuning΍Prompt Tuning •͋ΔλεΫʹର͢Δੑೳ͸promptʹେ͖͘ґଘ • promptΛ͍͍ײ͡ʹվળ͢Δඞཁ • ͜ͷϓϩηεΛPrompt Engineeringͱ͍͏ • Chain-of-ThoughtͳͲͷํ๏࿦͕ଘࡏ •Prompt Engineeringͷ·ͱΊ΋ Prompt / Prompt Engineering 19 ਤ͸ Kojima et al. Large Language Models are Zero-Shot Reasoners, NeurIPS 2022. ΑΓҾ༻

Slide 20

Slide 20 text

•OpenAIͷChatGPT APIͱGPT-4, MetaͷLLaMAͷެ։ʹΑͬͯେྲྀߦ • GPT-4͕࢖͑Δ ChatGPT Plus ΋ΠΠκ •LLMͷద༻ൣғͷ֦େͱLLMΛखݩͰಈ͔ͨ͢Ίͷٕज़։ൃ͕രൃతʹྲྀߦ • llama.cpp΍LangChain, LlamaIndexͳͲपลٕज़ͷ։ൃ΋Ξπ͍ • Alpaca-LoRAͳͲLLMΛLoRAͰௐ੔͢Δ࿩΋ଟ਺ •ϑϦʔͷେن໛ݴޠϞσϧΛެ։͢Δಈ͖΋׆ൃ • OPT, Alpaca, Vicuna, Dolly, RWKV, ChatRWKV, … େن໛ݴޠϞσϧ (Large Language Models; LLMs) 20 RNNϕʔεͷṖͷLLM
 (࿦จະެ։)

Slide 21

Slide 21 text

•೔ຊޠʹಛԽͯ͠ࣄલֶश͞ΕͨϞσϧ΋ଟ਺ଘࡏ • ౦๺େBERT • ૣҴాେRoBERTa • Studio Ousia೔ຊޠLUKE • ژେDeBERTa-v2 • rinnaࣾ GPT-2 (1.3B) • ૣҴాେ GPT-2 (1.5B) • ABEJAࣾ GPT-NeoX-Japanese (2.7B) ࢀߟ: ೔ຊޠࣄલֶशࡁΈݴޠϞσϧ 21 ࢀߟจݙ: ϑϦʔͰ࢖͑Δ೔ຊޠͷओͳେن໛ݴޠϞσϧʢLLMʣ·ͱΊ ݁ہ೔ຊޠେن໛ݴޠϞσϧʢLLMʣͬͯͲΕΛ࢖͑͹͍͍ͷʁJGLUEϕϯνϚʔΫඇެࣜ·ͱΊ


Slide 22

Slide 22 text

ࣄલ஌ࣝ •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ •஫ҙػߏɺTransformerɺBERT, … LoRA •େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ͱղܾࡦ •ؔ࿈ݚڀ •ఏҊख๏ɾධՁ࣮ݧ LoRAͷ࢖͍ํ •PEFT: Parameter-E ff i cient Fine-Tuning ໨࣍ 22

Slide 23

Slide 23 text

LoRA

Slide 24

Slide 24 text

•ඍௐ੔ޙͷϞσϧͷอଘ͕ͱͯ΋େม • ྫ: GPT-3ͷ৔߹ɺඍௐ੔ͷͨͼʹ175BͷϞσϧ͕ग़དྷ্͕Δ • 175BͷϞσϧ͸ҰͭͰ350GBఔ౓ͷετϨʔδ༰ྔ* •େ͖ͳϞσϧ͸GPUʹࡌͤΔͷ͕೉͍͠ • ྫ: GPT-3͸ fi ne-tuningͷͨΊʹGPUͷϝϞϦ͕1.2TBඞཁ େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ 24 *LoRA࿦จͷهड़ A100 80GB (150ສԁ) × 15 🙄

Slide 25

Slide 25 text

•ඍௐ੔ޙͷϞσϧͷอଘ͕ͱͯ΋େม • ྫ: GPT-3ͷ৔߹ɺඍௐ੔ͷͨͼʹ175BͷϞσϧ͕ग़དྷ্͕Δ • 175BͷϞσϧ͸ҰͭͰ350GBఔ౓ͷετϨʔδ༰ྔ* •େ͖ͳϞσϧ͸GPUʹࡌͤΔͷ͕೉͍͠ • ྫ: GPT-3͸ fi ne-tuningͷͨΊʹGPUͷϝϞϦ͕1.2TBඞཁ •Few-shot / Zero-shot learningͳΒϞσϧͷߋ৽͸ෆཁ͕ͩ… • ҰൠʹͪΌΜͱϞσϧΛௐ੔ͯ͋͛͠Δํ͕ੑೳ͸ߴ͍ •ͳΒ͹ͤΊͯอଘ/࠷దԽ͢ΔύϥϝʔλΛݮΒ͍ͨ͠ • ϞσϧͷҰ෦ͷΈͷߋ৽ͰλεΫΛղ͚ͨΒخ͍͠ େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ 25 *LoRA࿦จͷهड़ A100 80GB (150ສԁ) × 15 🙄

Slide 26

Slide 26 text

•ج൫Ϟσϧͷོ੝ʹΑΓɺ෦෼తͳඍௐ੔ख๏ͷݚڀ͕੝Μʹ • ຊࢿྉͰ঺հ͢ΔLoRA΋͜ͷྨͷख๏ •͍ͣΕͷख๏΋ʮج൫ϞσϧΛগ͚ͩ͠ௐ੔͢Δʯͱ͍͏ίϯηϓτ طଘݚڀ •BitFit: ϞσϧͷόΠΞε߲ͷΈߋ৽ •Adapter: ϞσϧʹλεΫઐ༻ͷ૚Λ௥Ճ •Pre fi x-Tuning: λεΫઐ༻ͷೖྗϕΫτϧ+ϞσϧͷҰ෦ͷΈߋ৽ •Prompt Tuning: λεΫઐ༻ͷೖྗϕΫτϧͷΈߋ৽ ෦෼తͳඍௐ੔ 26

Slide 27

Slide 27 text

•TransformerͷதʹֶशՄೳͳ
 MLP૚Λࠩ͠ࠐΉख๏ • MLP૚͸ඇઢܗม׵Λඋ͑Δ •MLP૚ͷࠩ͠ࠐΉ৔ॴʹΑͬͯ
 ͍͔ͭ͘มछ͕ଘࡏ •ຊࢿྉͰ঺հ͢ΔLoRA͸
 AdapterͷҰछ Adapter 27 Houlsby et al., Parameter-E ff i cient Transfer Learning for NLP, ICML 2019. Adapterͷࠩ͠ࠐΈํ Adapterͷߏ଄ γϯϓϧͳMLP

Slide 28

Slide 28 text

•ೖྗͷઌ಄ʹλεΫ༻ͷϕΫτϧ(Soft Prompts)Λ༻ҙ • λεΫ༻ϕΫτϧ (+தؒ૚ͷҰ෦)Λ࠷దԽ •ಉ࣌ظʹఏҊ͞Εͨख๏ • Prompt Tuning͸Soft PromptsͷΈߋ৽ Pre fi x-Tuning / Prompt Tuning 28 Li et al., Pre fi x-Tuning: Optimizing Continuous Prompts for Generation, ACL-IJCNLP 2021. Lester et al., The Power of Scale for Parameter-E ff i cient Prompt Tuning, EMNLP 2021. LLMͰ༻͍ΒΕΔ “཭ࢄతͳ”
 promptͱ͸ରরత Prompt Tuning Pre fi x-Tuning

Slide 29

Slide 29 text

•ύϥϝʔλͷߋ৽ͷͨΊʹ͸ٯ఻೻͕ඞཁ • ࣮͸ٯ఻೻ͷͨΊͷܭࢉ݁Ռͷอଘίετ͕ΊͪΌͪ͘Όେ͖͍ • ٯ఻೻ͷඞཁ͕ͳ͍෦෼͸ϘτϧωοΫͰ͸ͳ͍ ͳͥϝϞϦ࢖༻ྔ͕࡟ݮͰ͖Δͷ͔? 29 https://huggingface.co/docs/transformers/main/en/performance ܇࿅࣌ͷϝϞϦ
 ࢖༻ྔΛࣔ͢ਤ

Slide 30

Slide 30 text

•ύϥϝʔλͷߋ৽ͷͨΊʹ͸ٯ఻೻͕ඞཁ • ࣮͸ٯ఻೻ͷͨΊͷܭࢉ݁Ռͷอଘίετ͕ΊͪΌͪ͘Όେ͖͍ • ٯ఻೻ͷඞཁ͕ͳ͍෦෼͸ϘτϧωοΫͰ͸ͳ͍ ͳͥϝϞϦ࢖༻ྔ͕࡟ݮͰ͖Δͷ͔? 30 https://huggingface.co/docs/transformers/main/en/performance ͍Ζ͍Ζͳ޻෉ΛೖΕͯ
 ΍ͬͱϝϞϦ࢖༻ྔ͕গͳ͘ͳΔ ܇࿅࣌ͷϝϞϦ
 ࢖༻ྔΛࣔ͢ਤ Ϟσϧࣗମ͸ൺֱతখ͍͞

Slide 31

Slide 31 text

ਪ࿦ίετͷ૿Ճ •طଘͷAdapter͸૚Λ௚ྻతʹ௥Ճ • “ܭࢉ଴ͪ”ʹΑΓGPUͷฒྻॲཧੑೳΛ͏·͘׆͔ͤͳ͍ • ϕʔεϞσϧʹ༨෼ͳܭࢉίετΛಋೖͯ͠͠·͏ طଘख๏ͷ໰୊఺ 31

Slide 32

Slide 32 text

ਪ࿦ίετͷ૿Ճ •طଘͷAdapter͸૚Λ௚ྻతʹ௥Ճ • “ܭࢉ଴ͪ”ʹΑΓGPUͷฒྻॲཧੑೳΛ͏·͘׆͔ͤͳ͍ • ϕʔεϞσϧʹ༨෼ͳܭࢉίετΛಋೖͯ͠͠·͏ ࠷దԽͷ೉͠͞ •(ಛʹ) Pre fi x-Tuning͸࠷దԽ͕ࠔ೉Ͱੑೳ΋༧ଌ͕͖ͭͮΒ͍ • ܇࿅ՄೳͳύϥϝʔλΛ૿΍ͯ͠΋ੑೳ্͕͕Βͳ͍͜ͱ͕͋Δ(ޙड़) •ੑೳΛग़͢ʹ͸128 token΄ͲೖΕΔඞཁ͕͋Δ • ೖྗܥྻΛ͔ͳΓѹഭ͢Δ طଘख๏ͷ໰୊఺ 32

Slide 33

Slide 33 text

LoRA: Low-Rank Adaptation of Large Language Models 33 •ϕʔεϞσϧͷઢܗ૚ͷྡʹ
 ࠩ෼ߦྻ Λ௥Ճ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ • ߦྻA: (d × r) • ߦྻB: (r × d) • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK ΔW = BA

Slide 34

Slide 34 text

LoRA: Low-Rank Adaptation of Large Language Models 34 •ϕʔεϞσϧͷઢܗ૚ͷྡʹ
 ࠩ෼ߦྻ Λ௥Ճ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ • ߦྻA: (d × r) • ߦྻB: (r × d) • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK •׬શͳ fi ne-tuningͱൺֱͯ͠ • ֶशύϥϝʔλ਺Λܶతʹ࡟ݮ • ಉ౳ੑೳ ΔW = BA ϕʔεϞσϧதͷ
 ઢܗ૚Λݻఆ

Slide 35

Slide 35 text

LoRA: Low-Rank Adaptation of Large Language Models 35 •ϕʔεϞσϧͷઢܗ૚ͷྡʹ
 ࠩ෼ߦྻ Λ௥Ճ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ • ߦྻA: (d × r) • ߦྻB: (r × d) • r͸2ͳͲ͘͝খ͍͞਺ࣈͰ΋OK •׬શͳ fi ne-tuningͱൺֱͯ͠ • ֶशύϥϝʔλ਺Λܶతʹ࡟ݮ • ಉ౳ੑೳ ΔW = BA ઢܗม׵ͯ͠
 ଍͚ͩ͢ Transformerதͷ஫ҙػߏͷ
 ઢܗ૚ ͳͲ Wq , Wv ϕʔεϞσϧதͷ
 ઢܗ૚Λݻఆ ஫ҙ: ϞσϧࣗମͰ͸ͳ͘
 ϞσϧதͷҰ෦ͷઢܗ૚

Slide 36

Slide 36 text

λεΫͷ੾Γସ͕͑؆୯ •λεΫ͝ͱʹLoRA૚Λ༻ҙ •ϕʔεϞσϧ͸ͦͷ··ʹLoRA૚ͷΈΛࠩ͠ସ͑Δ͚ͩ อଘ༰ྔ͕খ͍͞ •ϕʔεϞσϧ͸มߋෆཁɺLoRA૚͚ͩอଘ͢Ε͹ྑ͍ ਪ࿦ίετ͕૿Ճ͠ͳ͍ •LoRA૚͸ઢܗม׵Λ͍ͯ͠Δ͚ͩ
 → ΦϦδφϧͷઢܗ૚ͱϚʔδͰ͖Δ •ਪ࿦࣌ͷલॲཧͱͯ͠ϚʔδΛઌʹ͓͚ͯ͠͹OK LoRA͸Կ͕خ͍͠ͷ͔ʁ 36 h = W0 x + ΔWx = W0 x + BAx = (W0 + BA)x = W′  x ϕʔεϞσϧ͸GPUʹ
 ࡌͤͬͺͳ͠ͰOK

Slide 37

Slide 37 text

GPT-3ద༻࣌ͷྫ (r=4) •܇࿅࣌ͷVRAM (GPUͷϝϞϦ)Λ 1.2TB → 350GB ʹ࡟ݮ • Gradient CheckpointingͳͲଞͷ޻෉Ͱ͞Βʹ࡟ݮͰ͖ͦ͏ •อଘ͢Δύϥϝʔλͷ༰ྔΛ 350GB → 35MB ʹ࡟ݮ • ༰ྔ͕ݮͬͯอଘɾI/Oίετ͕ܶతʹখ͘͞ • ݸผϢʔβɾλεΫͷͨΊʹΧελϚΠζͨ͠Ϟσϧͷ࡞੒͕༰қʹ •܇࿅͕25%ఔ౓ߴ଎Խ • ͦ͜·Ͱ଎͘ͳ͍ͬͯͳ͍ͷ͸ผͷ෦෼͕ϘτϧωοΫ͔ͩΒʁ LoRAͷ༗༻ੑ 37 ܇࿅༻σʔλͷసૹͱ͔ϊʔυؒ௨৴ʁ

Slide 38

Slide 38 text

ॳظԽͷ޻෉ •A͸ฏۉ0ͷਖ਼ن෼෍ɺB͸ྵߦྻͰॳظԽ • ͸࠷ॳԿ΋͠ͳ͍ • LoRA૚͕ݩͷϞσϧΛअຐ͠ͳ͍ Ϟσϧߏ଄ΛมԽͤ͞ͳ͍ •ֶशՄೳύϥϝʔλ਺Λ૿΍͍ͯ͘͠ͱ… • LoRA: r͕ݩͷߦྻͱಉ͡ = ݩͷϞσϧͱ΄΅ಉ͡ • Adapter: MLP૚Λ௥Ճͨ͠ϞσϧΛFull FT͢Δͷͱಉ౳ • Prompt Tuning: ೖྗ௕͕গ͠୹͘ͳͬͨϞσϧΛFull FT͢Δͷͱಉ౳ ΔW = BA LoRA͸ͳͥ͏·͍͘͘ͷ͔ʁ 38

Slide 39

Slide 39 text

•LoRAࣗମ͸ͦ͜·Ͱ໨৽͍͠࿩Ͱ͸ͳ͍͕… • Adapter͔ΒඇઢܗੑΛൈ͍͚ͨͩͱ͍͏ؾ΋͢Δ ͓΋͠ΖϙΠϯτ •AdapterΑΓগͳ͍ύϥϝʔλ਺Ͱ΋ੑೳ͸ग़ͤΔ •AdapterͷΑ͏ʹඇઢܗੑΛಋೖ͠ͳͯ͘΋ੑೳ͸ग़ͤΔ LoRAͷ৽نੑɾ஫ҙ఺ 39

Slide 40

Slide 40 text

•LoRAࣗମ͸ͦ͜·Ͱ໨৽͍͠࿩Ͱ͸ͳ͍͕… • Adapter͔ΒඇઢܗੑΛൈ͍͚ͨͩͱ͍͏ؾ΋͢Δ ͓΋͠ΖϙΠϯτ •AdapterΑΓগͳ͍ύϥϝʔλ਺Ͱ΋ੑೳ͸ग़ͤΔ •AdapterͷΑ͏ʹඇઢܗੑΛಋೖ͠ͳͯ͘΋ੑೳ͸ग़ͤΔ ஫ҙ఺ •10BҎ্ͷϞσϧͩͱLoRA૚Ͱ͢Β20MҎ্ͷύϥϝʔλ਺ʹͳΔ͜ͱ΋ •Ϟσϧ͕े෼େ͖͘ͳ͍ͱ͏·͘ಈ࡞͠ͳ͍͔΋ • 6.7Bͩͱ͏·͍͕͘͘ɺ1BϞσϧͩͱ͏·͘ඍௐ੔Ͱ͖ͳ͍ͱ͍͏ࣄྫ LoRAͷ৽نੑɾ஫ҙ఺ 40

Slide 41

Slide 41 text

•LoRAࣗମ͸ͦ͜·Ͱ໨৽͍͠࿩Ͱ͸ͳ͍͕… • Adapter͔ΒඇઢܗੑΛൈ͍͚ͨͩͱ͍͏ؾ΋͢Δ ͓΋͠ΖϙΠϯτ •AdapterΑΓগͳ͍ύϥϝʔλ਺Ͱ΋ੑೳ͸ग़ͤΔ •AdapterͷΑ͏ʹඇઢܗੑΛಋೖ͠ͳͯ͘΋ੑೳ͸ग़ͤΔ ஫ҙ఺ •10BҎ্ͷϞσϧͩͱLoRA૚Ͱ͢Β20MҎ্ͷύϥϝʔλ਺ʹͳΔ͜ͱ΋ •Ϟσϧ͕े෼େ͖͘ͳ͍ͱ͏·͘ಈ࡞͠ͳ͍͔΋ • 6.7Bͩͱ͏·͍͕͘͘ɺ1BϞσϧͩͱ͏·͘ඍௐ੔Ͱ͖ͳ͍ͱ͍͏ࣄྫ LoRAͷ৽نੑɾ஫ҙ఺ 41 ͦΜͳʹখ͘͞ͳ͍… ಉ͡όονʹҟͳΔ
 λεΫΛೖΕΔͷ΋
 গ͠೉͍͠ খ͞ΊͷϞσϧͩͱͲͷ͘Β͍
 ϋΠύϥௐ੔͕ඞཁ / sensitive ͔एׯෆ໌

Slide 42

Slide 42 text

•LoRA͸AdapterͷҰछ • ඇઢܗੑΛഉͨ͠Adapterͷಛघܥͱ΋ଊ͑ΒΕΔ •Pre fi x- / Prompt TuningΑΓֶशύϥϝʔλ਺͸ଟ͘ͳΔ LoRAͱطଘख๏ͱͷؔ܎ 42 ख๏໊ fi ne-tuning Adapter LoRA Pre fi x-Tuning / 
 Prompt Tuning few-shot / zero-shot learning ࠷దԽର৅ ͢΂ͯ MLP
 (ଟ૚ύʔηϓτϩϯ) ௿ϥϯΫߦྻ Soft Prompts
 + α - ਪ࿦ίετ͸
 ૿Ճ͢Δʁ No Yes No Yes
 (ܥྻ௕͕૿Ճ) -

Slide 43

Slide 43 text

•ਂ૚ֶशͰ͸ߦྻܭࢉ͕ඞཁෆՄܽ • ઢܗ૚(ॏΈߦྻ + όΠΞε)͸ಛʹසग़ •௚ײతʹ͸ΨϦόʔτϯωϧΈ͍ͨͳ΋ͷ • ϕΫτϧΛೖΕΔͱग़ޱͷܗʹͳͬͯग़ͯ͘Δ • ϕΫτϧΛ೚ҙͷܗʹมߋͰ͖Δػߏ • ྫ: ෼ྨ໰୊͸(ϕΫτϧ࣍ݩ਺, Ϋϥε਺)Ͱมܗ ؓ࿩ٳ୊: ߦྻ (ςϯιϧ) ܭࢉͷ௚ײతཧղ 43 ը૾͸ ςϨϏே೔ ʰͻΈͭಓ۩ΧλϩάʱΑΓҾ༻ ΨϦόʔτϯωϧ nn.Linear(5, 3) ग़ޱͷܗʹ֦େɾॖখ͢ΔͻΈͭಓ۩

Slide 44

Slide 44 text

•LoRAͷ༗༻ੑΛ fi ne-tuning, AdapterͳͲͱൺֱͯ͠ݕূ •ෳ਺छྨͷλεΫ • ࣗવݴޠཧղ (Natural Language Understanding; NLU) ܥλεΫ • ࣗવݴޠੜ੒ (Natural Language Generation; NLG) ܥλεΫ •ෳ਺ͷϞσϧ • RoBERTa base (125M) / large (355M) • DeBERTa XXL (1.5B) • GPT-2 medium (355M) / large (774M) • GPT-3 (175B) ධՁ࣮ݧ 44

Slide 45

Slide 45 text

•GLUEͰͷੑೳΛධՁɺશମͱͯ͠ଟ͘ͷλεΫͰLoRA͕࠷ྑͷ݁Ռ • fi ne-tuningΑΓΉ͠Ζੑೳ্͕͕͍ͬͯΔ NLU: RoBERTa base / large & DeBERTa XXL 45

Slide 46

Slide 46 text

•E2E NLG ChallengeͰͷੑೳΛධՁɺ΄ͱΜͲͷ৔߹ʹLoRA͕࠷ྑͷ݁Ռ • ൺֱతখ͞ͳϞσϧͰ΋LoRA͸༗ޮͦ͏ NLG: GPT-2 medium / large 46

Slide 47

Slide 47 text

•WikiSQL΍MNLI, SAMSumͰධՁɺLoRA͕ߴ͍ੑೳ • ෳ਺ͷrͰ fi ne-tuningΛ্ճΔੑೳΛࣔ͢͜ͱ΋ NLU & NLG: GPT-3 175B 47 ࣗવݴޠˠSQL ձ࿩ཁ໿ ࣗવݴޠਪ࿦

Slide 48

Slide 48 text

•WikiSQL΍MNLI, SAMSumͰධՁɺLoRA͕ߴ͍ੑೳ • fi ne-tuning͸few-shotΑΓ
 ֨ஈʹੑೳ͕ߴ͍ˠ • fi ne-tuning͕ऑ͍Θ͚Ͱ͸ͳ͍ ͳͥLoRA͸ fi ne-tuningͱಉ౳ʁ •ݩͷϞσϧʹඞཁͳ஌͕ࣝଘࡏ •LoRA͸ͦΕΛڧௐ͢Δ͚ͩ* • ࣮͸ͦΕ͚ͩͰे෼ͩͬͨʁ NLU & NLG: GPT-3 175B 48 *ݩ࿦จͷ7અΛࢀর ࣗવݴޠˠSQL ձ࿩ཁ໿ ࣗવݴޠਪ࿦ fi ne-tuningࣗମ͸༗ޮ

Slide 49

Slide 49 text

•GPT-3ΛLoRAνϡʔχϯάɺֶशՄೳύϥϝʔλ਺ͱੑೳͷؔ܎Λ؍࡯ •LoRA͸ൺֱతੑೳ͕҆ఆ͍ͯͯ͠ѻ͍΍ͦ͢͏ • Pre fi x Tuning͸ੑೳ͕ෆ҆ఆ ֶशՄೳύϥϝʔλ਺ͱੑೳͷؔ܎ 49

Slide 50

Slide 50 text

•GPT-3ΛLoRAνϡʔχϯάɺֶशՄೳύϥϝʔλ਺ͱੑೳͷؔ܎Λ؍࡯ •LoRA͸ൺֱతੑೳ͕҆ఆ͍ͯͯ͠ѻ͍΍ͦ͢͏ • Pre fi x Tuning͸ੑೳ͕ෆ҆ఆ ֶशՄೳύϥϝʔλ਺ͱੑೳͷؔ܎ 50 256 tokens௒ 256 tokens௒ 32 tokensҎԼ 32 tokensҎԼ LoRA͸҆ఆ

Slide 51

Slide 51 text

•GPT-3ΛLoRAνϡʔχϯάɺ܇࿅Մೳύϥϝʔλ਺Λݻఆ • ༷ʑͳ૚ɾϥϯΫͰͷੑೳมԽΛ؍࡯ •ෳ਺ͷ૚ʹখ͞ͳϥϯΫ > ୯Ұͷ૚ʹେ͖ͳϥϯΫ • ʮ࠷దԽ͢Δ૚ͷଟ༷ੑʯ͕େࣄͦ͏ Ͳͷ૚Λඍௐ੔͢Ε͹͍͍ͷ͔ʁ 51

Slide 52

Slide 52 text

•GPT-3ΛLoRAνϡʔχϯάɺrʹΑΔੑೳมԽΛ؍࡯ • r=2ͳͲඇৗʹখ͍࣍͞ݩ਺Ͱ΋ߴ͍ੑೳΛग़ͤΔ͜ͱ͕Θ͔Δ • λεΫͷͨΊʹඞཁͳ “෦෼ۭؒ” ͕े෼޿͍ʁ r͸Ͳͷ͘Β͍ͷେ͖͕͍͍͞ͷ͔ʁ 52

Slide 53

Slide 53 text

•ϞσϧͷύϥϝʔλΛ෦෼తʹඍௐ੔͢Δख๏ΛఏҊ •ೋͭͷ௿ϥϯΫߦྻA, BΛಋೖ •ֶशύϥϝʔλ਺Λܶతʹ࡟ݮɺ fi ne-tuningͱಉ౳ੑೳ ॴײ •ݚڀతͳ৽نੑΑΓɺૉ௚Ͱศརͳઃܭʹྗ఺Λஔ͍ͨख๏ (ҹ৅) •Appendix͔ΒϋΠύϥௐ੔ͷͨΊͷਘৗͳΒ͟Δ౒ྗͷ੻͕Ӑ͑Δ • ͦ͜·ͰϋΠύϥʹsensitiveͳख๏Ͱ͸ͳͦ͞͏͕ͩ…(ମײ) •ઢܗม׵ͷΈͷߏ੒ʹΑΓϕʔεϞσϧʹ༥߹ͤ͞ΒΕΔͷ͸໘ന͍ •ͱͯ΋γϯϓϧͳख๏ͳͷͰద༻ൣғ͸͔ͳΓ޿ͦ͏ ·ͱΊɾॴײ 53

Slide 54

Slide 54 text

ࣄલ஌ࣝ •ਂ૚ֶशΛ༻͍ͨࣗવݴޠॲཧͷجૅ •஫ҙػߏɺTransformerɺBERT, … LoRA •େن໛ݴޠϞσϧͷඍௐ੔ʹ͓͚Δ໰୊ͱղܾࡦ •ؔ࿈ݚڀ •ఏҊख๏ɾධՁ࣮ݧ LoRAͷ࢖͍ํ •PEFT: Parameter-E ff i cient Fine-Tuning ໨࣍ 54

Slide 55

Slide 55 text

LoRAͷ࢖͍ํ

Slide 56

Slide 56 text

•ۙ೥୆಄͍ͯ͠Δਂ૚ֶशϞσϧΛ؆୯ʹར༻͢ΔͨΊͷϥΠϒϥϦ • ஶ໊ͳਂ૚ֶशϞσϧɾΞʔΩςΫνϟ࣮૷ͷར༻ • ࣄલ܇࿅ࡁΈϞσϧύϥϝʔλͷڞ༗ɾμ΢ϯϩʔυ • ࣄલఆٛɾࣄલ܇࿅͞ΕͨϞσϧΛ༻ֶ͍ͨशɾਪ࿦ͷ؆ུԽ •Ϟσϧͷμ΢ϯϩʔυɾॏΈͷϩʔυ·Ͱ1ߦͰ࣮૷Մೳ •໊લ͕Ϟσϧߏ଄ͷTransformerͱࣅ͍ͯͯͱͯ΋΍΍͍͜͠ ͓͞Β͍: Transformers 🤗 56

Slide 57

Slide 57 text

•HuggingFace͕ఏڙ͢ΔϥΠϒϥϦ • https://github.com/huggingface/peft • LoRA΍Prompt Tuning, AdaLoRA (LoRAͷޙଓख๏) ౳͕࣮૷ PEFT: Parameter-E ffi cient Fine-Tuning 🤗 57

Slide 58

Slide 58 text

•HuggingFace͕ఏڙ͢ΔϥΠϒϥϦ • https://github.com/huggingface/peft • LoRA΍Prompt Tuning, AdaLoRA (LoRAͷޙଓख๏) ౳͕࣮૷ •PEFTΛ༻͍ͨLoRAʹΑΔඍௐ੔ͷྲྀΕ 1. ϕʔεͱͳΔϞσϧΛ༻ҙ͢Δ 2. ϕʔεϞσϧͷ૚ͷҰ෦ΛLoRAͷ૚ʹஔ׵ 3. LoRAͷ૚Ҏ֎ͷ૚Λݻఆ(freeze) 4. LoRA෦෼ͷΈΛֶश •อଘ͢Δͷ͸௥Ճͨ͠૚ͷΈ (ϕʔεϞσϧͱൺֱͯ͠ۃΊͯখ͍͞) PEFT: Parameter-E ffi cient Fine-Tuning 🤗 58

Slide 59

Slide 59 text

PEFT: ࣮ࡍͷίʔυ 59

Slide 60

Slide 60 text

PEFT: ࣮ࡍͷίʔυ 60 ϕʔεϞσϧͷಡΈࠐΈͱ४උ

Slide 61

Slide 61 text

PEFT: ࣮ࡍͷίʔυ 61 LoRAͷCon fi gΛࢦఆ

Slide 62

Slide 62 text

PEFT: ࣮ࡍͷίʔυ 62 ϕʔεϞσϧͷҰ෦Λ
 LoRA૚ʹஔ͖׵͑

Slide 63

Slide 63 text

PEFT: ࣮ࡍͷίʔυ 63 LoRAͷద༻ର৅Λ
 Ҿ਺(target_modules)Ͱ
 ࢦఆ͢Δ͜ͱ΋Մೳ ͜Ε͕ͳ͍ͱಈ͔ͳ͍͜ͱ͕
 ͋ΔͷͰ஫ҙ ਖ਼نදݱʹΑΔ Ϟδϡʔϧࢦఆ΋Մೳ
 ྫ: .*Attention.(q|k|v|o)$

Slide 64

Slide 64 text

•GoogleͷFlan-UL2 (20B)ʹରͯ͠PEFTΛ༻͍ͯखݩͷλεΫʹద༻ • ੑೳ͸ྑ޷ɺอଘ༰ྔ΋খ͘͞܇࿅΋ૣ͍ • ྫ (r=16): ܇࿅ύϥϝʔλ਺ 25MɺVRAM࢖༻ྔ 55GiB PEFT: ࣮ࡍʹ৮ͬͨײ૝ 64 Batch size 32, BF16, 
 Gradient Checkpointing

Slide 65

Slide 65 text

•GoogleͷFlan-UL2 (20B)ʹରͯ͠PEFTΛ༻͍ͯखݩͷλεΫʹద༻ • ੑೳ͸ྑ޷ɺอଘ༰ྔ΋খ͘͞܇࿅΋ૣ͍ • ྫ (r=16): ܇࿅ύϥϝʔλ਺ 25MɺVRAM࢖༻ྔ 55GiB ༨ஊ •Flan-T5 XXL (11B) ͳͲFlan instruction tuning͞ΕͨϞσϧ͸ۃΊͯڧྗ • खݩͷλεΫͰۃΊͯߴ͍Zero-shotੑೳɺLoRAνϡʔχϯά΋ྑ޷ • ͜ͷลΓɺInstructGPT΍ChatGPTʹ௨ͣΔ΋ͷ͕͋Γͦ͏ •ӳޠλεΫ͸ͱΓ͋͑ͣFlan-T5 XXL / UL2 + LoRAΛߟ͑ͯ΋ྑͦ͞͏ •Flan-UL2΋Flan-T5΋Encoder-DecoderϞσϧͷT5ϕʔεͳͷͰࣗ༝ࣗࡏ • BERTͷΑ͏ʹEncoderͷΈ࢖ͬͯ΋ྑ͍͠ɺੜ੒ͤͯ͞΋ྑ͍ PEFT: ࣮ࡍʹ৮ͬͨײ૝ 65 Batch size 32, BF16, 
 Gradient Checkpointing

Slide 66

Slide 66 text

•GoogleͷFlan-UL2 (20B)ʹରͯ͠PEFTΛ༻͍ͯखݩͷλεΫʹద༻ • ੑೳ͸ྑ޷ɺอଘ༰ྔ΋খ͘͞܇࿅΋ૣ͍ • ྫ (r=16): ܇࿅ύϥϝʔλ਺ 25MɺVRAM࢖༻ྔ 55GiB ༨ஊ •Flan-T5 XXL (11B) ͳͲFlan instruction tuning͞ΕͨϞσϧ͸ۃΊͯڧྗ • खݩͷλεΫͰۃΊͯߴ͍Zero-shotੑೳɺLoRAνϡʔχϯά΋ྑ޷ • ͜ͷลΓɺInstructGPT΍ChatGPTʹ௨ͣΔ΋ͷ͕͋Γͦ͏ •ӳޠλεΫ͸ͱΓ͋͑ͣFlan-T5 XXL / UL2 + LoRAΛߟ͑ͯ΋ྑͦ͞͏ •Flan-UL2΋Flan-T5΋Encoder-DecoderϞσϧͷT5ϕʔεͳͷͰࣗ༝ࣗࡏ • BERTͷΑ͏ʹEncoderͷΈ࢖ͬͯ΋ྑ͍͠ɺੜ੒ͤͯ͞΋ྑ͍ PEFT: ࣮ࡍʹ৮ͬͨײ૝ 66 Batch size 32, BF16, 
 Gradient Checkpointing T5 (11B) ͳΒVRAM 40GBͰ܇࿅ՄೳɺA6000Ͱಈ͘ʂ BERTͷ୅ସͱ͙ͯ͢͠ಈ͔ͤΔϞσϧײ