Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Word Embeddings Are Steers
for Language Models

Hayato Tsukagoshi
September 29, 2024
62

Word Embeddings Are Steers
for Language Models

出力単語埋め込み層に付加した線形層Wのみを学習する微調整手法LM-Steerと、それによって得られるSteer matrix Wがもつ面白い性質について分析した論文の解説資料です。

2024-09-30: ACL2024読み会@名大
http://cr.fvcrc.i.nagoya-u.ac.jp/~oshika/acl2024nagoya/

Hayato Tsukagoshi

September 29, 2024
Tweet

Transcript

  1. Word Embeddings Are Steers
 for Language Models D2, Graduate School

    of Informatics, Nagoya University, Japan Hayato Tsukagoshi Chi Han, Jialiang Xu, Manling Li, Yi Fung, Chenkai Sun, Nan Jiang, Tarek Abdelzaher, Heng Ji ACL 2024 Outstanding paper award
 https://aclanthology.org/2024.acl-long.864/
  2. •୯ޠຒΊࠐΈʹΑΓݴޠϞσϧΛ
 ʮૢॎʯͰ͖Δ͜ͱΛࣔͨ͠ • Steer = ଩ΛऔΔ⎈ •࣮ଶ͸ύϥϝʔλޮ཰ͷΑ͍ඍௐ੔ख๏ • ग़ྗ୯ޠຒΊࠐΈʹͷΈઢܗ૚
 Λ෇Ճͯ͠ϞσϧΛ܇࿅

    • LoRAͳͲAdapterख๏ͱ࣮૷͸͍ۙ •໘നϙΠϯτ • ઢܗ૚ͷ࡞༻ͷͤ͞ํΛม͑Δͱੜ੒݁Ռ͕࿈ଓతʹมԽ • ܇࿅ͨ͠ઢܗ૚͕Ϟσϧඇґଘʹར༻Մೳ ֓ཁ 2 ͱ͜ΖͰ໊ࢺͱͯ͠ͷsteerʹ͸଩ͱ͍͏ҙຯ߹͍͕ͳͦ͞͏ͳͷͰɺएׯෆࣗવͳλΠτϧʹͳ͍ͬͯΔΈ͍ͨ
 Word Embeddings Steer Language Models ͳΒࣗવͳӳޠͬΆ͍ͷ͕ͩ
  3. ࣄલ஌ࣝ: Transformerʹجͮ͘ݴޠϞσϧͷ֓ཁਤ 4 ୯ޠ༧ଌ֬཰ I like to eat pizza with

    Transformer … ୯ޠ༧ଌ૚ softmax จ຺Խ୯ޠຒΊࠐΈ
  4. ࣄલ஌ࣝ: Transformerʹجͮ͘ݴޠϞσϧͷ֓ཁਤ 6 ୯ޠ༧ଌ֬཰ I like to eat pizza with

    Transformer … ֤ग़ྗ୯ޠຒΊࠐΈ
 ͱͷ಺ੵΛܭࢉ softmax ग़ྗ୯ޠຒΊࠐΈ૚
  5. •ϞσϧશମΛඍௐ੔͢Δ୅ΘΓʹग़ྗຒΊࠐΈʹઢܗ૚Λ෇Ճֶͯ͠श •෇Ճͨ͠ઢܗ૚ͷΈύϥϝʔλߋ৽ɺϞσϧ΍ຒΊࠐΈ૚͸ߋ৽͠ͳ͍ •ࠜڌ (c.f. Appendix B ͓Αͼ Appendix C) •

    ༷ʑͳԾఆͷ΋ͱͰελΠϧͷม׵͸୯ޠຒΊࠐΈͷઢܗม׵ͱΈͳͤΔ • Ծఆͷେࡶ೺ͳཁ໿: ελΠϧͱͦͷଞͷ৘ใ͕෼཭Մೳ LM-Steer: ग़ྗ୯ޠຒΊࠐΈΛର৅ͱ͢Δඍௐ੔ख๏ 8 ୯ޠvͷग़ྗ୯ޠຒΊࠐΈ 🔥 ❄ ❄
  6. •ϞσϧશମΛඍௐ੔͢Δ୅ΘΓʹग़ྗຒΊࠐΈʹઢܗ૚Λ෇Ճֶͯ͠श •෇Ճͨ͠ઢܗ૚ͷΈύϥϝʔλߋ৽ɺϞσϧ΍ຒΊࠐΈ૚͸ߋ৽͠ͳ͍ •ࠜڌ (c.f. Appendix B ͓Αͼ Appendix C) •

    ༷ʑͳԾఆͷ΋ͱͰελΠϧͷม׵͸୯ޠຒΊࠐΈͷઢܗม׵ͱΈͳͤΔ • Ծఆͷେࡶ೺ͳཁ໿: ελΠϧͱͦͷଞͷ৘ใ͕෼཭Մೳ LM-Steer: ग़ྗ୯ޠຒΊࠐΈΛର৅ͱ͢Δඍௐ੔ख๏ 9 ୯ޠvͷग़ྗ୯ޠຒΊࠐΈ 🔥 ❄ ❄ Steer matrix Steering value
  7. 12 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer …

    softmax ग़ྗ୯ޠຒΊࠐΈ૚ TransformerΛ༻͍ͨ௨ৗͷจੜ੒
  8. Steer-LMΛ༻͍ͨจੜ੒ 13 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer

    … softmax W 🔥 + ε … ૢ࡞ޙͷ୯ޠຒΊࠐΈ૚
  9. Steer-LMΛ༻͍ͨจੜ੒ 14 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer

    … softmax W 🔥 + ε … ૢ࡞ޙͷ୯ޠຒΊࠐΈ૚ ཁ͸୯ޠ༧ଌ૚ͷࠩ͠ସ͑
  10. LM-SteerΛར༻͢Δ࣌ͷΠϝʔδ 16 This movie is ok bad LM good ༧ଌ୯ޠ

    ೖྗจ Positiveͳํ޲΁ͷૢॎ Steer matrixͰ
 ग़ྗຒΊࠐΈΛճస
  11. LM-SteerΛར༻͢Δ࣌ͷΠϝʔδ 17 This movie is ok bad LM good ༧ଌ୯ޠ

    ೖྗจ Positiveͳํ޲΁ͷૢॎ Steer matrixͰ
 ग़ྗຒΊࠐΈΛճస
  12. LM-SteerΛར༻͢Δ࣌ͷΠϝʔδ 18 This movie is ok bad LM good ༧ଌ୯ޠ

    ೖྗจ ग़ྗ୯ޠຒΊࠐΈͷํ͕ճస͞Ε
 ݁Ռͱͯ͠ੜ੒୯ޠ͕ૢ࡞͞ΕΔ Positiveͳํ޲΁ͷૢॎ Steer matrixͰ
 ग़ྗຒΊࠐΈΛճస
  13. LM-SteerΛར༻͢Δ࣌ͷΠϝʔδ 19 This movie is ok bad LM good ༧ଌ୯ޠ

    ೖྗจ Steering valueΛϚΠφεʹ͢Δ
 ͜ͱͰٯํ޲΁ͷૢ࡞΋Մೳ -Positiveͳํ޲΁ͷૢॎ
  14. •ੜ੒จͷελΠϧΛૢ࡞Ͱ͖Δ͔Ͳ͏͔ͰධՁ •ධՁλεΫ • Language Detoxi fi cation • Sentiment Control

    ෼ੳ •Steering valueͱੜ੒จͷؔ܎ɾSteer matrixͷߏ੒ੑͷݕূ •ֶशσʔλྔͱੑೳͷؔ܎ •ֶश͞ΕͨSteer matrix͕ݴޠඇґଘʹಈ࡞͢Δ͜ͱͷݕূ ධՁ࣮ݧ 20
  15. •༩͑ΒΕͨಟੑϥϕϧʹదͨ͠ੜ੒͕Ͱ͖Δ͔ධՁ͢ΔλεΫ σʔληοτ •܇࿅: Jigsaw Unintended Bias •ධՁ: RealToxicityPromptsσʔληοτ • σʔληοτதͷ10kจΛpromptͱͯ͠25จੜ੒ͯ͠ಟੑΛࣗಈධՁ

    ධՁࢦඪ: Toxicity •ಟੑ෼ྨϞσϧʹΑΓੜ੒จ͕toxicͱ෼ྨ͞ΕΔ֬཰ •෼ྨϞσϧ: Google API comment analyzer ධՁ࣮ݧ: Language Detoxi fi cation 21 ൃදऀ஫: ධՁʹͲͷϞσϧΛ࢖͕ͬͨ࿦จதʹ໌֬ʹ͸هࡌ͞Ε͍ͯͳ͔ͬͨͨΊɺGitHubͷධՁ༻ίʔυ͔Β൑அ
  16. •༩͑ΒΕͨײ৘ϥϕϧʹదͨ͠ੜ੒͕Ͱ͖Δ͔ධՁ͢ΔλεΫ σʔληοτ •܇࿅: Stanford Sentiment Treebank (SST-5) •ධՁ: OpenWebText Corpus

    • Ұ෦Λϓϩϯϓτͱͯ͠༻͍ͯ25จΛੜ੒͠ײ৘ۃੑΛࣗಈධՁ ධՁࢦඪ: Positivity •ײ৘ۃੑ෼ྨϞσϧʹΑΓੜ੒จ͕positiveͱ෼ྨ͞ΕΔ֬཰ •෼ྨϞσϧ: distilbert/distilbert-base-uncased- fi netuned-sst-2-english ධՁ࣮ݧ: Sentiment Control 22 ൃදऀ஫: ධՁʹͲͷϞσϧΛ࢖͕ͬͨ࿦จதʹ໌֬ʹ͸هࡌ͞Ε͍ͯͳ͔ͬͨͨΊɺGitHubͷධՁ༻ίʔυ͔Β൑அ
  17. ৚݅෇͖ੜ੒ख๏ •Soft Word Blacklist (SWB): logitͷo ff set஋ͷΈΛֶश •DExperts: ಛఆଐੑͷจΛੜ੒͢ΔΑ͏܇࿅͞ΕͨิॿϞσϧΛ༻͍ੜ੒

    •DAPT: ԼྲྀλεΫͷ܇࿅σʔλΛϥϕϧͳ͠ςΩετͱͯ͠௥Ճࣄલֶश •PPLM: ײ৘ۃੑ෼ྨث౳͔ΒಘΒΕΔޯ഑Λ༻͍ͯӅΕঢ়ଶΛߋ৽ˠੜ੒ •GeDi: ײ৘ۃੑϥϕϧͷࣄޙ֬཰ΛϕΠζͷఆཧͰมܗ͠৚݅෇͖ੜ੒ •MuCoLa: ෼ྨثͷείΞΛͱϥϯδϡόϯαϯϓϦϯάΛ༻͍ͯੜ੒ •PromptT5: T5ʹϥϕϧΛϓϩϯϓτͱͯ͠ೖΕͯੜ੒ •LoRA: Ϟσϧதͷઢܗ૚ʹ௿ϥϯΫ૚Λ෇Ճͯ͠௿ϥϯΫ૚ͷΈඍௐ੔ ൺֱख๏ 23
  18. •ײ৘ۃੑ෼ྨͰpositiveͳจΛੜ੒
 ͢ΔΑ͏Steering matrixΛ܇࿅ •܇࿅ޙʹSteering value (ε)Λ༷ʑ
 ʹௐ੔ͯ͠positivityΛධՁ •Steering valueΛڧ͘͢Δ΄Ͳ
 ࿈ଓతʹpositivity͕ڧ͘ͳͬͯ


    ͍Δ͜ͱ͕؍࡯Ͱ͖Δ • Steering valueΛϚΠφεʹ͢Δͱ
 negativeͳจ͕ੜ੒͞Ε͍ͯΔ
 Steering valueΛมԽͤͨ࣌͞ͷग़ྗͷ܏޲ 26 -5ε0 5ε0 0ε0
  19. •ײ৘ۃੑ෼ྨλεΫͱಟੑ෼ྨλεΫͰ
 ͰͦΕͧΕ܇࿅ͨ͠Steering matrix
 Λ༻ҙ •ͦΕͧΕͷSteering valueΛมԽͤͨ͞
 ࣌ͷ ײ৘ۃੑ (৭) ͱ

    ಟੑ (ߴ͞) Λ
 ϓϩοτ •ײ৘ۃੑͱಟੑ͕͋Δఔ౓ಠཱͯ͠
 มԽ͍ͯ͠Δ͜ͱ͕֬ೝͰ͖Δ ߏ੒ੑͷݕূ: ෳ਺ͷSteerͷ૊Έ߹Θͤ 27 ͦ΋ͦ΋ײ৘ۃੑͱಟੑͬͯ៉ྷʹ෼཭Ͱ͖ͳ͍࣠Ͱ͸ʁͱ͍͏ؾ΋͢Δ͕ ײ৘ۃੑ ಟ ੑ
  20. •Steer matrix͸ϞσϧΛม͑ͯ΋
 ద༻ՄೳͰ͋Δ͜ͱΛݕূ खॱ •ҟͳΔݴޠϞσϧؒͷ୯ޠ
 ຒΊࠐΈͷϚοϐϯάHΛಘΔ •ֶशࡁΈSteer matrix Λ
 ʹஔ͖׵͑ͯλʔήοτ


    ݴޠϞσϧʹͦͷ··ద༻ •Steer matrixΛ࢖͍ճͯ͠΋ग़ྗͷ੍ޚ͕Մೳ •ҙຯ: ֤ϞσϧͰಉ͡ʮ֦ॖɾճసʯ͕ຒΊࠐΈૢ࡞্ͷࣅͨҙຯΛ࣋ͭ W HTWH ϞσϧΛލ͍ͩసҠੑೳ 29
  21. •ग़ྗ୯ޠຒΊࠐΈͷΈΛର৅ͱͨ͠
 ඍௐ੔ख๏ LM-Steer ΛఏҊ •ݴޠϞσϧͷੜ੒ελΠϧ੍ޚ
 λεΫͰධՁ • LoRAΛ͸͡Ίͱ͢Δطଘख๏Λ
 ্ճΔੑೳΛୡ੒ ײ૝

    •ֶश͞ΕͨSteer matrix͕ଞϞσϧʹస༻Ͱ͖Δ఺͸ඇৗʹ໘ന͍ • ϞσϧඇґଘͰSteer matrixʹΑΔʮ֦ॖɾճసʯ͕ࣅͨҙຯΛ࣋ͭ •ҰํͰελΠϧΑΓ΋ෳࡶͳλεΫͰͷద༻͸͔ͳΓ೉ͦ͠͏ ·ͱΊ 30