Slide 1

Slide 1 text

Word Embeddings Are Steers
 for Language Models D2, Graduate School of Informatics, Nagoya University, Japan Hayato Tsukagoshi Chi Han, Jialiang Xu, Manling Li, Yi Fung, Chenkai Sun, Nan Jiang, Tarek Abdelzaher, Heng Ji ACL 2024 Outstanding paper award
 https://aclanthology.org/2024.acl-long.864/

Slide 2

Slide 2 text

•୯ޠຒΊࠐΈʹΑΓݴޠϞσϧΛ
 ʮૢॎʯͰ͖Δ͜ͱΛࣔͨ͠ • Steer = ଩ΛऔΔ⎈ •࣮ଶ͸ύϥϝʔλޮ཰ͷΑ͍ඍௐ੔ख๏ • ग़ྗ୯ޠຒΊࠐΈʹͷΈઢܗ૚
 Λ෇Ճͯ͠ϞσϧΛ܇࿅ • LoRAͳͲAdapterख๏ͱ࣮૷͸͍ۙ •໘നϙΠϯτ • ઢܗ૚ͷ࡞༻ͷͤ͞ํΛม͑Δͱੜ੒݁Ռ͕࿈ଓతʹมԽ • ܇࿅ͨ͠ઢܗ૚͕Ϟσϧඇґଘʹར༻Մೳ ֓ཁ 2 ͱ͜ΖͰ໊ࢺͱͯ͠ͷsteerʹ͸଩ͱ͍͏ҙຯ߹͍͕ͳͦ͞͏ͳͷͰɺएׯෆࣗવͳλΠτϧʹͳ͍ͬͯΔΈ͍ͨ
 Word Embeddings Steer Language Models ͳΒࣗવͳӳޠͬΆ͍ͷ͕ͩ

Slide 3

Slide 3 text

ࣄલ஌ࣝ: Transformerʹجͮ͘ݴޠϞσϧͷ֓ཁਤ 3 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer … ୯ޠ༧ଌ૚ softmax

Slide 4

Slide 4 text

ࣄલ஌ࣝ: Transformerʹجͮ͘ݴޠϞσϧͷ֓ཁਤ 4 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer … ୯ޠ༧ଌ૚ softmax จ຺Խ୯ޠຒΊࠐΈ

Slide 5

Slide 5 text

ࣄલ஌ࣝ: Transformerʹجͮ͘ݴޠϞσϧͷ֓ཁਤ 5 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer … softmax ग़ྗ୯ޠຒΊࠐΈ૚

Slide 6

Slide 6 text

ࣄલ஌ࣝ: Transformerʹجͮ͘ݴޠϞσϧͷ֓ཁਤ 6 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer … ֤ग़ྗ୯ޠຒΊࠐΈ
 ͱͷ಺ੵΛܭࢉ softmax ग़ྗ୯ޠຒΊࠐΈ૚

Slide 7

Slide 7 text

ैདྷख๏ •ੜ੒࣌ʹ֎෇͚ͷ෼ྨثΛ༻͍Δख๏΍Ϟσϧͷඍௐ੔Λߦ͏ख๏͕ଘࡏ •͜ΕΒ͸ύϥϝʔλޮ཰͕ѱ͘ɺֶश͕࣌ؒ௕͘ɺܭࢉίετ͕େ͖͍ ఏҊख๏: LM-Steer •ݴޠϞσϧʹ͓͚Δ୯ޠຒΊࠐΈͷ໾ׂΛߟ࡯ •ΑΓখن໛ɾߴޮ཰ͳελΠϧ੍ޚख๏ΛఏҊ ݴޠϞσϧͷੜ੒ελΠϧͷ੍ޚ 7

Slide 8

Slide 8 text

•ϞσϧશମΛඍௐ੔͢Δ୅ΘΓʹग़ྗຒΊࠐΈʹઢܗ૚Λ෇Ճֶͯ͠श •෇Ճͨ͠ઢܗ૚ͷΈύϥϝʔλߋ৽ɺϞσϧ΍ຒΊࠐΈ૚͸ߋ৽͠ͳ͍ •ࠜڌ (c.f. Appendix B ͓Αͼ Appendix C) • ༷ʑͳԾఆͷ΋ͱͰελΠϧͷม׵͸୯ޠຒΊࠐΈͷઢܗม׵ͱΈͳͤΔ • Ծఆͷେࡶ೺ͳཁ໿: ελΠϧͱͦͷଞͷ৘ใ͕෼཭Մೳ LM-Steer: ग़ྗ୯ޠຒΊࠐΈΛର৅ͱ͢Δඍௐ੔ख๏ 8 ୯ޠvͷग़ྗ୯ޠຒΊࠐΈ 🔥 ❄ ❄

Slide 9

Slide 9 text

•ϞσϧશମΛඍௐ੔͢Δ୅ΘΓʹग़ྗຒΊࠐΈʹઢܗ૚Λ෇Ճֶͯ͠श •෇Ճͨ͠ઢܗ૚ͷΈύϥϝʔλߋ৽ɺϞσϧ΍ຒΊࠐΈ૚͸ߋ৽͠ͳ͍ •ࠜڌ (c.f. Appendix B ͓Αͼ Appendix C) • ༷ʑͳԾఆͷ΋ͱͰελΠϧͷม׵͸୯ޠຒΊࠐΈͷઢܗม׵ͱΈͳͤΔ • Ծఆͷେࡶ೺ͳཁ໿: ελΠϧͱͦͷଞͷ৘ใ͕෼཭Մೳ LM-Steer: ग़ྗ୯ޠຒΊࠐΈΛର৅ͱ͢Δඍௐ੔ख๏ 9 ୯ޠvͷग़ྗ୯ޠຒΊࠐΈ 🔥 ❄ ❄ Steer matrix Steering value

Slide 10

Slide 10 text

Steer matrix •ग़ྗ୯ޠຒΊࠐΈΛͲͷΑ͏ʹ
 ม׵͢Δ͔Λ੍ޚ Steering value •ग़ྗ୯ޠຒΊࠐΈͷม׵౓߹͍ͷ
 ڧ͞Λ੍ޚ LM-SteerΛ༻͍ͨग़ྗ੍ޚ 10 🔥 ❄ ❄ Steer matrix Steering value

Slide 11

Slide 11 text

Steer matrix •ग़ྗ୯ޠຒΊࠐΈΛͲͷΑ͏ʹ
 ม׵͢Δ͔Λ੍ޚ Steering value •ग़ྗ୯ޠຒΊࠐΈͷม׵౓߹͍ͷ
 ڧ͞Λ੍ޚ LM-SteerΛ༻͍ͨग़ྗ੍ޚ 11 🔥 ❄ ❄ Steer matrix Steering value ग़ྗ୯ޠຒΊࠐΈΛߋ৽

Slide 12

Slide 12 text

12 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer … softmax ग़ྗ୯ޠຒΊࠐΈ૚ TransformerΛ༻͍ͨ௨ৗͷจੜ੒

Slide 13

Slide 13 text

Steer-LMΛ༻͍ͨจੜ੒ 13 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer … softmax W 🔥 + ε … ૢ࡞ޙͷ୯ޠຒΊࠐΈ૚

Slide 14

Slide 14 text

Steer-LMΛ༻͍ͨจੜ੒ 14 ୯ޠ༧ଌ֬཰ I like to eat pizza with Transformer … softmax W 🔥 + ε … ૢ࡞ޙͷ୯ޠຒΊࠐΈ૚ ཁ͸୯ޠ༧ଌ૚ͷࠩ͠ସ͑

Slide 15

Slide 15 text

LM-SteerΛར༻͢Δ࣌ͷΠϝʔδ 15 This movie is ok bad LM good ༧ଌ୯ޠ ೖྗจ ௨ৗͷจੜ੒

Slide 16

Slide 16 text

LM-SteerΛར༻͢Δ࣌ͷΠϝʔδ 16 This movie is ok bad LM good ༧ଌ୯ޠ ೖྗจ Positiveͳํ޲΁ͷૢॎ Steer matrixͰ
 ग़ྗຒΊࠐΈΛճస

Slide 17

Slide 17 text

LM-SteerΛར༻͢Δ࣌ͷΠϝʔδ 17 This movie is ok bad LM good ༧ଌ୯ޠ ೖྗจ Positiveͳํ޲΁ͷૢॎ Steer matrixͰ
 ग़ྗຒΊࠐΈΛճస

Slide 18

Slide 18 text

LM-SteerΛར༻͢Δ࣌ͷΠϝʔδ 18 This movie is ok bad LM good ༧ଌ୯ޠ ೖྗจ ग़ྗ୯ޠຒΊࠐΈͷํ͕ճస͞Ε
 ݁Ռͱͯ͠ੜ੒୯ޠ͕ૢ࡞͞ΕΔ Positiveͳํ޲΁ͷૢॎ Steer matrixͰ
 ग़ྗຒΊࠐΈΛճస

Slide 19

Slide 19 text

LM-SteerΛར༻͢Δ࣌ͷΠϝʔδ 19 This movie is ok bad LM good ༧ଌ୯ޠ ೖྗจ Steering valueΛϚΠφεʹ͢Δ
 ͜ͱͰٯํ޲΁ͷૢ࡞΋Մೳ -Positiveͳํ޲΁ͷૢॎ

Slide 20

Slide 20 text

•ੜ੒จͷελΠϧΛૢ࡞Ͱ͖Δ͔Ͳ͏͔ͰධՁ •ධՁλεΫ • Language Detoxi fi cation • Sentiment Control ෼ੳ •Steering valueͱੜ੒จͷؔ܎ɾSteer matrixͷߏ੒ੑͷݕূ •ֶशσʔλྔͱੑೳͷؔ܎ •ֶश͞ΕͨSteer matrix͕ݴޠඇґଘʹಈ࡞͢Δ͜ͱͷݕূ ධՁ࣮ݧ 20

Slide 21

Slide 21 text

•༩͑ΒΕͨಟੑϥϕϧʹదͨ͠ੜ੒͕Ͱ͖Δ͔ධՁ͢ΔλεΫ σʔληοτ •܇࿅: Jigsaw Unintended Bias •ධՁ: RealToxicityPromptsσʔληοτ • σʔληοτதͷ10kจΛpromptͱͯ͠25จੜ੒ͯ͠ಟੑΛࣗಈධՁ ධՁࢦඪ: Toxicity •ಟੑ෼ྨϞσϧʹΑΓੜ੒จ͕toxicͱ෼ྨ͞ΕΔ֬཰ •෼ྨϞσϧ: Google API comment analyzer ධՁ࣮ݧ: Language Detoxi fi cation 21 ൃදऀ஫: ධՁʹͲͷϞσϧΛ࢖͕ͬͨ࿦จதʹ໌֬ʹ͸هࡌ͞Ε͍ͯͳ͔ͬͨͨΊɺGitHubͷධՁ༻ίʔυ͔Β൑அ

Slide 22

Slide 22 text

•༩͑ΒΕͨײ৘ϥϕϧʹదͨ͠ੜ੒͕Ͱ͖Δ͔ධՁ͢ΔλεΫ σʔληοτ •܇࿅: Stanford Sentiment Treebank (SST-5) •ධՁ: OpenWebText Corpus • Ұ෦Λϓϩϯϓτͱͯ͠༻͍ͯ25จΛੜ੒͠ײ৘ۃੑΛࣗಈධՁ ධՁࢦඪ: Positivity •ײ৘ۃੑ෼ྨϞσϧʹΑΓੜ੒จ͕positiveͱ෼ྨ͞ΕΔ֬཰ •෼ྨϞσϧ: distilbert/distilbert-base-uncased- fi netuned-sst-2-english ධՁ࣮ݧ: Sentiment Control 22 ൃදऀ஫: ධՁʹͲͷϞσϧΛ࢖͕ͬͨ࿦จதʹ໌֬ʹ͸هࡌ͞Ε͍ͯͳ͔ͬͨͨΊɺGitHubͷධՁ༻ίʔυ͔Β൑அ

Slide 23

Slide 23 text

৚݅෇͖ੜ੒ख๏ •Soft Word Blacklist (SWB): logitͷo ff set஋ͷΈΛֶश •DExperts: ಛఆଐੑͷจΛੜ੒͢ΔΑ͏܇࿅͞ΕͨิॿϞσϧΛ༻͍ੜ੒ •DAPT: ԼྲྀλεΫͷ܇࿅σʔλΛϥϕϧͳ͠ςΩετͱͯ͠௥Ճࣄલֶश •PPLM: ײ৘ۃੑ෼ྨث౳͔ΒಘΒΕΔޯ഑Λ༻͍ͯӅΕঢ়ଶΛߋ৽ˠੜ੒ •GeDi: ײ৘ۃੑϥϕϧͷࣄޙ֬཰ΛϕΠζͷఆཧͰมܗ͠৚݅෇͖ੜ੒ •MuCoLa: ෼ྨثͷείΞΛͱϥϯδϡόϯαϯϓϦϯάΛ༻͍ͯੜ੒ •PromptT5: T5ʹϥϕϧΛϓϩϯϓτͱͯ͠ೖΕͯੜ੒ •LoRA: Ϟσϧதͷઢܗ૚ʹ௿ϥϯΫ૚Λ෇Ճͯ͠௿ϥϯΫ૚ͷΈඍௐ੔ ൺֱख๏ 23

Slide 24

Slide 24 text

•Toxicity͸վળɺFluency͸ׂͱѱԽ • ग़ྗ୯ޠຒΊࠐΈͷΈͷૢ࡞Ͱ΋݁ߏ੍ޚͰ͖͍ͯΔ ධՁ࣮ݧ: ಟੑɾྲྀெੑɾଟ༷ੑ 24

Slide 25

Slide 25 text

•LM-Steer͕ଞͷख๏ΑΓ΋Positivityͷߴ͍จΛੜ੒͢Δ͜ͱ͕Ͱ͖ͨ ධՁ࣮ݧ: ײ৘ۃੑɾྲྀெੑɾଟ༷ੑ 25

Slide 26

Slide 26 text

•ײ৘ۃੑ෼ྨͰpositiveͳจΛੜ੒
 ͢ΔΑ͏Steering matrixΛ܇࿅ •܇࿅ޙʹSteering value (ε)Λ༷ʑ
 ʹௐ੔ͯ͠positivityΛධՁ •Steering valueΛڧ͘͢Δ΄Ͳ
 ࿈ଓతʹpositivity͕ڧ͘ͳͬͯ
 ͍Δ͜ͱ͕؍࡯Ͱ͖Δ • Steering valueΛϚΠφεʹ͢Δͱ
 negativeͳจ͕ੜ੒͞Ε͍ͯΔ
 Steering valueΛมԽͤͨ࣌͞ͷग़ྗͷ܏޲ 26 -5ε0 5ε0 0ε0

Slide 27

Slide 27 text

•ײ৘ۃੑ෼ྨλεΫͱಟੑ෼ྨλεΫͰ
 ͰͦΕͧΕ܇࿅ͨ͠Steering matrix
 Λ༻ҙ •ͦΕͧΕͷSteering valueΛมԽͤͨ͞
 ࣌ͷ ײ৘ۃੑ (৭) ͱ ಟੑ (ߴ͞) Λ
 ϓϩοτ •ײ৘ۃੑͱಟੑ͕͋Δఔ౓ಠཱͯ͠
 มԽ͍ͯ͠Δ͜ͱ͕֬ೝͰ͖Δ ߏ੒ੑͷݕূ: ෳ਺ͷSteerͷ૊Έ߹Θͤ 27 ͦ΋ͦ΋ײ৘ۃੑͱಟੑͬͯ៉ྷʹ෼཭Ͱ͖ͳ͍࣠Ͱ͸ʁͱ͍͏ؾ΋͢Δ͕ ײ৘ۃੑ ಟ ੑ

Slide 28

Slide 28 text

•σʔλྔΛมԽͤͨ࣌͞ͷ
 LM-SteerͷੑೳΛධՁ •LM-Steer͸30ࣄྫఔ౓Ͱ΋
 ߴ͍ੑೳΛൃش •ϕʔεϥΠϯख๏ͱൺֱͯ͠
 1%ఔ౓ͷֶशύϥϝʔλ਺Ͱ
 ಉ౳ͷੑೳΛୡ੒ ֶशσʔλྔͱૢॎੑೳɾPerplexityͷؔ܎ 28

Slide 29

Slide 29 text

•Steer matrix͸ϞσϧΛม͑ͯ΋
 ద༻ՄೳͰ͋Δ͜ͱΛݕূ खॱ •ҟͳΔݴޠϞσϧؒͷ୯ޠ
 ຒΊࠐΈͷϚοϐϯάHΛಘΔ •ֶशࡁΈSteer matrix Λ
 ʹஔ͖׵͑ͯλʔήοτ
 ݴޠϞσϧʹͦͷ··ద༻ •Steer matrixΛ࢖͍ճͯ͠΋ग़ྗͷ੍ޚ͕Մೳ •ҙຯ: ֤ϞσϧͰಉ͡ʮ֦ॖɾճసʯ͕ຒΊࠐΈૢ࡞্ͷࣅͨҙຯΛ࣋ͭ W HTWH ϞσϧΛލ͍ͩసҠੑೳ 29

Slide 30

Slide 30 text

•ग़ྗ୯ޠຒΊࠐΈͷΈΛର৅ͱͨ͠
 ඍௐ੔ख๏ LM-Steer ΛఏҊ •ݴޠϞσϧͷੜ੒ελΠϧ੍ޚ
 λεΫͰධՁ • LoRAΛ͸͡Ίͱ͢Δطଘख๏Λ
 ্ճΔੑೳΛୡ੒ ײ૝ •ֶश͞ΕͨSteer matrix͕ଞϞσϧʹస༻Ͱ͖Δ఺͸ඇৗʹ໘ന͍ • ϞσϧඇґଘͰSteer matrixʹΑΔʮ֦ॖɾճసʯ͕ࣅͨҙຯΛ࣋ͭ •ҰํͰελΠϧΑΓ΋ෳࡶͳλεΫͰͷద༻͸͔ͳΓ೉ͦ͠͏ ·ͱΊ 30