Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transformerは何をやっているのか

 Transformerは何をやっているのか

Transformerが結局のところ何をやっているのかを図メインで説明しています。
研究室の論文読み会で用いた資料の一部です。

93951dd323433ad39f5bc635cc686f96?s=128

Yusuke-Takagi-Q

April 24, 2021
Tweet

Transcript

  1. Transformer ͸ԿΛ΍͍ͬͯΔͷ͔ ߴ໦༏հ Nagoya Institute of Technology Takeuchi & Karasuyama

    Lab 2021/04/19
  2. ͸͡Ίʹ • ग़ͯ͘Δਤ͸શͯࣗ࡞ or ࿦จ ”Attention Is All You Need”

    ͔ΒͷҾ༻ • ࿦จಡΈձʹͯ”Attention Is All You Need”Λ঺հͨ͠ࡍͷҰ෦ͳͷͰ, આ໌͸ݩ࿦จʹԊ͍ͬͯ·͢ • Transformer ͕ԿΛ΍͍ͬͯΔͷ͔ΛϝΠϯͰઆ໌͍ͯ͠ΔͷͰ, ܇࿅ํ ๏΍ֶशޙͷਪ࿦ํ๏, ػց຋༁Ҏ֎ͷ࿮૊ΈͳͲ͸ղઆ͍ͯ͠·ͤΜ 1 / 41
  3. Overview : Training flow • Transformer ͷߏ଄Λઆ໌͢Δલʹ܇࿅ͷྲྀΕΛ֬ೝ • ػց຋༁ͷઃఆͰ͸ҎԼͷਤͷΑ͏ͳྲྀΕ &ODPEFS

    %FDPEFS * IBWF B QFO  ࢲ ͸ ϖϯ Λ ࣋ͭ ɻ &04 ࢲ ͸ ϖϯ Λ ࣋ͭ ɻ &04 <#04> ࢲ ͸ ϖϯ Λ ࣋ͭ ɻ &04 4IJGUˠ 1SFEJDU "OTXFS 'SPNEBUBTFU 5SBOTGPSNFS 2 / 41
  4. Overview : Training flow • ྫ͑͹ Transformer ͕ઌ಄ͷޠΛ༧ଌ͢Δͱ͖͸ ”I have

    a pen.” ͱ ”<BOS>”ͷ৘ใ͕࢖༻Մೳ • <BOS> ͸ಛผͳτʔΫϯͰจͷ࢝·ΓΛද͢ 3 / 41 &ODPEFS %FDPEFS * IBWF B QFO  ࢲ ͸ ϖϯ Λ ࣋ͭ ɻ &04 ࢲ ͸ ϖϯ Λ ࣋ͭ ɻ &04 <#04>
  5. Overview : Training flow • Transformer ͕ 2 ͭ໨ͷޠΛ༧ଌ͢Δͱ͖͸ ”I

    have a pen.” ͱ ”<BOS> ࢲ” ͷ৘ใ͕࢖༻Մೳ 4 / 41 &ODPEFS %FDPEFS * IBWF B QFO  ࢲ ͸ ϖϯ Λ ࣋ͭ ɻ &04 ࢲ ͸ ϖϯ Λ ࣋ͭ ɻ &04 <#04>
  6. Overview : Model architecture • Transformer ͸ԼͷਤͷΑ͏ʹ self-attention ͱ fully

    connected layers ΛελοΫͨ͠ߏ଄ͱͳ͍ͬͯΔ • Ҏ߱ 5 ͭͷύʔτʹؔͯ͠ৄ͘͠આ໌͢Δ Figure 1: The Transformer - model architecture. ᶃ ᶄ ᶅ ᶆ ᶇ 5 / 41
  7. Part 1 : Overall • ෮श : Transformer ͷશମ૾ Figure

    1: The Transformer - model architecture. ᶃ ᶄ ᶅ ᶆ ᶇ 6 / 41
  8. Part 1 : Embedding and Positional Encoding • จΛ Transformer

    ʹೖྗ͢ΔͨΊʹ୯ޠΛϕΫτϧʹม׵͢Δඞཁ͕ ͋Δ • ͜͜Ͱ͸ֶशࡁΈͷ embedding Λ༻͍ͯ input ͱ output ͷ୯ޠΛ dmodel ࣍ݩͷϕΫτϧʹม׵ • <EOS> ͸จͷऴΘΓΛද͢ಛผͳτʔΫϯ • <pad> ͸จͷ௕͞Λଗ͑ΔͨΊͷύσΟϯάΛද͢τʔΫϯ *IBWFBQFO IBWF B QFO  * &04 QBE QBE 𝑑!"#$% 7 / 41
  9. Part 1 : Embedding and Positional Encoding • จͷߦྻʹ૬ରత or

    ઈରతͳ୯ޠͷҐஔ৘ใ ΛՃ͑Δඞཁ͕͋Δ • Transformer ʹ͸࠶ؼߏ଄΍৞ΈࠐΈͳͲॱংΛߟྀͰ͖Δॲཧ͕ଘࡏ͠ ͳ͍͔Β • ”I have a pen .” ͱ ”I pen . a have” ͷҧ͍͕ encode Ͱ͖ͳ͍ • ͜ͷ࿦จͰ͸ Positional Encoding ʹҎԼͷࣜΛ༻͍͍ͯΔ PE(pos,2i) = sin ( pos/100002i/dmodel ) PE(pos,2i+1) = cos ( pos/100002i/dmoddl ) • pos ͸୯ޠҐஔͰ i ͸ϕΫτϧͷ࣍ݩ 8 / 41
  10. Part 1 : Embedding and Positional Encoding 𝑑!"#$% 𝑃𝐸(',)) =

    sin (7/10000)/#!"#$% ) 𝑃𝐸(,,-) = cos (6/10000)/#!"#$% ) 1PTJUJPOBM&ODPEJOH IBWF B QFO  * &04 QBE QBE 𝑑!"#$% 9 / 41
  11. Part 1 : Embedding and Positional Encoding • ͳͥ Positional

    Encoding ʹ͜ͷؔ਺Λ༻͍͍ͯΔͷ͔ʁ ⇒ ஶऀͨͪ͸Φϑηοτ k Λݻఆͨ࣌͠ʹ PEpos+k ͕ PEpos ͷઢܗؔ਺ ͰදݱͰ͖Δ͜ͱ͔Β, Ϟσϧ͕؆୯ʹ૬ରతͳҐஔؔ܎ΛֶशՄೳͳ ͸ͣͱͷԾઆΛཱ͍ͯͯΔ • ඞͣ͠΋͜ͷؔ਺Ͱͳ͍ͱ͍͚ͳ͍༁Ͱ͸ͳ͍ ui = 1 100002i/dmodel PEpos,2i = sin (posui) PEpos,2i+1 = cos (posui) PEpos+k,2i = sin ((pos + k)ui) = PEpos,2i cos (kui) + PEpos,2i+1 sin (kui) 10 / 41
  12. Part 2 : Overall • ෮श : Transformer ͷશମ૾ Figure

    1: The Transformer - model architecture. ᶃ ᶄ ᶅ ᶆ ᶇ 11 / 41
  13. Part 2 : Encoder self-attention 12 / 41

  14. Part 2 : Encoder self-attention • Attention ͱ͸ query ͱ

    keyɾvalue ϖΞͷηοτΛग़ྗʹϚοϐϯά͢ Δ΋ͷͱઆ໌Ͱ͖Δ • query, key, value, output ͸શͯϕΫτϧ • output ͸ value ͷॏΈ෇͖࿨Ͱܭࢉ͞ΕΔ • ֤஋ʹׂΓ౰ͯΒΕΔॏΈ͸, query ͱରԠ͢Δ key ʹΑͬͯܭࢉ͞ΕΔ • ݴ༿ͩͱΑ͘෼͔Βͳ͍ͷͰҎ߱ͷεϥΠυͰਤΛަ͑ͯ self-attention ͷॲཧΛઆ໌ 13 / 41
  15. Part 2 : Encoder self-attention • ͸͡Ίʹ, query, key, value

    Λ Source, Target ͔Βύϥϝʔλ W q , W k , W v Λ༻͍ͯܭࢉ • Source = Target ⇒ self-attention IBWF B QFO  * &04 QBE QBE IBWF B QFO  * &04 QBE QBE 4PVSDF 5BSHFU 4BNF JOQVU ˠ 4FMG"UUFOUJPO 7BMVF ,FZ 2VFSZ 𝑆𝑊! 𝑆𝑊" 𝑇𝑊# 14 / 41
  16. Part 2 : Encoder self-attention • ࣍ʹ softmax(QK⊤) Λܭࢉ •

    ௚ײతղऍΛ͢Δͱ, ಘΒΕΔߦྻ͸ input จͷ୯ޠؒͷؔ܎ੑΛදͨ͠ ΋ͷʹͳΔ • ྫ͑͹, ”I” ͱ ”have” ͸ͱͯ΋ؔ܎͍ͯͦ͠͏͕ͩ, ”I” ͱ ”a” ͸ͦΕ΄ ͲͰ΋ͳͦ͞͏ ͷΑ͏ͳ΋ͷ 7BMVF ,FZ 2VFSZ 𝑄𝐾! 4PGUNBY IBWF B QFO  * &04 QBE QBE IBWF * B QFO  &04 QBE QBE 3FQSFTFOUBSFMBUJPOTIJQCFUXFFOXPSET 15 / 41
  17. Part 2 : Encoder self-attention • ࠷ޙʹ, softmax(QK⊤)V Λܭࢉ͠, Target

    ΛՃ͑Δ (residual connection). • ࠷ऴతʹಘΒΕΔߦྻ͸ ݩͷ Target ͷ৘ใʹ֤୯ޠͱؔ࿈͢Δ୯ޠ৘ ใΛՃ͑ͨ΋ͷ 7BMVF 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄𝐾!)𝑉 IBWF B QFO  * &04 QBE QBE 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄𝐾!) IBWF B QFO  * &04 QBE QBE 5BSHFU ʴ SFTJEVBMDPOOFDUJPO 16 / 41
  18. Part 2 : Encoder self-attention • ΋͏গ͠ softmax(QK⊤)V ͷܭࢉΛৄ͘͠ݟΔͱ 7BMVF

    𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾! IBWF B QFO  * &04 QBE QBE * IBWF * B QFO  &04QBEQBE IBWF B QFO  * QBE QBE IBWF B QFO  * QBE QBE 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾! [0]𝑉 TVN 3FQSFTFOUBSFMBUJPOTIJQ CFUXFFOl*zBOEJOQVUXPSET 8FJHIUFEWBMVF 17 / 41
  19. Part 2 : Encoder self-attention • Encoder ͷ self-attention ྲྀΕ·ͱΊ

    IBWF B QFO  * &04 QBE QBE IBWF B QFO  * &04 QBE QBE 4PVSDF 5BSHFU 7BMVF ,FZ 2VFSZ 𝑆𝑊! 𝑆𝑊" 𝑇𝑊# 𝑄𝐾$ 4PGUNBY ʴ IBWF B QFO  * &04 QBE QBE 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄𝐾$)𝑉 18 / 41
  20. Part 2 : Scaled Dot-Product Attention • આ໌Ͱ͸ softmax(QK⊤)V Λ༻͍͖͕ͯͨ,

    ࣮༻Ͱ͸ softmax(QK⊤ √ dk )V ͷΑ͏ʹεέʔϦϯά͢Δඞཁ͕͋Δ • dk : query ͱ key ͷ࣍ݩ • ͳͥεέʔϦϯά͢Δඞཁ͕ʁ ⇒ ޯ഑ফࣦΛආ͚ΔͨΊ • ϕΫτϧͷ࣍ݩ͕૿͑Δͱ಺ੵ஋΋େ͖͘ͳΓ͕ͪͰ͋Γ, ͦͷ··Ͱ͸ sonfmax ͕΄΅ max ͱಉ͡Α͏ͳԋࢉʹͳͬͯ͠·͏ 19 / 41
  21. Part 2 : Multi-Head Attention • ͞Βʹ࿦จͰ͸ Multi-Head Attention ͕࢖༻͞Ε͍ͯΔ

    • Ҏ߱ͷεϥΠυͰ Multi-Head Attention Λઆ໌͢Δ 20 / 41
  22. Part 2 : Multi-Head Attention • ͸͡Ίʹ, query, key, value

    Λ Source, Target ͔Βύϥϝʔλ W q i , W k i , W v i Λ༻͍ͯܭࢉ • h-head attention Ͱ͋Ε͹, W q i , W k i ∈ Rdmodel×dk , W v i ∈ Rdmodel×dv • dk = dv = dmodel /h, i = 1, . . . , h • ͭ·Γ, h-head ͳΒ query, key, value ͷ࣍ݩ͕ 1-head ࣌ͷ 1/h ʹͳΔ IBWF B QFO  * &04 QBE QBE IBWF B QFO  * &04 QBE QBE 4PVSDF 5BSHFU 7BMVF ,FZ 2VFSZ 𝑆𝑊 ! " 𝑆𝑊 ! # 𝑇𝑊 ! $ 𝑇𝑊 % $ 𝑆𝑊 % " 𝑆𝑊 % # 21 / 41
  23. Part 2 : Multi-Head Attention • ࣍ʹ, ֤ head Ͱ

    softmax(QiK⊤ i ) Λܭࢉ 𝑄! 𝐾! " 4PGUNBY ,FZ 2VFSZ 𝑄# 𝐾# " ,FZ 2VFSZ 4PGUNBY )FBE )FBE 22 / 41
  24. Part 2 : Multi-Head Attention • ࠷ޙʹ Concat(softmax(QiK⊤ i )Vi)W

    O Λܭࢉ͠ Target ΛՃ͑Δ • W O ∈ Rhdv×dmodel ͸ύϥϝʔλ 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄! 𝐾! ")𝑉! 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄! 𝐾! ") 7BMVF 7BMVF 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄# 𝐾# ") 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄# 𝐾# ")𝑉# 𝐶𝑜𝑛𝑐𝑎𝑡 (ℎ𝑒𝑎𝑑! ,ℎ𝑒𝑎𝑑# )𝑊$ IBWF B QFO  * &04 QBE QBE 5BSHFU ʴ 23 / 41
  25. Part 2 : Multi-Head Attention • جຊతʹ͸ Multi-Head Attention Ͱ΋ྲྀΕ͸ಉ͡

    • ҧ͍͸ head ͷ਺͚ͩ Attention ͕ܭࢉ͞ΕͯͦΕΒΛ concat ͯ͠ݩͷ࣍ ݩʹ໭͢ͱ͜Ζ • Multi-Head Attention ʹ͢Δར఺͸ෳ਺ͷ Attention ͕ܭࢉ͞ΕΔͷͰ ΑΓෳࡶͳؔ܎ੑ͕ encode ՄೳͱͳΔͱ͜Ζ 24 / 41
  26. Part 2 : Layer Normalization 25 / 41

  27. Part 2 : Layer Normalization • Transformer Ͱ͸ residual connection

    ޙʹ layer normalization Λ༻͍ͯ ਖ਼نԽ͍ͯ͠Δ • ֤୯ޠຖʹਖ਼نԽ • β, γ : ύϥϝʔλ • µi , σi : ֤୯ޠͷฏۉͱඪ४ภࠩ • ϵ : 0 আࢉΛආ͚ΔͨΊͷখ͍͞ఆ਺ LayerNorm(xi) = xi − µi σi + ϵ γ + β ࢲ ͸ ϖϯ Λ <#04> ࣋ͭ ɻ <&04> 𝜇! ,𝜎! 𝜇" ,𝜎" 𝜇# ,𝜎# ・・・ 26 / 41
  28. Part 2 : Position-wise Feed-Forward Networks 27 / 41

  29. Part 2 : Position-wise Feed-Forward Networks • attention sub-layer ͷޙ͸

    fully connected feed-forward network Λ༻ ͍Δ • W1 , W2 , b1 , b2 : ύϥϝʔλ FFN(x) = max(0, xW1 + b1)W2 + b2 28 / 41
  30. Part 2 : Summary • ෮श : Part2 ͷશମਤ •

    ͜Ε͕ Transformer ͷجຊతͳϨΠϠʔͱͳ͍ͬͯΔ • encoder Ͱ͸͜ͷϒϩοΫ͕ N = 6 ݸελοΫ͞Ε͍ͯΔ 29 / 41
  31. Part 3 : Overall • ෮श : Transformer ͷશମ૾ Figure

    1: The Transformer - model architecture. ᶃ ᶄ ᶅ ᶆ ᶇ 30 / 41
  32. Part 3 : Masked self-attention • ͸͡Ίʹ, query, key, value

    Λ Source, Target ͔Βύϥϝʔλ W q, W k, W v Λ༻͍ͯܭࢉ ࢲ ͸ ϖϯ Λ #04 ࣋ͭ ɻ &04 4PVSDF 5BSHFU 7BMVF ,FZ 2VFSZ 𝑆𝑊! 𝑆𝑊" 𝑇𝑊# ࢲ ͸ ϖϯ Λ #04 ࣋ͭ ɻ &04 'SPN &NCFEEJOH 4IJGUFE 4IJGUFE 31 / 41
  33. Part 3 : Masked self-attention • ࣍ʹ, softmax(QiK⊤ i −

    Mask) Λܭࢉ • Mask ͸ কདྷͷ৘ใͷ࢖༻Λ๷͙ͨΊʹ༻͍ΒΕΔ • n+1 ൪໨ͷ୯ޠͷ༧ଌΛ͢Δࡍ͸, Transformer ͸ 1 ൪໨͔Β n ൪໨ͷ୯ ޠͷΈ࢖༻Մೳ • ຋༁จ ʮࢲ ͸ ϖϯ Λ ࣋ͭ ɻ ʯ ͷ ʮϖϯʯ Λ༧ଌ͢Δͱ͖͸ ʮ<BOS> ࢲ ͸ʯ·Ͱ࢖༻Մೳ ,FZ 2VFSZ  㱣 㱣 㱣 㱣 㱣   㱣 㱣 㱣 㱣    㱣 㱣 㱣     㱣 㱣      㱣                   㱣 㱣 㱣 㱣 㱣 㱣 㱣 㱣 㱣 㱣 㱣 㱣  㱣   .BTL − 𝑄𝐾! QSFWFOUGSPNVTJOHGVUVSFJOGPSNBUJPO 32 / 41
  34. Part 3 : Masked self-attention • ࣍ʹ, softmax(QK⊤ − Mask)

    Λܭࢉ • Mask ͸ কདྷͷ৘ใͷ࢖༻Λ๷͙ͨΊʹ༻͍ΒΕΔ • n+1 ൪໨ͷ୯ޠͷ༧ଌΛ͢Δࡍ͸, Transformer ͸ 1 ൪໨͔Β n ൪໨ͷ୯ ޠͷΈ࢖༻Մೳ • ຋༁จ ʮࢲ ͸ ϖϯ Λ ࣋ͭ ɻ ʯ ͷ ʮϖϯʯ Λ༧ଌ͢Δͱ͖͸ ʮ<BOS> ࢲ ͸ʯ·Ͱ࢖༻Մೳ .BTL − 𝑄𝐾! 4PGUNBY ࢲ ͸ ϖϯ Λ #04 &04 ࣋ͭ ɻ ࢲ #04 ͸ ϖϯ Λ <&04> ࣋ͭ ɻ 33 / 41
  35. Part 3 : Masked self-attention • ࠷ޙʹ, softmax(QK⊤ − Mask)V

    Λܭࢉ͠ Target ΛՃ͑Δ (residual connection). 7BMVF 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄𝐾! − 𝑀𝑎𝑠𝑘)𝑉 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄𝐾! − 𝑀𝑎𝑠𝑘) ࢲ ͸ ϖϯ Λ <#04> ࣋ͭ ɻ <&04> 5BSHFU ࢲ ͸ ϖϯ Λ <#04> ࣋ͭ ɻ <&04> ʴ 34 / 41
  36. Part 4 : Overall • ෮श : Transformer ͷશମ૾ Figure

    1: The Transformer - model architecture. ᶃ ᶄ ᶅ ᶆ ᶇ 35 / 41
  37. Part 4 : Decoder attention • ͸͡Ίʹ, query, key, value

    Λ Source, Target ͔Βύϥϝʔλ W q, W k, W v Λ༻͍ͯܭࢉ • ͜͜ͷ attention Ͱ͸ Source ͸ encoder ͔Β, Target ͸ part 3 (Masked self-attention layer) ͔Βདྷ͍ͯΔ IBWF B QFO  * <&04> <QBE> <QBE> 4PVSDF 7BMVF ,FZ 2VFSZ 𝑆𝑊! 𝑆𝑊" 𝑇𝑊# ࢲ ͸ ϖϯ Λ <#04> ࣋ͭ ɻ <&04> 5BSHFU 'SPN&ODPEFS 'SPN.BTLFE TFMGBUUFOUJPOMBZFS 36 / 41
  38. Part 4 : Decoder attention • ࣍ʹ, softmax(QK⊤) Λܭࢉ •

    ௚ײతղऍΛ͢Δͱ, ಘΒΕΔߦྻ͸ Source Λ Target ʹ຋༁͢ΔͨΊͷ ؔ܎ੑΛදͨ͠΋ͷʹͳΔ • ྫ͑͹ ”I” ͱ ”ࢲ” ͸࣍ͷޠ ”͸” Λ༧ଌ͢ΔͨΊʹڧؔ͘܎͍ͯ͠Δ͸ ͣ ͷΑ͏ͳ΋ͷ ,FZ 2VFSZ 𝑄𝐾! 4PGUNBY IBWF * B QFO  <&04><QBE><QBE> ࢲ ͸ ϖϯ Λ <#04> <&04> ࣋ͭ ɻ 37 / 41
  39. Part 4 : Decoder attention • ࠷ޙʹ, softmax(QK⊤)V Λܭࢉ͠, Target

    ΛՃ͑Δ • ࠷ऴతʹಘΒΕΔߦྻ͸ ݩͷ Target ͷ৘ใʹ ֤ Target ୯ޠͱؔ܎͢Δ Source ͷ୯ޠ৘ใΛՃ͑ͨ΋ͷ 7BMVF 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄𝐾!)𝑉 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑄𝐾!) ࢲ ͸ ϖϯ Λ <#04> ࣋ͭ ɻ <&04> 5BSHFU ࢲ ͸ ϖϯ Λ <#04> ࣋ͭ ɻ <&04> ʴ 38 / 41
  40. Part 3, 4 : Summary • ෮श : Part 3

    and 4 ͷશମਤ 'SPN &ODPEFS • decoder Ͱ΋͜ͷϒϩοΫ͕ N = 6 ݸελοΫ͞Ε͍ͯΔ 39 / 41
  41. Part 5 : Overall • ෮श : Transformer ͷશମ૾ Figure

    1: The Transformer - model architecture. ᶃ ᶄ ᶅ ᶆ ᶇ 40 / 41
  42. Part 5 : predict • decoder layer ͔Βग़ྗ͞Εͨߦྻͷ֤ߦ͸࣍ͷޠͷ༧ଌʹ࢖ΘΕΔ ࢲ ͸

    ϖϯ Λ <#04> ࣋ͭ ɻ <&04> ࢲ ൴ ͸ ϖϯ <&04> ൴ঁ       ɾɾɾ ɾɾɾ ࢲ ͸ <&04>       ɾɾɾ       ɾɾɾ ɾɾɾ       <QBE> -JOFBS 4PGUNBY 1SFEJDU ɾɾɾ ɾɾɾ "MMXPSETJOEBUBTFU 1SPCBCJMJUJFTPGOFYUXPSET 41 / 41