Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transformer

Nariaki Tateiwa
June 12, 2022
170

 Transformer

Nariaki Tateiwa

June 12, 2022
Tweet

Transcript

  1. Transformer Google brain Attention is all you need, 2017 ,

    SOTA attention FeedForward Encoder Decoder (Conformer) (Vision Transformer) 1 2 1: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017). 2: 2
  2. DeepMind AI Transformer 1 1: Reed, Scott, et al. "A

    Generalist Agent." arXiv preprint arXiv:2205.06175 (2022). 3
  3. [1] . 2 (MLP ). , 2015. [2] AIcia Solid

    Project, Transformer - Multi-Head Attention vol.28 , youtube, 2021 [3] Arithmer, Transformer , 2021 4
  4. > RNN( ) GAN( ) CNN( ) Transformer( ) D

    = {(x , y )} i i x i y i 8
  5. > ON, OFF 1: : > > > > AI

    , https://www.soumu.go.jp/johotsusintokei/whitepaper/ja/r01/html/nd113210.html 9
  6. Attention > : query query ( ) query {z }

    i z i q r r = i r(z , q) i r z i q (a , ..., a ) = 1 N softmax(r , ..., r ) 1 N 1 F = a z ∑ i i 1: softmax a = i r /( e ) i ∑ r i a = ∑ i 1 a ≥ i 0 r ≥ i r j a ≥ i a j 14
  7. Attention > query query → query ( ) a i

    z i F = a z ∑ i i z i source, query target source-to-target attention source target self-attention z i 15
  8. Attention > ( ; token) tokenize token word embedding "

    " → [" ", " ", " ", " "] → = [0, 1, 3], # = [3, 4, -1], # = [1, 0, -4], # = [-3, 2, 1], # {z } i z 0 z 1 z 2 z 3 16
  9. Attention > word embedding word embedding → a b z

    a z b ⟨v , v ⟩ a n 1 1: ( ) ( ) ( ) 17
  10. Attention > ( ) ( ) query Transformer scaled dot-product

    attention z i q i i n z = i q i r(z , q ) i j ⟨z , q ⟩/d i j 1 d = n q j F = j a z = ∑ i i softmax( )Z d q Z j T 2 1: i j 2: Z z i 18
  11. Attention > Key-Value Transformer (key) (value) ( ) key-value (python

    dict c++ map ) key value query key value ? k , q , v i i i i n F = j softmax( )V d q K j T 19
  12. Scaled dot-product attention embeding x Q ← xW Q K

    ← xW K V ← xW V F = softmax( )V d Q K T W , W , W Q K V 22
  13. Scaled dot-product attention B: , S: , E: (embeded dim)

    s = Q K, t = T softmax(s/d), F = tV 23
  14. Multi-Head attention B: , S: , E: (embeded dim), h:

    ; Q, K, V h h attention (concatration) 25
  15. Residual connection "ReLU "[1] z = f(x) + x ,

    f(x) x residual connection ? f(x) 0 27