Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transformer

Avatar for Nariaki Tateiwa Nariaki Tateiwa
June 12, 2022
240

 Transformer

Avatar for Nariaki Tateiwa

Nariaki Tateiwa

June 12, 2022
Tweet

Transcript

  1. Transformer Google brain Attention is all you need, 2017 ,

    SOTA attention FeedForward Encoder Decoder (Conformer) (Vision Transformer) 1 2 1: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017). 2: 2
  2. DeepMind AI Transformer 1 1: Reed, Scott, et al. "A

    Generalist Agent." arXiv preprint arXiv:2205.06175 (2022). 3
  3. [1] . 2 (MLP ). , 2015. [2] AIcia Solid

    Project, Transformer - Multi-Head Attention vol.28 , youtube, 2021 [3] Arithmer, Transformer , 2021 4
  4. > RNN( ) GAN( ) CNN( ) Transformer( ) D

    = {(x , y )} i i x i y i 8
  5. > ON, OFF 1: : > > > > AI

    , https://www.soumu.go.jp/johotsusintokei/whitepaper/ja/r01/html/nd113210.html 9
  6. Attention > : query query ( ) query {z }

    i z i q r r = i r(z , q) i r z i q (a , ..., a ) = 1 N softmax(r , ..., r ) 1 N 1 F = a z ∑ i i 1: softmax a = i r /( e ) i ∑ r i a = ∑ i 1 a ≥ i 0 r ≥ i r j a ≥ i a j 14
  7. Attention > query query → query ( ) a i

    z i F = a z ∑ i i z i source, query target source-to-target attention source target self-attention z i 15
  8. Attention > ( ; token) tokenize token word embedding "

    " → [" ", " ", " ", " "] → = [0, 1, 3], # = [3, 4, -1], # = [1, 0, -4], # = [-3, 2, 1], # {z } i z 0 z 1 z 2 z 3 16
  9. Attention > word embedding word embedding → a b z

    a z b ⟨v , v ⟩ a n 1 1: ( ) ( ) ( ) 17
  10. Attention > ( ) ( ) query Transformer scaled dot-product

    attention z i q i i n z = i q i r(z , q ) i j ⟨z , q ⟩/d i j 1 d = n q j F = j a z = ∑ i i softmax( )Z d q Z j T 2 1: i j 2: Z z i 18
  11. Attention > Key-Value Transformer (key) (value) ( ) key-value (python

    dict c++ map ) key value query key value ? k , q , v i i i i n F = j softmax( )V d q K j T 19
  12. Scaled dot-product attention embeding x Q ← xW Q K

    ← xW K V ← xW V F = softmax( )V d Q K T W , W , W Q K V 22
  13. Scaled dot-product attention B: , S: , E: (embeded dim)

    s = Q K, t = T softmax(s/d), F = tV 23
  14. Multi-Head attention B: , S: , E: (embeded dim), h:

    ; Q, K, V h h attention (concatration) 25
  15. Residual connection "ReLU "[1] z = f(x) + x ,

    f(x) x residual connection ? f(x) 0 27