Transformer - Speaker Deck

Transformer

by Nariaki Tateiwa

Slide 1

Slide 1 text

Transformer 2022-06-07

Slide 2

Slide 2 text

Transformer Google brain Attention is all you need, 2017 , SOTA attention FeedForward Encoder Decoder (Conformer) (Vision Transformer) 1 2 1: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017). 2: 2

Slide 3

Slide 3 text

DeepMind AI Transformer 1 1: Reed, Scott, et al. "A Generalist Agent." arXiv preprint arXiv:2205.06175 (2022). 3

Slide 4

Slide 4 text

[1] . 2 (MLP ). , 2015. [2] AIcia Solid Project, Transformer - Multi-Head Attention vol.28 , youtube, 2021 [3] Arithmer, Transformer , 2021 4

Slide 5

Slide 5 text

attention transformer ([3] ) , BERT, GPT Transofor positional encoding, normalize 5

Slide 6

Slide 6 text

Attention Transformer Scaled dot-product attention Multi-head attention Residual connection Encoder/Decoder Transformer : 6

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

> RNN( ) GAN( ) CNN( ) Transformer( ) D = {(x , y )} i i x i y i 8

Slide 9

Slide 9 text

> ON, OFF 1: : > > > > AI , https://www.soumu.go.jp/johotsusintokei/whitepaper/ja/r01/html/nd113210.html 9

Slide 10

Slide 10 text

> 1. 2. ( ) y = f(x) 10

Slide 11

Slide 11 text

--> Attention 11

Slide 12

Slide 12 text

Attention

Slide 13

Slide 13 text

Attention " (query) " [1, Chapter7.2] ( )" " 13

Slide 14

Slide 14 text

Attention > : query query ( ) query {z } i z i q r r = i r(z , q) i r z i q (a , ..., a ) = 1 N softmax(r , ..., r ) 1 N 1 F = a z ∑ i i 1: softmax a = i r /( e ) i ∑ r i a = ∑ i 1 a ≥ i 0 r ≥ i r j a ≥ i a j 14

Slide 15

Slide 15 text

Attention > query query → query ( ) a i z i F = a z ∑ i i z i source, query target source-to-target attention source target self-attention z i 15

Slide 16

Slide 16 text

Attention > ( ; token) tokenize token word embedding " " → [" ", " ", " ", " "] → = [0, 1, 3], # = [3, 4, -1], # = [1, 0, -4], # = [-3, 2, 1], # {z } i z 0 z 1 z 2 z 3 16

Slide 17

Slide 17 text

Attention > word embedding word embedding → a b z a z b ⟨v , v ⟩ a n 1 1: ( ) ( ) ( ) 17

Slide 18

Slide 18 text

Attention > ( ) ( ) query Transformer scaled dot-product attention z i q i i n z = i q i r(z , q ) i j ⟨z , q ⟩/d i j 1 d = n q j F = j a z = ∑ i i softmax( )Z d q Z j T 2 1: i j 2: Z z i 18

Slide 19

Slide 19 text

Attention > Key-Value Transformer (key) (value) ( ) key-value (python dict c++ map ) key value query key value ? k , q , v i i i i n F = j softmax( )V d q K j T 19

Slide 20

Slide 20 text

Transformer

Slide 21

Slide 21 text

Transormer Multi-Head Attention Self-Attention Add & Norm (Residual connection) Encoder/ Decoder 21

Slide 22

Slide 22 text

Scaled dot-product attention embeding x Q ← xW Q K ← xW K V ← xW V F = softmax( )V d Q K T W , W , W Q K V 22

Slide 23

Slide 23 text

Scaled dot-product attention B: , S: , E: (embeded dim) s = Q K, t = T softmax(s/d), F = tV 23

Slide 24

Slide 24 text

Multi-Head attention Q, K, V scaled dot-product attention (concat) ( ) attention 24

Slide 25

Slide 25 text

Multi-Head attention B: , S: , E: (embeded dim), h: ; Q, K, V h h attention (concatration) 25

Slide 26

Slide 26 text

Multi-Head attention 26

Slide 27

Slide 27 text

Residual connection "ReLU "[1] z = f(x) + x , f(x) x residual connection ? f(x) 0 27