Transformer

Transformer 2022-06-07

Transformer Google brain Attention is all you need, 2017 ,
SOTA attention FeedForward Encoder Decoder (Conformer) (Vision Transformer) 1 2 1: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017). 2: 2

DeepMind AI Transformer 1 1: Reed, Scott, et al. "A
Generalist Agent." arXiv preprint arXiv:2205.06175 (2022). 3

[1] . 2 (MLP ). , 2015. [2] AIcia Solid
Project, Transformer - Multi-Head Attention vol.28 , youtube, 2021 [3] Arithmer, Transformer , 2021 4

attention transformer ([3] ) , BERT, GPT Transofor positional encoding,
normalize 5

Attention Transformer Scaled dot-product attention Multi-head attention Residual connection Encoder/Decoder
Transformer : 6

> RNN( ) GAN( ) CNN( ) Transformer( ) D
= {(x , y )} i i x i y i 8

> ON, OFF 1: : > > > > AI
, https://www.soumu.go.jp/johotsusintokei/whitepaper/ja/r01/html/nd113210.html 9

> 1. 2. ( ) y = f(x) 10

--> Attention 11

Attention

Attention " (query) " [1, Chapter7.2] ( )" " 13

Attention > : query query ( ) query {z }
i z i q r r = i r(z , q) i r z i q (a , ..., a ) = 1 N softmax(r , ..., r ) 1 N 1 F = a z ∑ i i 1: softmax a = i r /( e ) i ∑ r i a = ∑ i 1 a ≥ i 0 r ≥ i r j a ≥ i a j 14

Attention > query query → query ( ) a i
z i F = a z ∑ i i z i source, query target source-to-target attention source target self-attention z i 15

Attention > ( ; token) tokenize token word embedding "
" → [" ", " ", " ", " "] → = [0, 1, 3], # = [3, 4, -1], # = [1, 0, -4], # = [-3, 2, 1], # {z } i z 0 z 1 z 2 z 3 16

Attention > word embedding word embedding → a b z
a z b ⟨v , v ⟩ a n 1 1: ( ) ( ) ( ) 17

Attention > ( ) ( ) query Transformer scaled dot-product
attention z i q i i n z = i q i r(z , q ) i j ⟨z , q ⟩/d i j 1 d = n q j F = j a z = ∑ i i softmax( )Z d q Z j T 2 1: i j 2: Z z i 18

Attention > Key-Value Transformer (key) (value) ( ) key-value (python
dict c++ map ) key value query key value ? k , q , v i i i i n F = j softmax( )V d q K j T 19

Transformer

Transormer Multi-Head Attention Self-Attention Add & Norm (Residual connection) Encoder/
Decoder 21

Scaled dot-product attention embeding x Q ← xW Q K
← xW K V ← xW V F = softmax( )V d Q K T W , W , W Q K V 22

Scaled dot-product attention B: , S: , E: (embeded dim)
s = Q K, t = T softmax(s/d), F = tV 23

Multi-Head attention Q, K, V scaled dot-product attention (concat) (
) attention 24

Multi-Head attention B: , S: , E: (embeded dim), h:
; Q, K, V h h attention (concatration) 25

Multi-Head attention 26

Residual connection "ReLU "[1] z = f(x) + x ,
f(x) x residual connection ? f(x) 0 27

x[i][j] i ID j 28

Transformer :

Transformer > Encoder , (A) Decoder (A) Decoder i j
loss 30

Transformer > ( ) <pad> Decoder 31

Transformer > Encoder/Decoder positional encoding 32

Transformer > Encoder dog is cute Decoder <bos> 1 33

Transformer > 36

Transformer

Transformer

More Decks by Nariaki Tateiwa

Featured

Transcript