【深度學習】06 RNN 實務與 Transformers

政⼤應數。數學軟體應⽤蔡炎龍 RNN 實作與 Transformers 政治⼤學應⽤數學系深度學習入⾨ 06.

RNN 的實作篇 15.

數學軟體應⽤ 314 實戰篇。情意分析 IMDB 評論

數學軟體應⽤ 315 正評負評 or f ⼀則評論

數學軟體應⽤ 316 ⽂字怎麼輸入? 在⾃然語⾔處理當中, 最基本的問題就是, 我們如何把我們如何把語⾔「輸入」... ⼀段⽂字 f

數學軟體應⽤ 317 每個字我們給它⼀個代表數字⽂字我們會給⼀個編號再輸入... f 龍這裡指定⼀個數字給這個字

數學軟體應⽤ 318 我們給字編號! 的一了是我最常⾒的⽅式是我們把字依出現的頻率排序, 越常出現給的編號越⼩。
1 2 3 4 5

數學軟體應⽤ 319 Tokenizer 是寫⼀個幫我們數字編號的函式, 就叫 tokenizer。 5 tokenizer

數學軟體應⽤ 320 然後 one-hot encoding! 的一了是我
每個字做 one-hot encoding! 1 2 3 4 5 1 0 0 0 0 ⋮ 0 1 0 0 0 ⋮ 0 0 1 0 0 ⋮ 0 0 0 1 0 ⋮ 0 0 0 0 1 ⋮ 注意 one-hot encoding 後還是只是個編號!

數學軟體應⽤ 321 實作 Word Embedding 可是這樣⼦如果常⽤字有 10,000, 每個字就要
10,000 維向去「記」它。

數學軟體應⽤ 322 ⽂字怎麼輸入? 通常我們就是⼀個字 (或⼀個詞), 就給它⼀個代表的數字 (或向 )。 E
龍 [94, 87, 87] 這樣的函數就叫做⼀個 word embedding。 * 數學上說 embedding 需要確定函數 1-1, ⽽且保持某種結構, 這裡沒有這麼嚴格, 不過⼤致精神是這樣。

數學軟體應⽤ 323 實作 Word Embedding x 原來的 one-hot encoding 10,000
維 128 維 V 維 N 維

數學軟體應⽤ 324 實作 Word Embedding01 在 Tensorﬂow 我們可以⽤新的⼀種 layer, 叫
Embedding(V, N)

數學軟體應⽤ 325 01. 讀入深度學習套件做⼀串輸入處理的剛剛說的 embedding LSTM!! 看來沒和以前差太多!

數學軟體應⽤ 326 02. 輸入 IMDB 數據庫通常我們會規定不同字的上限數⽬! (較罕⽤的字就忽略)

數學軟體應⽤ 327 03. 把每個句字長度設⼀樣不是說好 sequence- to-sequence!? * 我們的確可以做任意長度輸入、輸出的 seq2seq,
但⼀般為了計算效能等原因, RNN 還是把每次輸入串取⼀樣。也就是說每筆 time steps 的數⽬是⼀樣的!

數學軟體應⽤ 328 04. 建模三部曲之⼀: 打造我們的函數學習機感覺很簡單!

數學軟體應⽤ 329 04. 建模三部曲之⼀: 打造我們的函數學習機⼆元分類 (分兩類) 的問題, 我們常⽤ binary_crossentropy

數學軟體應⽤ 330 注意 RNN 層輸入樣貌! Embedding 層的輸出是準備輸入 RNN 層的。

數學軟體應⽤ 331 注意 RNN 層輸入樣貌!

數學軟體應⽤ 332 05. 建模三部曲之⼆: 訓練訓練過程中也幫我們驗證! 然後就會進入 (相對以前) 漫長的等待...

數學軟體應⽤ 333 06. 建模三部曲之三: 預測不知 tf.Keras 是怎麼編 IMDB 評論每個字的代碼
(也就是我們沒有這筆數據的 tokenizer), 如何丟⾃⼰的評論進去呢?

數學軟體應⽤ 334 06. 建模三部曲之三: 預測原來 tf.Keras 有幫我們準備好!

數學軟體應⽤ 335 06. 建模三部曲之三: 預測問某個單字的 token 注意都⼩寫, 標點符號拿掉。

數學軟體應⽤ 336 06. 建模三部曲之三: 預測相當準耶! 畢竟例⼦也太簡單...

數學軟體應⽤ 337 實戰篇。打造自己的 Tokenizer 我想做中⽂的情意分析!

數學軟體應⽤ 338 01. 讀入 Tokenizer 基本上就是訓練⼀個 Tokenizer 函數學習機。

數學軟體應⽤ 339 02. 讀入⽂字檔案讀入⽂字, 做些必要處理。

數學軟體應⽤ 340 03. 三部曲之⼀: 打開函數學習機 char_level 本來是⼀個字⺟⼀個字⺟做 encoding, 剛
好中⽂是⼀個字⼀個字。

數學軟體應⽤ 341 04. 三部曲之⼆: 訓練注意輸入的格式。

數學軟體應⽤ 342 05. 三部曲之三:預測注意輸入的格式。

Encoder-Decoder Structure 16.

數學軟體應⽤ 344 Seq2seq model 重點 Seq2seq 字1 字2 回1 EOS
回1 回2 回k EOS Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112). 還記得我們對話機器人 seq2seq model? c

數學軟體應⽤ 345 Encoder-Decoder 字1 字2 回1 EOS 回1 回2 回k
EOS 我們可以想成, 這是個 enconder-decoder 的結構。 c encoder decoder 這裡的 c 是 encoder 最終輸出的 hidden state。

數學軟體應⽤ 346 Encoder-Decoder c 回應客戶的話 encoder decoder
更清楚明白的說, 是如下的結構。

數學軟體應⽤ 347 Encoder-Decoder c 回應客戶的話我們統一下符號, 輸入用
示, 而輸出用表示。 encoder decoder 1 , 2 , …, 1 , 2 , …,

數學軟體應⽤ 348 Encoder-Decoder 假如把我們的 RNN 叫做 , 我們有這個式⼦: f Decoder
也是⼀樣的! 但怕⼤家弄混, 我們 hidden states 改名⼦。 h t = f(h t−1 , x t ) h t = f(h t−1 , y t ) s t = f(s t−1 , y t )

數學軟體應⽤ 349 問題討論 c 這樣⼦ enconding 最後輸出的 h 就是我們唯⼀的資訊,
代表前⾯完整的句⼦! 然後我們就要⽤單⼀向⽣出完整的回應 (翻譯，⽂章…)

數學軟體應⽤ 350 關鍵的表⽰向 c 向可以看成輸入句⼦ (⽂章) 的表⽰向。
c

數學軟體應⽤ 351 冒險01 encoder Encoder 的⽅式基本上是⼀樣的! h t = f
e (h t−1 , x t ) x 1 , x 2 , …, x T c = h T = f e (h T−1 , x t ) c 關鍵的表⽰向 c

數學軟體應⽤ 352 冒險01 Decoder 每次都參考原版的 !! c K. Cho, B.
Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078, 2014. s t = f d (s t−1 , y t , c) 關鍵的表⽰向 c

數學軟體應⽤ 353 冒險01 Decoder 每次都參考原版的 c!! 這個⼀定要固定不變的嗎？ c
s t = f d (s t−1 , y t , c) 關鍵的表⽰向 c

數學軟體應⽤ 354 冒險01 我們 summary 的 c 其實不⼀定是要輸入串最後⼀個 hidden state。
也可以參考所有輸入時的 hidden states 算出來! D. Bahdanau, K. Cho, Y. Bengi. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. 2014. s t = f d (s t−1 , y t , c) c = q(h1, h2, …, hT) 關鍵的表⽰向 c

數學軟體應⽤ 355 冒險01 可以不是固定的嗎? c Decoder 不同階段看重的部份可能不一様! 關鍵的表⽰向
c

Attention 16.

數學軟體應⽤ 357 Attention 我們準備⽣出時, 會關注之前輸入每個字的比可能不⼀樣! y i
x 1 x 2 x T−1 x T

數學軟體應⽤ 358 Attention 這種注意⼒放在幾個點的叫 attention D. Bahdanau, K. Cho,
Y. Bengio, Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. 2014.

數學軟體應⽤ 359 雖然寫起來看來很複雜 1 2 Ｔ−1 Ｔ 1 2 Ｔ−1
Ｔ 1 2 −1 + 1 2 T−1 T = ∑ =1 s t = f e (s t−1 , y t , c t ) y t y t+1 s t−1 s t

數學軟體應⽤ 360 Attention 到底怎麼算? 我們來說明⼀下這裡發⽣了什麼事?

數學軟體應⽤ 361 Attention 到底怎麼算? 我們對每個位置的「注意⼒」⼤⼩不同, 所以要給不同權。 c t
= α 1 h 1 +α 2 h 2 + ⋯+α T h T 我們要決定這些權怎麼決定! α 1 , α 2 , …, α T values

數學軟體應⽤ 362 Attention 到底怎麼算? y i y i+1 s i
s i+1 h 1 h 2 h T encoder decoder 目前關切的 query keys 一一算關連強度 e 1 e 2 e T e j = a(s i , h j ) 用 attention 算出 e 1 , e 2 , …, e T

數學軟體應⽤ 363 Attention 到底怎麼算? e j = a(s i ,
h j ) h j s i e j e j = s i ⋅ h j 可以用一個神經元甚至就 dot product Attention 可以怎麼設計呢?

數學軟體應⽤ 364 Attention 到底怎麼算? 於是就算出了現在的狀態, 對每個位置的注意⼒強度! e 1 e 2
e T 我們希望這些數字加起來是 1。於是我們的老朋友 softmax 登場了...

數學軟體應⽤ 365 Attention 到底怎麼算? = ( −1 , ) =
exp( ) ∑ =1 exp( ) alignment model 這就是 softmax! 其實這是某種的「對⿑」。

Transformer 19.

數學軟體應⽤ 367 Attenation is All You Need 不用 RNN, 可以做
Attention 嗎? A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), 2017. 帶起 transformer 風潮的 FAttention is All You NeedM

數學軟體應⽤ 368 Self-Attention 解決遞歸的問題 encoder x 1 x 2 x
T decoder y 1 y 2 y k−1 y k ⼀次⼀起輸入 Self-Attention

數學軟體應⽤ 369 Encoder Encoder 的⼀個 Layer Multi-Head (Self) Attention Dense
sublayer sublayer 原版 encoder 做了 6 層

數學軟體應⽤ 370 Encoder Encoder 的⼀個 Layer Multi-Head (Self) Attention Dense
每個 sublayer 都做 ResNet 型的連結

數學軟體應⽤ 371 Decoder Decoder 的⼀個 Layer Multi-Head Attention Dense sublayer
sublayer 原版 decoder 也做了 6 層 Masked Multi- Head (Self) Attention sublayer 從 encoder 來的 K, V

數學軟體應⽤ 372 Query-Key-Value 的概念神秘的 Q, K, V Q queries
K keys V values

數學軟體應⽤ 373 其實和前⾯ RNN 的 Attention ⼀樣! q i k
1 k 2 k T e 1 e 2 e T ⽬前正在關注的 , 和所有的做 attention, 依此決定各強度將線性組合起來。 q j k i h i α 1 v 1 +α 2 v 2 + ⋯+α T v T query keys values

數學軟體應⽤ 374 問題是... 沒有 RNN, 哪來這些 hidden states 當
Q-K-V 的向啊?

數學軟體應⽤ 375 ⾃⼰的 Q-K-V ⾃⼰⽣! 神秘的 Q, K, V self-attention
的時候, q, k, v 向都是由輸入 embedding 算出來的。 x p q p k p v p x p WQ x p WK x p WV query key value

數學軟體應⽤ 376 ⾃⼰的 Q-K-V ⾃⼰⽣! Multi-Head (Self) Attention Dense Multi-Head
Attention Dense Masked Multi- Head (Self) Attention encoder decoder 這兩處的 self- attenation 都是由輸入向⾃家⽣產 q, k, v

數學軟體應⽤ 377 只有 decoder 中間那層 sublayer 有點不同 Multi-Head Attention Dense
Masked Multi- Head (Self) Attention decoder q k v Multi-Head Attention 來⾃ encoder 來⾃前⼀層的 output

數學軟體應⽤ 378 進入 attention 程序 Attention 然後就進入「正常的」attention 程序。 q i
k 1 k 2 k T 送入⼀個 query , 就和每⼀個做 attention, 得到的「強度」做 softmax 之後成為的係數。 q i k j v j 本⽂的 attention 只是做內積。

數學軟體應⽤ 379 進入 attention 程序超愛內積的 Google, attention 當然唯⼀⽀持內積。
a(q i , k j ) = q i ⋅ k j T

數學軟體應⽤ 380 Q-K-V 矩陣 Q = K = V =
q 1 q 2 q T q 3 k 1 k 2 k T k 3 v 1 v 2 v T v 3 Q, K, V 矩陣⾃然就是收集相對的向 , 注意 Google 超愛列向。

數學軟體應⽤ 381 Q-K-V 矩陣 KT q i k 1 ⼀個
和所有 keys 的 attention (內積) 只是⼀個矩陣乘法。 q i k 2 k 3 k T = e 1 e 2 e 3 e T

數學軟體應⽤ 382 Q-K-V 矩陣 KT q i k 1 再做
softmax 就找到要對 values 線性組合的係數。 k 2 k 3 k T = e 1 e 2 e 3 e T α 1 α 2 α 3 α T softmax

數學軟體應⽤ 383 Q-K-V 矩陣對 V 矩陣列向線性組合, 還是⼀個簡單矩陣乘法。 α
1 α 2 α 3 α T V v 1 v 2 v T v 3 α 1 v 1 +α 2 v 2 + ⋯+α T v T =

數學軟體應⽤ 384 Q-K-V 矩陣所有的 , 所有的 attention ⼀次寫出來是這樣: q
i Attention(Q, K, V) = softmax( QKT dk )V dk 是⼀個 key 向的維度, 為什麼要除以這神秘數字呢?

數學軟體應⽤ 385 Q-K-V 矩陣 Google 發現, attention ⽤內積還不如⽤⼀個神經元訓練。 a(q
i , k j ) = q i ⋅ kT j q i k j vs a 勝 e i 對深愛內積的 Google 來說, 這是⼀個沈的打擊!

數學軟體應⽤ 386 Q-K-V 矩陣原來問題是內積有些數字太⼤, 做 softmax 很容易贏者通吃!
e 1 e 2 e 3 e T α 1 α 2 α 3 α T softmax

數學軟體應⽤ 387 Q-K-V 矩陣 softmax 3.9 3.2 1 0.3 1.1
61% 30% 3% 2% 4% 本來感覺只差⼀點啊...

數學軟體應⽤ 388 Q-K-V 矩陣 softmax 3.9 3.2 1 0.3 1.1
40% 29% 11% 8% 11% 當然不⼀定要是 , 可以當⼀個 hyperparameter! τ d k 1.74 1.43 0.45 0.13 0.49 同除以 τ = d k

數學軟體應⽤ 389 Q-K-V 矩陣再看⼀次 transformer 的 attention 公式, 是不是就很清楚了呢?
Attention(Q, K, V) = softmax( QKT dk )V

數學軟體應⽤ 390 Multihead Attention 認真想想, attention 沒理由只有⼀種, 所以我們可以定義第個 attention…
n Attention(QWQ n , KWK n , VWV n )

數學軟體應⽤ 391 Masked Multihead Attention Transformer Encoder 要注意的是, transformer 輸入有幾個
(字/詞向 ), 輸出就有幾個。

數學軟體應⽤ 392 Masked Multihead Attention Transformer Decoder 包括 Decoder 也是這樣,
只是開始還沒有的輸入會被 mask 住。

數學軟體應⽤ 393 也許神經網路該說有四⼤天王! DNN CNN RNN 最新加入! Transformer

數學軟體應⽤ 394 事實上原本 Google 意思是... “Attention is All You Need”就是我們只需要
transformers。 Transformer 你倆包包款款可以走了! DNN ⼩僕 CNN RNN

數學軟體應⽤ 395 但是 RNN 沒有真的消失! 許多原本⽤ RNN (尤其是 NLP 的應⽤)
都改成 transformer 沒錯。可是還是有⼈⽤ RNN (包括 Google 的研究), 更要的是, 有些不是傳統叫 RNN 的 model, 但⽤了 RNN 輸入前期時間性資料 (temporal data) 的概念。

數學軟體應⽤ 396 要取代 CNN 的部份… 理論上 transformer 取代 CNN 原理上是
有點道理, 但像 Visual Transformer (ViT) 在 ICLR 2021 才發表 (雖然不是唯⼀類似⼯作)。 A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and Neil Houlsby (Google). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR), 2021. arXiv: 2010.11929.

數學軟體應⽤ 397 甚⾄⼀路 MLP 也強強的! I. Tolstikhin, N. Houlsby, A.
Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, D. Keysers, J. Uszkoreit, M. Lucic and A. Dosovitskiy (Google). MLP-Mixer: An all-MLP Architecture for Vision. arXiv: 2105.01601. 2021/05/04 放上 arXiv! 不⽤ CNN, 不⽤ transformer, 我們來處理圖形! Yann LeCun 說, 並不是真的沒有⽤到 CNN...

數學軟體應⽤ 398 Q-K-V 的啟發 K V Q Key-value 可看成我們的「記憶」 Query
是實際發⽣/關注的事雖然 key-value-query 不是 transformer ⾸創, 但啟發我們這樣的思考⽅向。這些都是我們數據的表⽰⽅式, 是不是更有意識的去找適合的表⽰⽅式才是點?

尋找語意的 embedding 17.

數學軟體應⽤ 400 ⽂字進入電腦要做 word embedding 在⾃然語⾔處理當中, 最基本的問題就是, 我們如何把我們如何把語⾔「輸入」... f
⼀段⽂字

數學軟體應⽤ 401 但基本上我們無法準備訓練資料! 除非是這麼明顯的例⼦... f 輸出輸入 [ 94 87]
龍

數學軟體應⽤ 402 前情提要⽂字我們會給⼀個編號再輸入... f 龍這裡也要變成數字才能輸入電腦

1 2 3 4 5

數學軟體應⽤ 405 然後 one-hot encoding! 的一了是我
每個字做 one-hot encoding! 1 2 3 4 5 1 0 0 0 0 ⋮ 0 1 0 0 0 ⋮ 0 0 1 0 0 ⋮ 0 0 0 1 0 ⋮ 0 0 0 0 1 ⋮ 注意 one-hot encoding 後還是只是個編號!

數學軟體應⽤ 406 Word2Vec 我們以著名的 Word2Vec 來看看怎麼做 word embedding! 相似的字會在⼀起!
Google 官網: https://code.google.com/archive/p/word2vec/

數學軟體應⽤ 407 Word2Vec T. Mikolov, K. Chen, G. Corrado, J.
Dean. Toutanova. Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR, 2013.. 訓練好了有很多炫炫的功能。巴黎法國義⼤利羅⾺國王男⼈女⼈皇后

數學軟體應⽤ 408 Pretext Task 我們可以讓電腦去做⼀些⼩任務, 這個任務是我們覺得「電腦要懂⽂字的意」才能完成的任務。這種不是我們真正最後的⽬標, 通常是為了訓練好的表⽰向的叫 pretext
task。 Embedding 我們看要 embed 到幾維向 , 比如說 128 維, 那就在神經網路中間的隱藏層, 放 128 個神經元!

數學軟體應⽤ 409 Word2Vec 的⼩任務 Word2Vec 就設計兩種任務。 f w t−2 w
t w t−1 w t+1 w t+2 CBOW model ⽤周圍的字預測中間的字。

數學軟體應⽤ 410 Word2Vec 的⼩任務 CBOW model w t−2 w t−1
w t+1 w t+2 w t 我們可以找到字的 embedding!

數學軟體應⽤ 411 Word2Vec 的⼩任務或是更炫的去訓練這樣的函數! f Skip-Gram model 中間的字預測週圍的字 w
t−2 w t w t−1 w t+1 w t+2

數學軟體應⽤ 412 記憶或理解 word2vec 給我們⼀個啟發, 就是權也可以是我們⽇後要⽤的部份,
可以當成是某種「記憶」。 w 11 w 12 ⋯ w 1N w 21 w 22 ⋯ w 2N ⋮ ⋮ ⋮ w i1 w i2 ⋯ w iN ⋮ ⋮ ⋮ w V1 w V2 ⋯ w VN W 其⾄這⼀⼩段網路想成是「理解」。

數學軟體應⽤ 413 記憶或理解 h W x One-hot encoding T 0
0 ⋮ 1 ⋮ 0 w 11 w 12 ⋯ w 1N w 21 w 22 ⋯ w 2N ⋮ ⋮ ⋮ w i1 w i2 ⋯ w iN ⋮ ⋮ ⋮ w V1 w V2 ⋯ w VN WT x= h 就 word2vec 來說, 其實也可以看成是隱藏層的輸出! = h

數學軟體應⽤ 414 NLP 常⽤概念 Bag of Words n-Gram 不如順便介紹一下兩個 NLP
常用概念。詞袋模型 n元語法

數學軟體應⽤ 415 Bag of Words (BOW) 假設我們現在有個句子, , 每個字是什麼字就放進那個字的「袋子」中。最後這
句話就數每個袋子有幾個字, 來表示這句話。 {w 1 , w 2 , …, w T } B 1 B 2 B 3 B V 2 0 3 0 [2, 0, 3, …, 0] 這句話就表示成這個向量。

數學軟體應⽤ 416 n-Gram 就是把附近的字合起來考慮, 假設我們的句子還是 , 現在準備用 2-gram 表示,
那就是: {w 1 , w 2 , …, w T } [[w 1 , w 2 ], [w 2 , w 3 ], …, [w V−1 , w V ]]

數學軟體應⽤ 417 傳統 Word Embedding 還是有缺點 Word Embedding 基本上固定的字 (詞)
就有固定代表的特徵向。但是... 這個⼈的個性有點天天。我天天都會喝⼀杯咖啡。⼀個字、⼀個詞, 在不同的地⽅可能有不⼀樣的意思。

數學軟體應⽤ 418 語意型的 word embedding! f 某個意涵編碼⽤意涵來編碼! 這真的做得到?

數學軟體應⽤ 419 ELMo 開創⾃然語⾔的「芝⿇街時代」! ELMo M.E. Peters, M. Neumann, M.
Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer. Deep contextualized word representations. NAACL 2018. arXiv preprint arXiv:1802.05365v2. AI2

數學軟體應⽤ 420 其實我們已經有了! 1 2 −1 <BOS> 我天天
啡咖喝咖我們要的 embedding 對話機器⼈的 hidden states 就是很好的 embedding!

數學軟體應⽤ 421 沒⼈限制我們只能有⼀層! 1 2 −1 <BOS> 天喝咖
1 2 −1 LSTM1 LSTM2

數學軟體應⽤ 422 於是我們會有更「客製化」embedding hi hi token w 1 w 2
w 3 + + 我們在要⽤時, 才會去學 , 成為「真正」的 embedding。 w 1 , w 2 , w 3 前⾯需要⼤訓練資料的都不⽤動哦!

數學軟體應⽤ 423 引領⾃然語⾔新時代的 BERT BERT J. Devlin, M.W. Chang, K.
Lee, K. Toutanova. BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805v2. Google

數學軟體應⽤ 424 Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,
J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008). 運⽤ self-attention, 避開 RNN 的缺點!

數學軟體應⽤ 425 Transformer BERT 的架構基本上是 transformer 的 encoder。其中⼀種訓練⽅式是這樣。 BER
我天天都會喝⼀杯__。咖啡克漏字

數學軟體應⽤ 426 BERT 的標準⽤法記得 BERT 是⽤ transformer, 所以幾個輸入就是幾個輸出。要應⽤,
例如情意分析是這樣... Transformer Encoder [cls] 字1 字T

數學軟體應⽤ 427 感受 BERT 的威⼒ https://github.com/google-research/bert BERT 官⽅版本, 包括中⽂版! 我⾃然語⾔項⽬
幾乎全能哦!

數學軟體應⽤ 428 當然, 不只是 BERT… https://simpletransformers.ai/ 比 Transformers 還簡單的 Simple
Transformers https://github.com/huggingface/transformers 集各名⾨, Hugging Face 著名的 Transformers

數學軟體應⽤ 429 Transformer 版的 ELMO ELMo AI2 Transformer 版 BERT
再帶起 Transformer 風潮, 甚⾄ ELMo 都出現... M. E. Peters, M. Neumann, L. Zettlemoyer, W.-T. Yih. Dissecting Contextual Word Embeddings: Architecture and Representation. EMNLP 2018.

數學軟體應⽤ 430 震驚世界的 GPT 唬爛王系列 GPT-2 基本上是 transformer 的 decoder。
OpenAI 還有改善 BERT 不太會⽣⽂章、(當年) ⼤到可怕的... A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019. 1,500* 個參數 * 單位都是百萬

數學軟體應⽤ 431 震驚世界的 GPT 唬爛王系列著名的「發現獨⾓獸」唬爛⽂章。 Better Language Models
and Their Implications https://openai.com/blog

數學軟體應⽤ 432 XLNet XLNet Z. Yang, Z. Dai, Y Yang,
J. Carbonell, R. Salakhutdinov, Q. V. Le. XLNet: Generalized Autoregressive Pretraining for Language Understanding. NeruIPS 2019. 使⽤ Transformer XL 使⽤ permutation 訓練法

數學軟體應⽤ 433 RoBERTa Ro RoBERTa 加強訓練的 BERT! Facebook 唸的書更多就更強!
A

數學軟體應⽤ 434 MegatronLM MegatronLM 超級火⼒展⽰ 8,300!! M. Shoeybi, M. M.
A. Patwary, P. Puri, P. LeGresley, J. Casper, B. Catanzaro, Megatron-LM: Training Multi-Billion Parameter Language Models Using GPU Model Parallelism. arXiv:1909.08053 2019.

數學軟體應⽤ 435 Talk to Transformer https://talktotransformer.com/ 試試 transformer 的功力! 現在是
Megatron!

數學軟體應⽤ 436 Talk to Transformer Zero-Shot Learning

數學軟體應⽤ 437 唬爛王第三代 GPT-3 https://openai.com/blog/openai-api/ 新的 GPT-3 你可以申請 API,
在你的應用中使用! 175,000

數學軟體應⽤ 438 唬爛王第三代 GPT-3 告訴電腦要什麼樣的數學式子, 就把式子表示出來。

數學軟體應⽤ 439 GPT-3 讓你感覺是⼤老闆! 網頁排版我是要這樣、那樣修一下!

數學軟體應⽤ 440 GPT-3 讓你感覺是⼤老闆! 這裡有個 Excel 表單, 缺了什麼你自己看著辦!

數學軟體應⽤ 441 GPT-3 讓你感覺是⼤老闆! 給我畫個這樣那樣的圖, 我現在就要!

數學軟體應⽤ 442 輕化然後⼤家終於想到, 這樣軍備競賽下去, 很多場景其實不太能應⽤。 V. Sanh, L.
Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. NeruIPS 2019. 小巧版的 BERT 來了! DistilBERT

數學軟體應⽤ 443 DistilBERT DistilBERT 其實就是大 BERT 訓練小 BERT 的概念。來,
我教你! G. Hinton, O. Vinyals, J. Dean. Distilling the knowledge in a neural network. arxiv: 1503.02531, 2015.

數學軟體應⽤ 444 使⽤參數比較 0 125 250 375 500 ELMo GPT
BERT XLNet RoBERTa DistilBERT ELMo Transformer 94 110 340 465 340 355 可怕的 GPT-2/3, MegatronLM 我們沒放進去...

玩玩 trasnformer 18.

數學軟體應⽤ 446 Hugging Face 的 Transformers Hugging Face 的套件 >
pip install transformers transformers

數學軟體應⽤ 447 Gradio 2 > pip install gradio 輕鬆做出⼀個 AI
App。

數學軟體應⽤ 448 Demo http://bit.ly/yenlung 請到我的程式區: 找到 AI-Demo > Gradio2_快速_NLP.ipynb。

【深度學習】06 RNN 實務與 Transformers

【深度學習】06 RNN 實務與 Transformers

More Decks by [email protected]

Other Decks in Technology

Featured

Transcript