[論紹] Self-Attentive Residual Decoder for Neural Machine Translation

Introduction of “Self-Attentive Residual Decoder for Neural Machine Translation”, Lesly
Miculicich Werlen, Nikolaos Pappas, Dhananjay Ram, Andrei Popescu-Belis 20180704 NomotoEriko NAACL2018

Summary - Improve translation without complicated calculation Method - Self-Attentive
Residual Decoder, Memory RNN, Self-attentive RNN Experimental Settings - Datasets Results - Overview, Impact of the Attention Function Qualitative Analysis - Distribution of Attention, Translation Examples Appendix NAACL2018 Contents 1

The Problem of Previous research Øthe target-side context is solely
based on the sequence model Proposed method Øtarget-side-attentive residual recurrent network for decoding result Øthe method improved the translation without increasing the number of parameters Summary Improve translation without complicated calculation 3

Define target-side summary vector !" and use it instead of
the previous word #"$% when selecting the target word #" tow ways of !" definition ØMean Residual Connections ! " &'( = % "$% ∑+,% "$% #+ ØSelf-Attentive Residual Connections !" -&'( = ∑+,% "$% .+ "#+, .+ " = softmax 6+ " 6+ " = 78 tanh ; < #+ + ; > ?" @A 78 tanh ; < #+ Method Self-Attentive Residual Decoder 5

based on the proposal by Cheng el al. (2016) defined
as !" = $ ̃ !" , '"() , *" where ̃ !" = ∑ ,-) "() ., "!, ., " = softmax 6, " 6, " = 7(ℎ, , '"() , ̃ !"() ) Method other self attentive networks: Memory RNN 6 ̃ !"() ̃ !" !, …

based on the proposal by Daniluk et al. (2017) defined
as p "# "$ , ⋯ , "#'$ , (# ≈ * +# , "#'$ , (# , ̃ +# where ̃ +# = ∑ /0$ #'$ 1/ #+/ 1/ # = softmax 9/ # 9/ # = :(+/ , +# ) Method other self attentive networks: Self-Attention RNN 7 ̃ +#'$ ̃ +#

En2Zh Es2En En2De training UN parallel corpus: 500K WMT2013 Europarl
v7 and News Commentary v11:2.1M WMT 2016: 4.5M validation UN parallel corpus: 2K Newstest2012 Newstest2013 test UN parallel corpus: 2K Newstest2013 Newstest2015 and Newstest2016 Vocab size 25K 50K 50K Experimental Settings Datasets 9

OOV: use Bite Pair Encoding maximum sentence length: 50 words
word embeddings: 500 dim hidden layer: 1024 dim Flamework: Theano optimizer: Adadelta Training period: 7-12 days 10 Experimental Settings Implementation Settings

The NMT model with self-attentive residual conn. achieve the best
scores. Similar results were obtained in human evaluation The proposed models don’t increase the number of parameters. Results Overview 12

at Self-Attentive Residual Connections, attention function !" # defined as
!" # = softmax ," # ," # = -. tanh 1 2 3" + 1 5 6# or ," # = -. tanh 1 2 3" the context does not necessarily lead to the best translation the content is effective to extract representations of the whole sentence Results Impact of the Attention Function 13

Memory RNN frequently selects the immediately previous word The attention
of Self-attentive RNN vary for each word Self-attentive residual connections model focuses on particular words Qualitative Analysis Distribution of Attention 15

16 Qualitative Analysis Translation Examples Better than baseline R: Students
and teachers are taking the date lightly. B: Students and teachers are being taken lightly to the date. O: Students and teachers are taking the date lightly. R: Not because he shared their world view, but because for him, human rights are indivisible. B: Not because I share his ideology, but because he is indivisible by human rights. O: Not because he shared his ideology, but because for him human rights are indivisible. Worse than baseline R: The Government is trying not to build so many small houses. B: The government is trying not to build so many small houses. O: The government is trying to ensure that so many small houses are not built. R: Other people can have children. B: Other people can have children. O: Others may have children.

The proposed models outperform the LSTM baseline and the self-attention
model, but not the 4-gram LSTM This shows that even if a model improves language modeling, it does not necessarily improve machine translation Appendix Performance on Language Modeling 18

[論紹] Self-Attentive Residual Decoder for Neural...

[論紹] Self-Attentive Residual Decoder for Neural Machine Translation

onizuka laboratory

More Decks by onizuka laboratory

Other Decks in Research

Featured

Transcript

Introduction of “Self-Attentive Residual Decoder for Neural Machine Translation”, Lesly

Summary - Improve translation without complicated calculation Method - Self-Attentive

Summary - Improve translation without complicated calculation Method - Self-Attentive

The Problem of Previous research Øthe target-side context is solely

Summary - Improve translation without complicated calculation Method - Self-Attentive

Define target-side summary vector !" and use it instead of

based on the proposal by Cheng el al. (2016) defined

based on the proposal by Daniluk et al. (2017) defined

Summary - Improve translation without complicated calculation Method - Self-Attentive

En2Zh Es2En En2De training UN parallel corpus: 500K WMT2013 Europarl

OOV: use Bite Pair Encoding maximum sentence length: 50 words

Summary - Improve translation without complicated calculation Method - Self-Attentive

The NMT model with self-attentive residual conn. achieve the best

at Self-Attentive Residual Connections, attention function !" # defined as

Summary - Improve translation without complicated calculation Method - Self-Attentive

Memory RNN frequently selects the immediately previous word The attention

16 Qualitative Analysis Translation Examples Better than baseline R: Students

Summary - Improve translation without complicated calculation Method - Self-Attentive

The proposed models outperform the LSTM baseline and the self-attention