Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[論紹] Self-Attentive Residual Decoder for Neural Machine Translation

[論紹] Self-Attentive Residual Decoder for Neural Machine Translation

弊研究室内で行なったNAACL読み会での発表資料です。MTにおいてデコーダ側でアテンションを張る論文です。

onizuka laboratory

July 11, 2018
Tweet

More Decks by onizuka laboratory

Other Decks in Research

Transcript

  1. Introduction of “Self-Attentive Residual Decoder for Neural Machine Translation”, Lesly

    Miculicich Werlen, Nikolaos Pappas, Dhananjay Ram, Andrei Popescu-Belis 20180704 NomotoEriko NAACL2018 
  2. šSummary - Improve translation without complicated calculation šMethod - Self-Attentive

    Residual Decoder, Memory RNN, Self-attentive RNN šExperimental Settings - Datasets šResults - Overview, Impact of the Attention Function šQualitative Analysis - Distribution of Attention, Translation Examples šAppendix NAACL2018  Contents 1
  3. šSummary - Improve translation without complicated calculation šMethod - Self-Attentive

    Residual Decoder, Memory RNN, Self-attentive RNN šExperimental Settings - Datasets šResults - Overview, Impact of the Attention Function šQualitative Analysis - Distribution of Attention, Translation Examples šAppendix NAACL2018  Contents 2
  4. šThe Problem of Previous research Øthe target-side context is solely

    based on the sequence model šProposed method Øtarget-side-attentive residual recurrent network for decoding šresult Øthe method improved the translation without increasing the number of parameters Summary Improve translation without complicated calculation 3
  5. šSummary - Improve translation without complicated calculation šMethod - Self-Attentive

    Residual Decoder, Memory RNN, Self-attentive RNN šExperimental Settings - Datasets šResults - Overview, Impact of the Attention Function šQualitative Analysis - Distribution of Attention, Translation Examples šAppendix NAACL2018  Contents 4
  6. šDefine target-side summary vector !" and use it instead of

    the previous word #"$% when selecting the target word #" štow ways of !" definition ØMean Residual Connections ! " &'( = % "$% ∑+,% "$% #+ ØSelf-Attentive Residual Connections !" -&'( = ∑+,% "$% .+ "#+, .+ " = softmax 6+ " 6+ " = 78 tanh ; < #+ + ; > ?" @A 78 tanh ; < #+ Method Self-Attentive Residual Decoder 5
  7. šbased on the proposal by Cheng el al. (2016) šdefined

    as !" = $ ̃ !" , '"() , *" where ̃ !" = ∑ ,-) "() ., "!, ., " = softmax 6, " 6, " = 7(ℎ, , '"() , ̃ !"() ) Method other self attentive networks: Memory RNN 6 ̃ !"() ̃ !" !,  …
  8. šbased on the proposal by Daniluk et al. (2017) šdefined

    as p "# "$ , ⋯ , "#'$ , (# ≈ * +# , "#'$ , (# , ̃ +# where ̃ +# = ∑ /0$ #'$ 1/ #+/ 1/ # = softmax 9/ # 9/ # = :(+/ , +# ) Method other self attentive networks: Self-Attention RNN 7 ̃ +#'$ ̃ +#
  9. šSummary - Improve translation without complicated calculation šMethod - Self-Attentive

    Residual Decoder, Memory RNN, Self-attentive RNN šExperimental Settings - Datasets šResults - Overview, Impact of the Attention Function šQualitative Analysis - Distribution of Attention, Translation Examples šAppendix NAACL2018  Contents 8
  10. En2Zh Es2En En2De training UN parallel corpus: 500K WMT2013 Europarl

    v7 and News Commentary v11:2.1M WMT 2016: 4.5M validation UN parallel corpus: 2K Newstest2012 Newstest2013 test UN parallel corpus: 2K Newstest2013 Newstest2015 and Newstest2016 Vocab size 25K 50K 50K Experimental Settings Datasets 9
  11. šOOV: use Bite Pair Encoding šmaximum sentence length: 50 words

    šword embeddings: 500 dim šhidden layer: 1024 dim šFlamework: Theano šoptimizer: Adadelta šTraining period: 7-12 days 10 Experimental Settings Implementation Settings
  12. šSummary - Improve translation without complicated calculation šMethod - Self-Attentive

    Residual Decoder, Memory RNN, Self-attentive RNN šExperimental Settings - Datasets šResults - Overview, Impact of the Attention Function šQualitative Analysis - Distribution of Attention, Translation Examples šAppendix NAACL2018  Contents 11
  13. šThe NMT model with self-attentive residual conn. achieve the best

    scores. šSimilar results were obtained in human evaluation šThe proposed models don’t increase the number of parameters. Results Overview 12
  14. šat Self-Attentive Residual Connections, attention function !" # defined as

    !" # = softmax ," # ," # = -. tanh 1 2 3" + 1 5 6# or ," # = -. tanh 1 2 3" šthe context does not necessarily lead to the best translation šthe content is effective to extract representations of the whole sentence Results Impact of the Attention Function 13
  15. šSummary - Improve translation without complicated calculation šMethod - Self-Attentive

    Residual Decoder, Memory RNN, Self-attentive RNN šExperimental Settings - Datasets šResults - Overview, Impact of the Attention Function šQualitative Analysis - Distribution of Attention, Translation Examples šAppendix NAACL2018  Contents 14
  16. šMemory RNN frequently selects the immediately previous word šThe attention

    of Self-attentive RNN vary for each word šSelf-attentive residual connections model focuses on particular words Qualitative Analysis Distribution of Attention 15
  17. 16 Qualitative Analysis Translation Examples Better than baseline R: Students

    and teachers are taking the date lightly. B: Students and teachers are being taken lightly to the date. O: Students and teachers are taking the date lightly. R: Not because he shared their world view, but because for him, human rights are indivisible. B: Not because I share his ideology, but because he is indivisible by human rights. O: Not because he shared his ideology, but because for him human rights are indivisible. Worse than baseline R: The Government is trying not to build so many small houses. B: The government is trying not to build so many small houses. O: The government is trying to ensure that so many small houses are not built. R: Other people can have children. B: Other people can have children. O: Others may have children.
  18. šSummary - Improve translation without complicated calculation šMethod - Self-Attentive

    Residual Decoder, Memory RNN, Self-attentive RNN šExperimental Settings - Datasets šResults - Overview, Impact of the Attention Function šQualitative Analysis - Distribution of Attention, Translation Examples šAppendix NAACL2018  Contents 17
  19. šThe proposed models outperform the LSTM baseline and the self-attention

    model, but not the 4-gram LSTM šThis shows that even if a model improves language modeling, it does not necessarily improve machine translation Appendix Performance on Language Modeling 18