Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文紹介: Deep Neural Machine Translation with Linear Associative Unit

論文紹介: Deep Neural Machine Translation with Linear Associative Unit

首都大 小町研 論文紹介

Satoru Katsumata

December 10, 2023
Tweet

More Decks by Satoru Katsumata

Other Decks in Research

Transcript

  1. Deep Neural Machine Translation with Linear Associative Unit Mingxuan Wang,

    Zhengdong Lu, Jie Zhou, Qun Liu (ACL 2017) 首都大 B4 勝又智
  2. Abstract • NMT systems with deep architecture RNNs often suffer

    from severe gradient diffusion. ◦ due to the non-linear recurrent activations, which often make the optimization much more difficult. • we propose novel linear associative units (LAU). ◦ reduce the gradient path inside the recurrent units • experiment ◦ NIST task: Chinese-English ◦ WMT14: English-German English-French • analysis ◦ LAU vs. GRU ◦ Depth vs. Width ◦ about Length 2
  3. background: gate structure • LSTM, GRU: capture long-term dependencies •

    Residual Network (He et al., 2015) • highway Network (Srivastava et al., 2015) • Fast-Forward Network (Zhou et al., 2016) (F-F connections) 3 highway Network (H, T: non-linear function)
  4. background: Gated Recurrent Unit taking a linear sum between the

    existing state and the newly computed state z_t: update gate r_t: reset gate 4
  5. Model: LAU (Linear Asocciative Unit) LAU extends GRU by having

    an additional linear transformation of the input. f_t and r_t express how much of the non-linear abstraction are preduced by the input x_t and previous hidden state h_t. g_t decides how much of the linear transformation of the input is carried to the hidden state. GRU 5
  6. What is good using LAU? LAU offers a direct way

    for input x_t to go to latter hidden state layers. This mechanism is very useful for translation where the input should sometimes be directly carried to the next stage of processing without any substantial composition or nonlinear transformation. ex. imagine we want to translate a rare entity name such as ‘Bahrain’ to Chinese. → LAU is able to retain the embedding of this word in its hidden state. Otherwise, serious distortion occurs due to lack of training instances. 6
  7. Model: encoder-decoder (DeepLAU) • vertical stacking ◦ only the output

    of the previous layer of RNN is fed to the current layer as input. • bidirectional encoder ◦ φ is LAU. ◦ the directions are marked by a direction term d. d = -1 or +1 when d = -1, processing in forward diretion. otherwise backward direction. 7
  8. Model: Encoder side • in order to learn more temporal

    dependencies, they choose unusual bidirectional approach. • encoding ◦ an RNN layer processes the input sequence in forward direction. ◦ the output of this layer is taken by an upper RNN layer as input, processed in reverse direction. ◦ Formally, following Equation (9), they set d = (-1)^ℓ ◦ the final encoder consists of Lenc layers and produces the output 8
  9. model: Attention side α_t,j is caluclated by the first layer

    of decoder at step t - 1 (st-1), the most-top layer of the encoder at step j (hj), and context word yt-1. σ(): tanh() 9
  10. model: Decoder side the decoder follows Equation (9) with fixed

    direction term d = -1. At the first layer, they use the following input: At inference stage, they only utilize the top-most hidden state s(Ldec ) to make the final predication with a softmax layer: yt-1 is the target word embedding 10
  11. Experiments: corpus • NIST Chinese-English ◦ training: LDC corpora 1.25M

    sents, 27.9M Chinese words and 34.5M English words ◦ dev: NIST 2002 (MT02) dataset ◦ test: NIST 2003 (MT03), NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06) • WMT14 English-German ◦ training: WMT14 training corpus 4.5M sents, 91M English words and 87M German words ◦ dev: news-test 2012, 2013 ◦ test: news-test 2014 • WMT14 English-French ◦ training: subset of WMT14 training corpus 12M sents, 304M English words and 348M French words ◦ dev: concatenation of news-test 2012 and news-test 2013 ◦ test: news-test 2014 11
  12. Experiments: setup • For all expariments ◦ dimension: embedding, hidden

    states and ct are 512 size. ◦ optimizer: Adadelta ◦ batch size: 128 ◦ input length limit: 80 words ◦ beam size: 10 ◦ dropout rate: 0.5 ◦ layer: both encoder and decoder have 4 layers • settings of each experiment ◦ in Chinese-English and English-French, use the soruce and target vocab frequent 30k ◦ for English-German, use the source 120k and the target 80k in order of frequent 12
  13. Result: Chinese-English, English-French LAUs apply adaptive gate function conditioned on

    the input which it able to decide how much linear information should be transferred to the next step. 13
  14. Analysis: LAU vs. GRU, Depth vs. Width LAU vs. GRU

    • row 3 to row 7 → LAU bring imporovement. • row 3,4 to row 7,8 → GRU decrese BLEU, but LAU bring improvement. Depth vs. Width • when increasing the model depth but they failed to see further imporovements. • hidden size (width) row 2 to row 3 → improvements is relative small → depth plays a more important role in incresing the complexity of neural networks than width. in any analysis, use NIST Chinese-English task 15
  15. Analysis: About Length DeepLAU models yield higher BLEU score than

    the DeepGRU model. → very deep RNN model is good at modelling the nested latent structures on relatively complicated sentences. 16
  16. Conclusion • propose a Linear Asociative Unit (LAU) ◦ it

    makes a fusion of both linear and nonlinear transformation. • LAU enable us to build a deep neural network for MT. • My feeling ◦ After all, is this model a non-recurrent deep or a recurrent deep? maybe, both are likely to be good… ◦ I was also interested in the weight of the model. So, I wanted them to mention about it. 17
  17. reference • Srivastava et al. Training Very Deep Networks NIPS

    2015 • Zhou et al. Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation arXiv • He et al. Deep residual learning for image recognition arXiv • Wu et al. Google’s neural machine translation system: Bridging the gap between human and machine translation arXiv 18