Dependency-Based Self-Attention for Transformer NMT (RANLP2019)

Dependency-Based Self-Attention for Transformer NMT (RANLP2019)

C4e79a8ccde1b5c1a84fd9b35bec67ea?s=128

Hiroyuki Deguchi

September 03, 2019
Tweet

Transcript

  1. Dependency-Based Self-Attention for Transformer NMT Ehime University, JAPAN Hiroyuki Deguchi,

    Akihiro Tamura, Takashi Ninomiya
  2. Machine Translation ➢ Rule-based Machine Translation (RBMT) ➢ Statistical Machine

    Translation (SMT) ➢ Neural Machine Translation (NMT) ➢ RNN-based model [Sutskever+, 14] ➢ CNN-based model [Gehring+, 17] ➢ Transformer model [Vaswani+, 17] 2 Sep. 3, 2019
  3. Transformer Model [Vaswani+, 17] ⚫ Self-attention computes the strength of

    relationships between two words in the same sentence. ⚫ Encoder-decoder attention computes the strength of relationships between a source word and a target word. 3 Sep. 3, 2019 I listen to jazz I listen to jazz ➢ e.g. Self-Attention (Encoder side)
  4. Existing Syntax-Based MT ⚫ The performance of MT has been

    improved by incorporating sentence structures. ➢ incorporating dependency relations in SMT [Lin, 04] ➢ incorporating dependency relations in RNN-based NMT [Chen+,17][Eriguchi+,17] ➢ incorporating dependency relations in Transformer NMT [Wu+,18][Zhang+,19][Currey+,19] 4 Sep. 3, 2019
  5. Existing Syntactic Transformer NMT [Wu+,18] 5 Sep. 3, 2019 Action

    RNN <s> SH SH LR SH SH SH LR SH RR Transformer Encoder I listen to jazz ℎ1 ℎ2 ℎ3 ℎ4 Transformer Decoder <s> Ich höre jazz Ich höre jazz </s> ℎ2 ℎ1 ℎ4 ℎ3 CES Syntax-aware Encoder ℎ1 ℎ3 ℎ4 ℎ2 HES I to jazz listen listen I jazz to The model structures of Transformer NMT have not been modified.
  6. Purposes ⚫ Improve translation performance by utilizing dependency relations in

    Transformer NMT ⚫ Extend self-attention to incorporate dependency relations on both source and target sides ⚫ Extend our method to work for subword sequences 6 Sep. 3, 2019
  7. Proposed Method: Dependency-Based Self-Attention ⚫ Incorporates dependency relations into self-

    attention on both source and target sides. ➢ Inspired by Syntactically-Informed Self-Attention [Strubell+, 18] 7 Sep. 3, 2019 Incorporating the source side dependencies Incorporating the target side dependencies
  8. Syntactically-Informed Self-Attention [Strubell+, 18] ⚫ The self-attention is trained by

    giving a constraint based on the dependency relationship. 8 Sep. 3, 2019 ➢ self-attention: ➢ training data: I listen to jazz I listen to jazz I listen to jazz I listen to jazz
  9. Syntactically-Informed Self-Attention [Strubell+, 18] ⚫ Syntactically-Informed Self-Attention is calculated by

    the biaffine transformations. ⚫ = ( + ) ➢ , : query and key representations ➢ , : parameters ⚫ loss function: = _(, ) 9 Sep. 3, 2019 ➢ self-attention: ➢ training data: I listen to jazz I listen to jazz I listen to jazz I listen to jazz , = ( = ℎ ) probability of being the head of
  10. Syntactically-Informed Self-Attention vs. Dependency-Based Self-Attention 10 Sep. 3, 2019 Task

    Encoder Decoder Subword Syntactically-Informed Self-Attention SRL ✓ Dependency-Based Self- Attention (our method) MT ✓ ✓ ✓
  11. Dependency-Based Self-Attention ⚫ One head of a multi-head self-attention both

    on source and target sides are trained to attend to the head for each token similarly to syntactically- informed self-attention [Strubell+,18]. ⚫ Loss function: = + 1 (, ) + 2 (, ) ⚫ calculates label smoothed cross entropy ⚫ , calculate cross entropy ⚫ 1, 2 : hyperparameters 11 Sep. 3, 2019 ➢ self-attention: ➢ training data: Encoder side parsing error Decoder side parsing error I listen to jazz I listen to jazz Translation error I listen to jazz I listen to jazz
  12. Future Masking on the Decoder Side ⚫ In the decoder

    side, future information is masked to prevent from attending to unpredicted subsequent tokens. ⚫ e.g. 12 Sep. 3, 2019 ➢ self-attention: ➢ training data: I listen to jazz I listen to jazz I listen to jazz I listen to jazz
  13. Subword Dependency-Based Self-Attention ⚫ When an original head is divided

    into multiple subwords, the head is set to the leftmost one. 13 Sep. 3, 2019 original dependency subword dependency
  14. Subword Dependency-Based Self-Attention ⚫ The head of the rightmost subword

    is set to the head of the original word. 14 Sep. 3, 2019 original dependency subword dependency the rightmost subword of "jazz"
  15. Subword Dependency-Based Self-Attention ⚫ The head of each subword other

    than the rightmost one is set to the right adjacent subword. 15 Sep. 3, 2019 original dependency subword dependency
  16. Experiments ⚫ WAT 2018 Japanese-English translation task : ASPEC ⚫

    Training data: 1,198,149 sentence pairs, Test data: 1,812 sentence pairs ⚫ We used the vocabulary of 100K subword tokens based on BPE for both languages. ⚫ Tokenizer ⚫ Ja: KyTea, En: Moses Tokenizer ⚫ Dependency Parser ⚫ Ja: EDA, En: Stanford Dependencies ⚫ The dependency parsers are NOT used in decoding. 16 Sep. 3, 2019
  17. Experiments ⚫ Baseline: Transformer [Vaswani+, 17] ➢ Encoder: 6 layers,

    Decoder: 6 layers ➢ Multihead Attention: 8 heads ⚫ Proposed model ➢ Mixing Ratio: 1 = 1.0, 2 = 1.0 ➢ Subword Dependency-based self-attention is applied to the 4th layer of each encoder and decoder 17 Sep. 3, 2019
  18. Experiment Results ⚫ S-DBSA: Subword Dependency-Based Self-Attention ⚫ Our model

    achieved a 1.0 point gain in BLEU over the baseline Transformer model 18 Sep. 3, 2019 Model BLEU [%] Transformer 27.29 Transformer + S-DBSA(Enc) 28.05 Transformer + S-DBSA(Dec) 27.86 Transformer + S-DBSA(Enc&Dec) 28.29 (+1.0)
  19. Effectiveness of Subword ⚫ Subword dependency-based self-attention improves the performance

    of our proposed model (+0.43 BLEU). ⚫ The proposed model outperforms the baseline model when BPE is not used. 19 Sep. 3, 2019 Model BPE BLEU [%] Transformer - 27.29 Transformer ✔ 28.05 Transformer + DBSA(Enc&Dec) - 27.86 Transformer + S-DBSA(Enc&Dec) ✔ 28.29
  20. Conclusion ⚫ Contribution ➢ We improved the performance of machine

    translation by incorporating the dependency relations into the self- attention of each encoder and decoder of Transformer. ➢ We extended dependency-based self-attention to work well for subword sequences. ⚫ Future works ➢ Explore the effectiveness of our proposed model for language pairs other than Japanese-to-English ➢ Explore the effectiveness of our proposed model for larger corpora 20 Sep. 3, 2019
  21. Thank you for your attention.