Dependency-Based Self-Attention for Transformer NMT (RANLP2019)

Dependency-Based Self-Attention for Transformer NMT Ehime University, JAPAN Hiroyuki Deguchi,
Akihiro Tamura, Takashi Ninomiya

Machine Translation ➢ Rule-based Machine Translation (RBMT) ➢ Statistical Machine
Translation (SMT) ➢ Neural Machine Translation (NMT) ➢ RNN-based model [Sutskever+, 14] ➢ CNN-based model [Gehring+, 17] ➢ Transformer model [Vaswani+, 17] 2 Sep. 3, 2019

Transformer Model [Vaswani+, 17] ⚫ Self-attention computes the strength of
relationships between two words in the same sentence. ⚫ Encoder-decoder attention computes the strength of relationships between a source word and a target word. 3 Sep. 3, 2019 I listen to jazz I listen to jazz ➢ e.g. Self-Attention (Encoder side)

Existing Syntax-Based MT ⚫ The performance of MT has been
improved by incorporating sentence structures. ➢ incorporating dependency relations in SMT [Lin, 04] ➢ incorporating dependency relations in RNN-based NMT [Chen+,17][Eriguchi+,17] ➢ incorporating dependency relations in Transformer NMT [Wu+,18][Zhang+,19][Currey+,19] 4 Sep. 3, 2019

Existing Syntactic Transformer NMT [Wu+,18] 5 Sep. 3, 2019 Action
RNN <s> SH SH LR SH SH SH LR SH RR Transformer Encoder I listen to jazz ℎ1 ℎ2 ℎ3 ℎ4 Transformer Decoder <s> Ich höre jazz Ich höre jazz </s> ℎ2 ℎ1 ℎ4 ℎ3 CES Syntax-aware Encoder ℎ1 ℎ3 ℎ4 ℎ2 HES I to jazz listen listen I jazz to The model structures of Transformer NMT have not been modified.

Purposes ⚫ Improve translation performance by utilizing dependency relations in
Transformer NMT ⚫ Extend self-attention to incorporate dependency relations on both source and target sides ⚫ Extend our method to work for subword sequences 6 Sep. 3, 2019

Proposed Method: Dependency-Based Self-Attention ⚫ Incorporates dependency relations into self-
attention on both source and target sides. ➢ Inspired by Syntactically-Informed Self-Attention [Strubell+, 18] 7 Sep. 3, 2019 Incorporating the source side dependencies Incorporating the target side dependencies

Syntactically-Informed Self-Attention [Strubell+, 18] ⚫ The self-attention is trained by
giving a constraint based on the dependency relationship. 8 Sep. 3, 2019 ➢ self-attention: ➢ training data: I listen to jazz I listen to jazz I listen to jazz I listen to jazz

Syntactically-Informed Self-Attention [Strubell+, 18] ⚫ Syntactically-Informed Self-Attention is calculated by
the biaffine transformations. ⚫ = ( + ) ➢ , : query and key representations ➢ , : parameters ⚫ loss function: = _(, ) 9 Sep. 3, 2019 ➢ self-attention: ➢ training data: I listen to jazz I listen to jazz I listen to jazz I listen to jazz , = ( = ℎ ) probability of being the head of

Syntactically-Informed Self-Attention vs. Dependency-Based Self-Attention 10 Sep. 3, 2019 Task
Encoder Decoder Subword Syntactically-Informed Self-Attention SRL ✓ Dependency-Based Self- Attention (our method) MT ✓ ✓ ✓

Dependency-Based Self-Attention ⚫ One head of a multi-head self-attention both
on source and target sides are trained to attend to the head for each token similarly to syntactically- informed self-attention [Strubell+,18]. ⚫ Loss function: = + 1 (, ) + 2 (, ) ⚫ calculates label smoothed cross entropy ⚫ , calculate cross entropy ⚫ 1, 2 : hyperparameters 11 Sep. 3, 2019 ➢ self-attention: ➢ training data: Encoder side parsing error Decoder side parsing error I listen to jazz I listen to jazz Translation error I listen to jazz I listen to jazz

Future Masking on the Decoder Side ⚫ In the decoder
side, future information is masked to prevent from attending to unpredicted subsequent tokens. ⚫ e.g. 12 Sep. 3, 2019 ➢ self-attention: ➢ training data: I listen to jazz I listen to jazz I listen to jazz I listen to jazz

Subword Dependency-Based Self-Attention ⚫ When an original head is divided
into multiple subwords, the head is set to the leftmost one. 13 Sep. 3, 2019 original dependency subword dependency

Subword Dependency-Based Self-Attention ⚫ The head of the rightmost subword
is set to the head of the original word. 14 Sep. 3, 2019 original dependency subword dependency the rightmost subword of "jazz"

Subword Dependency-Based Self-Attention ⚫ The head of each subword other
than the rightmost one is set to the right adjacent subword. 15 Sep. 3, 2019 original dependency subword dependency

Experiments ⚫ WAT 2018 Japanese-English translation task : ASPEC ⚫
Training data: 1,198,149 sentence pairs, Test data: 1,812 sentence pairs ⚫ We used the vocabulary of 100K subword tokens based on BPE for both languages. ⚫ Tokenizer ⚫ Ja: KyTea, En: Moses Tokenizer ⚫ Dependency Parser ⚫ Ja: EDA, En: Stanford Dependencies ⚫ The dependency parsers are NOT used in decoding. 16 Sep. 3, 2019

Experiments ⚫ Baseline: Transformer [Vaswani+, 17] ➢ Encoder: 6 layers,
Decoder: 6 layers ➢ Multihead Attention: 8 heads ⚫ Proposed model ➢ Mixing Ratio: 1 = 1.0, 2 = 1.0 ➢ Subword Dependency-based self-attention is applied to the 4th layer of each encoder and decoder 17 Sep. 3, 2019

Experiment Results ⚫ S-DBSA: Subword Dependency-Based Self-Attention ⚫ Our model
achieved a 1.0 point gain in BLEU over the baseline Transformer model 18 Sep. 3, 2019 Model BLEU [%] Transformer 27.29 Transformer + S-DBSA(Enc) 28.05 Transformer + S-DBSA(Dec) 27.86 Transformer + S-DBSA(Enc&Dec) 28.29 (+1.0)

Effectiveness of Subword ⚫ Subword dependency-based self-attention improves the performance
of our proposed model (+0.43 BLEU). ⚫ The proposed model outperforms the baseline model when BPE is not used. 19 Sep. 3, 2019 Model BPE BLEU [%] Transformer - 27.29 Transformer ✔ 28.05 Transformer + DBSA(Enc&Dec) - 27.86 Transformer + S-DBSA(Enc&Dec) ✔ 28.29

Conclusion ⚫ Contribution ➢ We improved the performance of machine
translation by incorporating the dependency relations into the self- attention of each encoder and decoder of Transformer. ➢ We extended dependency-based self-attention to work well for subword sequences. ⚫ Future works ➢ Explore the effectiveness of our proposed model for language pairs other than Japanese-to-English ➢ Explore the effectiveness of our proposed model for larger corpora 20 Sep. 3, 2019

Thank you for your attention.

Dependency-Based Self-Attention for Transformer...

Dependency-Based Self-Attention for Transformer NMT (RANLP2019)

Hiroyuki Deguchi

More Decks by Hiroyuki Deguchi

Other Decks in Research

Featured

Transcript

Dependency-Based Self-Attention for Transformer NMT Ehime University, JAPAN Hiroyuki Deguchi,

Machine Translation ➢ Rule-based Machine Translation (RBMT) ➢ Statistical Machine

Transformer Model [Vaswani+, 17] ⚫ Self-attention computes the strength of

Existing Syntax-Based MT ⚫ The performance of MT has been

Existing Syntactic Transformer NMT [Wu+,18] 5 Sep. 3, 2019 Action

Purposes ⚫ Improve translation performance by utilizing dependency relations in

Proposed Method: Dependency-Based Self-Attention ⚫ Incorporates dependency relations into self-

Syntactically-Informed Self-Attention [Strubell+, 18] ⚫ The self-attention is trained by

Syntactically-Informed Self-Attention [Strubell+, 18] ⚫ Syntactically-Informed Self-Attention is calculated by

Syntactically-Informed Self-Attention vs. Dependency-Based Self-Attention 10 Sep. 3, 2019 Task

Dependency-Based Self-Attention ⚫ One head of a multi-head self-attention both

Future Masking on the Decoder Side ⚫ In the decoder

Subword Dependency-Based Self-Attention ⚫ When an original head is divided

Subword Dependency-Based Self-Attention ⚫ The head of the rightmost subword

Subword Dependency-Based Self-Attention ⚫ The head of each subword other

Experiments ⚫ WAT 2018 Japanese-English translation task : ASPEC ⚫

Experiments ⚫ Baseline: Transformer [Vaswani+, 17] ➢ Encoder: 6 layers,

Experiment Results ⚫ S-DBSA: Subword Dependency-Based Self-Attention ⚫ Our model

Effectiveness of Subword ⚫ Subword dependency-based self-attention improves the performance

Conclusion ⚫ Contribution ➢ We improved the performance of machine

Thank you for your attention.