When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

When and Why are Pre-trained Word Embeddings Useful for Neural
Machine Translation? Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Janani Padmanabhan, Graham Neubig Proceedings of NAACL-HLT 2018, pages 529–535 New Orleans, Louisiana, June 1 - 6, 2018 文献紹介　　　　　　　　　　長岡技術科学大学自然言語処理研究室勝田哲弘

Introduction NMTシステムで単一言語データを使用する方法 • pre-trained word embeddings have been used either
in standard　translation systems (Neishi et al., 2017; Artetxe et al., 2017) • as a method for learning translation lexicons in an entirely unsupervised manner (Conneau et al., 2017; Gangi and Federico, 2017) これらはNMTに適切に組み込めばBLEUを向上させるいつ、なぜ、性能が向上するのかが明確ではない 2

Introduction • Q1 Is the behavior of pre-training affected by
language families and other linguistic features of source and target languages? (§3) • Q2 Do pre-trained embeddings help more when the size of the training data is small? (§4) • Q3 How much does the similarity of the source and target languages affect the efficacy of using pre-trained embeddings? (§5) • Q4 Is it helpful to align the embedding spaces between the source and target languages? (§6) • Q5 Do pre-trained embeddings help more in multilingual systems as compared to bilingual systems? (§7) 3

Experimental Setup TEDからコーパスを作成（英語(EN)と言語の似ているペアの3組を用意）片方の資源が比較的少ない • ガリシア（GL）とポルトガル（PT） • アゼルバイジャン（AZ）とトルコ（TR） • ベラルーシ（BE）とロシア（RU）
4

Experimental Setup Model: standard 1-layer encoder-decoder model with attention (Bahdanau
et al., 2014) with a beam size of 5 implemented in xnmt5 (Neubig et al., 2018). Training uses a batch size of 32 and the Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.0002, decaying the learning rate by 0.5 when development loss decreases (Denkowski and Neubig, 2017). pre-trained word embeddings (Bojanowski et al., 2016) trained using fastText6 on Wikipedia7 for each language. 5

Q1: Efficacy of Pre-training pre-trained word embeddingの有用性 • スコアを大きく向上させている •
より良い符号化ができていることを示す 6

Q2: Effect of Training Data Size トレーニングデータを元のサイズの1/2, 1/4, 1/8 に制限することで、制御された環境での効果を検証
7

Q3: Effect of Language Similarity Portuguese as the target language
all pairs were trained on 40,000 sentences. 言語が類似している程精度が高い 8

Q4: Effect of Word Embedding Alignment 2つの言語にわたって一貫した空間を有することが有益であると仮定する we adopted the
approach proposed by Smith et al. (2017) 9

Q5: Effect of Multilinguality multilingual translation systems that share an
encoder or decoder between multiple languages (Johnson et al., 2016; Firat et al., 2016) 低リソースと高リソース言語のペアを使用してモデルをトレーニングし、低リソースのみでテスト 3つの対について、GL / PTの類似度は最も高く、BE / RUは最も低い。 10

Q5: Effect of Multilinguality • 概ね類似度と精度に関連性がある • Q4と違いalignによる向上が見られる • BE→EN多言語の恩恵を得られていない
11

Analysis 12

Qualitative Analysis レアな語彙を補足するだけでなく、文法的に整形された文を生成するのにも役立つ 13

Analysis of Frequently Generated n-grams. 14

F-measure of Target Words 特に低頻度の単語の改善が見られる 15

When and Why are Pre-trained Word Embeddings Us...

When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

katsutan

More Decks by katsutan

Other Decks in Technology

Featured

Transcript