Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oral: Multimodal Machine Translation with Embedding Prediction

F16d24f8c3767910d0ef9dd3093ae016?s=47 tosho
June 11, 2019

Oral: Multimodal Machine Translation with Embedding Prediction

Multimodal machine translation is an attractive application of neural machine translation (NMT). It helps computers to deeply understand visual objects and their relations with natural languages. However, multimodal NMT systems suffer from a shortage of available training data, resulting in poor performance for translating rare words. In NMT, pretrained word embeddings have been shown to improve NMT of low-resource domains, and a search-based approach is proposed to address the rare word problem. In this study, we effectively combine these two approaches in the context of multimodal NMT and explore how we can take full advantage of pretrained word embeddings to better translate rare words. We report overall performance improvements of 1.24 METEOR and 2.49 BLEU and achieve an improvement of 7.67 F-score for rare word translation.

F16d24f8c3767910d0ef9dd3093ae016?s=128

tosho

June 11, 2019
Tweet

Transcript

  1. Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi,

    Yukio Matsumura, Mamoru Komachi hirasawa-tosho@ed.tmu.ac.jp Tokyo Metropolitan University NAACL SRW 2019
  2. Multimodal Machine Translation • Practical application of machine translation •

    Translate a source sentence along with related nonlinguistic information • Visual information 6/11/19 NAACL SRW 2019, Minneapolis 1 two young girls are sitting on the street eating corn . deux jeunes filles sont assises dans la rue , mangeant du maïs .
  3. Issue of MMT • Multi30k [Elliott et al., 2016] has

    only small mount of data • Statistic of training data • Hard to train rare word translation • Tend to output synonyms guided by language model 6/11/19 NAACL SRW 2019, Minneapolis 2 Sentences Tokens Types English 29,000 377,534 10,210 French 409,845 11,219 Source deux jeunes filles sont assises dans la rue , mangeant du maïs . Reference two young girls are sitting on the street eating corn . NMT two young girls are sitting on the street eating food .
  4. Previous Solutions • Parallel corpus without images [Elliott and Kádár,

    2017; Grönroos et al., 2018] • Out-of-domain data • Pseudo in-domain data by filtering general domain data • Pseudo-parallel corpus [Sennrich et al., 2016; Helcl et al., 2018] • Back-translation of caption/monolingual data • Monolingual data • Pretrained Word Embedding • Seldomly studied 6/11/19 NAACL SRW 2019, Minneapolis 3
  5. Motivation • Introduce pretrained word embedding to MMT • Improve

    rare word translation in MMT • Pretrained word embeddings with conventional MMT? • See our paper on MT Summit 2019 (https://arxiv.org/abs/1905.10464) ! • Pretrained Word Embedding in text-only NMT • Initialize embedding layers in encoder/decoder [Qi et al., 2018] üImprove overall performance in low-resource domain • Search-based decoder with continuous output [Kumar and Tsvetkov, 2019] üImprove rare word translation 6/11/19 NAACL SRW 2019, Minneapolis 4
  6. 1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3.

    Pretrained Word Embedding 4. Result & Conclusion 6/11/19 NAACL SRW 2019, Minneapolis 5
  7. Baseline: IMAGINATION [Elliot and Kádáar, 2017] 6/11/19 NAACL SRW 2019,

    Minneapolis 6 While validating, testing While training Multitask Learning: Train both MT task and shared space learning task to improve the shared encoder. MT Model: Bahdanau et al., 2015
  8. MMT with Embedding Prediction 6/11/19 NAACL SRW 2019, Minneapolis 7

    While validating, testing While training 1. Use embedding prediction in decoder 3. Shift visual features to make the mean vector be a zero 2. Initialize embedding layers in encoder/decoder with pretrained word embeddings
  9. Embedding Prediction (Continuous Output) 6/11/19 NAACL SRW 2019, Minneapolis 8

    • i.e. Continuous Output [Kumar and Tsvetkov, 2019] • Predict a word embedding and search for the nearest word 1. Predict a word embedding of next word. 2. Compute cosine similarities with each word in pretrained word embedding. 3. Find and output the most similar word as system output. Keep unchanged: Pretrained word embedding will NOT be updated during training. 3 2 1
  10. Embedding Layer Initialization • Initialize embedding layer with pretrained word

    embedding • Fine-tune the embedding layer in encoder • DO NOT update the embedding layer in decoder 6/11/19 NAACL SRW 2019, Minneapolis 9 Fine Tune Unchanged [Qi et al., 2018]
  11. Loss Function • Model loss: Interpolation of each loss [Elliot

    and Kádáar, 2017] • MT task: Max-margin with negative sampling [Lazaridou et al., 2015] • negative sampling • Shared space learning task: Max-margin [Elliot and Kádáar, 2017] 6/11/19 NAACL SRW 2019, Minneapolis 10 <latexit sha1_base64="Y1TJQZNz6khHVoTzkNhLmVfILGs=">AAAC2XicjVFNbxMxEPVu+SjhoykcuVgEpESk0W6pVC5IFVwQElKRmrZSNl3NOk7WyXq9smeByPKBG+LKmR/Gv8Gb5tC0PTCS5ef33mhGz1lVCINR9DcIt+7cvXd/+0Hr4aPHT3bau09Pjao140OmCqXPMzC8ECUfosCCn1eag8wKfpYtPjT62VeujVDlCS4rPpYwK8VUMEBPpe0/n9JEAuZa2hPXTTDnCH2aVLm4wvfoO5qYWqbzi8/U099pYiPvmoGUQF/TSTfJpE1yQMtdOndeat7rfu5c91tq5+7C7rlej+79h3+Zzr2zlbY70SBaFb0J4jXokHUdp7vBy2SiWC15iawAY0ZxVOHYgkbBCu5aSW14BWwBMz7ysATJzdiucnT0lWcmdKq0PyXSFXu1w4I0Zikz72x2Nde1hrxNG9U4fTu2oqxq5CW7HDStC4qKNp9CJ0JzhsXSA2Ba+F0py0EDQ/91G1OYkqspt63V93ezuGkA5rKfyU1fptQCITPO5xpfT/EmON0fxNEg/nLQOXq/TnibPCcvSJfE5JAckY/kmAwJC7aCXrAfvAlH4Y/wZ/jr0hoG655nZKPC3/8AK7LkIw==</latexit> <latexit sha1_base64="oJs4glWS4qBuFUzG605uRQdJnvU=">AAADHHicjVFNixMxGM6MX7V+tevRS7AIXeiWmUXQi7DoxeMKtrvQ1CGTpm3afAxJxlpi7v4Kf4038Sp49ZeYmY5gt3vwhZBnnvd55n3JkxecGZskv6L4xs1bt++07rbv3X/w8FGnezQ2qtSEjojiSl/m2FDOJB1ZZjm9LDTFIuf0Il+/qfoXH6k2TMn3dlvQqcALyeaMYBuorPN5k7mV/+BOPHwFEdYLgT9lbgMRkxAJbJcEczf2HiIHZ32UC4eW2Drqs5UfwPq7UmkRKN/fHB9DePIfwm22CtKs00uGSV3wEKQN6IGmzrNu9AXNFCkFlZZwbMwkTQo7dVhbRjj1bVQaWmCyxgs6CVBiQc3U1c/k4bPAzOBc6XCkhTX7r8NhYcxW5EFZrWqu9iryut6ktPOXU8dkUVoqyW7QvOTQKli9OZwxTYnl2wAw0SzsCskSa0xsSGZvClGinnLdWoNwV4ubCtilGORiX5crtbY4N77dRpJu6n/JmWtS9bsAVOGQFrDhEGeC2eA4MDB5aAjcX0OVXXo1qUMwPh2myTB997x39rpJsQWegKegD1LwApyBt+AcjAABv6NW1I2O4q/xt/h7/GMnjaPG8xjsVfzzD3W5A3I=</latexit> <latexit sha1_base64="mcKAhji//xZOUN+j4d5fjAQJCiQ=">AAADLHicbVHLihNBFK1uX2N8ZXQjuCkniDOYCd0i6EYYdCOuRjCZgVQI1ZVKukg92qrbo6HovQu/xa9xI+LWD/ALrO5pYTLJhaZPnXNP3UudrJDCQZL8jOIrV69dv7Fzs3Pr9p2797q790fOlJbxITPS2NOMOi6F5kMQIPlpYTlVmeQn2fJtrZ+cceuE0R9hVfCJogst5oJRCNS0++39lCgKuVV+VO0TyDnQPilycYE+wK8xcaWaepIpf/a0wkTzT7g5VOGg6BdMPE76mFBZ5BQ/w7P9WiU5hbqlj1vjAT7cLgWFVHja7SWDpCm8CdIW9FBbx9Pd6CuZGVYqroFJ6tw4TQqYeGpBMMmrDikdLyhb0gUfB6ip4m7im2er8JPAzPDc2PBpwA170eGpcm6lstBZv4W7rNXkNm1cwvzVxAtdlMA1Ox80LyUGg+sM8ExYzkCuAqDMirArZjm1lEFIam0KM6qZsm2tfvjXi7saQK76mVrvy4xZAs1c1emEwD43d+mZJ9QuQmSVbxI2hSdW4ZYjUigBwbFhEHrTELj/hjq79HJSm2D0fJAmg/TDi97RmzbFHfQI7aF9lKKX6Ai9Q8doiBj6Gz2MHkd78ff4R/wr/n3eGket5wFaq/jPP9RgCJk=</latexit> <latexit sha1_base64="XQxfQrA162bkpg+QKWvn5A4Rd6c=">AAADBXicbZHNbtNAEMc3Lh8lfDSFI5cVEVIqQmQjJLggVXBBPRWpSStlo2i93sSr7Ie1O26JLJ/Ly3BDXHkAnoDH4NpeWLuuaJqMZPmv/8xvZzQTZ1I4CMM/rWDrzt1797cftB8+evxkp7P7dORMbhkfMiONPYmp41JoPgQBkp9kllMVS34cLz5V+eNTbp0w+giWGZ8oOtdiJhgFb007/AB/wER6IKH4YEoUhdSq4qjsEUg50D7JUnHD3sOvcC/Cr6+Zvf/MaDMz8sy00w0HYR14XUSN6KImDqe7rW8kMSxXXAOT1LlxFGYwKagFwSQv2yR3PKNsQed87KWmirtJUe+jxC+9k+CZsf7TgGv3JlFQ5dxSxb6yGtLdzlXmptw4h9n7SSF0lgPX7KrRLJcYDK6WixNhOQO59IIyK/ysmKXUUgb+BCtdmFF1l01j9f2/GtxVAlLVj9VqXWzMAmjsynabaH5Wv6WTglA7V/RrWdSrN1lBrMKNR6RQAjyxBgi9DnjvGqhuF92+1LoYvRlE4SD68ra7/7G54jZ6jl6gHorQO7SPPqNDNEQM/UZ/0QW6DM6D78GP4OdVadBqmGdoJYJf/wCZqPu9</latexit>
  12. 1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3.

    Pretrained Word Embedding 4. Result & Conclusion 6/11/19 NAACL SRW 2019, Minneapolis 11
  13. Hubness Problem [Lazaridou et al., 2015] • Certain words (hubs)

    appear frequently in the neighbors of other words • Even of the word that has entirely no relationship with hubs • Prevent the embedding prediction model from searching for correct output words • Incorrectly output the hub word 6/11/19 NAACL SRW 2019, Minneapolis 12
  14. All-but-the-Top [Mu and Viswanath, 2018] • Address hubness problem in

    other NLP tasks • Debias a pretrained word embedding based on its global bias 1. Shift all word embeddings to make their mean vector into a zero vector 2. Subtract top 5 PCA components from each shifted word embedding • Applied to pretrained word embeddings for encoder/decoder 6/11/19 NAACL SRW 2019, Minneapolis 13
  15. 1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3.

    Pretrained Word Embedding 4. Result & Conclusion 6/11/19 NAACL SRW 2019, Minneapolis 14
  16. Implementation & Dataset • Implementation • Based on nmtpytorch v3.0.0

    [Caglayan et al., 2017] • Dataset • Multi30k (French to English) • Pretrained ResNet50 for visual encoder • Pretrained Word Embedding • FastText • Trained on Common Crawl and Wikipedia • https://fasttext.cc/docs/en/crawl-vectors.html 6/11/19 NAACL SRW 2019, Minneapolis 15 Our code is here: https://github.com/toshohirasawa/nmtpytorch-emb-pred
  17. Hyper Parameters • Model • dimension of hidden state: 256

    • RNN type: GRU • dimension of word embedding: 300 • dimension of shared space: 2048 • Vocabulary size (French, English): 10,000 • Training • λ = 0.99 • Optimizer: Adam • Learning rate: 0.0004 • Dropout rate: 0.3 6/11/19 NAACL SRW 2019, Minneapolis 16
  18. Word-level F1 -score 6/11/19 NAACL SRW 2019, Minneapolis 17 5.48

    12.46 19.97 24.65 32.44 49.66 69.66 5.63 12.86 16.76 22.74 33.64 51.12 69.98 13.59 19.77 28.34 33.64 38.03 52.13 71.24 0 10 20 30 40 50 60 70 80 1 2 3 4 5 - 9 10 - 99 100+ F-score of word Frequency in training data Bahdanau et al., 2015 IMAGINATION Ours Rare words
  19. Ablation w.r.t. Embedding Layers • Fixing the embedding layer in

    decoder is essential • Keep word embeddings in input/output layers consistent 6/11/19 NAACL SRW 2019, Minneapolis 18 Encoder Decoder Fixed BLEU METEOR FastText FastText Yes 53.49 43.89 random FastText Yes 53.22 43.83 FastText random No 51.53 43.07 random random No 51.42 42.77 FastText FastText No 51.42 42.88 random FastText No 50.72 42.52 Encoder/Decoder: Initialize embedding layer with random values or FastText word embedding. Fixed (Yes/No): Whether fix the embedding layer in decoder or fine-tune that while training.
  20. Overall Performance • Our model performs better than baselines •

    Even those with embedding layer initialization 6/11/19 NAACL SRW 2019, Minneapolis 19 Model Validation Test BLEU BLEU METEOR Bahdanau et al. 2015 50.83 51.00  .37 42.65  .12 + pretrained 52.05 52.33  .66 43.42  .13 IMAGINATION 51.03 51.18  .16 42.80  .19 + pretrained 52.40 52.75  .25 43.56  .04 Ours 53.14 53.49  .20 43.89  .14 Model (+ pretrained): Apply embedding layer initialization and All-but-the-Top debiasing.
  21. Ablation w.r.t. Visual Features • Centering visual features is required

    to train our model 6/11/19 NAACL SRW 2019, Minneapolis 20 Visual Features Validation Test BLEU BLEU METEOR Centered 53.14 53.49 43.89 Raw 52.65 53.27 43.91 No 52.97 53.25 43.91 Visual Features (Centered/Raw/No): Use centered visual features or raw visual features to train model. ’’No’’ show the result of text-only NMT with embedding prediction model.
  22. Conclusion & Future Works • MMT with embedding prediction improves

    ... • Rare word translation • Overall performance • It is essential for embedding prediction model to ... • Fix the embedding in decoder • Debias the pretrained word embedding • Center the visual feature for multitask learning • Future works • Better training corpora for embedding learning in MMT domain • Incorporate visual features into contextualized word embeddings 6/11/19 NAACL SRW 2019, Minneapolis 21 Thank you!
  23. 6/11/19 NAACL SRW 2019, Minneapolis 22

  24. Translation Example un homme en vélo pédale devant une voûte

    . a man on a bicycle pedals through an archway . a man on a bicycle pedal past an arch . a man on a bicycle pedals outside a monument . a man on a bicycle pedals in front of a archway . 6/11/19 NAACL SRW 2019, Minneapolis 23 Source Reference Text-only NMT IMAGINATION Ours
  25. Translation Example (long) quatre hommes , dont trois portent des

    kippas , sont assis sur un tapis à motifs bleu et vert olive . four men , three of whom are wearing prayer caps , are sitting on a blue and olive green patterned mat . four men , three of whom are wearing aprons , are sitting on a blue and green speedo carpet . four men , three of them are wearing alaska , are sitting on a blue patterned carpet and green green seating . four men , three are wearing these are wearing these are sitting on a blue and green patterned mat . 6/11/19 NAACL SRW 2019, Minneapolis 24 Source Reference Text-only NMT IMAGINATION Ours