Oral: Multimodal Machine Translation with Embedding Prediction

Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi,
Yukio Matsumura, Mamoru Komachi [email protected] Tokyo Metropolitan University NAACL SRW 2019

Multimodal Machine Translation • Practical application of machine translation •
Translate a source sentence along with related nonlinguistic information • Visual information 6/11/19 NAACL SRW 2019, Minneapolis 1 two young girls are sitting on the street eating corn . deux jeunes filles sont assises dans la rue , mangeant du maïs .

Issue of MMT • Multi30k [Elliott et al., 2016] has
only small mount of data • Statistic of training data • Hard to train rare word translation • Tend to output synonyms guided by language model 6/11/19 NAACL SRW 2019, Minneapolis 2 Sentences Tokens Types English 29,000 377,534 10,210 French 409,845 11,219 Source deux jeunes filles sont assises dans la rue , mangeant du maïs . Reference two young girls are sitting on the street eating corn . NMT two young girls are sitting on the street eating food .

Previous Solutions • Parallel corpus without images [Elliott and Kádár,
2017; Grönroos et al., 2018] • Out-of-domain data • Pseudo in-domain data by filtering general domain data • Pseudo-parallel corpus [Sennrich et al., 2016; Helcl et al., 2018] • Back-translation of caption/monolingual data • Monolingual data • Pretrained Word Embedding • Seldomly studied 6/11/19 NAACL SRW 2019, Minneapolis 3

Motivation • Introduce pretrained word embedding to MMT • Improve
rare word translation in MMT • Pretrained word embeddings with conventional MMT? • See our paper on MT Summit 2019 (https://arxiv.org/abs/1905.10464) ! • Pretrained Word Embedding in text-only NMT • Initialize embedding layers in encoder/decoder [Qi et al., 2018] üImprove overall performance in low-resource domain • Search-based decoder with continuous output [Kumar and Tsvetkov, 2019] üImprove rare word translation 6/11/19 NAACL SRW 2019, Minneapolis 4

1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3.
Pretrained Word Embedding 4. Result & Conclusion 6/11/19 NAACL SRW 2019, Minneapolis 5

Baseline: IMAGINATION [Elliot and Kádáar, 2017] 6/11/19 NAACL SRW 2019,
Minneapolis 6 While validating, testing While training Multitask Learning: Train both MT task and shared space learning task to improve the shared encoder. MT Model: Bahdanau et al., 2015

MMT with Embedding Prediction 6/11/19 NAACL SRW 2019, Minneapolis 7
While validating, testing While training 1. Use embedding prediction in decoder 3. Shift visual features to make the mean vector be a zero 2. Initialize embedding layers in encoder/decoder with pretrained word embeddings

Embedding Prediction (Continuous Output) 6/11/19 NAACL SRW 2019, Minneapolis 8
• i.e. Continuous Output [Kumar and Tsvetkov, 2019] • Predict a word embedding and search for the nearest word 1. Predict a word embedding of next word. 2. Compute cosine similarities with each word in pretrained word embedding. 3. Find and output the most similar word as system output. Keep unchanged: Pretrained word embedding will NOT be updated during training. 3 2 1

Embedding Layer Initialization • Initialize embedding layer with pretrained word
embedding • Fine-tune the embedding layer in encoder • DO NOT update the embedding layer in decoder 6/11/19 NAACL SRW 2019, Minneapolis 9 Fine Tune Unchanged [Qi et al., 2018]

Loss Function • Model loss: Interpolation of each loss [Elliot
and Kádáar, 2017] • MT task: Max-margin with negative sampling [Lazaridou et al., 2015] • negative sampling • Shared space learning task: Max-margin [Elliot and Kádáar, 2017] 6/11/19 NAACL SRW 2019, Minneapolis 10 <latexit sha1_base64="Y1TJQZNz6khHVoTzkNhLmVfILGs=">AAAC2XicjVFNbxMxEPVu+SjhoykcuVgEpESk0W6pVC5IFVwQElKRmrZSNl3NOk7WyXq9smeByPKBG+LKmR/Gv8Gb5tC0PTCS5ef33mhGz1lVCINR9DcIt+7cvXd/+0Hr4aPHT3bau09Pjao140OmCqXPMzC8ECUfosCCn1eag8wKfpYtPjT62VeujVDlCS4rPpYwK8VUMEBPpe0/n9JEAuZa2hPXTTDnCH2aVLm4wvfoO5qYWqbzi8/U099pYiPvmoGUQF/TSTfJpE1yQMtdOndeat7rfu5c91tq5+7C7rlej+79h3+Zzr2zlbY70SBaFb0J4jXokHUdp7vBy2SiWC15iawAY0ZxVOHYgkbBCu5aSW14BWwBMz7ysATJzdiucnT0lWcmdKq0PyXSFXu1w4I0Zikz72x2Nde1hrxNG9U4fTu2oqxq5CW7HDStC4qKNp9CJ0JzhsXSA2Ba+F0py0EDQ/91G1OYkqspt63V93ezuGkA5rKfyU1fptQCITPO5xpfT/EmON0fxNEg/nLQOXq/TnibPCcvSJfE5JAckY/kmAwJC7aCXrAfvAlH4Y/wZ/jr0hoG655nZKPC3/8AK7LkIw==</latexit> <latexit sha1_base64="oJs4glWS4qBuFUzG605uRQdJnvU=">AAADHHicjVFNixMxGM6MX7V+tevRS7AIXeiWmUXQi7DoxeMKtrvQ1CGTpm3afAxJxlpi7v4Kf4038Sp49ZeYmY5gt3vwhZBnnvd55n3JkxecGZskv6L4xs1bt++07rbv3X/w8FGnezQ2qtSEjojiSl/m2FDOJB1ZZjm9LDTFIuf0Il+/qfoXH6k2TMn3dlvQqcALyeaMYBuorPN5k7mV/+BOPHwFEdYLgT9lbgMRkxAJbJcEczf2HiIHZ32UC4eW2Drqs5UfwPq7UmkRKN/fHB9DePIfwm22CtKs00uGSV3wEKQN6IGmzrNu9AXNFCkFlZZwbMwkTQo7dVhbRjj1bVQaWmCyxgs6CVBiQc3U1c/k4bPAzOBc6XCkhTX7r8NhYcxW5EFZrWqu9iryut6ktPOXU8dkUVoqyW7QvOTQKli9OZwxTYnl2wAw0SzsCskSa0xsSGZvClGinnLdWoNwV4ubCtilGORiX5crtbY4N77dRpJu6n/JmWtS9bsAVOGQFrDhEGeC2eA4MDB5aAjcX0OVXXo1qUMwPh2myTB997x39rpJsQWegKegD1LwApyBt+AcjAABv6NW1I2O4q/xt/h7/GMnjaPG8xjsVfzzD3W5A3I=</latexit> <latexit sha1_base64="mcKAhji//xZOUN+j4d5fjAQJCiQ=">AAADLHicbVHLihNBFK1uX2N8ZXQjuCkniDOYCd0i6EYYdCOuRjCZgVQI1ZVKukg92qrbo6HovQu/xa9xI+LWD/ALrO5pYTLJhaZPnXNP3UudrJDCQZL8jOIrV69dv7Fzs3Pr9p2797q790fOlJbxITPS2NOMOi6F5kMQIPlpYTlVmeQn2fJtrZ+cceuE0R9hVfCJogst5oJRCNS0++39lCgKuVV+VO0TyDnQPilycYE+wK8xcaWaepIpf/a0wkTzT7g5VOGg6BdMPE76mFBZ5BQ/w7P9WiU5hbqlj1vjAT7cLgWFVHja7SWDpCm8CdIW9FBbx9Pd6CuZGVYqroFJ6tw4TQqYeGpBMMmrDikdLyhb0gUfB6ip4m7im2er8JPAzPDc2PBpwA170eGpcm6lstBZv4W7rNXkNm1cwvzVxAtdlMA1Ox80LyUGg+sM8ExYzkCuAqDMirArZjm1lEFIam0KM6qZsm2tfvjXi7saQK76mVrvy4xZAs1c1emEwD43d+mZJ9QuQmSVbxI2hSdW4ZYjUigBwbFhEHrTELj/hjq79HJSm2D0fJAmg/TDi97RmzbFHfQI7aF9lKKX6Ai9Q8doiBj6Gz2MHkd78ff4R/wr/n3eGket5wFaq/jPP9RgCJk=</latexit> <latexit sha1_base64="XQxfQrA162bkpg+QKWvn5A4Rd6c=">AAADBXicbZHNbtNAEMc3Lh8lfDSFI5cVEVIqQmQjJLggVXBBPRWpSStlo2i93sSr7Ie1O26JLJ/Ly3BDXHkAnoDH4NpeWLuuaJqMZPmv/8xvZzQTZ1I4CMM/rWDrzt1797cftB8+evxkp7P7dORMbhkfMiONPYmp41JoPgQBkp9kllMVS34cLz5V+eNTbp0w+giWGZ8oOtdiJhgFb007/AB/wER6IKH4YEoUhdSq4qjsEUg50D7JUnHD3sOvcC/Cr6+Zvf/MaDMz8sy00w0HYR14XUSN6KImDqe7rW8kMSxXXAOT1LlxFGYwKagFwSQv2yR3PKNsQed87KWmirtJUe+jxC+9k+CZsf7TgGv3JlFQ5dxSxb6yGtLdzlXmptw4h9n7SSF0lgPX7KrRLJcYDK6WixNhOQO59IIyK/ysmKXUUgb+BCtdmFF1l01j9f2/GtxVAlLVj9VqXWzMAmjsynabaH5Wv6WTglA7V/RrWdSrN1lBrMKNR6RQAjyxBgi9DnjvGqhuF92+1LoYvRlE4SD68ra7/7G54jZ6jl6gHorQO7SPPqNDNEQM/UZ/0QW6DM6D78GP4OdVadBqmGdoJYJf/wCZqPu9</latexit>

Hubness Problem [Lazaridou et al., 2015] • Certain words (hubs)
appear frequently in the neighbors of other words • Even of the word that has entirely no relationship with hubs • Prevent the embedding prediction model from searching for correct output words • Incorrectly output the hub word 6/11/19 NAACL SRW 2019, Minneapolis 12

All-but-the-Top [Mu and Viswanath, 2018] • Address hubness problem in
other NLP tasks • Debias a pretrained word embedding based on its global bias 1. Shift all word embeddings to make their mean vector into a zero vector 2. Subtract top 5 PCA components from each shifted word embedding • Applied to pretrained word embeddings for encoder/decoder 6/11/19 NAACL SRW 2019, Minneapolis 13

Implementation & Dataset • Implementation • Based on nmtpytorch v3.0.0
[Caglayan et al., 2017] • Dataset • Multi30k (French to English) • Pretrained ResNet50 for visual encoder • Pretrained Word Embedding • FastText • Trained on Common Crawl and Wikipedia • https://fasttext.cc/docs/en/crawl-vectors.html 6/11/19 NAACL SRW 2019, Minneapolis 15 Our code is here: https://github.com/toshohirasawa/nmtpytorch-emb-pred

Hyper Parameters • Model • dimension of hidden state: 256
• RNN type: GRU • dimension of word embedding: 300 • dimension of shared space: 2048 • Vocabulary size (French, English): 10,000 • Training • λ = 0.99 • Optimizer: Adam • Learning rate: 0.0004 • Dropout rate: 0.3 6/11/19 NAACL SRW 2019, Minneapolis 16

Word-level F1 -score 6/11/19 NAACL SRW 2019, Minneapolis 17 5.48
12.46 19.97 24.65 32.44 49.66 69.66 5.63 12.86 16.76 22.74 33.64 51.12 69.98 13.59 19.77 28.34 33.64 38.03 52.13 71.24 0 10 20 30 40 50 60 70 80 1 2 3 4 5 - 9 10 - 99 100+ F-score of word Frequency in training data Bahdanau et al., 2015 IMAGINATION Ours Rare words

Ablation w.r.t. Embedding Layers • Fixing the embedding layer in
decoder is essential • Keep word embeddings in input/output layers consistent 6/11/19 NAACL SRW 2019, Minneapolis 18 Encoder Decoder Fixed BLEU METEOR FastText FastText Yes 53.49 43.89 random FastText Yes 53.22 43.83 FastText random No 51.53 43.07 random random No 51.42 42.77 FastText FastText No 51.42 42.88 random FastText No 50.72 42.52 Encoder/Decoder: Initialize embedding layer with random values or FastText word embedding. Fixed (Yes/No): Whether fix the embedding layer in decoder or fine-tune that while training.

Overall Performance • Our model performs better than baselines •
Even those with embedding layer initialization 6/11/19 NAACL SRW 2019, Minneapolis 19 Model Validation Test BLEU BLEU METEOR Bahdanau et al. 2015 50.83 51.00 .37 42.65 .12 + pretrained 52.05 52.33 .66 43.42 .13 IMAGINATION 51.03 51.18 .16 42.80 .19 + pretrained 52.40 52.75 .25 43.56 .04 Ours 53.14 53.49 .20 43.89 .14 Model (+ pretrained): Apply embedding layer initialization and All-but-the-Top debiasing.

Ablation w.r.t. Visual Features • Centering visual features is required
to train our model 6/11/19 NAACL SRW 2019, Minneapolis 20 Visual Features Validation Test BLEU BLEU METEOR Centered 53.14 53.49 43.89 Raw 52.65 53.27 43.91 No 52.97 53.25 43.91 Visual Features (Centered/Raw/No): Use centered visual features or raw visual features to train model. ’’No’’ show the result of text-only NMT with embedding prediction model.

Conclusion & Future Works • MMT with embedding prediction improves
... • Rare word translation • Overall performance • It is essential for embedding prediction model to ... • Fix the embedding in decoder • Debias the pretrained word embedding • Center the visual feature for multitask learning • Future works • Better training corpora for embedding learning in MMT domain • Incorporate visual features into contextualized word embeddings 6/11/19 NAACL SRW 2019, Minneapolis 21 Thank you!

6/11/19 NAACL SRW 2019, Minneapolis 22

Translation Example un homme en vélo pédale devant une voûte
. a man on a bicycle pedals through an archway . a man on a bicycle pedal past an arch . a man on a bicycle pedals outside a monument . a man on a bicycle pedals in front of a archway . 6/11/19 NAACL SRW 2019, Minneapolis 23 Source Reference Text-only NMT IMAGINATION Ours

Translation Example (long) quatre hommes , dont trois portent des
kippas , sont assis sur un tapis à motifs bleu et vert olive . four men , three of whom are wearing prayer caps , are sitting on a blue and olive green patterned mat . four men , three of whom are wearing aprons , are sitting on a blue and green speedo carpet . four men , three of them are wearing alaska , are sitting on a blue patterned carpet and green green seating . four men , three are wearing these are wearing these are sitting on a blue and green patterned mat . 6/11/19 NAACL SRW 2019, Minneapolis 24 Source Reference Text-only NMT IMAGINATION Ours

Oral: Multimodal Machine Translation with Embed...

Oral: Multimodal Machine Translation with Embedding Prediction

tosho

More Decks by tosho

Other Decks in Science

Featured

Transcript

Multimodal Machine Translation with Embedding Prediction Tosho Hirasawa, Hayahide Yamagishi,

Multimodal Machine Translation • Practical application of machine translation •

Issue of MMT • Multi30k [Elliott et al., 2016] has

Previous Solutions • Parallel corpus without images [Elliott and Kádár,

Motivation • Introduce pretrained word embedding to MMT • Improve

1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3.

Baseline: IMAGINATION [Elliot and Kádáar, 2017] 6/11/19 NAACL SRW 2019,

MMT with Embedding Prediction 6/11/19 NAACL SRW 2019, Minneapolis 7

Embedding Prediction (Continuous Output) 6/11/19 NAACL SRW 2019, Minneapolis 8

Embedding Layer Initialization • Initialize embedding layer with pretrained word

Loss Function • Model loss: Interpolation of each loss [Elliot

1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3.

Hubness Problem [Lazaridou et al., 2015] • Certain words (hubs)

All-but-the-Top [Mu and Viswanath, 2018] • Address hubness problem in

1. Multimodal Machine Translation 2. MMT with Embedding Prediction 3.

Implementation & Dataset • Implementation • Based on nmtpytorch v3.0.0

Hyper Parameters • Model • dimension of hidden state: 256

Word-level F1 -score 6/11/19 NAACL SRW 2019, Minneapolis 17 5.48

Ablation w.r.t. Embedding Layers • Fixing the embedding layer in

Overall Performance • Our model performs better than baselines •

Ablation w.r.t. Visual Features • Centering visual features is required

Conclusion & Future Works • MMT with embedding prediction improves

6/11/19 NAACL SRW 2019, Minneapolis 22

Translation Example un homme en vélo pédale devant une voûte

Translation Example (long) quatre hommes , dont trois portent des