Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Huang et al. 2020 Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting

tosho
June 10, 2020

Huang et al. 2020 Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting

tosho

June 10, 2020
Tweet

More Decks by tosho

Other Decks in Science

Transcript

  1. Summary • Unsupervised multimodal machine translation • Train: L1-Img, L2-Img

    (usually, no image overlap) • Test: L1-Img-L2 • Introduce three new loss from: • Multilingual visual-semantic embedding • Pivoted Captioning for Back-Translation • Povited Captioning for Paired-Translation • SOTA for unsupervised multimodal MT on Multi30K 1
  2. Unsupervised Machine Translation • Common principels for unsupervised MT •

    Pre-training step is essential • Masked language model [Conneau and Lample, 2019] • Span-based seq-to-seq masking [Song et al., 2019] • Back-translation loss Approach Train/Val Test Loss Supervised MT L1-L2 L1-L2 Unsupervised MT L1, L2 L1-L2 ∗, ℎ∗: sentence predictors 2
  3. Unsupervised Multimodal Machine Translation Approach Train/Val Test Loss Supervised MT

    L1-L2 L1-L2 Unsupervised MT L1, L2 L1-L2 Supervised multimodal MT L1-Img-L2 L1-Img-L2 • MT loss • (Auxiliary loss from L1-Img) Unsupervised multimodal MT L1-Img, L2-Img L1-Img-L2 • BT loss • Multilingual Visual-Semantic Embedding • Pivoted Captioning for Back-Translation • Povited Captioning for Paired-Translation ∗, ℎ∗: sentence predictors 3
  4. UMMT: Multilingual Visual-Semantic Emb. • Intuition: Align the latent spaces

    of the source and target languages by using visual as pivot • VSE objective: Max-margin loss w/ negative sampling • Compute similarity of two sequences 6
  5. UMMT: Pivoted Captioning for BT • Intuition: Reconstruct pesudo image

    captions • Pre-train captioning models w/ large-scale dataset • Train two models (c_x, c_y) on disjoint subsets • Objective: • Gold sentence/translation DOES NOT involve in CBT ∗, ℎ∗: sentence predictors ∗: caption predictors 7
  6. UMMT: Pivoted Captioning for Paired-Trans. • Intuition: Translate pesudo image

    captioning • Same captioning models as CBT • Objective: • Gold sentence/translation DOES NOT involve in CPT 8
  7. UMMT: Loss overview • Minimizing joint loss: • No interpolation

    weight for each loss • BUT, loss for CPT will decrease according to a scheduler • Decrease weight from 1.0 to 0.1 at 10-th epoch • Avoid training on noisy captions in laterstage of training 9
  8. Experiments: Dataset • English -> {German, French} • Multi30K •

    Train: 29K, Val: 1K, Test:1K • Multi30K-half • Train: 14,500 (En-Img), 14,500 ({De, Fr}-Img), no overlap • Validation: 507 (En-Img), 507 ({De, Fr}-Img), no overlap • Test: 1,000 (En-Img-{De, Fr}) • Preprocess • Sentence: tokenization, Byte Pair Encoding • Image: Faster-RCNN, 36 objects per image 10
  9. Experiments: Model • Transformer • 6 layers • 8 heads

    • 1024 hidden units • 4096 feed-forward filtersize • Multimodal Transformer • Hierachical multi-head multimodal attention [Libovicky and Helcl, 2017] 1. Compute two individual context vectors from encoder states and visual features 2. Map to space with same units 3. Compute attention to encoder and visual context vector 4. Weighted sum 11
  10. Experiments: Pre-training • Pre-train Transformer model • Dtaset: WMT News

    Crawl from 2007 to 2017 • 10M data for English/German/French • Objective: Masked seq-to-seq objective • Pre-train captioning model • Dtaset: MS-COCO • 56,643 images and 283,215 captions for English • Use Google Translate to generate German/French translations • Objective: 12
  11. Experiments: Evaluatioin • BLEU (multi-bleu from Moses) • METEOR •

    Model selection: BLEU scors of “round-trip” translation [Lample+, 2018] • source -> target -> source -> evaluate • target -> source -> target -> evaluate • Emperically shown to correlate well with the testing metrics 13
  12. Results: Generalizability • Train with images Test without images •

    VSE is the key component to use visual information • Full model is more sensitive with images (-0.65 BLEU) than model w/o VSE (-0.25 BLEU) 16 Results with images:
  13. Results: Real-pivoting & Low-resource • The performance is improved when

    training with the overlapped images • Proposed method works with low-resource setting 17
  14. Results: Supervised MT • Supervised training • Multi30K (100% overlapped)

    • Supervised MT objective • Visual information contributes less to improving performance in supervised MT 18
  15. Conclusion • Proposed pseudo visual pivoting for unsupervised multimodal MT

    • Improve the crosslingual alighnments in the shared latent space (VSE) • Train on image-pivoted pseudo sentences (CBT, CPT) 19