Caglayan et al. - NAACL 2019 - Probing the Need for Visual Context in Multimodal Machine Translation

F16d24f8c3767910d0ef9dd3093ae016?s=47 tosho
May 16, 2019

Caglayan et al. - NAACL 2019 - Probing the Need for Visual Context in Multimodal Machine Translation

F16d24f8c3767910d0ef9dd3093ae016?s=128

tosho

May 16, 2019
Tweet

Transcript

  1. Probing the Need for Visual Context in Multimodal Machine Translation

    Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Loïc Barrault NAACL 2019 Best Short Paper       M1 hirasawa-tosho@ed.tmu.ac.jp 1 5/16/19
  2. +  • $  %#-& • , *!( -&

     • " •  " %#'  •  -&  • .) -& 2 5/16/19
  3. ^W • L],3JA? 8\PX0+)"%# %#!% a`=F>'IS D  • [L],3J1E

    b(SHO/0 • ,3J&B0<7_ #!%TC  • Y@5: WQVU  • - k R6WQG • N#$  8\PX0*4# $ .2K9Z #!% M; 3 5/16/19
  4.   &' • Multimodal attention using conventional features [Caglayan

    et al., 2016; Calixto et al., 2016; Libovický and Helcl, 2017; Helcl et al., 2018] •  #!(")% • Cross-modal interactions with spatially-unaware global features [Calixto and Liu, 2017; Ma et al., 2017; Caglayan et al., 2017a; Madhyastha et al., 2017] • $!(")% • The integration of regional features from object detection networks [Huang et al., 2016; Grönroos et al., 2018] •   4 5/16/19
  5. !%#%# %490 • “modest” [Grönroos et al., 2018] • Multimodal

    WSD Task 49  [Lala et al., 2018] •  !%#%/1(.:2 '): 26+ 8  [Barrault et al., 2018] • :2,$&"5;*- 037 [Elliott, 2018] 5 5/16/19
  6. '*$(#* 8? 1< • /0965;. C> • Multi30k  

    • /0Two young girls are sitting on the street eating food. • -0 Zwei junge mädchen sitzen auf der straße und essen mais. • :B   • ($&")+ • ,C>4  • '*$(#* 8?  =@  • /0932!D'*$(#*(%*!A7 6 5/16/19
  7. Input Degradation • Color Deprivation (train: 3.3%, test: 3.1%) •

    650 special token .4 • Entity Masking (train: 26.2%, test: 26.2%) • visual depictable entity [Plummer et al., 2015] ,'9 special token .4 • Progressive Masking • % k 1 50$! special token .4 • low-resource /3 -8)2"&(  7:#* + 7 5/16/19
  8. Input Degradation: Example <latexit sha1_base64="o5WQnbQ2Hv0sawwBv49fcyT3MMw=">AAADkXicjVLtbtMwFL1Z+Bjlq7Cf/IkoIP5QJWjSYL8qNiQkBCoS3SYtU+WkTmvV+ZDtdlRR3oMX4KF4DF4AcewGRDcQOIp9fHzPPfdaTioptAnDr96Wf+XqtevbNzo3b92+c7d77/6RLhcq5aO0lKU6SZjmUhR8ZISR/KRSnOWJ5MfJ/MCeHy+50qIsPppVxc9yNi1EJlJmQI27n+OUyfqwGcc5MzOV1wdN8GQ/iA3/ZHRWuwTBIa+UWDpFE8Rx56Lm/YbmdYE6VsE7pueimP5RMN8QDFU5VVxrseS/VJ1xtxf2QzeCyyBqQY/aMSy73yimCZWU0oJy4lSQAZbESOM7pYhCqsCdUQ1OAQl3zqmhDrQLRHFEMLBzzFPsTlu2wN7m1E6dwkXiV1AG9LiNmQBnjl2v1j/4LfZvHrXLbWtcYU3anDlYQzOw/9L9jPxfne3JoMIXrheBOivH2C7TjY4yrBJ7g/rtvEIkB5pApYBScBLsmrEeCuv6Xm3nM3fPzMVxIFuTrUo5L07n7nZyV3OBHDV9QUwGZL2ZczeoqIbvjJbU4FFEF5/AZXD0vB+F/ejDbm/wqn0e2/SAHtJTvIE9GtAbGtII3t+9R94zr+/v+C/9gd/GbnmtZoc2hv/2B0VJ0Ug=</latexit> 8 5/16/19

  9. Visual Sensitivity •  "* #) • incongruent decoding •

     "*'(  • "*&! $+ incongruent decoding   %   9 5/16/19
  10. Dataset • Multi30K • lowercase, normalize, tokenize using Moses [Koehn

    et al., 2007] • data splits: • English to French • 9,951 English and 11,216 French wordsBPE  • Degradation train/dev/test   Dataset Color Deprivation Progressive Masking Entity Masking train (multi30k) train train val (multi30k) train dev test2016 dev test test2017 test - 10 5/16/19
  11. Visual Features • Feature Extractor: ResNet-50 CNN [He et al.,

    2016] • trained on ImageNet [Deng et al., 2009] • Spatial feature: final convolutional layer [Caglayan et al., 2018] • L2 normalization • size 2048 x 8 x 8 (WMT     ) • Global feature: pool5 layer 11 5/16/19
  12. Models • baseline NMT [Bahdanau et al., 2014] • 2-layer

    bidirectional GRU encoder • 2-layer conditional GRU • DIRECT [Caglayan et al., 2016] • basic multimodal attention • HIER [Libovický and Helcl, 2017] • hierarchical extension of DIRECT • INIT [Calixto and Liu, 2017; Caglayan et al., 2017a] • encoder-decoder initialization 12 5/16/19
  13. baseline NMT [Bahdanau et al., 2014] • Encoder-Decoder model •

    RNN • Attention [Bahdanau et al., 2014]  13 5/16/19
  14. DIRECT [Caglayan et al., 2016] [Caglayan et al., 2016] 

    14 5/16/19
  15. HIER [Libovický and Helcl, 2017] • DIRECT  Fusion !:8

    • 6 Encoder < " "0 1> /(  • 61> 4% 7# 9= 5 • 1> <""- 2 5 • +,*;0 '3$ 1>  .5 [Libovický and Helcl, 2017] )& 15 5/16/19
  16. INIT [Calixto and Liu, 2017; Caglayan et al., 2017a] [Calixto

    and Liu, 2017]  16 5/16/19
  17. Hyperparameters • Hidden units of encoder and decoder GRUs: 400

    • Embedding size: 200, Tied embedding [Press and Wolf, 2016] • Dropout: 0.4 (source embeddings), 0.5 (encoder/decoder outputs) • Optimizer: ADAM • Learning rate: 0.0004 • Batch size: 64 • Norm clipping: 1 • Early stopping: METEOR, 10 epochs 17 5/16/19
  18. Results: General and Color Deprivation • ' # ! &

      • Color Deprivation #  " $ • Color Deprivation # %  • +1.6 METEOR (HIER vs NMT) • +12% color accuracy (HIER vs NMT) • +4% color accuracy (DIRECT vs NMT) 18 5/16/19
  19. Results: Entity Masking • +$% ,'-  • baseline (

    METEOR  *!#" • Incongruent decoding !#) +$%, &  19 5/16/19
  20. Results: Entity Masking (Visual Attention) • 20-4*!'%' • (baseline MMT#

    7 3  MMT • +Masked MMT • $ "& 20song  baseline MMT /61. !'%')  • Masked MMT ,5 20 /61.!'% ')  20 5/16/19
  21. Results: Entity Masking (Czech and German) • >#!> $') >-CH2+

    • Entity Masking 94. %( &(&"(JK • %( &(&"(7I8/ , • $')>  • A;GE?1F5 • INIT &"(7I8/< • decoder 30B 6  RNN * @D:=  5 21 5/16/19
  22. Results: Progressive Masking •   •   

      •      22 5/16/19
  23. Results: Progressive Masking (Incongruent) • Blinding: 96  incongruent 

    ,8)! • Mask 42#& , 8- 0 • NMT 342#'%  &.  • * + ,8) 9 6  ,8/1 • Blinding NMT "(7$ 5 23 5/16/19
  24. Discussion and Conclusions • 6 !)=8;&" " " 29: •

    %'3/.+7 4>2( • 4>1 A@?C,-#  • %'3 • )=8;&B$0< "5* 24 5/16/19