Caglayan et al. - NAACL 2019 - Probing the Need for Visual Context in Multimodal Machine Translation

Probing the Need for Visual Context in Multimodal Machine Translation
Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Loïc Barrault NAACL 2019 Best Short Paper M1 [email protected] 1 5/16/19

+ • $ %#-& • , *!( -&
• " • " %#' • -& • .) -& 2 5/16/19

^W • L],3JA? 8\PX0+)"%# %#!% a`=F>'IS D • [L],3J1E
b(SHO/0 • ,3J&B0<7_ #!%TC • Y@5: WQVU • - k R6WQG • N#$ 8\PX0*4# $ .2K9Z #!% M; 3 5/16/19

&' • Multimodal attention using conventional features [Caglayan
et al., 2016; Calixto et al., 2016; Libovický and Helcl, 2017; Helcl et al., 2018] • #!(")% • Cross-modal interactions with spatially-unaware global features [Calixto and Liu, 2017; Ma et al., 2017; Caglayan et al., 2017a; Madhyastha et al., 2017] • $!(")% • The integration of regional features from object detection networks [Huang et al., 2016; Grönroos et al., 2018] • 4 5/16/19

!%#%# %490 • “modest” [Grönroos et al., 2018] • Multimodal
WSD Task 49 [Lala et al., 2018] • !%#%/1(.:2 '): 26+ 8 [Barrault et al., 2018] • :2,$&"5;*- 037 [Elliott, 2018] 5 5/16/19

'*$(#* 8? 1< • /0965;. C> • Multi30k
• /0Two young girls are sitting on the street eating food. • -0 Zwei junge mädchen sitzen auf der straße und essen mais. • :B • ($&")+ • ,C>4 • '*$(#* 8? =@ • /0932!D'*$(#*(%*!A7 6 5/16/19

Input Degradation • Color Deprivation (train: 3.3%, test: 3.1%) •
650 special token .4 • Entity Masking (train: 26.2%, test: 26.2%) • visual depictable entity [Plummer et al., 2015] ,'9 special token .4 • Progressive Masking • % k 1 50$! special token .4 • low-resource /3 -8)2"&( 7:#* + 7 5/16/19

Input Degradation: Example <latexit sha1_base64="o5WQnbQ2Hv0sawwBv49fcyT3MMw=">AAADkXicjVLtbtMwFL1Z+Bjlq7Cf/IkoIP5QJWjSYL8qNiQkBCoS3SYtU+WkTmvV+ZDtdlRR3oMX4KF4DF4AcewGRDcQOIp9fHzPPfdaTioptAnDr96Wf+XqtevbNzo3b92+c7d77/6RLhcq5aO0lKU6SZjmUhR8ZISR/KRSnOWJ5MfJ/MCeHy+50qIsPppVxc9yNi1EJlJmQI27n+OUyfqwGcc5MzOV1wdN8GQ/iA3/ZHRWuwTBIa+UWDpFE8Rx56Lm/YbmdYE6VsE7pueimP5RMN8QDFU5VVxrseS/VJ1xtxf2QzeCyyBqQY/aMSy73yimCZWU0oJy4lSQAZbESOM7pYhCqsCdUQ1OAQl3zqmhDrQLRHFEMLBzzFPsTlu2wN7m1E6dwkXiV1AG9LiNmQBnjl2v1j/4LfZvHrXLbWtcYU3anDlYQzOw/9L9jPxfne3JoMIXrheBOivH2C7TjY4yrBJ7g/rtvEIkB5pApYBScBLsmrEeCuv6Xm3nM3fPzMVxIFuTrUo5L07n7nZyV3OBHDV9QUwGZL2ZczeoqIbvjJbU4FFEF5/AZXD0vB+F/ejDbm/wqn0e2/SAHtJTvIE9GtAbGtII3t+9R94zr+/v+C/9gd/GbnmtZoc2hv/2B0VJ0Ug=</latexit> 8 5/16/19

Visual Sensitivity • "* #) • incongruent decoding •
"*'( • "*&! $+ incongruent decoding % 9 5/16/19

Dataset • Multi30K • lowercase, normalize, tokenize using Moses [Koehn
et al., 2007] • data splits: • English to French • 9,951 English and 11,216 French wordsBPE • Degradation train/dev/test Dataset Color Deprivation Progressive Masking Entity Masking train (multi30k) train train val (multi30k) train dev test2016 dev test test2017 test - 10 5/16/19

Visual Features • Feature Extractor: ResNet-50 CNN [He et al.,
2016] • trained on ImageNet [Deng et al., 2009] • Spatial feature: final convolutional layer [Caglayan et al., 2018] • L2 normalization • size 2048 x 8 x 8 (WMT ) • Global feature: pool5 layer 11 5/16/19

Models • baseline NMT [Bahdanau et al., 2014] • 2-layer
bidirectional GRU encoder • 2-layer conditional GRU • DIRECT [Caglayan et al., 2016] • basic multimodal attention • HIER [Libovický and Helcl, 2017] • hierarchical extension of DIRECT • INIT [Calixto and Liu, 2017; Caglayan et al., 2017a] • encoder-decoder initialization 12 5/16/19

baseline NMT [Bahdanau et al., 2014] • Encoder-Decoder model •
RNN • Attention [Bahdanau et al., 2014] 13 5/16/19

DIRECT [Caglayan et al., 2016] [Caglayan et al., 2016]
14 5/16/19

HIER [Libovický and Helcl, 2017] • DIRECT Fusion !:8
• 6 Encoder < " "0 1> /( • 61> 4% 7# 9= 5 • 1> <""- 2 5 • +,*;0 '3$ 1> .5 [Libovický and Helcl, 2017] )& 15 5/16/19

INIT [Calixto and Liu, 2017; Caglayan et al., 2017a] [Calixto
and Liu, 2017] 16 5/16/19

Hyperparameters • Hidden units of encoder and decoder GRUs: 400
• Embedding size: 200, Tied embedding [Press and Wolf, 2016] • Dropout: 0.4 (source embeddings), 0.5 (encoder/decoder outputs) • Optimizer: ADAM • Learning rate: 0.0004 • Batch size: 64 • Norm clipping: 1 • Early stopping: METEOR, 10 epochs 17 5/16/19

Results: General and Color Deprivation • ' # ! &
• Color Deprivation # " $ • Color Deprivation # % • +1.6 METEOR (HIER vs NMT) • +12% color accuracy (HIER vs NMT) • +4% color accuracy (DIRECT vs NMT) 18 5/16/19

Results: Entity Masking • +$% ,'- • baseline (
METEOR *!#" • Incongruent decoding !#) +$%, & 19 5/16/19

Results: Entity Masking (Visual Attention) • 20-4*!'%' • (baseline MMT#
7 3 MMT • +Masked MMT • $ "& 20song baseline MMT /61. !'%') • Masked MMT ,5 20 /61.!'% ') 20 5/16/19

Results: Entity Masking (Czech and German) • >#!> $') >-CH2+
• Entity Masking 94. %( &(&"(JK • %( &(&"(7I8/ , • $')> • A;GE?1F5 • INIT &"(7I8/< • decoder 30B 6 RNN * @D:= 5 21 5/16/19

Results: Progressive Masking • •
• 22 5/16/19

Results: Progressive Masking (Incongruent) • Blinding: 96 incongruent
,8)! • Mask 42#& , 8- 0 • NMT 342#'% &. • * + ,8) 9 6 ,8/1 • Blinding NMT "(7$ 5 23 5/16/19

Discussion and Conclusions • 6 !)=8;&" " " 29: •
%'3/.+7 4>2( • 4>1 A@?C,-# • %'3 • )=8;&B$0< "5* 24 5/16/19

Caglayan et al. - NAACL 2019 - Probing the Need...

Caglayan et al. - NAACL 2019 - Probing the Need for Visual Context in Multimodal Machine Translation

tosho

More Decks by tosho

Other Decks in Science

Featured

Transcript

Probing the Need for Visual Context in Multimodal Machine Translation

+ • $ %#-& • , *!( -&

^W • L],3JA? 8\PX0+)"%# %#!% a`=F>'IS D • [L],3J1E

&' • Multimodal attention using conventional features [Caglayan

!%#%# %490 • “modest” [Grönroos et al., 2018] • Multimodal

'$(# 8? 1< • /0965;. C> • Multi30k

Input Degradation • Color Deprivation (train: 3.3%, test: 3.1%) •

Visual Sensitivity • "* #) • incongruent decoding •

Dataset • Multi30K • lowercase, normalize, tokenize using Moses [Koehn

Visual Features • Feature Extractor: ResNet-50 CNN [He et al.,

Models • baseline NMT [Bahdanau et al., 2014] • 2-layer

baseline NMT [Bahdanau et al., 2014] • Encoder-Decoder model •

DIRECT [Caglayan et al., 2016] [Caglayan et al., 2016]

HIER [Libovický and Helcl, 2017] • DIRECT Fusion !:8

INIT [Calixto and Liu, 2017; Caglayan et al., 2017a] [Calixto

Hyperparameters • Hidden units of encoder and decoder GRUs: 400

Results: General and Color Deprivation • ' # ! &

Results: Entity Masking • +$% ,'- • baseline (

Results: Entity Masking (Visual Attention) • 20-4*!'%' • (baseline MMT#

Results: Entity Masking (Czech and German) • >#!> $') >-CH2+

Results: Progressive Masking • •

Results: Progressive Masking (Incongruent) • Blinding: 96 incongruent

Discussion and Conclusions • 6 !)=8;&" " " 29: •