Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes
and Captions Q. Li, J. Fu, D. Yu et al. EMNLP 2018 20181218

VQA CNN RNN ;* end-to-end !" → <2AE+H
C0 A9#5 2 4G' C0 <2I!".(AE)= >'1 %FI6?B,?:$D/-8 3@ VQA end-to-end &7 2 4G' <2AE)= 1

Visual Q&A Q: where is
the man swinging the racket? A: tennis court 2

Visual Q&A Q: what kind
of drink is in the glass? A: water 3

Visual Q&A Q: what is
walking next to the bus? A: cow 4

Visual Q&A Q: does the
man need a haircut? A: yes 5

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and
Captions 6

"# where is the man swinging
the racket? yes no water tennis court ⋮ CNN RNN $ ! 7

7# 9+ D >(. <& D (CB D *
"6* D -2,4 end-to-end 2 3A $% @8181;?':/4-2 0),4 =C !581;?B 8

Captions 9

( ,# &+ Ø#'! Ø#(" ) ,( %*&$ ! Ø'
ver. Ø( ver. Ø ver. ( ) 10

@,→H3L0 !%$ %)=1M ;F$ % @, I87 H3L0 ResNet152
'#(' .? ØBAK6 ED "&'&( .?9 /$ % -NC> .? G+ ;F$ % cos N*2 J17 H3L0 cos N*2.? 5:4< H3)= 12

/ →3*) / ResNet152 LSTM 1 .%0
1', (e.g. BLEU) 4") .%) 2$5! cos 6# (+&- / 3*).% 13

7*='?8.-?9(-→64 #2(> 8.-9(- LSTM !" Ø LSTM &7!!)<%/
;5 $5') 3: softmax ,0+1 64#2 14

Captions 15

-VQA-real Ø1 & 3 ,1 10
)', *!#-min(#)*+,-. /010-/ 2),2 ,-.345 6 , 1) Ø%)' 10 " 3 ( $) + * 16

VQA vs.
17

vs.
18

VQA *!-# +$ .% -#"0,)( 0'/ Ø& -#" Ø
-#" ØNULL 1 -# $ 19

* '" Ø+! & )%
Ø+! & )% Ø+! & )% Ø+! & )% $ ,# - +! ( VQA & 20

tennis, ball, man, racket, hit, court, play, player,
swing, hold a man holding a tennis racket on a tennis court. tennis court & Q: where is the man swinging the racket? A: tennis court 21

bicycle, man, sit, eat, bike, look, outside, food,
person, table a man sitting at a table with a plate of food. beer & Q: what kind of drink is in the glass? A: water 22

street, bus, cow, city, walk, car, drive, stand,
road, white a cow that is walking in the street. car & Q: what is walking next to the bus? A: cow 23

woman, bear, teddy, hold, sit, glass, animal, large,
lady a woman holding a sandwich in her hands. yes & Q: does the man need a haircut? A: yes 24

30%
65% yes/no 80% 25

VQA 26

Captions 27

>1=9 5 2 3B# VQA #2 Ø7!6*A-"?&.% / Ø7!
<+; 0',4:(C $ 8 VQA =@ ) = 28

Tell-and-Answer: Towards Explainable Visual Que...

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions

onizuka laboratory

More Decks by onizuka laboratory

Other Decks in Research

Featured

Transcript

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes

VQA CNN RNN ;* end-to-end !" → <2AE+H

Visual Q&A Q: where is

Visual Q&A Q: what kind

Visual Q&A Q: what is

Visual Q&A Q: does the

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and

"# where is the man swinging

7# 9+ D >(. <& D (CB D *

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and

( ,# &+ Ø#'! Ø#(" ) ,( %*&$ ! Ø'

11

@,→H3L0 !%$ %)=1M ;F$ % @, I87 H3L0 ResNet152

/ →3*) / ResNet152 LSTM 1 .%0

7*='?8.-?9(-→64 #2(> 8.-9(- LSTM !" Ø LSTM &7!!)<%/

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and

-VQA-real Ø1 & 3 ,1 10

VQA vs.

vs.

VQA *!-# +$ .% -#"0,)( 0'/ Ø& -#" Ø

* '" Ø+! & )%

tennis, ball, man, racket, hit, court, play, player,

bicycle, man, sit, eat, bike, look, outside, food,

street, bus, cow, city, walk, car, drive, stand,

woman, bear, teddy, hold, sit, glass, animal, large,

30%

VQA 26

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and

>1=9 5 2 3B# VQA #2 Ø7!6*A-"?&.% / Ø7!