Wang, Wang, Wang - 2018 - Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions
for Video Captioning Xin Wang, Yuan-Fang Wang, William Yang Wang (NAACL 2018) Tosho Hirasawa

Index 1. Overview 2. Video Captioning 3. Background 4. Proposed
Model 5. Experiment 6. Results 7. Conclusion

1. Overview Multimodal(Video, Audio) Video Captioning Task
• Hierarchical encoder • Local/Global decoder • Deep Audio Feats BLEU2

2. Video Captioning VideoCaption • a car is shown
a man drives a vehicle through the countryside a man drives down the road in an audi a man driving a car a man is driving a car a man is driving down a road a man is driving in a car as part of a commercial a man is driving a man riding the car speedly in a narrow road a man showing the various features of a car a man silently narrates his experience driving an audi a person is driving his car around curves in the road ...

2. Video Captioning Dataset • MSR-VTT dataset • 10k videos
• 20 human annotated reference captions for each video Competition • MS Multimedia Challenge • http://ms-multimedia-challenge.com/2017/challenge • 2016 • • BLEU • METEOR • ROUGE-L • CIDEr-D

CIDEr - I : CIDEr: Consensus-based Image Description Evaluation (CVPR2015)
- !" Candidate references cos-similarity n-gram T #$% - ℎ' !' ( - )* - : FD )* # ∈ ℝ|.|

CIDEr-D CIDEr (203. # +-4 & "'), % 1$
• + Gaussian penalty*! • 5/ n-gram reference clipping

Video Captioning #! &* ) "
('Audio #%$

Multi-modal • Modal alignment
! " • ! " # Audio • Handcrafted audio features (e.g. MFCC)

Related works of Multi-modal 1/3 Ramanishka et al., Multimodal Video
Description MM '16 Proceedings of the 2016 ACM on Multimedia Conference http://cs-people.bu.edu/dasabir/papers/MMVD_ACM_2016.pdf • • CNN • C3D • audio • label

Related works of Multi-modal 2/3 Xu et al., Learning Multimodal
Attention LSTM Networks for Video Captioning MM '16 Proceedings of the 2016 ACM on Multimedia Conference https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/coi110-xuA.pdf • features attention (fusion) • decoder fusion

Related works of Multi-modal 3/3 Attention-Based Multimodal Fusion for Video
Description 2017 IEEE International Conference on Computer Vision (ICCV) http://openaccess.thecvf.com/content_ICCV_2017/papers/Hori_Attenti on-Based_Multimodal_Fusion_ICCV_2017_paper.pdf • cross-modalfeature attention (fusion)

4. Proposed Model

4. Proposed Model Hierarchically Aligned Cross-modal Attentive network (HACA) Encoder
(Hierarchical Attentive Encoder) • ,-LSTM *+%.)! • (low-level • high-level Decoder (Globally and Locally Aligned Cross-modal Attentive Decoder) • Decoder Encoder /&' decode • Global Decoder: high-level feats • Local Decoder: low-level feats • Cross-modal attention • Modal "$ context vector# attention"$ • Decoder context

Feature Extractors '3.*!2$&)- '3ResNet'3, (4+6 .*VGGish •
AudioSet (.*!/51 • Tensorflow #0%"

Hierarchical Attentive Encoder Feature Extractor
Low-level: LSTM High-level: Attention + LSTM

Globally and Locally Aligned Cross-modal Attentive Decoder (Global Decoder)
Global • Encoder high-level feats (video, audio) • decoder 1. Modal context 2. attention context 3. LSTM Local Decoder !

Globally and Locally Aligned Cross-modal Attentive Decoder (Local Decoder)
Local • Encoder low-level feats (video, audio) • decoder • Global Decoder 1. Global Decoder context 2. attention context 3. LSTM "!

Globally and Locally Aligned Cross-modal Attentive Decoder (softmax & loss)
Local Decoder Cross Entropy Loss

Overview

5. Experiment Dataset • MSR-VTT dataset Feature Extractor • ResNet
(Video, sampling rate: 3 fps), VGGish (Audio) Evaluation Metrics • BLEU, METEOR, ROUGE-L, CIDEr-D (MS Challenge Metrics ) Model • Encoder dim (low, high): (512, 256), (128, 64) for video, audio • Chunk size: 10, 4 for video, audio • Decoder dim: 256, 1024 for global, local • Embedding dim: 512

6. Results • BLEU +2 • Metrics SOTA
CIDEnt-RL HACA

ATT(v) • enc:LSTM, dec:ATT CM-ATT(va) • enc:LSTM, dec: CM-ATT
• video + audio CM-ATT(vad) • enc:LSTM, dec: CM-ATT • video + audio + decoder HACA(w/o align) • enc: H-LSTM, dec: CM-ATT • decoder1

Deep Audio Feats /( MFCC, MFCC • mel-frequency cepstral
coefficients • 2-3.+% 07 !)1 • *' Deep Audio Features (VGGish) • AudioSet (2-"4 86 • Tensorflow $5&#

7. Conclusion 97EC • attentive?BLSTM"8D;:). • @F$8D;=6!0' #% •
Deep Audio Features =6! 4 • SOTA<- Further Works • +&(A13 ,52>3 • Visual, Audio /*8D;13 optical flow, C3D

Wang, Wang, Wang - 2018 - Watch, Listen, and De...

Wang, Wang, Wang - 2018 - Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

tosho

More Decks by tosho

Other Decks in Technology

Featured

Transcript

Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions

Index 1. Overview 2. Video Captioning 3. Background 4. Proposed

1. Overview Multimodal(Video, Audio) Video Captioning Task

2. Video Captioning VideoCaption • a car is shown

2. Video Captioning Dataset • MSR-VTT dataset • 10k videos

CIDEr - I : CIDEr: Consensus-based Image Description Evaluation (CVPR2015)

CIDEr-D CIDEr (203. # +-4 & "'), % 1$

Video Captioning #! &* ) "

Multi-modal • Modal alignment

Related works of Multi-modal 1/3 Ramanishka et al., Multimodal Video

Related works of Multi-modal 2/3 Xu et al., Learning Multimodal

Related works of Multi-modal 3/3 Attention-Based Multimodal Fusion for Video

4. Proposed Model

4. Proposed Model Hierarchically Aligned Cross-modal Attentive network (HACA) Encoder

Feature Extractors '3.!2$&)- '3ResNet'3, (4+6 .VGGish •

Hierarchical Attentive Encoder Feature Extractor

Globally and Locally Aligned Cross-modal Attentive Decoder (Global Decoder)

Globally and Locally Aligned Cross-modal Attentive Decoder (Local Decoder)

Globally and Locally Aligned Cross-modal Attentive Decoder (softmax & loss)

Overview

5. Experiment Dataset • MSR-VTT dataset Feature Extractor • ResNet

6. Results • BLEU +2 • Metrics SOTA

ATT(v) • enc:LSTM, dec:ATT CM-ATT(va) • enc:LSTM, dec: CM-ATT

Deep Audio Feats /( MFCC, MFCC • mel-frequency cepstral

•

7. Conclusion 97EC • attentive?BLSTM"8D;:). • @F$8D;=6!0' #% •