Wang, Wang, Wang - 2018 - Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

F16d24f8c3767910d0ef9dd3093ae016?s=47 tosho
August 02, 2018

Wang, Wang, Wang - 2018 - Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

F16d24f8c3767910d0ef9dd3093ae016?s=128

tosho

August 02, 2018
Tweet

Transcript

  1. Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions

    for Video Captioning Xin Wang, Yuan-Fang Wang, William Yang Wang (NAACL 2018) Tosho Hirasawa
  2. Index 1. Overview 2. Video Captioning 3. Background 4. Proposed

    Model 5. Experiment 6. Results 7. Conclusion
  3. 1. Overview Multimodal(Video, Audio)  Video Captioning Task  

      • Hierarchical encoder  • Local/Global decoder  • Deep Audio Feats  BLEU2  
  4. 2. Video Captioning VideoCaption  • a car is shown

    a man drives a vehicle through the countryside a man drives down the road in an audi a man driving a car a man is driving a car a man is driving down a road a man is driving in a car as part of a commercial a man is driving a man riding the car speedly in a narrow road a man showing the various features of a car a man silently narrates his experience driving an audi a person is driving his car around curves in the road ...
  5. 2. Video Captioning Dataset • MSR-VTT dataset • 10k videos

    • 20 human annotated reference captions for each video Competition • MS Multimedia Challenge • http://ms-multimedia-challenge.com/2017/challenge • 2016  •   • BLEU • METEOR • ROUGE-L • CIDEr-D
  6. CIDEr - I : CIDEr: Consensus-based Image Description Evaluation (CVPR2015)

    - !" Candidate references cos-similarity n-gram T #$% - ℎ' !' ( - )* - : FD )* # ∈ ℝ|.|
  7. CIDEr-D CIDEr (203. # +-4 & "'),  % 1$

    • +  Gaussian penalty*!  • 5/ n-gram  reference  clipping 
  8. Video Captioning  #! &* ) "   

    ('Audio   #%$ 
  9.   Multi-modal • Modal  alignment   

     ! " • ! "  #  Audio • Handcrafted audio features (e.g. MFCC)   
  10. Related works of Multi-modal 1/3 Ramanishka et al., Multimodal Video

    Description MM '16 Proceedings of the 2016 ACM on Multimedia Conference http://cs-people.bu.edu/dasabir/papers/MMVD_ACM_2016.pdf  •     • CNN • C3D • audio • label
  11. Related works of Multi-modal 2/3 Xu et al., Learning Multimodal

    Attention LSTM Networks for Video Captioning MM '16 Proceedings of the 2016 ACM on Multimedia Conference https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/coi110-xuA.pdf  • features  attention (fusion)   • decoder    fusion 
  12. Related works of Multi-modal 3/3 Attention-Based Multimodal Fusion for Video

    Description 2017 IEEE International Conference on Computer Vision (ICCV) http://openaccess.thecvf.com/content_ICCV_2017/papers/Hori_Attenti on-Based_Multimodal_Fusion_ICCV_2017_paper.pdf  • cross-modalfeature  attention (fusion)  
  13. 4. Proposed Model

  14. 4. Proposed Model Hierarchically Aligned Cross-modal Attentive network (HACA) Encoder

    (Hierarchical Attentive Encoder) • ,-LSTM *+%.)!  • (low-level • high-level Decoder (Globally and Locally Aligned Cross-modal Attentive Decoder) •   Decoder Encoder /&' decode  • Global Decoder: high-level feats  • Local Decoder: low-level feats  • Cross-modal attention • Modal "$ context vector# attention"$  • Decoder  context 
  15. Feature Extractors '3.*!2$&)-   '3ResNet'3, (4+6  .*VGGish •

    AudioSet (.*!/51  • Tensorflow #0%"  
  16. Hierarchical Attentive Encoder Feature Extractor     

     Low-level: LSTM High-level: Attention + LSTM
  17. Globally and Locally Aligned Cross-modal Attentive Decoder (Global Decoder) 

     Global • Encoder  high-level feats (video, audio) • decoder   1. Modal  context  2.   attention   context 3. LSTM Local Decoder !
  18. Globally and Locally Aligned Cross-modal Attentive Decoder (Local Decoder) 

     Local  • Encoder  low-level feats (video, audio) • decoder   • Global Decoder  1. Global Decoder  context  2.   attention    context 3. LSTM "!
  19. Globally and Locally Aligned Cross-modal Attentive Decoder (softmax & loss)

    Local Decoder     Cross Entropy Loss 
  20. Overview

  21. 5. Experiment Dataset • MSR-VTT dataset Feature Extractor • ResNet

    (Video, sampling rate: 3 fps), VGGish (Audio) Evaluation Metrics • BLEU, METEOR, ROUGE-L, CIDEr-D (MS Challenge  Metrics ) Model • Encoder dim (low, high): (512, 256), (128, 64) for video, audio • Chunk size: 10, 4 for video, audio • Decoder dim: 256, 1024 for global, local • Embedding dim: 512
  22. 6. Results • BLEU +2 •  Metrics SOTA 

    CIDEnt-RL    HACA 
  23.  ATT(v) • enc:LSTM, dec:ATT CM-ATT(va) • enc:LSTM, dec: CM-ATT

    • video + audio CM-ATT(vad) • enc:LSTM, dec: CM-ATT • video + audio + decoder HACA(w/o align) • enc: H-LSTM, dec: CM-ATT • decoder1
  24. Deep Audio Feats /( MFCC,  MFCC • mel-frequency cepstral

    coefficients • 2-3.+% 07 !)1  • *' Deep Audio Features (VGGish) • AudioSet (2-"4 86  • Tensorflow $5&# 
  25.   •      

  26. 7. Conclusion 97EC • attentive?BLSTM"8D;:). • @F$8D;=6!0'  #% •

    Deep Audio Features =6! 4 • SOTA<- Further Works • +&(A13  ,52>3 • Visual, Audio /*8D;13 optical flow, C3D