Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wang, Wang, Wang - 2018 - Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

tosho
August 02, 2018

Wang, Wang, Wang - 2018 - Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

tosho

August 02, 2018
Tweet

More Decks by tosho

Other Decks in Technology

Transcript

  1. Watch, Listen, and Describe Globally and Locally Aligned Cross-Modal Attentions

    for Video Captioning Xin Wang, Yuan-Fang Wang, William Yang Wang (NAACL 2018) Tosho Hirasawa
  2. Index 1. Overview 2. Video Captioning 3. Background 4. Proposed

    Model 5. Experiment 6. Results 7. Conclusion
  3. 1. Overview Multimodal(Video, Audio)  Video Captioning Task  

      • Hierarchical encoder  • Local/Global decoder  • Deep Audio Feats  BLEU2  
  4. 2. Video Captioning VideoCaption  • a car is shown

    a man drives a vehicle through the countryside a man drives down the road in an audi a man driving a car a man is driving a car a man is driving down a road a man is driving in a car as part of a commercial a man is driving a man riding the car speedly in a narrow road a man showing the various features of a car a man silently narrates his experience driving an audi a person is driving his car around curves in the road ...
  5. 2. Video Captioning Dataset • MSR-VTT dataset • 10k videos

    • 20 human annotated reference captions for each video Competition • MS Multimedia Challenge • http://ms-multimedia-challenge.com/2017/challenge • 2016  •   • BLEU • METEOR • ROUGE-L • CIDEr-D
  6. CIDEr - I : CIDEr: Consensus-based Image Description Evaluation (CVPR2015)

    - !" Candidate references cos-similarity n-gram T #$% - ℎ' !' ( - )* - : FD )* # ∈ ℝ|.|
  7. CIDEr-D CIDEr (203. # +-4 & "'),  % 1$

    • +  Gaussian penalty*!  • 5/ n-gram  reference  clipping 
  8.   Multi-modal • Modal  alignment   

     ! " • ! "  #  Audio • Handcrafted audio features (e.g. MFCC)   
  9. Related works of Multi-modal 1/3 Ramanishka et al., Multimodal Video

    Description MM '16 Proceedings of the 2016 ACM on Multimedia Conference http://cs-people.bu.edu/dasabir/papers/MMVD_ACM_2016.pdf  •     • CNN • C3D • audio • label
  10. Related works of Multi-modal 2/3 Xu et al., Learning Multimodal

    Attention LSTM Networks for Video Captioning MM '16 Proceedings of the 2016 ACM on Multimedia Conference https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/coi110-xuA.pdf  • features  attention (fusion)   • decoder    fusion 
  11. Related works of Multi-modal 3/3 Attention-Based Multimodal Fusion for Video

    Description 2017 IEEE International Conference on Computer Vision (ICCV) http://openaccess.thecvf.com/content_ICCV_2017/papers/Hori_Attenti on-Based_Multimodal_Fusion_ICCV_2017_paper.pdf  • cross-modalfeature  attention (fusion)  
  12. 4. Proposed Model Hierarchically Aligned Cross-modal Attentive network (HACA) Encoder

    (Hierarchical Attentive Encoder) • ,-LSTM *+%.)!  • (low-level • high-level Decoder (Globally and Locally Aligned Cross-modal Attentive Decoder) •   Decoder Encoder /&' decode  • Global Decoder: high-level feats  • Local Decoder: low-level feats  • Cross-modal attention • Modal "$ context vector# attention"$  • Decoder  context 
  13. Hierarchical Attentive Encoder Feature Extractor     

     Low-level: LSTM High-level: Attention + LSTM
  14. Globally and Locally Aligned Cross-modal Attentive Decoder (Global Decoder) 

     Global • Encoder  high-level feats (video, audio) • decoder   1. Modal  context  2.   attention   context 3. LSTM Local Decoder !
  15. Globally and Locally Aligned Cross-modal Attentive Decoder (Local Decoder) 

     Local  • Encoder  low-level feats (video, audio) • decoder   • Global Decoder  1. Global Decoder  context  2.   attention    context 3. LSTM "!
  16. Globally and Locally Aligned Cross-modal Attentive Decoder (softmax & loss)

    Local Decoder     Cross Entropy Loss 
  17. 5. Experiment Dataset • MSR-VTT dataset Feature Extractor • ResNet

    (Video, sampling rate: 3 fps), VGGish (Audio) Evaluation Metrics • BLEU, METEOR, ROUGE-L, CIDEr-D (MS Challenge  Metrics ) Model • Encoder dim (low, high): (512, 256), (128, 64) for video, audio • Chunk size: 10, 4 for video, audio • Decoder dim: 256, 1024 for global, local • Embedding dim: 512
  18. 6. Results • BLEU +2 •  Metrics SOTA 

    CIDEnt-RL    HACA 
  19.  ATT(v) • enc:LSTM, dec:ATT CM-ATT(va) • enc:LSTM, dec: CM-ATT

    • video + audio CM-ATT(vad) • enc:LSTM, dec: CM-ATT • video + audio + decoder HACA(w/o align) • enc: H-LSTM, dec: CM-ATT • decoder1
  20. Deep Audio Feats /( MFCC,  MFCC • mel-frequency cepstral

    coefficients • 2-3.+% 07 !)1  • *' Deep Audio Features (VGGish) • AudioSet (2-"4 86  • Tensorflow $5&# 
  21. 7. Conclusion 97EC • attentive?BLSTM"8D;:). • @F$8D;=6!0'  #% •

    Deep Audio Features =6! 4 • SOTA<- Further Works • +&(A13  ,52>3 • Visual, Audio /*8D;13 optical flow, C3D