Tan, Bansal_2019_EMNLP_LXMERT Learning Cross-Modality Encoder Representations from Transformers

F16d24f8c3767910d0ef9dd3093ae016?s=47 tosho
November 27, 2019

Tan, Bansal_2019_EMNLP_LXMERT Learning Cross-Modality Encoder Representations from Transformers

EMNLP 2019

F16d24f8c3767910d0ef9dd3093ae016?s=128

tosho

November 27, 2019
Tweet

Transcript

  1. LXMERT: Learning Cross-Modality Encoder Representations from Transformer Hao Tan, Mohit

    Bansal UNC Chapel Hill EMNLP 2019 紹介者:平澤 寅庄(TMU, 小町研, M1) 27 November, 2019 @小町研
  2. 0. Overview • BERT-style pre-trained model for cross-modal representation ◦

    Transformer for language ◦ Transformer for vision ◦ Transformer for cross-modality • Effectively learn language/vision/cross-modality relationship • Achieve SOTA in 3 tasks ◦ Visual Question Answering (VQA) ◦ Compositional Question Answering (GQA) ◦ Natural Language for Visual Reasoning for Real (NLVR2)
  3. 1. Introduction • Influential models for single-modality representation ◦ Language:

    BERT, XLM ◦ Vision: ResNet50, Faster R-CNN • Less studies for the cross-modality representation bw vision and language • Learning Cross-Modality Encoder Representations from Transformer ◦ BERT-style, Cross-modality, Pre-trained with 5 tasks
  4. 2. Model Architecture Language Transformer 2.1. Input Embeddings 2.2. Encoders

    2.3. Output Representations
  5. 2.1. Input Embeddings • Word-Level Sentence Embeddings ◦ BERT-style ◦

    1 + n tokens ([CLS] + sentence) • Object-Level Image Embedding ◦ Position feature + Region-of-Interest feature ◦ Mean output of 2 fully-connected layers [CLS]
  6. 2.2. Encoders • Single-Modality Encoders ◦ Language encoder (N_L layers)

    ◦ Object-relation encoder (N_R layers) ◦ Residual and Layer Normalization • Cross-Modality Encoder ◦ Uni-directional cross-attention (Cross) ▪ Exchange the information and ▪ Align the entities bw modalities ▪ Learn joint cross-modality representations ◦ Self-attention (Self) ◦ Feed-forward (FF) ◦ Residual and Layer Normalization Output of the previous layer
  7. 2.3. Output Representations • Language output ◦ Output of the

    cross-modality encoder for language • Visual output ◦ Output of the cross-modality encoder for vision • Cross-modal output ◦ Output of the cross-modality encoder for [CLS] [CLS]
  8. 3. Pre-Training Strategies • Tasks • Dataset • Procedure

  9. 3.1. Pre-Training Tasks • Language Task ◦ Masked Cross-Modality LM

    • Vision Task ◦ Masked Object Prediction ▪ RoI-feature regression ▪ Detected-label classification • Cross-Modality Task ◦ Cross-Modality Matching ◦ Image Question Answering • Equally weight all losses
  10. 3.1.1. Masked Cross-Modality LM (Language) • Predict masked words from

    the non-masked words in the language modality • Randomly mask words with a probability of 0.15 • Pros: Good at resolving ambiguity • Cons: Cannot load pre-trained BERT parameters
  11. 3.1.2. Masked Object Prediction (Vision) • Predict masked objects from

    non-masked object in the vision modality • Randomly mask objects with a probability of 0.15 • RoI-Feature Regression => Predict the object RoI feature with L2 loss • Detected-Label Classification => Predict the object label detected by Faster R-CNN • Pros: Learning object relationship • Pros: Learning cross-modality alignment
  12. 3.1.3. Cross-Modality Task Cross-Modality Matching • Replace sentence with a

    mismatched one with a probability of 0.5 • Predict whether an image and a sentence match each other • Equivalent of “Next Sentence Prediction” in BERT Image Question Answering • Predict the answer for image-question data • 9,500 answers (~90%) • Pros: Enlarge dataset (⅓ sentences in entire dataset)
  13. 3.2. Pre-Training Data • COCO/VG: Image + Caption • VQA/GQA/VG-QA:

    Image + Question
  14. 3.3. Pre-Training Procedure • Tokenizer: WordPiece provided by BERT •

    Object detector: Faster R-CNN pre-trained on Visual Genome • Number of objects: 36 (=m) • N_{R, L, X}: {5, 9, 5} • Optimizer: Adam (learning rate: 1e-4) • Batch size: 256 • Epochs: 10 (QA), 20 (otherwise) • Fine tuning ◦ Learning rate: 1e-5, 5e-5 ◦ Batch size: 32 ◦ Epochs: 4
  15. 4. Experimental Setup and Results

  16. 5. Analysis • BERT • Image QA Pre-Training Tasks •

    Vision Pre-Training Tasks
  17. 5.1. BERT versus LXMERT • Bottom-Up and Top-Down (BUTD) attention

    [Anderson et al., 2018] • CorssAtt: Use cross-modality attention sublayer(s) • Model initialized w/ BERT (Pre-train + BERT) shows weaker result than full model (Pre-train + scratch). •
  18. 5.2. Effect of the Image QA Pre-training Task • P20:

    pre-trained w/o image-QA • P10+QA10: pre-trained w/o image-QA for 10 epochs and then w/ image-QA for 10 epochs • DA: Use other image QA dataset as augmented data • FT: Use only specific image QA dataset in training • DA will drop model performance
  19. 5.3. Effect of Vision Pre-training tasks • No VIsion Task:

    Pre-train LXMERT w/o masked object prediction • Feat: w/ only feature regression • Label: w/ only label classification • Feat+Label: Full model
  20. 6. Related Work Model Architectures • Bottom-Up and Top-Down attention

    [Anderson et al., 2018] • Transformer [Vaswani et al., 2017] Pre-training • GPT [Radford et al., 2018] • BERT [Devlin et al., 2019] • XLM [Lample and Conneau, 2019] • ViLBERT [Lu et al., 2019] • Visual-BERT [Li et al., 2019]
  21. Bottom-Up and Top-Down Attention for Image Captioning • Encode an

    image with Faster R-CNN encoder (Bottom-Up) • Attend to object RoI features using language LSTM (Top-Down) • Element-wise product output of BUTD and language state • Predict answers / Generate captions
  22. ViLBERT [Lu et al., 2019] • Co-TRM: Co-attentional TRansforMer layers

  23. Visual-BERT [Li et al., 2019]

  24. 7. Conclusion • LXMERT, a framework for cross-modality representation ◦

    BERT-style ◦ Code/Pre-trained model available online • Learn representations for language/vision/cross-modality effectively ◦ Especially in NLVR2 • SOTA results on VGA/GQA/NLVR2 tasks