Tan, Bansal_2019_EMNLP_LXMERT Learning Cross-Modality Encoder Representations from Transformers

LXMERT: Learning Cross-Modality Encoder Representations from Transformer Hao Tan, Mohit
Bansal UNC Chapel Hill EMNLP 2019 紹介者：平澤寅庄（TMU, 小町研, M1） 27 November, 2019 @小町研

0. Overview • BERT-style pre-trained model for cross-modal representation ◦
Transformer for language ◦ Transformer for vision ◦ Transformer for cross-modality • Effectively learn language/vision/cross-modality relationship • Achieve SOTA in 3 tasks ◦ Visual Question Answering (VQA) ◦ Compositional Question Answering (GQA) ◦ Natural Language for Visual Reasoning for Real (NLVR2)

1. Introduction • Influential models for single-modality representation ◦ Language:
BERT, XLM ◦ Vision: ResNet50, Faster R-CNN • Less studies for the cross-modality representation bw vision and language • Learning Cross-Modality Encoder Representations from Transformer ◦ BERT-style, Cross-modality, Pre-trained with 5 tasks

2. Model Architecture Language Transformer 2.1. Input Embeddings 2.2. Encoders
2.3. Output Representations

2.1. Input Embeddings • Word-Level Sentence Embeddings ◦ BERT-style ◦
1 + n tokens ([CLS] + sentence) • Object-Level Image Embedding ◦ Position feature + Region-of-Interest feature ◦ Mean output of 2 fully-connected layers [CLS]

2.2. Encoders • Single-Modality Encoders ◦ Language encoder (N_L layers)
◦ Object-relation encoder (N_R layers) ◦ Residual and Layer Normalization • Cross-Modality Encoder ◦ Uni-directional cross-attention (Cross) ▪ Exchange the information and ▪ Align the entities bw modalities ▪ Learn joint cross-modality representations ◦ Self-attention (Self) ◦ Feed-forward (FF) ◦ Residual and Layer Normalization Output of the previous layer

2.3. Output Representations • Language output ◦ Output of the
cross-modality encoder for language • Visual output ◦ Output of the cross-modality encoder for vision • Cross-modal output ◦ Output of the cross-modality encoder for [CLS] [CLS]

3. Pre-Training Strategies • Tasks • Dataset • Procedure

3.1. Pre-Training Tasks • Language Task ◦ Masked Cross-Modality LM
• Vision Task ◦ Masked Object Prediction ▪ RoI-feature regression ▪ Detected-label classification • Cross-Modality Task ◦ Cross-Modality Matching ◦ Image Question Answering • Equally weight all losses

3.1.1. Masked Cross-Modality LM (Language) • Predict masked words from
the non-masked words in the language modality • Randomly mask words with a probability of 0.15 • Pros: Good at resolving ambiguity • Cons: Cannot load pre-trained BERT parameters

3.1.2. Masked Object Prediction (Vision) • Predict masked objects from
non-masked object in the vision modality • Randomly mask objects with a probability of 0.15 • RoI-Feature Regression => Predict the object RoI feature with L2 loss • Detected-Label Classification => Predict the object label detected by Faster R-CNN • Pros: Learning object relationship • Pros: Learning cross-modality alignment

3.1.3. Cross-Modality Task Cross-Modality Matching • Replace sentence with a
mismatched one with a probability of 0.5 • Predict whether an image and a sentence match each other • Equivalent of “Next Sentence Prediction” in BERT Image Question Answering • Predict the answer for image-question data • 9,500 answers (~90%) • Pros: Enlarge dataset (⅓ sentences in entire dataset)

3.2. Pre-Training Data • COCO/VG: Image + Caption • VQA/GQA/VG-QA:
Image + Question

3.3. Pre-Training Procedure • Tokenizer: WordPiece provided by BERT •
Object detector: Faster R-CNN pre-trained on Visual Genome • Number of objects: 36 (=m) • N_{R, L, X}: {5, 9, 5} • Optimizer: Adam (learning rate: 1e-4) • Batch size: 256 • Epochs: 10 (QA), 20 (otherwise) • Fine tuning ◦ Learning rate: 1e-5, 5e-5 ◦ Batch size: 32 ◦ Epochs: 4

4. Experimental Setup and Results

5. Analysis • BERT • Image QA Pre-Training Tasks •
Vision Pre-Training Tasks

5.1. BERT versus LXMERT • Bottom-Up and Top-Down (BUTD) attention
[Anderson et al., 2018] • CorssAtt: Use cross-modality attention sublayer(s) • Model initialized w/ BERT (Pre-train + BERT) shows weaker result than full model (Pre-train + scratch). •

5.2. Effect of the Image QA Pre-training Task • P20:
pre-trained w/o image-QA • P10+QA10: pre-trained w/o image-QA for 10 epochs and then w/ image-QA for 10 epochs • DA: Use other image QA dataset as augmented data • FT: Use only specific image QA dataset in training • DA will drop model performance

5.3. Effect of Vision Pre-training tasks • No VIsion Task:
Pre-train LXMERT w/o masked object prediction • Feat: w/ only feature regression • Label: w/ only label classification • Feat+Label: Full model

6. Related Work Model Architectures • Bottom-Up and Top-Down attention
[Anderson et al., 2018] • Transformer [Vaswani et al., 2017] Pre-training • GPT [Radford et al., 2018] • BERT [Devlin et al., 2019] • XLM [Lample and Conneau, 2019] • ViLBERT [Lu et al., 2019] • Visual-BERT [Li et al., 2019]

Bottom-Up and Top-Down Attention for Image Captioning • Encode an
image with Faster R-CNN encoder (Bottom-Up) • Attend to object RoI features using language LSTM (Top-Down) • Element-wise product output of BUTD and language state • Predict answers / Generate captions

ViLBERT [Lu et al., 2019] • Co-TRM: Co-attentional TRansforMer layers
•

Visual-BERT [Li et al., 2019]

7. Conclusion • LXMERT, a framework for cross-modality representation ◦
BERT-style ◦ Code/Pre-trained model available online • Learn representations for language/vision/cross-modality effectively ◦ Especially in NLVR2 • SOTA results on VGA/GQA/NLVR2 tasks

Tan, Bansal_2019_EMNLP_LXMERT Learning Cross-Mo...

Tan, Bansal_2019_EMNLP_LXMERT Learning Cross-Modality Encoder Representations from Transformers

tosho

More Decks by tosho

Other Decks in Science

Featured

Transcript

LXMERT: Learning Cross-Modality Encoder Representations from Transformer Hao Tan, Mohit

0. Overview • BERT-style pre-trained model for cross-modal representation ◦

1. Introduction • Influential models for single-modality representation ◦ Language:

2. Model Architecture Language Transformer 2.1. Input Embeddings 2.2. Encoders

2.1. Input Embeddings • Word-Level Sentence Embeddings ◦ BERT-style ◦

2.2. Encoders • Single-Modality Encoders ◦ Language encoder (N_L layers)

2.3. Output Representations • Language output ◦ Output of the

3. Pre-Training Strategies • Tasks • Dataset • Procedure

3.1. Pre-Training Tasks • Language Task ◦ Masked Cross-Modality LM

3.1.1. Masked Cross-Modality LM (Language) • Predict masked words from

3.1.2. Masked Object Prediction (Vision) • Predict masked objects from

3.1.3. Cross-Modality Task Cross-Modality Matching • Replace sentence with a

3.2. Pre-Training Data • COCO/VG: Image + Caption • VQA/GQA/VG-QA:

3.3. Pre-Training Procedure • Tokenizer: WordPiece provided by BERT •

4. Experimental Setup and Results

5. Analysis • BERT • Image QA Pre-Training Tasks •

5.1. BERT versus LXMERT • Bottom-Up and Top-Down (BUTD) attention

5.2. Effect of the Image QA Pre-training Task • P20:

5.3. Effect of Vision Pre-training tasks • No VIsion Task:

6. Related Work Model Architectures • Bottom-Up and Top-Down attention

Bottom-Up and Top-Down Attention for Image Captioning • Encode an

ViLBERT [Lu et al., 2019] • Co-TRM: Co-attentional TRansforMer layers

Visual-BERT [Li et al., 2019]

7. Conclusion • LXMERT, a framework for cross-modality representation ◦