Slide 1

Slide 1 text

BLEURT: Learning Robust Metrics 
 for Text Generation
 Thibault Sellam, Dipanjan Das, Ankur P. Parikh
 ACL 2020
 
 2020/04/22 研究室論文紹介 
 紹介者: 吉村
 


Slide 2

Slide 2 text

Abstract
 ● We propose BLEURT, a learned metric based on BERT
 ○ that can model human judgements with a few thousand possibly biased training examples
 ● A key aspect of BLEURT is a novel pre-training scheme 
 ○ that uses millions of synthetic example to help model generalize
 ● State-of-the-art results
 ○ on the last three years of the WMT Metrics shared task
 ○ and WebNLG Competition dataset
 ● BLEURT is superior to vanilla BERT-based approach
 ○ even when training data is scarce and out-of-distribution
 


Slide 3

Slide 3 text

Introduction
 ● Metrics based on surface similarity (BLEU, ROUGE)
 ○ measure the surface similarity b/w the reference and candidate.
 ○ correlate poorly with human judgment
 ● Metrics based on learned components
 ○ High correlation with human judgment
 ○ Fully learned metrics (BEER, RUSE, ESIM)
 ■ are trained ent-to-end, and rely on handcrafted features and/or learned embeddings ■ offers gread expressivity ○ Hybrid metrics (YiSi, BERTscore)
 ■ combine trained elements, e.g., contextual embeddings, with handwritten logic, e.g., as token alignment rules ■ offers robustness 


Slide 4

Slide 4 text

Introduction
 ● An ideal learned metric would be
 ○ able to take full advantage of available ratings data for training
 ○ and be robust to distribution drifts
 ● Insight is
 ○ possible to combine expressivity and robustness
 ○ by pre-training a metric on large amounts of synthetic data, before fine-tuning
 ● To this end, Wepropose BLEURT
 ○ Bilingual Evaluation Understudy with Representations from Transformers
 ● A key ingredient of BLEURT is pre-training scheme
 ○ , which uses random perturbations of Wikipedia sentences augmented with a diverse set of lexical and sentence-level signals


Slide 5

Slide 5 text

Pre-Training on Synthetic Data
 
 ● “Warm up” BERT before fine-tuning.
 ○ Generate a large number of synthetic ref-cand pairs.
 ○ Train BERT on several lexical- and semantic-level supervision signals with a multi-task loss.
 ● Optimize pre-training scheme for generality with 
 1. The set of reference sentences should be large and diverse.
 a. to cope with a wide range of NLG domains and tasks.
 2. The sentence pairs should contain wide variety of lexical, syntactic, and semantic dissimilarities.
 a. to anticipate all variations that an NLG system may produce.
 3. The pre-training objectives should effectively capture those phenomena
 a. so that BLEURT can lean to identify them.


Slide 6

Slide 6 text

Generating Sentence Pairs
 ● Generate synthetic sentence pairs (z, z~) 
 ○ by randomly perturbing 1.8 million segments z from WIkipedia
 ○ Obtain about 6.5 million perturbations z~
 ● Three techniques:
 ○ Mask filling with BERT
 ■ Inserting masks at random positions and fill them with the language model. ■ Introduce lexical alterations while maintaining the fluency ○ Backtranskation
 ■ to create variants of the reference sentence that preserves semantics ○ Dropping words
 
 


Slide 7

Slide 7 text

Pre-training Signals
 ● Automatic Metrics
 ○ BLEU, ROUGE, BERTscore
 ● Backtranslation Likelihood
 
 
 ○ computing the summation over all possible French sentences is intractable
 
 ○ 4 pre-training signals (en-de, en-fr) in both directions.
 τ: pre-training signal
 (z, z~): ref-cand pairs
 |z~|: number of tokens in z ~


Slide 8

Slide 8 text

Pre-training Signals
 ● Textual Entailment
 ○ whether z entails or contradicts z~ using a classifier.
 ○ using BERT fine-tuned on an entailment dataset, MNLI
 ● Backtranslation flag
 ○ Boolean that indicates whether the perturbation was generated
 ○ with backtranslation or with mask-filling.
 Modeling
 ● Task-lebel losses with a weighted sum.
 
 
 M: number of synthetic examples 
 K: number of task
 γ k : weight


Slide 9

Slide 9 text

Experiments
 ● WMT Metrics Shared Task (2017~2019)
 ○ Evaluate robustness to quality drifts (2017)
 ● WebNLG2017 Challenge Dataset
 ○ Test BLEURT’s ability to adapt to different tasks
 ● Ablation 
 ○ Measure the contribution of each pre-training task
 ● Model
 ○ BLEURT are trained in three steps
 i. regular BERT pre-training ii. pretraining on synthetic data iii. fine-tuning on task-specific rating ○ Batch size: 32, lr: 1e-5, 
 ○ pre-training steps: 0.8M, fine-tuning steps: 40k
 ○ γ k : set with grid search


Slide 10

Slide 10 text

Experiments - WMT Metrics Shared Task
 
 
● Dataset 
 ○ WMT2017-2019 (to-English)
 ● Metrics
 ○ Evaluate agreement b/w the metrics and the human ratings.
 ○ Kndall’s Tau τ (all year), Pearson’s correlation (2017), DARR (2018, 2019)
 ● Models
 ○ BLEURT, BLEURTbase
 ■ Based on BERT-large and BERT-base ○ BLEURT-pre, BLUERTbase-pre
 ■ Skip the pre-training phase and fine-tune directly on the WMT ratings.

Slide 11

Slide 11 text

Result - WMT Metrics Shared Task
 


Slide 12

Slide 12 text

Result - WMT Metrics Shared Task
 
 ● Pre-training consistently improves the results of BLEURT and BLEURTbase
 ○ up to 7.4 points increase (2017), 2.1 points (2018, 2019)
 ○ dataset used for 2017 is smaller, so pre-training is likelier to help.
 ● BLERT yields SoTA performance for all years.


Slide 13

Slide 13 text

Robustness to Quality Drift
 ● Create increasingly challenging datasets
 ○ sub-sampling records from WMT17 dataset, keeping low-rated translations for training and high-rated translations for test.
 ○ 11.9% of the original 5,344 training records (α=3.0)
 


Slide 14

Slide 14 text

Result
 ● Pre-training makes BLEURT significantly more robust to quality drift


Slide 15

Slide 15 text

Experiments - WebNLG
 ● Asses BLEURT’s capacity
 ○ to adapt to new tasks with limited training data.
 ● Dataset 
 ○ The WebNLG challenge benchmarks 
 ○ 4,677 sentence pairs with 1 to 3 reference
 ○ evaluated on 3 aspects: semantics, grammar, and fluency
 
 
 ● Systems 
 ○ BLEURT -pre -wmt: WebNLG
 ○ BLEURT -wmt: synthetic data → WebNLG
 ○ BLEURT: synthetic data → WMT → WebNLG
 e.g.


Slide 16

Slide 16 text

Result
 ● Thanks to pre-training, BLEURT can quickly adapt to the new tasks.
 ● BLEURT fine-tuned twice provides acceptable results on all tasks without training data.


Slide 17

Slide 17 text

Ablation Experiments
 ● Pretraining on BERTscore, entail, and backtrans yield improvements.
 ○ symmetrically, ablating them degrades BLEURT
 ● BLEU and ROUGE have negative impact.
 ● High correlation with human score
 ○ helps BLEURT
 ● Low correlation with human score
 ○ harm the model
 


Slide 18

Slide 18 text

Conclusions
 ● BLEURT
 ○ a reference-based text generation metrics for English
 ● Novel pre-train scheme
 ○ 1. regular BERT pre-training
 ○ 2. pre-training on synthetic data
 ○ 3. fine-tuning on task-specific rating
 ● Experiments
 ○ SoTA results on WMT and WebNLG Challenge dataset
 ○ robust both domain and quality drifts