BLEURT: Learning Robust Metrics for Text Generation

BLEURT: Learning Robust Metrics   for Text Generation  Thibault Sellam,
Dipanjan Das, Ankur P. Parikh  ACL 2020    2020/04/22 研究室論文紹介   紹介者: 吉村   

Abstract  • We propose BLEURT, a learned metric based on
BERT  ◦ that can model human judgements with a few thousand possibly biased training examples  • A key aspect of BLEURT is a novel pre-training scheme   ◦ that uses millions of synthetic example to help model generalize  • State-of-the-art results  ◦ on the last three years of the WMT Metrics shared task  ◦ and WebNLG Competition dataset  • BLEURT is superior to vanilla BERT-based approach  ◦ even when training data is scarce and out-of-distribution   

Introduction  • Metrics based on surface similarity (BLEU, ROUGE)  ◦
measure the surface similarity b/w the reference and candidate.  ◦ correlate poorly with human judgment  • Metrics based on learned components  ◦ High correlation with human judgment  ◦ Fully learned metrics (BEER, RUSE, ESIM)  ▪ are trained ent-to-end, and rely on handcrafted features and/or learned embeddings ▪ offers gread expressivity ◦ Hybrid metrics (YiSi, BERTscore)  ▪ combine trained elements, e.g., contextual embeddings, with handwritten logic, e.g., as token alignment rules ▪ offers robustness  

Introduction  • An ideal learned metric would be  ◦ able
to take full advantage of available ratings data for training  ◦ and be robust to distribution drifts  • Insight is  ◦ possible to combine expressivity and robustness  ◦ by pre-training a metric on large amounts of synthetic data, before fine-tuning  • To this end, Wepropose BLEURT  ◦ Bilingual Evaluation Understudy with Representations from Transformers  • A key ingredient of BLEURT is pre-training scheme  ◦ , which uses random perturbations of Wikipedia sentences augmented with a diverse set of lexical and sentence-level signals 

Pre-Training on Synthetic Data    • “Warm up” BERT before
fine-tuning.  ◦ Generate a large number of synthetic ref-cand pairs.  ◦ Train BERT on several lexical- and semantic-level supervision signals with a multi-task loss.  • Optimize pre-training scheme for generality with   1. The set of reference sentences should be large and diverse.  a. to cope with a wide range of NLG domains and tasks.  2. The sentence pairs should contain wide variety of lexical, syntactic, and semantic dissimilarities.  a. to anticipate all variations that an NLG system may produce.  3. The pre-training objectives should effectively capture those phenomena  a. so that BLEURT can lean to identify them. 

Generating Sentence Pairs  • Generate synthetic sentence pairs (z, z~)
  ◦ by randomly perturbing 1.8 million segments z from WIkipedia  ◦ Obtain about 6.5 million perturbations z~  • Three techniques:  ◦ Mask filling with BERT  ▪ Inserting masks at random positions and fill them with the language model. ▪ Introduce lexical alterations while maintaining the fluency ◦ Backtranskation  ▪ to create variants of the reference sentence that preserves semantics ◦ Dropping words     

Pre-training Signals  • Automatic Metrics  ◦ BLEU, ROUGE, BERTscore  •
Backtranslation Likelihood      ◦ computing the summation over all possible French sentences is intractable    ◦ 4 pre-training signals (en-de, en-fr) in both directions.  τ: pre-training signal  (z, z~): ref-cand pairs  |z~|: number of tokens in z ~ 

Pre-training Signals  • Textual Entailment  ◦ whether z entails or
contradicts z~ using a classifier.  ◦ using BERT fine-tuned on an entailment dataset, MNLI  • Backtranslation flag  ◦ Boolean that indicates whether the perturbation was generated  ◦ with backtranslation or with mask-filling.  Modeling  • Task-lebel losses with a weighted sum.      M: number of synthetic examples   K: number of task  γ k : weight 

Experiments  • WMT Metrics Shared Task (2017~2019)  ◦ Evaluate robustness
to quality drifts (2017)  • WebNLG2017 Challenge Dataset  ◦ Test BLEURT’s ability to adapt to different tasks  • Ablation   ◦ Measure the contribution of each pre-training task  • Model  ◦ BLEURT are trained in three steps  i. regular BERT pre-training ii. pretraining on synthetic data iii. fine-tuning on task-specific rating ◦ Batch size: 32, lr: 1e-5,   ◦ pre-training steps: 0.8M, fine-tuning steps: 40k  ◦ γ k : set with grid search 

Experiments - WMT Metrics Shared Task     • Dataset  
◦ WMT2017-2019 (to-English)  • Metrics  ◦ Evaluate agreement b/w the metrics and the human ratings.  ◦ Kndall’s Tau τ (all year), Pearson’s correlation (2017), DARR (2018, 2019)  • Models  ◦ BLEURT, BLEURTbase  ▪ Based on BERT-large and BERT-base ◦ BLEURT-pre, BLUERTbase-pre  ▪ Skip the pre-training phase and fine-tune directly on the WMT ratings.

Result - WMT Metrics Shared Task   

Result - WMT Metrics Shared Task    • Pre-training consistently
improves the results of BLEURT and BLEURTbase  ◦ up to 7.4 points increase (2017), 2.1 points (2018, 2019)  ◦ dataset used for 2017 is smaller, so pre-training is likelier to help.  • BLERT yields SoTA performance for all years. 

Robustness to Quality Drift  • Create increasingly challenging datasets  ◦
sub-sampling records from WMT17 dataset, keeping low-rated translations for training and high-rated translations for test.  ◦ 11.9% of the original 5,344 training records (α=3.0）   

Result  • Pre-training makes BLEURT significantly more robust to quality
drift 

Experiments - WebNLG  • Asses BLEURT’s capacity  ◦ to adapt
to new tasks with limited training data.  • Dataset   ◦ The WebNLG challenge benchmarks   ◦ 4,677 sentence pairs with 1 to 3 reference  ◦ evaluated on 3 aspects: semantics, grammar, and fluency      • Systems   ◦ BLEURT -pre -wmt: WebNLG  ◦ BLEURT -wmt: synthetic data → WebNLG  ◦ BLEURT: synthetic data → WMT → WebNLG  e.g. 

Result  • Thanks to pre-training, BLEURT can quickly adapt to
the new tasks.  • BLEURT fine-tuned twice provides acceptable results on all tasks without training data. 

Ablation Experiments  • Pretraining on BERTscore, entail, and backtrans yield
improvements.  ◦ symmetrically, ablating them degrades BLEURT  • BLEU and ROUGE have negative impact.  • High correlation with human score  ◦ helps BLEURT  • Low correlation with human score  ◦ harm the model   

Conclusions  • BLEURT  ◦ a reference-based text generation metrics for
English  • Novel pre-train scheme  ◦ 1. regular BERT pre-training  ◦ 2. pre-training on synthetic data  ◦ 3. fine-tuning on task-specific rating  • Experiments  ◦ SoTA results on WMT and WebNLG Challenge dataset  ◦ robust both domain and quality drifts     

BLEURT: Learning Robust Metrics for Text Genera...

BLEURT: Learning Robust Metrics for Text Generation

ryoma yoshimura

More Decks by ryoma yoshimura

Featured

Transcript

BLEURT: Learning Robust Metrics   for Text Generation  Thibault Sellam,

Abstract  • We propose BLEURT, a learned metric based on

Introduction  • Metrics based on surface similarity (BLEU, ROUGE)  ◦

Introduction  • An ideal learned metric would be  ◦ able

Pre-Training on Synthetic Data    • “Warm up” BERT before

Generating Sentence Pairs  • Generate synthetic sentence pairs (z, z~)

Pre-training Signals  • Automatic Metrics  ◦ BLEU, ROUGE, BERTscore  •

Pre-training Signals  • Textual Entailment  ◦ whether z entails or

Experiments  • WMT Metrics Shared Task (2017~2019)  ◦ Evaluate robustness

Experiments - WMT Metrics Shared Task     • Dataset

Result - WMT Metrics Shared Task

Result - WMT Metrics Shared Task    • Pre-training consistently

Robustness to Quality Drift  • Create increasingly challenging datasets  ◦

Result  • Pre-training makes BLEURT significantly more robust to quality

Experiments - WebNLG  • Asses BLEURT’s capacity  ◦ to adapt

Result  • Thanks to pre-training, BLEURT can quickly adapt to

Ablation Experiments  • Pretraining on BERTscore, entail, and backtrans yield

Conclusions  • BLEURT  ◦ a reference-based text generation metrics for