Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BLEURT: Learning Robust Metrics for Text Generation

ryoma yoshimura
April 22, 2020
130

BLEURT: Learning Robust Metrics for Text Generation

ryoma yoshimura

April 22, 2020
Tweet

More Decks by ryoma yoshimura

Transcript

  1. BLEURT: Learning Robust Metrics 
 for Text Generation
 Thibault Sellam,

    Dipanjan Das, Ankur P. Parikh
 ACL 2020
 
 2020/04/22 研究室論文紹介 
 紹介者: 吉村
 

  2. Abstract
 • We propose BLEURT, a learned metric based on

    BERT
 ◦ that can model human judgements with a few thousand possibly biased training examples
 • A key aspect of BLEURT is a novel pre-training scheme 
 ◦ that uses millions of synthetic example to help model generalize
 • State-of-the-art results
 ◦ on the last three years of the WMT Metrics shared task
 ◦ and WebNLG Competition dataset
 • BLEURT is superior to vanilla BERT-based approach
 ◦ even when training data is scarce and out-of-distribution
 

  3. Introduction
 • Metrics based on surface similarity (BLEU, ROUGE)
 ◦

    measure the surface similarity b/w the reference and candidate.
 ◦ correlate poorly with human judgment
 • Metrics based on learned components
 ◦ High correlation with human judgment
 ◦ Fully learned metrics (BEER, RUSE, ESIM)
 ▪ are trained ent-to-end, and rely on handcrafted features and/or learned embeddings ▪ offers gread expressivity ◦ Hybrid metrics (YiSi, BERTscore)
 ▪ combine trained elements, e.g., contextual embeddings, with handwritten logic, e.g., as token alignment rules ▪ offers robustness 

  4. Introduction
 • An ideal learned metric would be
 ◦ able

    to take full advantage of available ratings data for training
 ◦ and be robust to distribution drifts
 • Insight is
 ◦ possible to combine expressivity and robustness
 ◦ by pre-training a metric on large amounts of synthetic data, before fine-tuning
 • To this end, Wepropose BLEURT
 ◦ Bilingual Evaluation Understudy with Representations from Transformers
 • A key ingredient of BLEURT is pre-training scheme
 ◦ , which uses random perturbations of Wikipedia sentences augmented with a diverse set of lexical and sentence-level signals

  5. Pre-Training on Synthetic Data
 
 • “Warm up” BERT before

    fine-tuning.
 ◦ Generate a large number of synthetic ref-cand pairs.
 ◦ Train BERT on several lexical- and semantic-level supervision signals with a multi-task loss.
 • Optimize pre-training scheme for generality with 
 1. The set of reference sentences should be large and diverse.
 a. to cope with a wide range of NLG domains and tasks.
 2. The sentence pairs should contain wide variety of lexical, syntactic, and semantic dissimilarities.
 a. to anticipate all variations that an NLG system may produce.
 3. The pre-training objectives should effectively capture those phenomena
 a. so that BLEURT can lean to identify them.

  6. Generating Sentence Pairs
 • Generate synthetic sentence pairs (z, z~)

    
 ◦ by randomly perturbing 1.8 million segments z from WIkipedia
 ◦ Obtain about 6.5 million perturbations z~
 • Three techniques:
 ◦ Mask filling with BERT
 ▪ Inserting masks at random positions and fill them with the language model. ▪ Introduce lexical alterations while maintaining the fluency ◦ Backtranskation
 ▪ to create variants of the reference sentence that preserves semantics ◦ Dropping words
 
 

  7. Pre-training Signals
 • Automatic Metrics
 ◦ BLEU, ROUGE, BERTscore
 •

    Backtranslation Likelihood
 
 
 ◦ computing the summation over all possible French sentences is intractable
 
 ◦ 4 pre-training signals (en-de, en-fr) in both directions.
 τ: pre-training signal
 (z, z~): ref-cand pairs
 |z~|: number of tokens in z ~

  8. Pre-training Signals
 • Textual Entailment
 ◦ whether z entails or

    contradicts z~ using a classifier.
 ◦ using BERT fine-tuned on an entailment dataset, MNLI
 • Backtranslation flag
 ◦ Boolean that indicates whether the perturbation was generated
 ◦ with backtranslation or with mask-filling.
 Modeling
 • Task-lebel losses with a weighted sum.
 
 
 M: number of synthetic examples 
 K: number of task
 γ k : weight

  9. Experiments
 • WMT Metrics Shared Task (2017~2019)
 ◦ Evaluate robustness

    to quality drifts (2017)
 • WebNLG2017 Challenge Dataset
 ◦ Test BLEURT’s ability to adapt to different tasks
 • Ablation 
 ◦ Measure the contribution of each pre-training task
 • Model
 ◦ BLEURT are trained in three steps
 i. regular BERT pre-training ii. pretraining on synthetic data iii. fine-tuning on task-specific rating ◦ Batch size: 32, lr: 1e-5, 
 ◦ pre-training steps: 0.8M, fine-tuning steps: 40k
 ◦ γ k : set with grid search

  10. Experiments - WMT Metrics Shared Task
 
 
• Dataset 


    ◦ WMT2017-2019 (to-English)
 • Metrics
 ◦ Evaluate agreement b/w the metrics and the human ratings.
 ◦ Kndall’s Tau τ (all year), Pearson’s correlation (2017), DARR (2018, 2019)
 • Models
 ◦ BLEURT, BLEURTbase
 ▪ Based on BERT-large and BERT-base ◦ BLEURT-pre, BLUERTbase-pre
 ▪ Skip the pre-training phase and fine-tune directly on the WMT ratings.
  11. Result - WMT Metrics Shared Task
 
 • Pre-training consistently

    improves the results of BLEURT and BLEURTbase
 ◦ up to 7.4 points increase (2017), 2.1 points (2018, 2019)
 ◦ dataset used for 2017 is smaller, so pre-training is likelier to help.
 • BLERT yields SoTA performance for all years.

  12. Robustness to Quality Drift
 • Create increasingly challenging datasets
 ◦

    sub-sampling records from WMT17 dataset, keeping low-rated translations for training and high-rated translations for test.
 ◦ 11.9% of the original 5,344 training records (α=3.0)
 

  13. Experiments - WebNLG
 • Asses BLEURT’s capacity
 ◦ to adapt

    to new tasks with limited training data.
 • Dataset 
 ◦ The WebNLG challenge benchmarks 
 ◦ 4,677 sentence pairs with 1 to 3 reference
 ◦ evaluated on 3 aspects: semantics, grammar, and fluency
 
 
 • Systems 
 ◦ BLEURT -pre -wmt: WebNLG
 ◦ BLEURT -wmt: synthetic data → WebNLG
 ◦ BLEURT: synthetic data → WMT → WebNLG
 e.g.

  14. Result
 • Thanks to pre-training, BLEURT can quickly adapt to

    the new tasks.
 • BLEURT fine-tuned twice provides acceptable results on all tasks without training data.

  15. Ablation Experiments
 • Pretraining on BERTscore, entail, and backtrans yield

    improvements.
 ◦ symmetrically, ablating them degrades BLEURT
 • BLEU and ROUGE have negative impact.
 • High correlation with human score
 ◦ helps BLEURT
 • Low correlation with human score
 ◦ harm the model
 

  16. Conclusions
 • BLEURT
 ◦ a reference-based text generation metrics for

    English
 • Novel pre-train scheme
 ◦ 1. regular BERT pre-training
 ◦ 2. pre-training on synthetic data
 ◦ 3. fine-tuning on task-specific rating
 • Experiments
 ◦ SoTA results on WMT and WebNLG Challenge dataset
 ◦ robust both domain and quality drifts