Slide 1

Slide 1 text

BLEURT: Learning Robust Metrics for Text Generation Thibault Sellam, Dipanjan Das, Ankur Parikh Google Research, New York [SNLP 2020] (ACL 2020; 2020.acl-main.704) Introduced by Katsuhito Sudoh (NAIST)

Slide 2

Slide 2 text

Quick Summary of the Paper •Yet another sentence-level NLG evaluation metric • Predicting a real-valued score for a given pair of hypothesis and the corresponding reference • Fine-tuned model based on BERT • Pre-training with different metrics & tasks • BLEU, ROUGE, BERTScore, BT likelihood, NLI, BT flag • Outperforms other metrics in WMT/WebNLG • Robust to small or no fine-tuning data conditions 2

Slide 3

Slide 3 text

Rough Classification of Evaluation Metrics 3 •Model-free metrics • WER • BLEU (Papineni+ 2002) • NIST (Doddington 2002) • ROUGE (Lin+ 2003) • METEOR (Benerjee+ 2005) • TER (Snover+ 2006) • RIBES (Isozaki+ 2010) • chrF (Popović 2015) •Model-based metrics • Fully-learned metrics • BEER (Stanojević+ 2014) • RUSE (Shimanaka+ 2018) • ESIM (Chen+ 2017, Mathur+ 2019) • BERT regressor (Shimanaka+ 2019) • Hybrid metrics • YiSi (Lo, 2019) • BERTScore (Zhang+ 2020) • MoverScore (Zhao+ 2019)

Slide 4

Slide 4 text

Fully-learned vs. Hybrid 4 •Fully-learned •Direct score prediction •Expressivity •Tunable to different properties •Requires human- labeled training data •Sensitive to domain and quality drift •Hybrid •Element combination •Robustness •Available with little or no training data •No i.i.d. assumption

Slide 5

Slide 5 text

BERT to NLG evaluation 5 •Predicts a score for a given pair of a hypothesis and reference •Uses the vector on the [CLS] token (from Shimanaka+ (2019)) Figure 2: BERT sentence-pair encoding.

Slide 6

Slide 6 text

BLEURT (BiLingual Evaluation Understudy with Representations from Transformers) •BERT-based NLG evaluation metric with a novel pre-training scheme 6 BERT Pre- trained model BLEURT model Multi-task pre-training on synthetic data (warm up) Fine-tuning on human-labeled data

Slide 7

Slide 7 text

Pre-training Requirements 1) The dataset should be large and diverse enough to cover various NLG domains/tasks 2) Sentence pairs should contain various lexical, syntactic, and semantic errors 3) The objectives should effectively capture them 7

Slide 8

Slide 8 text

Data Synthesis for Pre-training •Pseudo-hypotheses (6.5M) by perturbing Wikipedia sentences (1.8M) i. BERT mask-filling (single or contiguous span) • Up to 15 tokens are masked per sentence • Beam search (width=8) to avoid token repetitions ii.Back-translation (En > De > En) by Transformer iii.Dropping words for 30% of the synthetic pairs • # of drop is drawn uniformly up to the sentence length 8

Slide 9

Slide 9 text

Pre-training Signals (Regression Objective) •Automatic metrics •!"#$ •%&$'# = %&$'#() , %&$'#(* , %&$'#(+ •!#%,-./01 = !#%,-./01 , !#%,-./01 , !#%,-./01 •Back-translation likelihood (length normalized) •-23→50→015 , 015→50→-23 , -23→61→015 , 015→61→-23 •Approximation • !→#$→% ≈ #$→&' argmax ( &'→#$ 9 Moses’ sentenceBLEU Google seq2seq ROUGE-N bert_score

Slide 10

Slide 10 text

Pre-training Signals (Multi-class Objective) •Textual Entailment •Whether the reference entails or contradicts the corresponding pseudo-hypothesis •The teacher signal (prob. of entail/neutral/contradict) is given by a MNLI-tuned BERT •Back-translation flag •Whether the perturbation was from back-translation or mask-filling 10

Slide 11

Slide 11 text

Experiments •Two versions •BLEURT based on BERT-Large •BLEURT-Base based on BERT-Base •Two tasks •WMT Metrics Shared Task (2017, 2018, 2019) •WebNLG 2017 Challenge Dataset •Ablation test 11

Slide 12

Slide 12 text

Results: WMT Metrics (2017) •BLEURT showed the best correlation •BLEURT was better than BLEURT-Base in general •Proposed pre-training improved the performance 12 model cs-en de-en fi-en lv-en ru-en tr-en zh-en avg τ / r τ / r τ / r τ / r τ / r τ / r τ / r τ / r sentBLEU 29.6 / 43.2 28.9 / 42.2 38.6 / 56.0 23.9 / 38.2 34.3 / 47.7 34.3 / 54.0 37.4 / 51.3 32.4 / 47.5 MoverScore 47.6 / 67.0 51.2 / 70.8 NA NA 53.4 / 73.8 56.1 / 76.2 53.1 / 74.4 52.3 / 72.4 BERTscore w/ BERT 48.0 / 66.6 50.3 / 70.1 61.4 / 81.4 51.6 / 72.3 53.7 / 73.0 55.6 / 76.0 52.2 / 73.1 53.3 / 73.2 BERTscore w/ roBERTa 54.2 / 72.6 56.9 / 76.0 64.8 / 83.2 56.2 / 75.7 57.2 / 75.2 57.9 / 76.1 58.8 / 78.9 58.0 / 76.8 chrF++ 35.0 / 52.3 36.5 / 53.4 47.5 / 67.8 33.3 / 52.0 41.5 / 58.8 43.2 / 61.4 40.5 / 59.3 39.6 / 57.9 BEER 34.0 / 51.1 36.1 / 53.0 48.3 / 68.1 32.8 / 51.5 40.2 / 57.7 42.8 / 60.0 39.5 / 58.2 39.1 / 57.1 BLEURTbase -pre 51.5 / 68.2 52.0 / 70.7 66.6 / 85.1 60.8 / 80.5 57.5 / 77.7 56.9 / 76.0 52.1 / 72.1 56.8 / 75.8 BLEURTbase 55.7 / 73.4 56.3 / 75.7 68.0 / 86.8 64.7 / 83.3 60.1 / 80.1 62.4 / 81.7 59.5 / 80.5 61.0 / 80.2 BLEURT -pre 56.0 / 74.7 57.1 / 75.7 67.2 / 86.1 62.3 / 81.7 58.4 / 78.3 61.6 / 81.4 55.9 / 76.5 59.8 / 79.2 BLEURT 59.3 / 77.3 59.9 / 79.2 69.5 / 87.8 64.4 / 83.5 61.3 / 81.1 62.9 / 82.4 60.2 / 81.4 62.5 / 81.8 Table 2: Agreement with human ratings on the WMT17 Metrics Shared Task. The metrics are Kendall Tau (τ) and (taken from the paper)

Slide 13

Slide 13 text

Analysis on Quality Drift •Simulate quality drift by data sub-sampling 13 BLEURT 31.2 / 16.9 31.7 / 36.3 28.3 / 31.9 39.5 / 44.6 35.2 / 40.6 28.3 / 22.3 42.7 / 42.4 able 4: Agreement with human ratings on the WMT19 Metrics Shared Task. The metrics are Kenda WMT’s Direct Assessment metrics divided by 100. All the values reported for Yisi1 SRL and ES 2 percentage of the official WMT results. 0.00 0.25 0.50 0.75 1.00 −2 −1 0 1 Ratings Density (rescaled) Dataset Test Train/Validation Skew factor 0 0.5 1.0 1.5 3.0 gure 1: Distribution of the human ratings in the ain/validation and test datasets for different skew fac- G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G BLEURT No Pretrain. BLEURT w. P 0 1 2 3 0 1 0.0 0.2 0.4 0.6 Test Set skew Kendall Tau w. Human Ratings G G G G G BERTscore BLEU train sk. 0 train sk. 0.5 train sk. 1.0 train sk. 1.5 th human ratings on the WMT19 Metrics Shared Task. The metrics are Kendall Tau (τ) and ment metrics divided by 100. All the values reported for Yisi1 SRL and ESIM fall within fficial WMT results. 0 1 ngs Dataset Test Train/Validation Skew factor 0 0.5 1.0 1.5 3.0 n of the human ratings in the t datasets for different skew fac- G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G BLEURT No Pretrain. BLEURT w. Pretrain 0 1 2 3 0 1 2 3 0.0 0.2 0.4 0.6 Test Set skew Kendall Tau w. Human Ratings G G G G G BERTscore BLEU train sk. 0 train sk. 0.5 train sk. 1.0 train sk. 1.5 train sk. 3.0 (taken from the paper)

Slide 14

Slide 14 text

Results: WebNLG •Applied WMT-tuned BLEURT •Compared w/ BLEURT-pre-wmt and BLEURT-pre 14 Split by System Split by Input fluency grammar semantics 0/9 systems 0 records 2/9 systems 1,174 records 3/9 systems 1,317 records 5/9 systems 2,424 records 0/224 inputs 0 records 38/224 inputs 836 records 66/224 inputs 1,445 records 122/224 inputs 2,689 records 0.0 0.2 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 Num. Systems/Inputs Used for Training and Validation Kentall Tau w. Human Ratings Metric BLEU TER Meteor BERTscore BLEURT −pre −wmt BLEURT −wmt BLEURT (taken from the paper)

Slide 15

Slide 15 text

Takeaways •Pre-training delivers consistent improvements, especially for BLEURT-base (smaller model). •Pre-training makes BLEURT significantly more robust to quality drifts. •Thanks to pre-training, BLEURT can quickly adapt to the new tasks. BLEURT fine-tuned twice provides acceptable results on all tasks without training data. 15 (taken from the paper)

Slide 16

Slide 16 text

Ablation Test 16 •BERTScore, entailment, and back-translation scores yield gains •BLEU and ROUGE have a negative impact •Open question: should we remove them from pre-training signals? 1 task 0%: no pre−training N−1 tasks 0%: all pre−training tasks BERTscore entail backtrans method_flag BLEU ROUGE −BERTscore −entail −backtrans −method_flag −BLEU −ROUGE −15 −10 −5 0 5 Pretraining Task Relative Improv./Degradation (%) BLEURT BLEURTbase Figure 4: Improvement in Kendall Tau on WMT 17 (taken from the paper)

Slide 17

Slide 17 text

Conclusions •BLEURT can model human assessment accurately •Pre-training with some different signals •Robust to domain and quality drift, and data scarcity •Future work: multilingual NLG evalaution? 17

Slide 18

Slide 18 text

My Impressions •BLEURT successfully penalizes following types of errors than BERTScore: •Wrong negations •Wrong word replacements •Word salad •Model-based metrics are powerful but difficult to use in low-resourced languages… 18

Slide 19

Slide 19 text

References (*some are not included in the paper) • Benerjee, S. et al.: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65-72, 2005 [link] • Chen, Q. et al.: Enhanced LSTM for Natural Language Inference, Proc. ACL 2017, pp. 1657-1668, 2017 [link] • Doddington, G.: Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, Proc. HLT’02, pp. 138-145, 2002 [link] 19

Slide 20

Slide 20 text

References (*some are not listed in the paper) • Isozaki, H. et al.:Automatic Evaluation of Translation Quality for Distant Language Pairs, Proc. EMNLP 2010, pp. 944-952, 2010 [link] • Lin, C.-Y. et al.: Automatic Evaluation of Summaries Using N-gram Co- Occurrence Statistics, Proc. HLT-NAACL 2003, pp. 71-78, 2003 [link] • Lo, C.-K.: YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources, Proc. WMT 2019, pp. 507-513, 2019 [link] • Mathur, M. et al.: Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation, Proc. ACL 2019, pp. 2799-2808, 2019 [link] 20

Slide 21

Slide 21 text

References (*some are not listed in the paper) • Papineni, K. et al.: BLEU: a Method for Automatic Evaluation of Machine Translation [link] • Popović, M.: CHRF: character n-gram F-score for automatic MT evaluation, Proc. WMT 2015, pp. 392-395, 2015 [link] • Snover, M. et al.:A Study of Translation Edit Rate with Targeted Human Annotation, Proc. AMTA 2006, pp. 223-231, 2006 [link (PDF)] • Shimanaka, H. et al.: RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation, Proc. WMT 2018, pp. 751- 758, 2018 [link] 21

Slide 22

Slide 22 text

References (*some are not listed in the paper) • Shimanaka, H. et al.: Machine Translation Evaluation with BERT Regressor, arXiv preprint (cs.CL) 1907.12679, 2019 [link] • Stanojević, M. et al.: BEER: BEtter Evaluation as Ranking, Proc. WMT 2014, pp. 414-419, 2014 [link] • Zhang, T. et al.: BERTScore: Evaluating Text Generation with BERT, Proc. ICLR 2020, arXiv preprint (cs.CL)1904.09675 [link] • Zhao, W. et al.: MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance, Proc. EMNLP- IJCNLP 2019, pp. 563-578, 2019 [link] 22