[Reading] Bleurt: Learning Robust Metrics for Text Generation (Sellam et al., ACL 2020)

BLEURT: Learning Robust Metrics for Text Generation Thibault Sellam, Dipanjan
Das, Ankur Parikh Google Research, New York [SNLP 2020] (ACL 2020; 2020.acl-main.704) Introduced by Katsuhito Sudoh (NAIST)

Quick Summary of the Paper •Yet another sentence-level NLG evaluation
metric • Predicting a real-valued score for a given pair of hypothesis and the corresponding reference • Fine-tuned model based on BERT • Pre-training with different metrics & tasks • BLEU, ROUGE, BERTScore, BT likelihood, NLI, BT flag • Outperforms other metrics in WMT/WebNLG • Robust to small or no fine-tuning data conditions 2

Rough Classification of Evaluation Metrics 3 •Model-free metrics • WER
• BLEU (Papineni+ 2002) • NIST (Doddington 2002) • ROUGE (Lin+ 2003) • METEOR (Benerjee+ 2005) • TER (Snover+ 2006) • RIBES (Isozaki+ 2010) • chrF (Popović 2015) •Model-based metrics • Fully-learned metrics • BEER (Stanojević+ 2014) • RUSE (Shimanaka+ 2018) • ESIM (Chen+ 2017, Mathur+ 2019) • BERT regressor (Shimanaka+ 2019) • Hybrid metrics • YiSi (Lo, 2019) • BERTScore (Zhang+ 2020) • MoverScore (Zhao+ 2019)

Fully-learned vs. Hybrid 4 •Fully-learned •Direct score prediction •Expressivity •Tunable
to different properties •Requires human- labeled training data •Sensitive to domain and quality drift •Hybrid •Element combination •Robustness •Available with little or no training data •No i.i.d. assumption

BERT to NLG evaluation 5 •Predicts a score for a
given pair of a hypothesis and reference •Uses the vector on the [CLS] token (from Shimanaka+ (2019)) Figure 2: BERT sentence-pair encoding.

BLEURT (BiLingual Evaluation Understudy with Representations from Transformers) •BERT-based NLG
evaluation metric with a novel pre-training scheme 6 BERT Pre- trained model BLEURT model Multi-task pre-training on synthetic data (warm up) Fine-tuning on human-labeled data

Pre-training Requirements 1) The dataset should be large and diverse
enough to cover various NLG domains/tasks 2) Sentence pairs should contain various lexical, syntactic, and semantic errors 3) The objectives should effectively capture them 7

Data Synthesis for Pre-training •Pseudo-hypotheses (6.5M) by perturbing Wikipedia sentences
(1.8M) i. BERT mask-filling (single or contiguous span) • Up to 15 tokens are masked per sentence • Beam search (width=8) to avoid token repetitions ii.Back-translation (En > De > En) by Transformer iii.Dropping words for 30% of the synthetic pairs • # of drop is drawn uniformly up to the sentence length 8

Pre-training Signals (Regression Objective) •Automatic metrics •!"#$ •%&$'# = %&$'#()
, %&$'#(* , %&$'#(+ •!#%,-./01 = !#%,-./01 , !#%,-./01 , !#%,-./01 •Back-translation likelihood (length normalized) •-23→50→015 , 015→50→-23 , -23→61→015 , 015→61→-23 •Approximation • !→#$→% ≈ #$→&' argmax ( &'→#$ 9 Moses’ sentenceBLEU Google seq2seq ROUGE-N bert_score

Pre-training Signals (Multi-class Objective) •Textual Entailment •Whether the reference entails
or contradicts the corresponding pseudo-hypothesis •The teacher signal (prob. of entail/neutral/contradict) is given by a MNLI-tuned BERT •Back-translation flag •Whether the perturbation was from back-translation or mask-filling 10

Experiments •Two versions •BLEURT based on BERT-Large •BLEURT-Base based on
BERT-Base •Two tasks •WMT Metrics Shared Task (2017, 2018, 2019) •WebNLG 2017 Challenge Dataset •Ablation test 11

Results: WMT Metrics (2017) •BLEURT showed the best correlation •BLEURT
was better than BLEURT-Base in general •Proposed pre-training improved the performance 12 model cs-en de-en ﬁ-en lv-en ru-en tr-en zh-en avg τ / r τ / r τ / r τ / r τ / r τ / r τ / r τ / r sentBLEU 29.6 / 43.2 28.9 / 42.2 38.6 / 56.0 23.9 / 38.2 34.3 / 47.7 34.3 / 54.0 37.4 / 51.3 32.4 / 47.5 MoverScore 47.6 / 67.0 51.2 / 70.8 NA NA 53.4 / 73.8 56.1 / 76.2 53.1 / 74.4 52.3 / 72.4 BERTscore w/ BERT 48.0 / 66.6 50.3 / 70.1 61.4 / 81.4 51.6 / 72.3 53.7 / 73.0 55.6 / 76.0 52.2 / 73.1 53.3 / 73.2 BERTscore w/ roBERTa 54.2 / 72.6 56.9 / 76.0 64.8 / 83.2 56.2 / 75.7 57.2 / 75.2 57.9 / 76.1 58.8 / 78.9 58.0 / 76.8 chrF++ 35.0 / 52.3 36.5 / 53.4 47.5 / 67.8 33.3 / 52.0 41.5 / 58.8 43.2 / 61.4 40.5 / 59.3 39.6 / 57.9 BEER 34.0 / 51.1 36.1 / 53.0 48.3 / 68.1 32.8 / 51.5 40.2 / 57.7 42.8 / 60.0 39.5 / 58.2 39.1 / 57.1 BLEURTbase -pre 51.5 / 68.2 52.0 / 70.7 66.6 / 85.1 60.8 / 80.5 57.5 / 77.7 56.9 / 76.0 52.1 / 72.1 56.8 / 75.8 BLEURTbase 55.7 / 73.4 56.3 / 75.7 68.0 / 86.8 64.7 / 83.3 60.1 / 80.1 62.4 / 81.7 59.5 / 80.5 61.0 / 80.2 BLEURT -pre 56.0 / 74.7 57.1 / 75.7 67.2 / 86.1 62.3 / 81.7 58.4 / 78.3 61.6 / 81.4 55.9 / 76.5 59.8 / 79.2 BLEURT 59.3 / 77.3 59.9 / 79.2 69.5 / 87.8 64.4 / 83.5 61.3 / 81.1 62.9 / 82.4 60.2 / 81.4 62.5 / 81.8 Table 2: Agreement with human ratings on the WMT17 Metrics Shared Task. The metrics are Kendall Tau (τ) and (taken from the paper)

Analysis on Quality Drift •Simulate quality drift by data sub-sampling
13 BLEURT 31.2 / 16.9 31.7 / 36.3 28.3 / 31.9 39.5 / 44.6 35.2 / 40.6 28.3 / 22.3 42.7 / 42.4 able 4: Agreement with human ratings on the WMT19 Metrics Shared Task. The metrics are Kenda WMT’s Direct Assessment metrics divided by 100. All the values reported for Yisi1 SRL and ES 2 percentage of the ofﬁcial WMT results. 0.00 0.25 0.50 0.75 1.00 −2 −1 0 1 Ratings Density (rescaled) Dataset Test Train/Validation Skew factor 0 0.5 1.0 1.5 3.0 gure 1: Distribution of the human ratings in the ain/validation and test datasets for different skew fac- G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G BLEURT No Pretrain. BLEURT w. P 0 1 2 3 0 1 0.0 0.2 0.4 0.6 Test Set skew Kendall Tau w. Human Ratings G G G G G BERTscore BLEU train sk. 0 train sk. 0.5 train sk. 1.0 train sk. 1.5 th human ratings on the WMT19 Metrics Shared Task. The metrics are Kendall Tau (τ) and ment metrics divided by 100. All the values reported for Yisi1 SRL and ESIM fall within fﬁcial WMT results. 0 1 ngs Dataset Test Train/Validation Skew factor 0 0.5 1.0 1.5 3.0 n of the human ratings in the t datasets for different skew fac- G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G BLEURT No Pretrain. BLEURT w. Pretrain 0 1 2 3 0 1 2 3 0.0 0.2 0.4 0.6 Test Set skew Kendall Tau w. Human Ratings G G G G G BERTscore BLEU train sk. 0 train sk. 0.5 train sk. 1.0 train sk. 1.5 train sk. 3.0 (taken from the paper)

Results: WebNLG •Applied WMT-tuned BLEURT •Compared w/ BLEURT-pre-wmt and BLEURT-pre
14 Split by System Split by Input fluency grammar semantics 0/9 systems 0 records 2/9 systems 1,174 records 3/9 systems 1,317 records 5/9 systems 2,424 records 0/224 inputs 0 records 38/224 inputs 836 records 66/224 inputs 1,445 records 122/224 inputs 2,689 records 0.0 0.2 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 Num. Systems/Inputs Used for Training and Validation Kentall Tau w. Human Ratings Metric BLEU TER Meteor BERTscore BLEURT −pre −wmt BLEURT −wmt BLEURT (taken from the paper)

Takeaways •Pre-training delivers consistent improvements, especially for BLEURT-base (smaller model).
•Pre-training makes BLEURT significantly more robust to quality drifts. •Thanks to pre-training, BLEURT can quickly adapt to the new tasks. BLEURT fine-tuned twice provides acceptable results on all tasks without training data. 15 (taken from the paper)

Ablation Test 16 •BERTScore, entailment, and back-translation scores yield gains
•BLEU and ROUGE have a negative impact •Open question: should we remove them from pre-training signals? 1 task 0%: no pre−training N−1 tasks 0%: all pre−training tasks BERTscore entail backtrans method_flag BLEU ROUGE −BERTscore −entail −backtrans −method_flag −BLEU −ROUGE −15 −10 −5 0 5 Pretraining Task Relative Improv./Degradation (%) BLEURT BLEURTbase Figure 4: Improvement in Kendall Tau on WMT 17 (taken from the paper)

Conclusions •BLEURT can model human assessment accurately •Pre-training with some
different signals •Robust to domain and quality drift, and data scarcity •Future work: multilingual NLG evalaution? 17

My Impressions •BLEURT successfully penalizes following types of errors than
BERTScore: •Wrong negations •Wrong word replacements •Word salad •Model-based metrics are powerful but difficult to use in low-resourced languages… 18

References (*some are not included in the paper) • Benerjee,
S. et al.: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65-72, 2005 [link] • Chen, Q. et al.: Enhanced LSTM for Natural Language Inference, Proc. ACL 2017, pp. 1657-1668, 2017 [link] • Doddington, G.: Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, Proc. HLT’02, pp. 138-145, 2002 [link] 19

References (*some are not listed in the paper) • Isozaki,
H. et al.:Automatic Evaluation of Translation Quality for Distant Language Pairs, Proc. EMNLP 2010, pp. 944-952, 2010 [link] • Lin, C.-Y. et al.: Automatic Evaluation of Summaries Using N-gram Co- Occurrence Statistics, Proc. HLT-NAACL 2003, pp. 71-78, 2003 [link] • Lo, C.-K.: YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources, Proc. WMT 2019, pp. 507-513, 2019 [link] • Mathur, M. et al.: Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation, Proc. ACL 2019, pp. 2799-2808, 2019 [link] 20

References (*some are not listed in the paper) • Papineni,
K. et al.: BLEU: a Method for Automatic Evaluation of Machine Translation [link] • Popović, M.: CHRF: character n-gram F-score for automatic MT evaluation, Proc. WMT 2015, pp. 392-395, 2015 [link] • Snover, M. et al.:A Study of Translation Edit Rate with Targeted Human Annotation, Proc. AMTA 2006, pp. 223-231, 2006 [link (PDF)] • Shimanaka, H. et al.: RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation, Proc. WMT 2018, pp. 751- 758, 2018 [link] 21

References (*some are not listed in the paper) • Shimanaka,
H. et al.: Machine Translation Evaluation with BERT Regressor, arXiv preprint (cs.CL) 1907.12679, 2019 [link] • Stanojević, M. et al.: BEER: BEtter Evaluation as Ranking, Proc. WMT 2014, pp. 414-419, 2014 [link] • Zhang, T. et al.: BERTScore: Evaluating Text Generation with BERT, Proc. ICLR 2020, arXiv preprint (cs.CL)1904.09675 [link] • Zhao, W. et al.: MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance, Proc. EMNLP- IJCNLP 2019, pp. 563-578, 2019 [link] 22

[Reading] Bleurt: Learning Robust Metrics for T...

[Reading] Bleurt: Learning Robust Metrics for Text Generation (Sellam et al., ACL 2020)

Katsuhito Sudoh

More Decks by Katsuhito Sudoh

Other Decks in Research

Featured

Transcript

BLEURT: Learning Robust Metrics for Text Generation Thibault Sellam, Dipanjan

Quick Summary of the Paper •Yet another sentence-level NLG evaluation

Rough Classification of Evaluation Metrics 3 •Model-free metrics • WER

Fully-learned vs. Hybrid 4 •Fully-learned •Direct score prediction •Expressivity •Tunable

BERT to NLG evaluation 5 •Predicts a score for a

BLEURT (BiLingual Evaluation Understudy with Representations from Transformers) •BERT-based NLG

Pre-training Requirements 1) The dataset should be large and diverse

Data Synthesis for Pre-training •Pseudo-hypotheses (6.5M) by perturbing Wikipedia sentences

Pre-training Signals (Regression Objective) •Automatic metrics •!"#$ •%&$'# = %&$'#()

Pre-training Signals (Multi-class Objective) •Textual Entailment •Whether the reference entails

Experiments •Two versions •BLEURT based on BERT-Large •BLEURT-Base based on

Results: WMT Metrics (2017) •BLEURT showed the best correlation •BLEURT

Analysis on Quality Drift •Simulate quality drift by data sub-sampling

Results: WebNLG •Applied WMT-tuned BLEURT •Compared w/ BLEURT-pre-wmt and BLEURT-pre

Takeaways •Pre-training delivers consistent improvements, especially for BLEURT-base (smaller model).

Ablation Test 16 •BERTScore, entailment, and back-translation scores yield gains

Conclusions •BLEURT can model human assessment accurately •Pre-training with some

My Impressions •BLEURT successfully penalizes following types of errors than

References (*some are not included in the paper) • Benerjee,

References (*some are not listed in the paper) • Isozaki,

References (*some are not listed in the paper) • Papineni,

References (*some are not listed in the paper) • Shimanaka,