metric • Predicting a real-valued score for a given pair of hypothesis and the corresponding reference • Fine-tuned model based on BERT • Pre-training with different metrics & tasks • BLEU, ROUGE, BERTScore, BT likelihood, NLI, BT flag • Outperforms other metrics in WMT/WebNLG • Robust to small or no fine-tuning data conditions 2
to different properties •Requires human- labeled training data •Sensitive to domain and quality drift •Hybrid •Element combination •Robustness •Available with little or no training data •No i.i.d. assumption
evaluation metric with a novel pre-training scheme 6 BERT Pre- trained model BLEURT model Multi-task pre-training on synthetic data (warm up) Fine-tuning on human-labeled data
enough to cover various NLG domains/tasks 2) Sentence pairs should contain various lexical, syntactic, and semantic errors 3) The objectives should effectively capture them 7
(1.8M) i. BERT mask-filling (single or contiguous span) • Up to 15 tokens are masked per sentence • Beam search (width=8) to avoid token repetitions ii.Back-translation (En > De > En) by Transformer iii.Dropping words for 30% of the synthetic pairs • # of drop is drawn uniformly up to the sentence length 8
or contradicts the corresponding pseudo-hypothesis •The teacher signal (prob. of entail/neutral/contradict) is given by a MNLI-tuned BERT •Back-translation flag •Whether the perturbation was from back-translation or mask-filling 10
13 BLEURT 31.2 / 16.9 31.7 / 36.3 28.3 / 31.9 39.5 / 44.6 35.2 / 40.6 28.3 / 22.3 42.7 / 42.4 able 4: Agreement with human ratings on the WMT19 Metrics Shared Task. The metrics are Kenda WMT’s Direct Assessment metrics divided by 100. All the values reported for Yisi1 SRL and ES 2 percentage of the official WMT results. 0.00 0.25 0.50 0.75 1.00 −2 −1 0 1 Ratings Density (rescaled) Dataset Test Train/Validation Skew factor 0 0.5 1.0 1.5 3.0 gure 1: Distribution of the human ratings in the ain/validation and test datasets for different skew fac- G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G BLEURT No Pretrain. BLEURT w. P 0 1 2 3 0 1 0.0 0.2 0.4 0.6 Test Set skew Kendall Tau w. Human Ratings G G G G G BERTscore BLEU train sk. 0 train sk. 0.5 train sk. 1.0 train sk. 1.5 th human ratings on the WMT19 Metrics Shared Task. The metrics are Kendall Tau (τ) and ment metrics divided by 100. All the values reported for Yisi1 SRL and ESIM fall within fficial WMT results. 0 1 ngs Dataset Test Train/Validation Skew factor 0 0.5 1.0 1.5 3.0 n of the human ratings in the t datasets for different skew fac- G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G BLEURT No Pretrain. BLEURT w. Pretrain 0 1 2 3 0 1 2 3 0.0 0.2 0.4 0.6 Test Set skew Kendall Tau w. Human Ratings G G G G G BERTscore BLEU train sk. 0 train sk. 0.5 train sk. 1.0 train sk. 1.5 train sk. 3.0 (taken from the paper)
14 Split by System Split by Input fluency grammar semantics 0/9 systems 0 records 2/9 systems 1,174 records 3/9 systems 1,317 records 5/9 systems 2,424 records 0/224 inputs 0 records 38/224 inputs 836 records 66/224 inputs 1,445 records 122/224 inputs 2,689 records 0.0 0.2 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 Num. Systems/Inputs Used for Training and Validation Kentall Tau w. Human Ratings Metric BLEU TER Meteor BERTscore BLEURT −pre −wmt BLEURT −wmt BLEURT (taken from the paper)
•Pre-training makes BLEURT significantly more robust to quality drifts. •Thanks to pre-training, BLEURT can quickly adapt to the new tasks. BLEURT fine-tuned twice provides acceptable results on all tasks without training data. 15 (taken from the paper)
•BLEU and ROUGE have a negative impact •Open question: should we remove them from pre-training signals? 1 task 0%: no pre−training N−1 tasks 0%: all pre−training tasks BERTscore entail backtrans method_flag BLEU ROUGE −BERTscore −entail −backtrans −method_flag −BLEU −ROUGE −15 −10 −5 0 5 Pretraining Task Relative Improv./Degradation (%) BLEURT BLEURTbase Figure 4: Improvement in Kendall Tau on WMT 17 (taken from the paper)
S. et al.: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65-72, 2005 [link] • Chen, Q. et al.: Enhanced LSTM for Natural Language Inference, Proc. ACL 2017, pp. 1657-1668, 2017 [link] • Doddington, G.: Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, Proc. HLT’02, pp. 138-145, 2002 [link] 19
H. et al.:Automatic Evaluation of Translation Quality for Distant Language Pairs, Proc. EMNLP 2010, pp. 944-952, 2010 [link] • Lin, C.-Y. et al.: Automatic Evaluation of Summaries Using N-gram Co- Occurrence Statistics, Proc. HLT-NAACL 2003, pp. 71-78, 2003 [link] • Lo, C.-K.: YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources, Proc. WMT 2019, pp. 507-513, 2019 [link] • Mathur, M. et al.: Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation, Proc. ACL 2019, pp. 2799-2808, 2019 [link] 20
K. et al.: BLEU: a Method for Automatic Evaluation of Machine Translation [link] • Popović, M.: CHRF: character n-gram F-score for automatic MT evaluation, Proc. WMT 2015, pp. 392-395, 2015 [link] • Snover, M. et al.:A Study of Translation Edit Rate with Targeted Human Annotation, Proc. AMTA 2006, pp. 223-231, 2006 [link (PDF)] • Shimanaka, H. et al.: RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation, Proc. WMT 2018, pp. 751- 758, 2018 [link] 21
H. et al.: Machine Translation Evaluation with BERT Regressor, arXiv preprint (cs.CL) 1907.12679, 2019 [link] • Stanojević, M. et al.: BEER: BEtter Evaluation as Ranking, Proc. WMT 2014, pp. 414-419, 2014 [link] • Zhang, T. et al.: BERTScore: Evaluating Text Generation with BERT, Proc. ICLR 2020, arXiv preprint (cs.CL)1904.09675 [link] • Zhao, W. et al.: MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance, Proc. EMNLP- IJCNLP 2019, pp. 563-578, 2019 [link] 22