K. Sinha et al., ACL 2020, https://arxiv.org/abs/2005.00583 [2] Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation, W. Liang et al., ACL 2020, https://arxiv.org/abs/2005.10716 [3] Evaluating Dialogue Generation Systems via Response Selection, S. Sato et al., ACL 2020, https://arxiv.org/abs/2004.14302 [4] Speaker Sensitive Response Evaluation Model, J. Bak et al., ACL 2020, https://arxiv.org/abs/2006.07015 [5] USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation, S. Mehri et al., ACL 2020, https://arxiv.org/abs/2005.00456 [6] Designing Precise and Robust Dialogue Response Evaluators, T. Zhao et al., ACL 2020, https://arxiv.org/abs/2004.04908 [7] uBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems, T. Yuma et al., ACL 2020, https://www.aclweb.org/anthology/2020.acl-srw.27/ [8] Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation, B. Pang et al., ACL 2020, https://www.aclweb.org/anthology/2020.acl-main.333/ [9] RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems, C. Tao et al., AAAI 2018, https://arxiv.org/abs/1701.03079