Slide 1

Slide 1 text

Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols Sarah E. Finch and Jinho D. Choi

Slide 2

Slide 2 text

Dialogue Systems Task-Oriented Dialogue System 2 Chat-Oriented Dialogue System

Slide 3

Slide 3 text

Dialogue Systems Task-Oriented Dialogue System ● Specific goal ● E.g. restaurant booking, movie recommendation, ... 3 Chat-Oriented Dialogue System

Slide 4

Slide 4 text

Dialogue Systems Task Efficiency Task-Oriented Dialogue System ● Specific goal ● E.g. restaurant booking, movie recommendation, ... 4 measured by Chat-Oriented Dialogue System

Slide 5

Slide 5 text

Dialogue Systems Task Efficiency Task-Oriented Dialogue System ● Specific goal ● E.g. restaurant booking, movie recommendation, ... 5 Chat-Oriented Dialogue System ● Social companion ● Open-domain conversation measured by

Slide 6

Slide 6 text

Dialogue Systems Chat-Oriented Dialogue System ● Social companion ● Open-domain conversation Task Efficiency ??? Task-Oriented Dialogue System ● Specific goal ● E.g. restaurant booking, movie recommendation, ... 6 measured by measured by

Slide 7

Slide 7 text

Overview 1. Common evaluation protocols for chat-oriented dialogue systems 2. Analysis of variability in evaluation protocols 3. Case study on human evaluation using Alexa Prize 2019 data 7

Slide 8

Slide 8 text

Procedure ● 20 non-task-oriented dialogue system papers (2018 - 2020) ● Variety of dialogue approaches: ○ Knowledge bases ○ Personality ○ Emotional responses ○ No external information source 8

Slide 9

Slide 9 text

Chat-oriented Evaluation Protocols 9 Evaluation Automated (AUT) Human Static (STA) Interactive (INT) 17 16 2

Slide 10

Slide 10 text

Chat-oriented Evaluation Protocols 10 Evaluation Automated (AUT) Human Static (STA) Interactive (INT) 17 16 2

Slide 11

Slide 11 text

Automated Metric Usage across Works 11

Slide 12

Slide 12 text

Automated Metric Usage across Works 12 Response Sim. Coherence Diversity

Slide 13

Slide 13 text

Chat-oriented Evaluation Protocols 13 Evaluation Automated (AUT) Human Static (STA) Interactive (INT) 17 16 2

Slide 14

Slide 14 text

Human Evaluation Sub-types 14 ● Static Evaluation ○ Offline rating of system responses given a static dialogue context ● Interactive Evaluation ○ Online interaction with system where rating is provided by user at conclusion of conversation

Slide 15

Slide 15 text

Human Evaluation Metrics ● Rating ○ Numerical rating of dialogue on specific characteristics ○ Used in 14 works ● Preference Selection ○ Given a set of responses, select the best one for some specific characteristic ○ Used in 4 works 15

Slide 16

Slide 16 text

Dimensions of Human Evaluation 16

Slide 17

Slide 17 text

Dimensions of Human Evaluation 17 21

Slide 18

Slide 18 text

Dimension Groupings: Grammaticality 18 Dimension Name Author’s Definition Fluency Whether the response from the listener is understandable (Lin et al., 2019) Whether the response is fluent and natural (Li et al., 2019) Whether each sentence has correct grammar (Luo et al., 2018) Fluency measures if the produced response itself is fluent (Wu et al., 2019) Consistency Whether the reply is fluent and grammatical (Li and Sun, 2018) Readability Whether the utterance is grammatically formed (Qiu et al., 2019) Grammaticality Whether the response is fluent and grammatical (Zhu et al., 2019)

Slide 19

Slide 19 text

Dimension Groupings: Grammaticality 19 Dimension Name Author’s Definition Fluency Whether the response from the listener is understandable (Lin et al., 2019) Whether the response is fluent and natural (Li et al., 2019) Whether each sentence has correct grammar (Luo et al., 2018) Fluency measures if the produced response itself is fluent (Wu et al., 2019) Consistency Whether the reply is fluent and grammatical (Li and Sun, 2018) Readability Whether the utterance is grammatically formed (Qiu et al., 2019) Grammaticality Whether the response is fluent and grammatical (Zhu et al., 2019)

Slide 20

Slide 20 text

Dimension Groupings: All 20 1: Li and Sun (2018) 2: Liu et al. (2018) 3: Luo et al. (2018) 4: Moghe et al. (2018) 5: Parthasarathi and Pineau (2018) 6: Xu et al. (2018) 7: Young et al. (2018) 8: Zhang et al. (2018) 9: Du and Black (2019) 10: Li et al. (2019) 11: Lin et al. (2019) 12: Madotto et al. (2019) 13: Qiu et al. (2019) 14: Tian et al. (2019) 15: Wu et al. (2019) 16: Zhang et al. (2019) 17: Zhou et al. (2019) 18: Zhu et al. (2019) 19: Adiwardana et al. (2020) 20: Wang et al. (2020). Grammaticality Fluency [3,10,11,15] Consistency [1] Readability [13] Grammaticality [18] Informativeness Informativeness [7,14, 15, 18] Specificity [4,19] Diversity [13] Relevance Relevance [4,11,13] Appropriateness [7] Coherence [3,15] Context Coherence [10] Logic [1] Sensibleness [19] Emotional Understanding Emotion [1] Empathy [11] Overall Quality Quality [14, 17] Humanness [4]

Slide 21

Slide 21 text

Non-grouped Dimensions 21

Slide 22

Slide 22 text

Final Dialogue Dimensions 22

Slide 23

Slide 23 text

Case Study: Alexa Prize 2019 ● Relationship between dialogue dimensions and conversation quality ● 100 rated conversations from Alexa Prize ● Overall Quality rating from user on their conversation ● Ratings on the 8 dialogue dimensions by expert evaluator Interannotator agreement on Overall Quality 23

Slide 24

Slide 24 text

Positive Relations: Relevance, Proactivity, Informativeness, Engagingness 24

Slide 25

Slide 25 text

Emotional Understanding and Consistency ● Unclear relationship ● Little variation on these dimensions for the analyzed conversational bot 25

Slide 26

Slide 26 text

Grammaticality ● Inverse relationship ● Highly rated conversations tend to be longer ○ Greater likelihood of revealing grammatical mistakes 26

Slide 27

Slide 27 text

Conclusion ● 3 main types of chat-oriented dialogue evaluation ○ Automated, Static Human, Interactive Human ● For human evaluation, Static Evaluation dominates recent works ● Observed set of 8 dialogue dimensions for human evaluation common to recent works ○ Relevance, Proactivity, Informativeness, and Engagingness may have positive correlation with Overall Quality of dialogues 27

Slide 28

Slide 28 text

Thank you Questions? 28 Thank you! Acknowledgment: We gratefully acknowledge the support of the Alexa Prize Socialbot Grand Challenge 3. Any contents in this material are those of the authors and do not necessarily reflect the views of the Alexa Prize.

Slide 29

Slide 29 text

Thank you Questions? 29 References 1. Jingyuan Li and Xiao Sun. 2018. A Syntactically Constrained Bidirectional-Asynchronous Approach for Emotional Conversation Generation. In EMNLP. 2. Shuman Liu, Hongshen Chen, Zhaochun Ren, Yang Feng, Qun Liu, and Dawei Yin. 2018. Knowledge Diffusion for Neural Dialogue Generation. In ACL. 3. Liangchen Luo, Jingjing Xu, Junyang Lin, Qi Zeng, and Xu Sun. 2018. An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation. In EMNLP. 4. Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Mitesh M. Khapra. 2018. Towards Exploiting Background Knowledge for Building Conversation Systems. In EMNLP. 5. Prasanna Parthasarathi and Joelle Pineau. 2018. Extending Neural Generative Conversational Model using External Knowledge Sources. In EMNLP. 6. Xinnuo Xu, Ondej Duek, Ioannis Konstas, and Verena Rieser. 2018. Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity. In EMNLP. 7. Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, and Minlie Huang. 2018. Augmenting End-to-End Dialogue Systems With Commonsense Knowledge. In AAAI. 8. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing Dialogue Agents: I have a dog, do you have pets too? In ACL. 9. Wenchao Du and Alan W Black. 2019. Boosting Dialog Response Generation. In ACL. 10. Zekang Li, Cheng Niu, Fandong Meng, Yang Feng, Qian Li, and Jie Zhou. 2019. Incremental Transformer with Deliberation Decoder for Document Grounded Conversations. In ACL. 11. Zhaojiang Lin, Andrea Madotto, Jamin Shin, Peng Xu, and Pascale Fung. 2019. MoEL: Mixture of Empathetic Listeners. In EMNLP-IJCNLP. 12. Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. 2019. Personalizing Dialogue Agents via Meta-Learning. In ACL. 13. Lisong Qiu, Juntao Li, Wei Bi, Dongyan Zhao, and Rui Yan. 2019. Are Training Samples Correlated? Learning to Generate Dialogue Responses with Multiple References. In ACL. 14. Zhiliang Tian, Wei Bi, Xiaopeng Li, and Nevin L. Zhang. 2019. Learning to Abstract for Memory-augmented Conversational Response Generation. In ACL. 15. Wenquan Wu, Zhen Guo, Xiangyang Zhou, Hua Wu, Xiyuan Zhang, Rongzhong Lian, and Haifeng Wang. 2019. Proactive Human-Machine Conversation with Explicit Conversation Goal. In ACL. 16. Hainan Zhang, Yanyan Lan, Liang Pang, Jiafeng Guo, and Xueqi Cheng. 2019. ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multiturn Dialogue Generation. In ACL. 17. Kun Zhou, Kai Zhang, Yu Wu, Shujie Liu, and Jingsong Yu. 2019. Unsupervised Context Rewriting for Open Domain Conversation. In EMNLP-IJCNLP. 18. Qingfu Zhu, Lei Cui, Wei-Nan Zhang, Furu Wei, and Ting Liu. 2019. Retrieval-Enhanced Adversarial Training for Neural Response Generation. In ACL. 19. Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a Human-like OpenDomain Chatbot. In arXiv. 20. Jian Wang, Junhao Liu, Wei Bi, Xiaojiang Liu, Kejing He, Ruifeng Xu, and Min Yang. 2020. Improving Knowledge-aware Dialogue Generation via Knowledge Base Question Answering. In arXiv.