Slide 1

Slide 1 text

会話における混乱状態検出と、 マルチモーダル会話AIプラット フォームの開発 佐伯真於 早稲⽥⼤学,Equmenopolis Inc.

Slide 2

Slide 2 text

2 ⾃⼰紹介 佐伯 真於(さえき まお) l 早稲⽥⼤学 ⼩林研究室 博⼠後期課程 l Equmenopolis Inc. リサーチサイエンティスト l 興味:対話システム,特にパラ⾔語の理解と⽣成

Slide 3

Slide 3 text

Confusion Detection for Adaptive Conversational Strategies of An Oral Proficiency Assessment Interview Agent Mao Saeki[1], Kotoka Miyagi[1], Shinya Fujie[2], Shungo Suzuki[1], Tetsuji Ogawa[1], Tetsunori Kobayashi[1], Yoichi Matsuyama[1] [1]Waseda University, Japan [2]Chiba Institute of Technology, Japan

Slide 4

Slide 4 text

4 Introduction Why confusion detection l In a conversation, listeners may become confused under situations such as … l Failing to hear due to noise l Not knowing a word or a concept l Unless resolved, the listener may become uncomfortable, or the conversation may break down

Slide 5

Slide 5 text

5 Detecting the cause of confusion Introduction Couldn’t hear Don’t know a word Don’t know a concept repeat rephrase give example l Causes of confusion are various l Knowing the cause can lead to more precise assistance

Slide 6

Slide 6 text

6 l Research Objective lAutomated detection of confusion l Research Questions lWhat are the signs of confusion? lIs it possible to predict the cause of confusion? Objectives

Slide 7

Slide 7 text

7 English assessment interview dialog l Conversation between Japanese English-learner and virtual agent for assessing speaking proficiency l Interview is conducted online Dialog Setting

Slide 8

Slide 8 text

8 Confusion Data collection design [1] R. Cumbal et al.,“Detection of Listener Uncertainty in Robot-Led Second Language Conversation Practice,” in ICMI 2020 [2] W. J. M. Levelt, “Speaking: From Intention to Articulation.” The MIT Press, 26-Aug-1993. l Confusion is fatal when it happens, but is rare l We artificially elicit confusion based on a procedure proposed by Cumbal et al.[1] l The procedure is expanded with 3 additional manipulation, based on the Leveltʼs comprehension process of speech [2]

Slide 9

Slide 9 text

9 Data collection design Manipulation Procedure Mixing non-existing words Increasing grammatical complexity

Slide 10

Slide 10 text

10 Data collection results Participants 47 Japanese English-learners Average interview duration 6 minutes Confused data samples 155 Not-confused data samples 372

Slide 11

Slide 11 text

11 Analysis on the cause of confusion Can you tell me … … system user Confused data sample 2s 5s Predicted True ① ② ③ ④

Slide 12

Slide 12 text

12 Signs of confusion Signs of confusion Feature extraction method Increased blinking activity of action unit (AU) 45 Averting gaze from screen absolute distance between the current gaze direction and the screen Rapid head movement rotation angle of the head relative to the screen Rapid eye movement absolute distance between the current gaze direction screen Moving the face towards screen head rotation and horizontal distance between the screen and head. Silence absence of user utterance using VAD Self-talk relative loudness by dividing the current user loudness by the mean loudness of all previous utterances, and head rotation voice activity, relative loudness, AU 45 intensity, gaze distance from the screen, head rotation, and head distance from the screen extracted every 40ms

Slide 13

Slide 13 text

13 Confusion detection results l Model: LSTM l Majority baseline accuracy: 0.706

Slide 14

Slide 14 text

14 Adaptive interview scenario

Slide 15

Slide 15 text

15 l Goal l Detection of confusion by identifying multimodal signs l Contributions l Proposed a data collection method to elicit confusion in different steps of speech processing l Showed the difficulty of predicting the cause of confusion using only user video l Identified 7 multimodal signs of confusion and conducted an ablation study to understand their importance Conclusion

Slide 16

Slide 16 text

InteLLA対話システムと, マルチモーダル会話AI プラットフォームの開発

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

18 Challenges of virtual agents Virtual agents must… - Know when to speak (or not to speak) - Understand nonverbal signs - Produce nonverbal signs Not just a text chat with a voice interface! I recently watched … Oh! (Nod) (laughter) Did you watch… While listening people make noises, gestures, and will interrupt!

Slide 19

Slide 19 text

19 Layered model of conversational processing and protocols Matsuyama 2015, Multiparty Conversation Facilitation Robots

Slide 20

Slide 20 text

21 Comparing human and automated interview 0 1 2 3 4 5 I was able to demonstrate my English language ability to the full extent human interview automated interview Strongly agree Strongly disagree The agent was friendly The agent was listening carefully to my speech The agent was respectful of me The agent’s accent and intonation was natural The agent’s speech rate was appropriate The agent’s gestures were natural The conversational flow was natural Turn taking was natural Identified key factors using Backwards Stepwise Regression

Slide 21

Slide 21 text

Thank you for listening!