Assessment Interview Agent Mao Saeki, Kotoka Miyagi, Shinya Fujie, Shungo Suzuki, Tetsuji Ogawa, Tetsunori Kobayashi, Yoichi Matsuyama Waseda University, Japan Chiba Institute of Technology, Japan
may become confused under situations such as … l Failing to hear due to noise l Not knowing a word or a concept l Unless resolved, the listener may become uncomfortable, or the conversation may break down
of Listener Uncertainty in Robot-Led Second Language Conversation Practice,” in ICMI 2020  W. J. M. Levelt, “Speaking: From Intention to Articulation.” The MIT Press, 26-Aug-1993. l Confusion is fatal when it happens, but is rare l We artificially elicit confusion based on a procedure proposed by Cumbal et al. l The procedure is expanded with 3 additional manipulation, based on the Leveltʼs comprehension process of speech 
Increased blinking activity of action unit (AU) 45 Averting gaze from screen absolute distance between the current gaze direction and the screen Rapid head movement rotation angle of the head relative to the screen Rapid eye movement absolute distance between the current gaze direction screen Moving the face towards screen head rotation and horizontal distance between the screen and head. Silence absence of user utterance using VAD Self-talk relative loudness by dividing the current user loudness by the mean loudness of all previous utterances, and head rotation voice activity, relative loudness, AU 45 intensity, gaze distance from the screen, head rotation, and head distance from the screen extracted every 40ms
signs l Contributions l Proposed a data collection method to elicit confusion in different steps of speech processing l Showed the difficulty of predicting the cause of confusion using only user video l Identified 7 multimodal signs of confusion and conducted an ablation study to understand their importance Conclusion
when to speak (or not to speak) - Understand nonverbal signs - Produce nonverbal signs Not just a text chat with a voice interface! I recently watched … Oh! (Nod) (laughter) Did you watch… While listening people make noises, gestures, and will interrupt!
4 5 I was able to demonstrate my English language ability to the full extent human interview automated interview Strongly agree Strongly disagree The agent was friendly The agent was listening carefully to my speech The agent was respectful of me The agent’s accent and intonation was natural The agent’s speech rate was appropriate The agent’s gestures were natural The conversational flow was natural Turn taking was natural Identified key factors using Backwards Stepwise Regression