Is Corpus Suitable for Human Perception?: Quality Assessment of Voice Response Timing in Conversational Corpus through Timing Replacement

Is Corpus Suitable for Human Perception?: Quality Assessment of Voice
Response Timing in Conversational Corpus through Timing Replacement Sadahiro Yoshikawa Japan Advanced Institute of Science and Technology Equmenopolis, Inc. APSIPA 2024

Importance of Voice Response Timing The voice response timing changes
the meaning of the voice *3. *3 Felicia Roberts and Alexander L. Francis. 2013. Identifying a temporal threshold of tolerance for silent gaps after requests. The Journal of the Acoustical Society of America 133(6):EL471–EL477. e.g.,

Timing Estimation Models Timing estimation models have typically been trained
by human responses *1, *2. *1 Roddy, Matthew, and Naomi Harte. ‘Neural Generation of Dialogue Response Timings’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2442–52. Online: Association for Computational Linguistics, 2020. *2 Sakuma, Jin, Shinya Fujie, and Tetsunori Kobayashi. ‘Response Timing Estimation for Spoken Dialog Systems Based on Syntactic Completeness Prediction’. In 2022 IEEE Spoken Language Technology Workshop (SLT), 369–74, 2023. *1 frame interval: 50ms *2

However, human response timings are not always appropriate… In a
previous experiment*1, the number of speakers with p > 0.5 is equal between the actual response timing and the response timing replaced with the mode in the corpus. *1 Roddy, Matthew, and Naomi Harte. ‘Neural Generation of Dialogue Response Timings’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2442–52. Online: Association for Computational Linguistics, 2020.

However, human response timings are not always appropriate… Some responses
replaced with fixed value sounded more realistic than actual voice. → Does learning from human conversations truly lead to “appropriate timing for humans”? → Shouldn’t we prioritize whether it sounds natural/realistic to humans over corpus data?

To prioritize whether it sounds natural/realistic to humans over corpus
data: 1. Identify which response timing replacement impact human perception 2. Explore the feasibility of the prediction model for dialogue systems Purpose of Our Research

1. Identify which response timing replacement impact human perception →
Conduct a Listening Test using corpus data → Define a human perception score and analyze it 2. Explore the feasibility of the prediction model for dialogue systems → Formulate an evaluation metric for predicting the score → Build a baseline ML model to predict the score Procedure of Our Research

Listening Test

1. Researchers collected corpus data and gathered 17 annotators per
speaker 2. Annotators compared Actual (actual responses) and Replaced (mode value in the corpus) audio samples 3. Annotators listened them and answered a question: “Which response timing sounds like it was produced in a real conversation?”*1 Procedure of Listening Test *1 Roddy, Matthew, and Naomi Harte. ‘Neural Generation of Dialogue Response Timings’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2442–52. Online: Association for Computational Linguistics, 2020.

Realness Score (RS) is defined for statistical analysis and ML
model prediction. • If annotators vote for Actual, then +1 • If annotators vote for Replaced, then -1 • Researchers calculate the mean of +1/-1 as Realness Score (RS) Realness Score (RS)

We used the corpus*1 which has wide variety of speech
samples for each speaker. • Japanese spoken language in face-to-face conversations with two participants • This corpus records 6 hours per speaker ◦ 1 hour each for 6 interlocutors ◦ 3 topics for each interlocutor i. self-introductions ii. introduction of positive and negative experiences iii. introduction of self-shortcomings • Casual & Friendly conversation Corpus Data for Listening Test *1 Hayashi, Takato, Candy Olivia Mawalim, Ryo Ishii, Akira Morikawa, Atsushi Fukayama, Takao Nakamura and Shogo Okada. “A Ranking Model for Evaluation of Conversation Partners Based on Rapport Levels.” IEEE Access 11 (2023): 73024-73035.

We defined “Response” as “when the speaker’s voice is shorter
than that of the interlocutor in a turn-change” without restricting the content. Definition of Response Target Voice Sample

Collection of Response for Listening Test A total of 1,720
responses were collected. • Select 8 speakers from this corpus for the listening test • Extract response timing from all dialogues of 8 speakers*1 ◦ Define the mode value (0ms) at that time • Responses into 3 categories for each speaker*2 ◦ Early: Below 30th percentile ◦ Late: Above 70th percentile ◦ Medium: Others • 192 responses per speaker were extracted ◦ 64 responses per category • Some responses are excluded ◦ Due to operation errors during experiment *1 *2

→ For responses from −1 seconds to 0 seconds, it
is more appropriate to replace the timing with 0 seconds. → For responses between 0 seconds and 400 milliseconds, it is better to leave the timing unchanged than to replace it with 0 seconds. Stats of Realness Score: Trends in Response Timing Distribution * Weighted score: A value obtained by dividing the response timing into 200ms intervals and multiplying the average value of RS within the divided range by the number of responses.

We conducted a chi-square test for each response independently. We
tested whether there were statistically significant biases (referred to as SSPs) in the ease of choosing Actual and Replaced as response. The test results also showed a trend similar to the above slide: • In Early from -1 second to 0 seconds, there are many responses with SSP in Replaced • Among responses from 0 seconds to 400 milliseconds, there are many responses with SSP in Actual • For responses faster than 0 seconds in Medium, there is no response with SSP in Actual • In Late, replies with SSP in Replaced also exist in late replies Stats of Realness Score: Statistical Test

1. Identify which response timing replacement impact human perception →
For responses from −1 seconds to 0 seconds, it is more appropriate to replace the timing with 0 seconds. → For responses between 0 seconds and 400 milliseconds, it is better to 　 leave the timing unchanged than to replace it with 0 seconds. Result of Listening Test

Realness Score Prediction

We used response audio for the input of the ML
model as assuming the response audio is adequate in dialogue. Input of RS Prediction Model Input

We applied the baseline model of the automatic prediction model
competition for synthetic speech evaluation (VoiceMOS Challenge) *1 and slightly customized it. • Added Fully Connected Layer to baseline model with HuBERT*2 encoder • Loss: L1 Loss (Mean Absolute Error (MAE)) • Optimizer: Adam Component of RS Prediction Model *1 E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of mos prediction networks,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446, 2021. *2 W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. rahman Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021. Customized

We predicted whether there is a statistically significant bias (SSP)
in Actual or Replaced (True/False Actual Rate) was evaluated using an ROC curve. → The classification accuracy of SSP in the RS model using Speaker IPU was AUC*1: 0.636. Model Evaluation *1

Result of Realness Score Prediction 2. Explore the feasibility of
the prediction model for dialogue systems → The accuracy of RS prediction was outperformed random selection when assuming the response audio is adequate in dialogue.

Discussion This experiment is just a case of analysis. Several
aspects remain to be analyzed: • Context & Situation • Relationship between the interlocutors • Dialogue act of each responses • Multimodal communication • Language other than Japanese Several aspects remain to be implemented: • Demonstration experiment using a dialogue system • Validity of not only response timing but also response audio • Replacement with other than mode value

Conclusion 1. Identify which response timing replacement impact human perception
→ Clarified trends of general human preference for each response timing in casual conversation in Japanese → More analysis in other situation is needed 2. Explore the feasibility of the prediction model for dialogue systems → Outperformed random selection assuming the response audio is adequate → Demonstration experiment using a dialogue system is needed

Is Corpus Suitable for Human Perception?: Quali...

Is Corpus Suitable for Human Perception?: Quality Assessment of Voice Response Timing in Conversational Corpus through Timing Replacement

sadahry

More Decks by sadahry

Other Decks in Research

Featured

Transcript

Is Corpus Suitable for Human Perception?: Quality Assessment of Voice

Importance of Voice Response Timing The voice response timing changes

Timing Estimation Models Timing estimation models have typically been trained

However, human response timings are not always appropriate… In a

However, human response timings are not always appropriate… Some responses

To prioritize whether it sounds natural/realistic to humans over corpus

1. Identify which response timing replacement impact human perception →

Listening Test

1. Researchers collected corpus data and gathered 17 annotators per

Realness Score (RS) is defined for statistical analysis and ML

We used the corpus*1 which has wide variety of speech

We defined “Response” as “when the speaker’s voice is shorter

Collection of Response for Listening Test A total of 1,720

→ For responses from −1 seconds to 0 seconds, it

We conducted a chi-square test for each response independently. We

1. Identify which response timing replacement impact human perception →

Realness Score Prediction

We used response audio for the input of the ML

We applied the baseline model of the automatic prediction model

We predicted whether there is a statistically significant bias (SSP)

Result of Realness Score Prediction 2. Explore the feasibility of

Discussion This experiment is just a case of analysis. Several

Conclusion 1. Identify which response timing replacement impact human perception