Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Is Corpus Suitable for Human Perception?: Quali...

Avatar for sadahry sadahry
December 05, 2024

Is Corpus Suitable for Human Perception?: Quality Assessment of Voice Response Timing in Conversational Corpus through Timing Replacement

APSIPA ASC 2024

Avatar for sadahry

sadahry

December 05, 2024
Tweet

More Decks by sadahry

Other Decks in Research

Transcript

  1. Is Corpus Suitable for Human Perception?: Quality Assessment of Voice

    Response Timing in Conversational Corpus through Timing Replacement Sadahiro Yoshikawa Japan Advanced Institute of Science and Technology Equmenopolis, Inc. APSIPA 2024
  2. Importance of Voice Response Timing The voice response timing changes

    the meaning of the voice *3. *3 Felicia Roberts and Alexander L. Francis. 2013. Identifying a temporal threshold of tolerance for silent gaps after requests. The Journal of the Acoustical Society of America 133(6):EL471–EL477. e.g.,
  3. Timing Estimation Models Timing estimation models have typically been trained

    by human responses *1, *2. *1 Roddy, Matthew, and Naomi Harte. ‘Neural Generation of Dialogue Response Timings’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2442–52. Online: Association for Computational Linguistics, 2020. *2 Sakuma, Jin, Shinya Fujie, and Tetsunori Kobayashi. ‘Response Timing Estimation for Spoken Dialog Systems Based on Syntactic Completeness Prediction’. In 2022 IEEE Spoken Language Technology Workshop (SLT), 369–74, 2023. *1 frame interval: 50ms *2
  4. However, human response timings are not always appropriate… In a

    previous experiment*1, the number of speakers with p > 0.5 is equal between the actual response timing and the response timing replaced with the mode in the corpus. *1 Roddy, Matthew, and Naomi Harte. ‘Neural Generation of Dialogue Response Timings’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2442–52. Online: Association for Computational Linguistics, 2020.
  5. However, human response timings are not always appropriate… Some responses

    replaced with fixed value sounded more realistic than actual voice. → Does learning from human conversations truly lead to “appropriate timing for humans”? → Shouldn’t we prioritize whether it sounds natural/realistic to humans over corpus data?
  6. To prioritize whether it sounds natural/realistic to humans over corpus

    data: 1. Identify which response timing replacement impact human perception 2. Explore the feasibility of the prediction model for dialogue systems Purpose of Our Research
  7. 1. Identify which response timing replacement impact human perception →

    Conduct a Listening Test using corpus data → Define a human perception score and analyze it 2. Explore the feasibility of the prediction model for dialogue systems → Formulate an evaluation metric for predicting the score → Build a baseline ML model to predict the score Procedure of Our Research
  8. 1. Researchers collected corpus data and gathered 17 annotators per

    speaker 2. Annotators compared Actual (actual responses) and Replaced (mode value in the corpus) audio samples 3. Annotators listened them and answered a question: “Which response timing sounds like it was produced in a real conversation?”*1 Procedure of Listening Test *1 Roddy, Matthew, and Naomi Harte. ‘Neural Generation of Dialogue Response Timings’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2442–52. Online: Association for Computational Linguistics, 2020.
  9. Realness Score (RS) is defined for statistical analysis and ML

    model prediction. • If annotators vote for Actual, then +1 • If annotators vote for Replaced, then -1 • Researchers calculate the mean of +1/-1 as Realness Score (RS) Realness Score (RS)
  10. We used the corpus*1 which has wide variety of speech

    samples for each speaker. • Japanese spoken language in face-to-face conversations with two participants • This corpus records 6 hours per speaker ◦ 1 hour each for 6 interlocutors ◦ 3 topics for each interlocutor i. self-introductions ii. introduction of positive and negative experiences iii. introduction of self-shortcomings • Casual & Friendly conversation Corpus Data for Listening Test *1 Hayashi, Takato, Candy Olivia Mawalim, Ryo Ishii, Akira Morikawa, Atsushi Fukayama, Takao Nakamura and Shogo Okada. “A Ranking Model for Evaluation of Conversation Partners Based on Rapport Levels.” IEEE Access 11 (2023): 73024-73035.
  11. We defined “Response” as “when the speaker’s voice is shorter

    than that of the interlocutor in a turn-change” without restricting the content. Definition of Response Target Voice Sample
  12. Collection of Response for Listening Test A total of 1,720

    responses were collected. • Select 8 speakers from this corpus for the listening test • Extract response timing from all dialogues of 8 speakers*1 ◦ Define the mode value (0ms) at that time • Responses into 3 categories for each speaker*2 ◦ Early: Below 30th percentile ◦ Late: Above 70th percentile ◦ Medium: Others • 192 responses per speaker were extracted ◦ 64 responses per category • Some responses are excluded ◦ Due to operation errors during experiment *1 *2
  13. → For responses from −1 seconds to 0 seconds, it

    is more appropriate to replace the timing with 0 seconds. → For responses between 0 seconds and 400 milliseconds, it is better to leave the timing unchanged than to replace it with 0 seconds. Stats of Realness Score: Trends in Response Timing Distribution * Weighted score: A value obtained by dividing the response timing into 200ms intervals and multiplying the average value of RS within the divided range by the number of responses.
  14. We conducted a chi-square test for each response independently. We

    tested whether there were statistically significant biases (referred to as SSPs) in the ease of choosing Actual and Replaced as response. The test results also showed a trend similar to the above slide: • In Early from -1 second to 0 seconds, there are many responses with SSP in Replaced • Among responses from 0 seconds to 400 milliseconds, there are many responses with SSP in Actual • For responses faster than 0 seconds in Medium, there is no response with SSP in Actual • In Late, replies with SSP in Replaced also exist in late replies Stats of Realness Score: Statistical Test
  15. 1. Identify which response timing replacement impact human perception →

    For responses from −1 seconds to 0 seconds, it is more appropriate to replace the timing with 0 seconds. → For responses between 0 seconds and 400 milliseconds, it is better to   leave the timing unchanged than to replace it with 0 seconds. Result of Listening Test
  16. We used response audio for the input of the ML

    model as assuming the response audio is adequate in dialogue. Input of RS Prediction Model Input
  17. We applied the baseline model of the automatic prediction model

    competition for synthetic speech evaluation (VoiceMOS Challenge) *1 and slightly customized it. • Added Fully Connected Layer to baseline model with HuBERT*2 encoder • Loss: L1 Loss (Mean Absolute Error (MAE)) • Optimizer: Adam Component of RS Prediction Model *1 E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi, “Generalization ability of mos prediction networks,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446, 2021. *2 W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. rahman Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021. Customized
  18. We predicted whether there is a statistically significant bias (SSP)

    in Actual or Replaced (True/False Actual Rate) was evaluated using an ROC curve. → The classification accuracy of SSP in the RS model using Speaker IPU was AUC*1: 0.636. Model Evaluation *1
  19. Result of Realness Score Prediction 2. Explore the feasibility of

    the prediction model for dialogue systems → The accuracy of RS prediction was outperformed random selection when assuming the response audio is adequate in dialogue.
  20. Discussion This experiment is just a case of analysis. Several

    aspects remain to be analyzed: • Context & Situation • Relationship between the interlocutors • Dialogue act of each responses • Multimodal communication • Language other than Japanese Several aspects remain to be implemented: • Demonstration experiment using a dialogue system • Validity of not only response timing but also response audio • Replacement with other than mode value
  21. Conclusion 1. Identify which response timing replacement impact human perception

    → Clarified trends of general human preference for each response timing in casual conversation in Japanese → More analysis in other situation is needed 2. Explore the feasibility of the prediction model for dialogue systems → Outperformed random selection assuming the response audio is adequate → Demonstration experiment using a dialogue system is needed