Interspeech2025読み会

Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics
Using Phonetic, Semantic, and NLI Approaches Bornali Phukon, Xiuwen Zheng, Mark Hasegawa-Johnson 紹介する⼈：須藤克仁（奈良⼥⼦⼤） Proc. Interspeech 2025, pp. 5708—5712, https://www.isca-archive.org/interspeech_2025/phukon25_interspeech.html

本論⽂の概要 • 既存のASR評価尺度 (WER, CER) は了解性を考慮できない • ⾳韻的/意味的類似性と⾔語推論を統合した評価尺度を提案 • 従来の尺度よりも⼈⼿評価との相関を改善
(0.890) 3 ASRも（機械翻訳と同じように）「何が読み取れるか」で評価すべき intelligibility （提案⼿法の技術的には「ふーん」という感じではありますが…）

着眼点 • 構⾳障害⾳声のASRは困難、でもGPTで訂正可能？ • wav2vec (in-domain FT有) • WavLLM
• 意味的評価値が⼤きく向上 • Whisper • 過訂正頻発（ASRで⾔語制約が効きすぎ？※須藤の私⾒） • LLM訂正可能性込みで評価？ 4 Table 3: Evaluation of ASR Systems on Dysarthric Speech, w/ and w/o GPT corrections. “Wav2vec” in this table refers to wav2vec-sap1005. The best entry in each block is in boldface. Model WER% Psim II BERT Bleurt Heval w/o GPT Wav2vec 31.11 0.9752 0.5954 -0.3284 0.1699 Wavllm 35.85 0.9718 0.5334 -0.5769 0.1976 Whisper 34.72 0.9387 0.6516 0.0183 0.1700 w/ GPT Wav2vec 24.26 0.9673 0.7023 0.1893 0.1197 Wavllm 27.3 0.9655 0.6714 0.0839 0.1349 Whisper 40.99 0.9198 0.6254 0.0265 0.1863 w/ GPT, Wav2vec 21.91 0.9813 0.7234 0.2362 0.1084 improved Wavllm 24.74 0.9782 0.6903 0.1216 0.1220 only Whisper 29.31 0.9428 0.6909 0.1591 0.1537 with high phonetic similarity. The phonetic score differences between wav2vec-sap1005 and Whisper grow as the severity increases. At high severity, the difference is 0.0609 (0.9135 for wav2vec-sap1005 vs. 0.8526 for whisper). This gap narrows to a r t e a f i c e d t s s a g • • * Table taken from the paper

提案⼿法 • 3種類のスコアを統合 Score = 𝛼𝑆!"# + 𝛽𝑆$ + 𝛾𝑆%
• 𝑆!"# : ⾔語推論スコア • 相互含意確率 𝑃! hypo ↔ ref • 𝑆$ : 意味スコア • BERTScore • 𝑆% : ⾳韻類似度 • Soundex w/ Jaro-Winkler 𝛼, 𝛽, 𝛾 = 0.40, 0.28, 0.32 • ⾔語推論とは • ことばの含意関係を当てる 5 前提 (premise) The deadline is not flexible. 仮説 (hypothesis) The deadline is flexible. 仮説 (hypothesis) The deadline is strict. 仮説 (hypothesis) The deadline is Friday. ⽭盾 (contradiction) 含意 (entailment) 不明 (neutral)

⼈⼿評価との相関 + 感想 • Pearson相関で 0.890 • 設計は良くも悪くもヌルい • FT不要（ハイパラ調整要）
• NN化して学習させるほどのデータはないので妥当か… • NLPerとしては⾃然な発想 • LLM前提はもはや不可避 • 後段タスク⾮依存な形にできるのが理想 6 tions were discarded). This suggests that Whisper outputs have fewer correctable errors. However, its highest BERTScore in- dicates stronger semantic correctness, supporting the idea that improved semantic accuracy of a baseline ASR transcript does not contribute to its LLM correctability. 4. A new integrated metric for dysartic speech ASR evaluation Figure 2: Correlation of WER, Phonetic, Semantic, NLI, and Proposed Metrics with Human Ratings Sections 1, 2, and 3 highlight that: 1) A single metric is in- sufficient for evaluating dysarthric speech ASR systems; WER captures word accuracy but misses semantic aspects. 2) Metrics score (SNLI ) of ASR hypothesis quality. SNLI is set equal to the probability Pe(hypo $ ref) computed by a RoBERTa-large model [18] that has beeen fine-tuned on the SNLI [19], MNLI, FEVER [20], and ANLI [21] datasets. To calculate the semantic score SS , we used BERTScore, and for phonetic similarity SP , we applied Soundex with Jaro-Winkler similarity, as discussed in Section 2. The integrated metric is defined as: Integrated Score = ↵SNLI + SS + SP where ↵, , and represent the weights. Optimal weights were determined via linear regression using human ratings as the tar- get. We used 5-fold cross-validation to avoid overfitting, train- ing the model on 80% of the data and testing it on the remaining 20%. Figure 2 shows that the integrated metric, using optimal weights, achieved the highest correlation (0.890) with human ratings, outperforming individual metrics. This demonstrates that the integrated approach better aligns with human judgment and is effective in evaluating ASR outputs with a focus on correctability. After completion of cross-validation, weights were recom- puted using the entire linear regression dataset (100 transcript pairs, each labeled by the average of intelligibility scores as- signed by six human annotators). The final normalized weights were: ↵ = 0.40, = 0.28, = 0.32 The p-values for each coefficient were all found to be statisti- cally significant (NLI Score: p = 1.47e-07, BERT Score: p = * Figure taken from the paper

Further Reading • Kiyotada Mori et al., What Do Humans
Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems • Proc. Interspeech 2025, pp. 1753--1757 • https://www.isca-archive.org/interspeech_2025/mori25_interspeech.html • ⾳声対話システムでの利⽤を考えたとき、全部の単語を同等に扱うWERでは不適切で、選択的／重み付き評価が必要なのではないか、という提⾔ 7

Interspeech2025読み会

Interspeech2025読み会

Katsuhito Sudoh

More Decks by Katsuhito Sudoh

Other Decks in Research

Featured

Transcript

Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics

本論⽂の概要 • 既存のASR評価尺度 (WER, CER) は了解性を考慮できない • ⾳韻的/意味的類似性と⾔語推論を統合した評価尺度を提案 • 従来の尺度よりも⼈⼿評価との相関を改善

着眼点 • 構⾳障害⾳声のASRは困難、でもGPTで訂正可能？ • wav2vec (in-domain FT有) • WavLLM

提案⼿法 • 3種類のスコアを統合 Score = 𝛼𝑆!"# + 𝛽𝑆$ + 𝛾𝑆%

⼈⼿評価との相関 + 感想 • Pearson相関で 0.890 • 設計は良くも悪くもヌルい • FT不要（ハイパラ調整要）

Further Reading • Kiyotada Mori et al., What Do Humans

Interspeech2025読み会

Interspeech2025読み会

Katsuhito Sudoh

More Decks by Katsuhito Sudoh

Other Decks in Research

Featured

Transcript

Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics

本論⽂の概要 • 既存のASR評価尺度 (WER, CER) は了解性を考慮できない • ⾳韻的/意味的類似性と⾔語推論を統合した評価尺度を提案 • 従来の尺度よりも⼈⼿評価との相関を改善

着眼点 • 構⾳障害⾳声のASRは困難、 でもGPTで訂正可能？ • wav2vec (in-domain FT有) • WavLLM

提案⼿法 • 3種類のスコアを統合 Score = 𝛼𝑆!"# + 𝛽𝑆$ + 𝛾𝑆%

⼈⼿評価との相関 + 感想 • Pearson相関で 0.890 • 設計は良くも悪くもヌルい • FT不要（ハイパラ調整要）

Further Reading • Kiyotada Mori et al., What Do Humans

着眼点 • 構⾳障害⾳声のASRは困難、でもGPTで訂正可能？ • wav2vec (in-domain FT有) • WavLLM