Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Interspeech2025読み会

 Interspeech2025読み会

B. Phukon et al., Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches, Proc. Interspeech 2025, pp. 5708--5712

Avatar for Katsuhito Sudoh

Katsuhito Sudoh

September 29, 2025
Tweet

More Decks by Katsuhito Sudoh

Other Decks in Research

Transcript

  1. Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics

    Using Phonetic, Semantic, and NLI Approaches Bornali Phukon, Xiuwen Zheng, Mark Hasegawa-Johnson 紹介する⼈:須藤 克仁(奈良⼥⼦⼤) Proc. Interspeech 2025, pp. 5708—5712, https://www.isca-archive.org/interspeech_2025/phukon25_interspeech.html
  2. 本論⽂の概要 • 既存のASR評価尺度 (WER, CER) は了解性を考慮できない • ⾳韻的/意味的類似性と⾔語推論を統合した評価尺度を提案 • 従来の尺度よりも⼈⼿評価との相関を改善

    (0.890) 3 ASRも(機械翻訳と同じように) 「何が読み取れるか」で評価すべき intelligibility (提案⼿法の技術的には「ふーん」という感じではありますが…)
  3. 着眼点 • 構⾳障害⾳声のASRは困難、 でもGPTで訂正可能? • wav2vec (in-domain FT有) • WavLLM

    • 意味的評価値が⼤きく向上 • Whisper • 過訂正頻発(ASRで⾔語制約が 効きすぎ?※須藤の私⾒) • LLM訂正可能性込みで評価? 4 Table 3: Evaluation of ASR Systems on Dysarthric Speech, w/ and w/o GPT corrections. “Wav2vec” in this table refers to wav2vec-sap1005. The best entry in each block is in boldface. Model WER% Psim II BERT Bleurt Heval w/o GPT Wav2vec 31.11 0.9752 0.5954 -0.3284 0.1699 Wavllm 35.85 0.9718 0.5334 -0.5769 0.1976 Whisper 34.72 0.9387 0.6516 0.0183 0.1700 w/ GPT Wav2vec 24.26 0.9673 0.7023 0.1893 0.1197 Wavllm 27.3 0.9655 0.6714 0.0839 0.1349 Whisper 40.99 0.9198 0.6254 0.0265 0.1863 w/ GPT, Wav2vec 21.91 0.9813 0.7234 0.2362 0.1084 improved Wavllm 24.74 0.9782 0.6903 0.1216 0.1220 only Whisper 29.31 0.9428 0.6909 0.1591 0.1537 with high phonetic similarity. The phonetic score differences between wav2vec-sap1005 and Whisper grow as the severity increases. At high severity, the difference is 0.0609 (0.9135 for wav2vec-sap1005 vs. 0.8526 for whisper). This gap narrows to a r t e a f i c e d t s s a g • • * Table taken from the paper
  4. 提案⼿法 • 3種類のスコアを統合 Score = 𝛼𝑆!"# + 𝛽𝑆$ + 𝛾𝑆%

    • 𝑆!"# : ⾔語推論スコア • 相互含意確率 𝑃! hypo ↔ ref • 𝑆$ : 意味スコア • BERTScore • 𝑆% : ⾳韻類似度 • Soundex w/ Jaro-Winkler 𝛼, 𝛽, 𝛾 = 0.40, 0.28, 0.32 • ⾔語推論とは • ことばの含意関係を当てる 5 前提 (premise) The deadline is not flexible. 仮説 (hypothesis) The deadline is flexible. 仮説 (hypothesis) The deadline is strict. 仮説 (hypothesis) The deadline is Friday. ⽭盾 (contradiction) 含意 (entailment) 不明 (neutral)
  5. ⼈⼿評価との相関 + 感想 • Pearson相関で 0.890 • 設計は良くも悪くもヌルい • FT不要(ハイパラ調整要)

    • NN化して学習させるほどの データはないので妥当か… • NLPerとしては⾃然な発想 • LLM前提はもはや不可避 • 後段タスク⾮依存な形にでき るのが理想 6 tions were discarded). This suggests that Whisper outputs have fewer correctable errors. However, its highest BERTScore in- dicates stronger semantic correctness, supporting the idea that improved semantic accuracy of a baseline ASR transcript does not contribute to its LLM correctability. 4. A new integrated metric for dysartic speech ASR evaluation Figure 2: Correlation of WER, Phonetic, Semantic, NLI, and Proposed Metrics with Human Ratings Sections 1, 2, and 3 highlight that: 1) A single metric is in- sufficient for evaluating dysarthric speech ASR systems; WER captures word accuracy but misses semantic aspects. 2) Metrics score (SNLI ) of ASR hypothesis quality. SNLI is set equal to the probability Pe(hypo $ ref) computed by a RoBERTa-large model [18] that has beeen fine-tuned on the SNLI [19], MNLI, FEVER [20], and ANLI [21] datasets. To calculate the semantic score SS , we used BERTScore, and for phonetic similarity SP , we applied Soundex with Jaro-Winkler similarity, as discussed in Section 2. The integrated metric is defined as: Integrated Score = ↵SNLI + SS + SP where ↵, , and represent the weights. Optimal weights were determined via linear regression using human ratings as the tar- get. We used 5-fold cross-validation to avoid overfitting, train- ing the model on 80% of the data and testing it on the remaining 20%. Figure 2 shows that the integrated metric, using optimal weights, achieved the highest correlation (0.890) with human ratings, outperforming individual metrics. This demonstrates that the integrated approach better aligns with human judgment and is effective in evaluating ASR outputs with a focus on cor- rectability. After completion of cross-validation, weights were recom- puted using the entire linear regression dataset (100 transcript pairs, each labeled by the average of intelligibility scores as- signed by six human annotators). The final normalized weights were: ↵ = 0.40, = 0.28, = 0.32 The p-values for each coefficient were all found to be statisti- cally significant (NLI Score: p = 1.47e-07, BERT Score: p = * Figure taken from the paper
  6. Further Reading • Kiyotada Mori et al., What Do Humans

    Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems • Proc. Interspeech 2025, pp. 1753--1757 • https://www.isca-archive.org/interspeech_2025/mori25_interspeech.html • ⾳声対話システムでの利⽤を考えたとき、全部の単語を同等に 扱うWERでは不適切で、選択的/重み付き評価が必要なのでは ないか、という提⾔ 7