論文調査: あいづち予測

論文調査：あいづち予測

調査目的 • “あいづち” 予測にはどういった手段があるのか知る ◦ “うなずき” については後日 • 先行研究がどれくらいの精度なのか知る •
自作アプリに適した手法を見定める

調査方法 Review論文の「4. End-of-turn detection and prediction」をベースに比較 • Skantze, Gabriel. “Turn-taking
in Conversational Systems and Human-Robot Interaction: A Review.” Comput. Speech Lang. 67 (2021): 101178. 比較対象 1. Ward, Nigel G. and Wataru Tsukahara. “Prosodic features which cue back-channel responses in English and Japanese.” Journal of Pragmatics 32 (2000): 1177-1207. 2. 山口貴史, 井上昂治, 吉野幸一郎. “傾聴対話システムのための言語情報と韻律情報に基づく多様な形態の相槌の生成 (「フィールド研究とインタラクション」および一般).” (2016). 3. Hara, Kohei, Koji Inoue, Katsuya Takanashi and Tatsuya Kawahara. “Prediction of Turn-taking Using Multitask Learning with Prediction of Backchannels and Fillers.” INTERSPEECH (2018). 4. Ruede, Robin, Markus Müller, Sebastian Stüker and Alexander H. Waibel. “Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor.” IWSDS (2017). 5. Ortega, Daniel, Chia-Yu Li and Ngoc Thang Vu. “OH, JEEZ! or UH-HUH? A Listener-Aware Backchannel Predictor on ASR Transcriptions.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020): 8064-8068.

• 予測手法 ◦ 音節末、リアルタイム ..etc • データセット ◦ 公開データ、生成データ ..etc
• アノテーション ◦ 手動、データセットから ..etc • 分類器 ◦ 決定木、DNN..etc • 特徴量 ◦ 音圧、差分、言語 ..etc • クラス分類 ◦ 2分類(yes/no)、複数分類(turn-talk/bc/ﬁller/no) ..etc 比較項目

• 概要: ◦ 比較項目の列挙 • 予測結果: ◦ 予測結果から精度の検討 • 知見:
◦ 得られた知見の列挙論文調査テンプレート

沈黙後の話者交代を判断前提知識: 予測手法引用: Skantze, Gabriel. “Turn-taking in Conversational Systems
and Human-Robot Interaction: A Review.” Comput. Speech Lang. 67 (2021): 101178. TRP=transition-relevant place 参考 IPU=inter-pausal unit 参考 BRP=backchannel-relevant place 参考リアルタイムに判断 (本資料:Projectionは対象外) IPUごとに話者交代を判断 (本資料:BRPも対象に)

調査・比較

1-1. 概要: Ward, Nigel G. and Wataru Tsukahara. “Prosodic features
which cue back-channel responses in English and Japanese.” Journal of Pragmatics 32 (2000): 1177-1207. • 予測手法 ◦ IPU-Based Model (Silencedに近い) • コーパス ◦ 視線を遮った状態での 1:1会話 ▪ en: 8対話計68分 (男10人, 女2人) ▪ ja: 18対話計80分 (男15人, 女9人) • アノテーション ◦ ルールベース*1で相槌を識別 ▪ en: 359 ▪ ja: 873 • 分類器 ◦ ルールベース(en/ja別)*2 ▪ コーパスでMAXチューニング ◦ 言語情報による 500ms以内のあいづち推測 • 特徴量 ◦ 韻律的特徴(pitch, vol, back-channel before) ◦ 発話条件は800ms以上だが予測は 10msごと • クラス分類 ◦ あいづちが打てる/打てない *2 *1 (P1) a region of pitch less than the {en:26th, ja:28th} -percentile pitch level (P2) continuing for at least 110 milliseconds, (P3) coming after at least 700 milliseconds of speech, (P4) providing you have not output back-channel feedback within the preceding {en:800, ja:1000} milliseconds, (P5) after {en:700, ja:350} milliseconds wait, (D1) responds directly to the content of an utterance of the other, (D2) is optional, and (D3) does not require acknowledgement by the other.

1-2.予測結果: Ward, Nigel G. and Wataru Tsukahara. “Prosodic features which
cue back-channel responses in English and Japanese.” Journal of Pragmatics 32 (2000): 1177-1207. accuracy | ja | max ⇒ 0.34 • Coverageという概念を導入 ◦ 当時パラメータ最適化と他ルールとの比較ができなかったため導入 (?) • en, ja ともに低精度 ◦ ja: 言語情報だけでも多少精度あり

• 単純なルールだと低精度 ◦ ロボやキャラだと主観アンケート評価は良かったりする ▪ 西田麻希子, 渡辺富夫,
石井裕, 音声相槌を伴う音声駆動型身体引き込みキャラクタシステム, 日本機械学会論文集 , 2019, 85 巻, 880 号, p. 19-00159 ▪ Ishi, Carlos Toshinori, Chaoran Liu, Hiroshi Ishiguro and Norihiro Hagita. “Head motion during dialogue speech and nod timing control in humanoid robots.” 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (2010): 293-300. • 具体的に音律をみてルール形成をしているため、こういった論文は原点に立ち戻る際に有用かもしれない 1-3. 知見: Ward, Nigel G. and Wataru Tsukahara. “Prosodic features which cue back-channel responses in English and Japanese.” Journal of Pragmatics 32 (2000): 1177-1207.

IPUのような音律情報だけでなく、言語情報を活用する場合もある。前提知識: 節末引用: 『日本語話し言葉コーパス』における節単位認定

• 予測手法 ◦ IPU-Based Model ▪ (A) あいづちのある節末 ▪ (B)
あいづちのない節末 +節末ではないIPU末 • コーパス ◦ 1:1の対話(スクールカウンセラ 2名、大学生8名) ▪ 傾聴対話における相槌の韻律的特徴の同調傾向の分析上里美樹 , 吉野幸一郎 , 高梨克也他言語・音声理解と対話処理研究会 / 人工知能学会 [編] 70 5 70:2014.3.5 p.7-13 ◦ 8対話 x 10~15分(身の上話(20~30分)の前半部) • アノテーション ◦ 3名(男性1名、女性2名)で、IPU末を含む音声を聴き判断 *1 • 分類器 ◦ ロジスティック回帰 • 特徴量 ◦ 言語的特徴＋韻律的特徴 150ms(どちらも複雑..)*2 • クラス分類 ◦ (A) 各あいづちが打てる/打てない ◦ (B) あいづちが打てる/打てない 2-1. 概要: 山口貴史, 井上昂治, 吉野幸一郎. “傾聴対話システムのための言語情報と韻律情報に基づく多様な形態の相槌の生成 (「フィールド研究とインタラクション」および一般 ).” (2016). *1 *2

f_score | avg | max ⇒ 0.656 • 言語的特徴＋韻律的特徴が最も高い ◦
言語的特徴のみと韻律的特徴のみで結果が離れている 2-2. 予測結果: 山口貴史, 井上昂治, 吉野幸一郎. “傾聴対話システムのための言語情報と韻律情報に基づく多様な形態の相槌の生成 (「フィールド研究とインタラクション」および一般 ).” (2016).

• 精度は高いが複雑... • アノテーション方法が参考になる ◦ 3名の結果が全然揃わない（＝あいづちの固有性） ◦ ３回以上の「うん」は感情表現
◦ 「あー」「はー」「へー」のうち、「はー」が一番柔軟に使える 2-3. 知見: 山口貴史, 井上昂治, 吉野幸一郎. “傾聴対話システムのための言語情報と韻律情報に基づく多様な形態の相槌の生成 (「フィールド研究とインタラクション」および一般 ).” (2016).

あいづち(=backchannel)とは異なる概念で、「自分が話す番だよ」を示す発話。 >ﬁllers are used by the prospective speakers to indicate
a will to take a turn. 前提知識: Filler 引用: Hara, Kohei, Koji Inoue, Katsuya Takanashi and Tatsuya Kawahara. “Prediction of Turn-taking Using Multitask Learning with Prediction of Backchannels and Fillers.” INTERSPEECH (2018).

• 予測手法 ◦ IPU-Based Model • コーパス ◦ アンドロイドを裏で (女性4人が)操作した対話
▪ K. Inoue, P. Milhorat, D. Lala, T. Zhao, and T. Kawahara, “Talking with ERICA, an autonomous android.” in SIGdial, 2016, pp. 212–215. ◦ 15セッションx10分x2種(仕事話/身の上話) • アノテーション ◦ SWITCH/KEEP, BC/NOT, Filler/NOT を分類*1 ◦ 学習対象の抽出方法は不明 • 分類器 ◦ LTSM • 特徴量 ◦ 韻律的特徴1000ms*2 ◦ 200ms<=IPU<400ms • クラス分類 ◦ 各分類を別々に [0, 1]で予測 ▪ モデル: 別々(baseline)/同一(integrated) の 2種 3-1. 概要: Hara, Kohei, Koji Inoue, Katsuya Takanashi and Tatsuya Kawahara. “Prediction of Turn-taking Using Multitask Learning with Prediction of Backchannels and Fillers.” INTERSPEECH (2018). *2 *1

3-2. 予測結果: Hara, Kohei, Koji Inoue, Katsuya Takanashi and Tatsuya
Kawahara. “Prediction of Turn-taking Using Multitask Learning with Prediction of Backchannels and Fillers.” INTERSPEECH (2018). f_score | BC | avg | max ⇒ 0.603 • BC: 比較的高精度だがタスクで精度が変わる ◦ recallはどちらも0.8ほどで高精度 • (参考) Filler: max ⇒ 0.460 • (参考) SWITCH/KEEP: max ⇒ 0.910

• シンプルな構成でGOOD • 音律的情報だけで高精度 (BC, SWITCH/KEEP) • Filler精度は高くはなかった ◦ 実際の発話でもそういうものか
◦ 予測エンジンに組み込むかも含め要検討 • 学習対象の抽出方法が謎 ◦ 記載がなく、自作コーパスなので気になる ◦ ERICA音声はどうやって出力？ ▪ 他論文をみる限りプリセットの合成音声 ▪ 音声そのものではなくインデックスで良かったのでは？ • モデルハイパーパラメータが気になる ◦ integrated: L =α×Lturn +β×(Lbc+Lﬁller) 3-3. 知見: Hara, Kohei, Koji Inoue, Katsuya Takanashi and Tatsuya Kawahara. “Prediction of Turn-taking Using Multitask Learning with Prediction of Backchannels and Fillers.” INTERSPEECH (2018).

4-1. 概要: Ruede, Robin, Markus Müller, Sebastian Stüker and Alexander
H. Waibel. “Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor.” IWSDS (2017). • 予測手法 ◦ Continuous-Based Model • コーパス ◦ 公開コーパス ▪ J. J. Godfrey, E. C. Holliman and J. McDaniel, "SWITCHBOARD: telephone speech corpus for research and development," [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992, ◦ 2438会話 ▪ 5秒以上一方的に会話しているデータのみ • アノテーション ◦ top 150 most common unique BC ◦ 公開アラインメントデータ ▪ Harkins, D., et al.: ISIP switchboard word alignments (2003). URL https://www.isip. piconepress.com/projects/switchboard/ • 分類器 ◦ LSTM • 特徴量 ◦ 音律特徴量（power,pitch,f0）を色々調整 ▪ lengths (500ms to 2000ms), strides (1 to 4 frames) ◦ word2vecも適用 • クラス分類 ◦ No-BC / BCで予測 ▪ No-BCの学習データ: BCを打つ数秒前の発話

f_score | avg | max ⇒ 0.390 • word2vecも含めた予測がmax値 •
音律情報のみでも0.375 • recallは比較的高いが、それでも 0.5なので2回に1回は外れる 4-2. 予測結果: Ruede, Robin, Markus Müller, Sebastian Stüker and Alexander H. Waibel. “Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor.” IWSDS (2017). *1

• コーパスでの大量学習だと F1-scoreはあまり出ない ◦ 全あいづちを一緒くたにしても精度は出ない • コーパスから曖昧音を除外する必要あり ◦ 笑い声や“uh” という単語
• 実用的な予測によるあいづち出力ロジックも記載 ◦ ガウシアンフィルタで滑らかにした値が最大値を迎えたタイミングであいづちを出力 ◦ デモ動画 https://streamable.com/dycu1 4-3. 知見: Ruede, Robin, Markus Müller, Sebastian Stüker and Alexander H. Waibel. “Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor.” IWSDS (2017).

5-1. 概要: OH, JEEZ! or UH-HUH? A Listener-Aware Backchannel Predictor
on ASR Transcriptions.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing • 予測手法 ◦ IPU-Based Model • コーパス ◦ 公開コーパス ▪ J. J. Godfrey, E. C. Holliman and J. McDaniel, "SWITCHBOARD: telephone speech corpus for research and development," [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992, ◦ 2438会話 543話し手 10[会話/話し手] • アノテーション ◦ 相槌を手動分類。分類を元に各発話を自動アノテーション ◦ No-BC 50%, BC-Continuer 22.5%, BC-Assessment 27.5%*1 ◦ 670 unique BC ◦ GMM/HMMでのアラインメント ◦ 手動/自動ASR • 分類器 ◦ Time-based CNN • 特徴量 ◦ 13次元MFCCs:2000ms + Listener index*2 ◦ word2vec:15単語 • クラス分類 ◦ No-BC / BC-Continuer / BC-Assessment で予測 ▪ 音律, 言語, 言語+音律それぞれ予測 *1 BC-Continuer: uh-huh, mm-hm. BC-Assessment: oh, jeez!, yeah *2 Listener index: 同一対話者の会話に IDを付与

accuracy | max ⇒ 0.589 • maxの詳細*1を見る限りrecallは低い • No-BC≒SWITCH ◦
他と比較すると微妙な精度 • 音律のみでも比較的高精度 • 自動音声認識(ATs)でも比較的高精度 5-2. 予測結果: OH, JEEZ! or UH-HUH? A Listener-Aware Backchannel Predictor on ASR Transcriptions.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing *1

• Listener indexで精度向上 ◦ あいづちの固有性 • 3. より精度が出ない理由が謎 ◦ MFCCだけでは不十分？
▪ 3.ではZ値まで含む ◦ BC-Continuer と BC-Assessment を混ぜて学習させたから？ ▪ 3.では学習データから別々に分類 ◦ 日本語と英語の違い？ ▪ 1.では英語のほうが精度が低かったため ◦ Listener indexのembeddingが十分ではなかった？ ▪ embeddingにより自動的に話者同士の共通要素が吸収されてもおかしくない ▪ 学習データ自体を変える必要があるかも 5-3. 知見: OH, JEEZ! or UH-HUH? A Listener-Aware Backchannel Predictor on ASR Transcriptions.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing

• 音律のみで SWITCH/KEEP, BC 予測は可能そう ◦ LSTMなので推論速度は気になる。プロト実装して確認したい • あいづちの固有性がありそう ◦
データセットは少人数のがよさそう ◦ データセットの作り方が気になる • リアルタイム性は発展途上 ◦ LSTMなのでスマホでの負荷が気になる ◦ ルールベースでもアンケート結果に悪影響はない (他調べ) ▪ リアルタイム性は後回しでもいいかもまとめ

論文調査: あいづち予測

論文調査: あいづち予測

Sadahiro Yoshikawa

More Decks by Sadahiro Yoshikawa

Other Decks in Technology

Featured

Transcript