論文紹介: Exploring Semi-Supervised Learning for Predicting Listener Backchannels

論文紹介： Exploring Semi-Supervised Learning for Predicting Listener Backchannels Vidit Jain,
Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449

概要 Exploring Semi-Supervised Learning for Predicting Listener Backchannels https://dl.acm.org/doi/abs/10.1145/34117 64.3445449
• BC※アノテーションの半自動化 • BCの有無だけでなくどういった BCを出力するかまで予測 • マルチモーダルBCの検討 • Big Five特性を踏まえたキャラクタ性の再現 Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449 ※BC=backchannel: 　会話に対する反応(うなずき,あいづち,笑顔など)

当論文に期待したこと 1. アノテーション作業の省略方法の参考 2. “あいづち”出力内容を予測するうえでの参考（特に少量ラベル） 3. マルチモーダルデータを利用する意義 4. キャラクタ性の再現方法の参考 →思いのほか参考にならなかった。
→1.2.はすこし参考になった。 3.4.はそこまで、、

概説：データセット • 同研究室のヒンディー語対話コーパス • 38人(24 男性, 14 女性) • 平均21.47歳,標準偏差2.25歳
• ほぼPhDの生徒 • 1:1対話 • トークテーマは選択制 • マルチモーダル(映像+音声) • Big Five特性 ◦ Goldbergの質問票 ◦ スコアで2クラス分類 Khan, Shahid Nawaz, Maitree Leekha, Jainendra Shukla and Rajiv Ratn Shah. “Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset for Personality Assessment.” 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM) (2020): 103-111.

概説：アノテーション方法 • ELANを活用して映像+ 音声に付与 • 作業者は3名。2名以上一致で採用 • Fleiss’ κ ◦
BCの有無 (0.86) ◦ nod (0.71), head-shake (0.64), mouth (0.63), eyebrow(0.45) short-utterance (0.49). • BCの開始/終了は1秒以内の誤差なら合体 ◦ 時刻ラベルは平均を採用 • 合計: 2781backchannel signal ◦ Eyebrowは少量すぎるため対象外に • 平均BC継続時間: 1.8sec Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449

概説：ネガティブサンプル • 以下条件を満たすもの ◦ 話していない ◦ 同一タイムフレーム(おそらく3秒間隔)にBCがない • 継続時間: [1.06,
5.43]sec からランダム ◦ BCの時間と合わせている ◦ 開始時間が謎。先頭から抽出なのか真ん中からなのか • 合計: 2670 negative instances ※薄文字: 　個人的なの感想と,推測による行間の埋め合わせ

概説：特徴量 • 音声、映像から抽出 ◦ 表情はOpenFace ◦ 音声はpyAudioAnalysis Vidit Jain, Maitree
Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449

概説：半教師あり学習 • ラベルを予測するidentiﬁcationモデルを構築 ◦ 「BC有無」「BC内容」の二種類 • モデル単位: 被験者グループごと ◦ (性格特性も鑑みて)38人を6グループに
◦ BCがほぼ無い被験者がいたのも一因 • BCの内容は大まかに visual, verbal, both で分類 • ラベルデータを使ってモデルを構築*1 ◦ 初回データセット(L)を決めてモデル(C)を学習 ◦ ラベルなしデータ(U)へ推論を実行し、推論結果が 0.9以上になったら(L)に追加 ◦ 追加データがゼロになるまで再学習+再追加 • モデル構築に利用するデータセット割合(x%)を調整 ◦ どこまで少ないxで効果が出るのか検証 • どの学習方法が一番か検証 ◦ Random Forests (RF) ◦ Support Vector Machine Classifer (SVC) ◦ K-Nearest Neighbour Classifer (KNN) ◦ AdaBoost (ADA) ◦ ResNet (ResNet) Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449 *1

• 正しくアノテーションされているか性能評価 ◦ 5セットに分けて平均値で評価 *1 ◦ なぜかtest setだけで評価してない • クラス分類の精度*2が高いのは以下
◦ BC有無: ResNet ◦ BC内容: ResNet, Random Forests • 最小限データで高性能となったものは以下 ◦ Random Forests x=25% ◦ ResNet x=35% • 半教師あり学習ラベルには以下を採用 ◦ Random Forests x=25% 概説：半教師あり学習の評価 Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449 *1 *2

概説：予測方法 • 「BC有無」「BC内容」で異なるモデル ◦ BC有無: LSTM*1 ▪ 予測対象フレーム以外も学習？ ◦ BC内容:
以下すべて検証 ▪ Random Forests (RF) ▪ Support Vector Machine Classifer (SVC) ▪ AdaBoost (ADA) ▪ K-Nearest Neighbors (KNN) ▪ Multi-Layered Perceptron model (MLP) • BC内容予測データを加工 ◦ visual: 1593, verbal: 326, both: 835 ◦ データの不均衡に対応するため SVM-SMOTEを活用 Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449 *1

概説：予測精度 • 予測結果(x=25%) ◦ BC有無: 半教師もまずまずの精度 *1 ◦ BC内容: 半教師も高精度*2
• verbalが最も精度がよい*3 ◦ visualとbothは結果が似通っていた ▪ verbalだけ学習データが少なかったのが理由の可能性も ◦ bothは半教師のが高精度 ▪ アノテーションのばらつきが多かった？ Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449 *1 *2 *3

概説：印象評価方法 • Apple’s Memojiでキャラ＋動作作成 ◦ 外向的キャラと内向的キャラの2種類*1 • BC内容予測を改良 ◦ verbal,
visualであっても、性格特性に応じたルール*2とinverse transform sampling で unimodal か multimodal かを判断 ◦ どういった仕草かもルール*2で規定 ▪ ルールベースとなると魅力減。。 • 性格特性ごとに別々にモデルを学習 • ビデオ通話をみてもらってアンケート評価 ◦ 27名のヒンディーネイティブの大学生 • 以下３パターン別々に点数をつける ◦ ランダム出力 ◦ 教師あり学習(ADA) ◦ 半教師あり学習(ADA, RF and ResNet, x=25%) Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449 *1 *2

概説：印象評価の結果 • BC有無とBC内容どちらも半教師のがアンケート結果がよかった(！) ◦ 半教師ありにすることでラベルづけのノイズが削減されたでは、との考察 ◦ BC内容が半教師ありのがいい結果なのが本当に謎。visual,
verbal, bothの予測精度だけでそこまで性能差が出るのか • 性格特性アンケート結果も上々だった模様 Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449 *1

概説：当論文から学べたこと 1. アノテーション作業の省略方法の参考 a. κ値を参考にする b. ラベルを学習させる(要調査) 2. “あいづち”出力内容を予測するうえでの参考 (特に少量ラベル)
a. 少量なら除外 b. 不均衡ならimbalanced-learn(要調査) 3. マルチモーダルデータを利用する意義 a. 出力分類(単一/マルチ)として活用できる b. 映像と音声+映像が識別しづらい？ or imbalancedが上手くいかなかった？ 4. キャラクタ性の再現方法の参考 a. 結局ルールベースだったのでそこまで参考にならず

疑問に思ったポイント • 半教師あり学習 ◦ 0.9以上、という基準が適切なのかわからなかった ▪ モデル仮定という考えがあるらしい。この考慮が必要そうモデル仮定とはラベル無しデータを学習に利用するためのデータに対する仮定であり、生成される分類器に対して大きな影響を与えて、真の仮定と大きく異なる仮定を採用した場合理想とは大きく異なる学習をすることが考えられます。
https://qiita.com/dcm_ishikawa/items/584cd373f49dd917566a • 予測結果 ◦ 他論文と比較しても高精度と言えるが、ネガティブサンプルの取得方法が不明瞭 • 印象評価 ◦ アンケートの性格特性はビジュアルの影響も受けそうで、何とも言えない ◦ 3秒間隔でBC有無を判定しているため、リアルタイム性に欠ける ◦ (毎回思うが) シンプルなルールベースが競争相手に含まれないのが気になる

余談：アノテーション作業工数アノテーション作業は 90時間 (30時間/人)ほどかかったらしい。 • 25 conversations (50 individual recordings)
with each one lasting 16 minutes and 6 seconds on an average. • Even in the context of the present study, the annotation process took around 90 hours ◦ 30 hours each annotator, taking the average time to annotate one side of a conversation as 35 minutes. The total amount of recorded content being ∼ 13.5 hours long. Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449

余談：パーソナリティによる変化 • 独自の比率でBCの性質を数値化 ◦ τ= マルチBC/単一BC • 各特性でτは似通っていた(≠有意差) • 外向的と内向的でτの傾向に差あり(≠有意差)
◦ マルチBC:単一BC の全体平均 ▪ 外向的 (51:49), 内向的 (35:65) Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021. Exploring Semi-Supervised Learning for Predicting Listener Backchannels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 395, 1–12. https://doi.org/10.1145/3411764.3445449

論文紹介: Exploring Semi-Supervised Learning for Pr...

論文紹介: Exploring Semi-Supervised Learning for Predicting Listener Backchannels

sadahry

More Decks by sadahry

Other Decks in Technology

Featured

Transcript

論文紹介： Exploring Semi-Supervised Learning for Predicting Listener Backchannels Vidit Jain,

概要 Exploring Semi-Supervised Learning for Predicting Listener Backchannels https://dl.acm.org/doi/abs/10.1145/34117 64.3445449

概説：データセット • 同研究室のヒンディー語対話コーパス • 38人(24 男性, 14 女性) • 平均21.47歳,標準偏差2.25歳

概説：アノテーション方法 • ELANを活用して映像+ 音声に付与 • 作業者は3名。2名以上一致で採用 • Fleiss’ κ ◦

概説：ネガティブサンプル • 以下条件を満たすもの ◦ 話していない ◦ 同一タイムフレーム(おそらく3秒間隔)にBCがない • 継続時間: [1.06,

概説：特徴量 • 音声、映像から抽出 ◦ 表情はOpenFace ◦ 音声はpyAudioAnalysis Vidit Jain, Maitree

概説：半教師あり学習 • ラベルを予測するidentiﬁcationモデルを構築 ◦ 「BC有無」「BC内容」の二種類 • モデル単位: 被験者グループごと ◦ (性格特性も鑑みて)38人を6グループに

• 正しくアノテーションされているか性能評価 ◦ 5セットに分けて平均値で評価 1 ◦ なぜかtest setだけで評価してない • クラス分類の精度2が高いのは以下

概説：予測方法 • 「BC有無」「BC内容」で異なるモデル ◦ BC有無: LSTM*1 ▪ 予測対象フレーム以外も学習？ ◦ BC内容:

概説：予測精度 • 予測結果(x=25%) ◦ BC有無: 半教師もまずまずの精度 1 ◦ BC内容: 半教師も高精度2

概説：印象評価方法 • Apple’s Memojiでキャラ＋動作作成 ◦ 外向的キャラと内向的キャラの2種類*1 • BC内容予測を改良 ◦ verbal,

概説：当論文から学べたこと 1. アノテーション作業の省略方法の参考 a. κ値を参考にする b. ラベルを学習させる(要調査) 2. “あいづち”出力内容を予測するうえでの参考 (特に少量ラベル)

余談：アノテーション作業工数アノテーション作業は 90時間 (30時間/人)ほどかかったらしい。 • 25 conversations (50 individual recordings)

余談：パーソナリティによる変化 • 独自の比率でBCの性質を数値化 ◦ τ= マルチBC/単一BC • 各特性でτは似通っていた(≠有意差) • 外向的と内向的でτの傾向に差あり(≠有意差)