論文紹介: "Multimodal Activation: Awakening Dialog Robots without Wake Words (SIGIR2021)"

欅惇志デンソーアイティーラボラトリ [email protected] Multimodal Activation: Awakening Dialog Robots without
Wake Words (SIGIR2021) *33FBEJOHळ ※ 図表は論⽂中・Web からの引⽤

この論⽂で解決したこと • Dialog Robots (AI スピーカー) から Wake Word (ロボットに指⽰する合図)
を省略 2021.10.30 IRReading2021秋 2 SIGIR にこんな論⽂が…！不便L

• ⼊⼒ o 動画 (顔画像 + ⾳声) + 発話内容 (テキスト)
• つまり，カメラ画⾯外からの発話には⾮対応 • activate (対話ロボットを起こす) 条件 o 顔と⾳声が⼀致している (consistency) • 認識対象の⼈が発話している o 対話意図がある (intention) • 対話ロボットに指⽰を出している 2021.10.30 IRReading2021秋 3 Wake word 省略の戦略

MAS (Multimodal Activation Scheme) 2021.10.30 IRReading2021秋 4 consistency intention

要約 • 貢献 o データセット作成 o multimodal activate 判別タスクの部分問題への切り分け
• consistency, intention o consistency に顔のランドマーク情報利⽤ • 顔について詳細な情報 (平常時からの動き) を利⽤ • (特徴量抽出を DNN に任せるのではなく) ⼈⼿で作られた次元を使っている点で上⼿く学習できた? • 評価実験 o F 尺度: 0.924 (P: 0.903, R: 0.947) o その他の実験 • コンポーネントの交換，アブレーションテスト，マイクロ分析など • ⾳声処理，画像処理，⾃然⾔語処理いずれにも精通してて凄かった (語彙) 2021.10.30 IRReading2021秋 5

Audio-Visual Consistency Detection 2021.10.30 IRReading2021秋 6

Audio-Visual Consistency Detection • Facial Landmark Feature o 68 点のランドマーク
(ライブラリで抽出) o 各点と顔の重⼼からの距離を特徴量化 • 縦軸・横軸それぞれ • Speech Feature o フレームレート 30 でサンプリング o MFCC (既存⼿法) で特徴量化 2021.10.30 IRReading2021秋 7

Audio-Visual Consistency Detection 2021.10.30 IRReading2021秋 8

Audio-Visual Consistency Detection 1. 各特徴量セットを畳み込み 2. 畳み込み結果 (分散表現) を concatate
3. 多層パーセプトロンに⼊れて学習 2021.10.30 IRReading2021秋 9

Semantic Talking Intention Inference 2021.10.30 IRReading2021秋 10

Semantic Talking Intention Inference • Textual Feature o ASR: automatic
speech recognition • ⾳声をテキスト化 o XLNet に⼊れて embedding 化 (特徴量化) 2021.10.30 IRReading2021秋 11

Semantic Talking Intention Inference 2021.10.30 IRReading2021秋 12

Semantic Talking Intention Inference 1. 意図推定成功データの抽出 o 並べて⾏列化 (Positive Transcript
Matric) 2. MF で Topic Pattern Matrix 作成 o MF (matrix factorization): 次元圧縮⼿法 o データを⾒てたら対話にパターンがあった • MF でトピックを抽出したい 3. テキスト特徴量とのコサイン類似度算出 4. 多層パーセプトロンに⼊れて学習 2021.10.30 IRReading2021秋 13

Multimodal Activation 2021.10.30 IRReading2021秋 14

Multimodal Activation 1. consistency と intention それぞれを畳み込み 2. concat
3. フル結合 2021.10.30 IRReading2021秋 15

評価実験 • データセット o D{c+, t+} • 被験者集めて動画撮影しながら対話ロボットに指⽰ o D{c-,
t-} • 被験者集めて動画撮影しながら⼝パク + テレビの⾳声流す o D{c-, t+} • D{c+, t+} の映像を有名⼈発話コーパスの映像に変更 o D{c-, t+} • D{c+, t+} の映像を他のサブセットからランダムに変更 2021.10.30 IRReading2021秋 16

実験結果 • F 尺度: 0.924 (P: 0.903, R: 0.947) o
⼗分実⽤的なレベルでは? • その他の実験 o コンポーネントの交換，アブレーションテスト，マイクロ分析など 2021.10.30 IRReading2021秋 17

論文紹介: "Multimodal Activation: Awakening Dialog ...

論文紹介: "Multimodal Activation: Awakening Dialog Robots without Wake Words (SIGIR2021)"

keyakkie

More Decks by keyakkie

Other Decks in Research

Featured

Transcript

欅惇志デンソーアイティーラボラトリ [email protected] Multimodal Activation: Awakening Dialog Robots without

この論⽂で解決したこと • Dialog Robots (AI スピーカー) から Wake Word (ロボットに指⽰する合図)

• ⼊⼒ o 動画 (顔画像 + ⾳声) + 発話内容 (テキスト)

MAS (Multimodal Activation Scheme) 2021.10.30 IRReading2021秋 4 consistency intention

要約 • 貢献 o データセット作成 o multimodal activate 判別タスクの部分問題への切り分け

Audio-Visual Consistency Detection 2021.10.30 IRReading2021秋 6

Audio-Visual Consistency Detection • Facial Landmark Feature o 68 点のランドマーク

Audio-Visual Consistency Detection 2021.10.30 IRReading2021秋 8

Audio-Visual Consistency Detection 1. 各特徴量セットを畳み込み 2. 畳み込み結果 (分散表現) を concatate

Semantic Talking Intention Inference 2021.10.30 IRReading2021秋 10

Semantic Talking Intention Inference • Textual Feature o ASR: automatic

Semantic Talking Intention Inference 2021.10.30 IRReading2021秋 12

Semantic Talking Intention Inference 1. 意図推定成功データの抽出 o 並べて⾏列化 (Positive Transcript

Multimodal Activation 2021.10.30 IRReading2021秋 14

Multimodal Activation 1. consistency と intention それぞれを畳み込み 2. concat

評価実験 • データセット o D{c+, t+} • 被験者集めて動画撮影しながら対話ロボットに指⽰ o D{c-,

実験結果 • F 尺度: 0.924 (P: 0.903, R: 0.947) o