Upgrade to Pro — share decks privately, control downloads, hide ads and more …

End-to-end Adaptation with Backpropagation through WFST for On-device Speech Recognition System

Emiru Tsunoo
November 22, 2019

End-to-end Adaptation with Backpropagation through WFST for On-device Speech Recognition System

Interspeech2019読み会向けの資料です。
Interspeech2019で発表した「End-to-end Adaptation with Backpropagation through WFST for On-device Speech Recognition System」についてです。

Emiru Tsunoo

November 22, 2019
Tweet

Other Decks in Research

Transcript

  1. End-to-end Adaptation with Backpropagation through WFST for On-device Speech Recognition

    System Sony Corporation, R&D Center Emiru Tsunoo, Yosuke Kashiwagi, Satoshi Asakawa, Toshiyuki Kumakura
  2. 現状のハイブリッド音声認識 と 提案手法 ViterbiNet Acoustic Model Feature Extraction Signal Processing

    input signals clean speech fbank state posterior text Jointly retrained DNN-HMM model Language Model (WFST) Acoustic Model Feature Extraction Signal Processing input signals clean speech fbank state posterior text Trained with noise data Classical DNN training with speech data Trained with text or hand crafted Proposed method Equivalent conversion to trainable NN 既存のモデルを初期値としてEnd-to-End 学習を可能に
  3. 既存手法:DNN-HMM ハイブリッド音声認識 0 2 1 3 4 5 6 8

    sil:∅ 2.0 s:∅ 0.6 t:∅ 1.0 aa:∅ 1.0 p:stop 3.0 t:start 3.0 r:start 3.0 t:∅ 1.0 sil:∅ 2.0 s:∅ 0.3 7 sil:∅ 2.0 sil s t aa r p AM score t 音響モデルによる尤度スコア [実行時] 音響モデルの尤度スコア(事後確率)と組み合わせて最適経路の探索 {start, stop}の2語彙の認識器のWFSTの例 2-1 2-2 2-3 sのHMMモデル
  4. [ViterbiNet] グラフ遷移を行列で表現 0 2 1 3 4 5 6 8

    sil:∅ 2.0 s:∅ 0.6 t:∅ 1.0 aa:∅ 1.0 p:stop 3.0 t:start 3.0 r:start 3.0 t:∅ 1.0 sil:∅ 2.0 s:∅ 0.3 7 sil:∅ 2.0 V = 0.0 2.0 0.6 0.0 0.3 0.0 1.0 0.0 1.0 0.0 3.0 3.0 3.0 0.0 1.0 0.0 2.0 0.0 2.0 0.0
  5. [ViterbiNet] 再帰的な前向き演算 sil s t aa r p initial state

    AM score mapped to WFST states AM score V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿ V W ⦿
  6. [ViterbiNet] 出力層の演算 0 2 1 3 4 5 6 8

    sil:∅ 2.0 s:∅ 0.6 t:∅ 1.0 aa:∅ 1.0 p:stop 3.0 t:start 3.0 r:start 3.0 t:∅ 1.0 sil:∅ 2.0 s:∅ 0.3 7 sil:∅ 2.0 backward forward Output sequence start stop max start stop Pooled output utt = max = SoftmaxCrossEntropy log utt , Loss function 探索問題から識別問題へ
  7. [ViterbiNet] ネットワーク構成 Sparse Affine Output Computation Maxpooling AM DNN mapping

    to WFST state Forward/backward process with transition matrix V output computation utterance pooling ViterbiNet
  8. 実ケースの実験 Model • AMs: 5 layers 640 units FC DNN

    English (28,000h) / Japanese (12,300h) • LM: Grammar WFST for each task Dataset Task EN-SC EN-ROBOT JP-ROBOT Examples “cat” “come-home” “おいで” “three” “move-forward” “前進” Adaptation data Vocabulary 20 157 229 # of utterances 34,760 (9.6h) 3,297 (1.8h) 4,580 (2.1h) # of speakers 1,034 7 20 SNR 0 — 20 dB -5 — 15 dB -5 — 15 dB Evaluation data Vocabulary 20 157 229 # of utterances 29,961 (8.3h) 1,413 (0.7h) 2,290 (1.0h) # of speakers 847 3 10 SNR 0 — 20 dB -5 — 15 dB -5 — 15 dB
  9. Adaptation 実験結果 Adaptation methods EN-SC EN-ROBOT JP-ROBOT (No adaptation) 34.70

    8.99 6.24 Cross Entropy 10.78 6.44 4.06 sMBR [1] 10.83 6.65 3.14 KL-regularization [2] 10.77 6.23 3.28 ViterbiNet AM 9.65 6.09 2.93 ViterbiNet WFST 13.08 3.89 3.06 ViterbiNet E2E 9.26 3.54 2.66 提案手法 • ViterbiNet AM: WFSTのパラメータは固定してAMのパラメータだけ再学習 • ViterbiNet WFST: AMのパラメータは固定してWFSTの重みだけ再学習 • ViterbiNet E2E: WFST/AMのどちらもパラメータを再学習 Sentence Error Rate (%) [1] M. Gibson and T. Hain, “Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition,” in 9th International Conference on Spoken Language Processing, 2006. [2] D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 7893–7897.
  10. 言語適応 実験結果 Adaptation methods JP AM  EN-SC JP AM

     EN-ROBOT EN AM  JP-ROBOT (No adaptation) 84.12 94.13 46.03 KL-regularization [2] 67.35 54.64 16.94 ViterbiNet AM 54.40 31.21 10.39 ViterbiNet WFST 47.24 49.54 25.68 ViterbiNet E2E 27.64 13.09 7.64 言語適応(英語日本語) • 他言語のAMを種に利用 • LMは人手でHMMに合うように再設計 • 例:(英) seven  (日) s e b N 2~10hの適応データで未知の言語の認識システムができる! Sentence Error Rate (%)
  11. まとめ WFSTをバックプロパゲーションするViterbiNetを提案 • AMの出力事後確率をWFST状態へマッピング • 再帰的なforward/backward計算 • 発話内でのmaxpoolingによる単語列(コマンド)識別 少ない適応データでの再学習 •

    従来適応手法より良好な結果 • WFST/AMのどちらもパラメータを再学習させたE2E適応が最高性能 異なる言語にも適応可能 • 英語日本語 を少ない適応データで実現 Future work • 大語彙連続音声認識のための大規模なWFST再学習