Upgrade to Pro — share decks privately, control downloads, hide ads and more …

深層学習を用いた複数マイクロホンの音源分離/ Multi-channel Speech Source Separation with Deep Learning

深層学習を用いた複数マイクロホンの音源分離/ Multi-channel Speech Source Separation with Deep Learning

戸上真人 LINE株式会社, 博士(工学)
Masahito Togami LINE Coroporation, Ph.D.

LINE AI Talk #02の発表資料です


LINE Developers

July 05, 2019


  1. 戸上真人 LINE株式会社, 博士(工学) Masahito Togami LINE Coroporation, Ph.D. 2019-07-05 深層学習を用いた複数マイクロホンの

    音源分離/ Multi-channel Speech Source Separation with Deep Learning
  2. • Masahito Togami, Ph.D. • 2003/10 ~ 2016/10 Hitachi, Ltd.,

    Central Research Laboratory • 2016/10 ~ 2018/5 Hitachi America, Ltd. • 2017/1~ 2018/5 Stanford University Visiting Scholar • 2018/6 ~ LINE Corporation, ResearchLabs Senior Researcher • JSAI Board Member, IEEE Senior member, TC Member of IEEE SPS Industry DSP Technology Standing Committee, Member of Digital Signal Processing Editorial Board • Research interest • Statistical Signal Processing, Speech processing, array signal processing, robot audition Self introduction
  3. Signal Processing Signal processing is an analysis with strong signal

    modeling for desired signal/parameter extraction from low S/N signal
  4. Key Message • Some people said that “Deep learning is

    an innovative tool which does not require any pre-given models”… • In deep neural network era, signal modeling is no longer unnecessary? My answer is No
  5. Signal modeling in deep neural network • Signal modeling with

    prior knowledge in deep neural network context is highly important • Representation Learning [Bengio 2013] • Learning with generic priors: Smoothness, Temporal and spatial coherence, sparsity, Simplicity of factor dependencies, explanatory factors… • Structured Variational Auto Encoder (SVAE) [Johnson 2016] • Combination of probabilistic graphical models with neural network: GMM, State-space model
  6. In this talk • Signal modeling is a strong tool,

    but sometimes, an actual signal is too complicated to be expressed by only simple models From signal modeling × deep neural network perspective, recent progress of multi-channel speech source separation with deep neural network and some LINE Research Labs activities are introduced.
  7. Motivation Purpose: High speech quality in hands-free communication system and

    improvement of automatic speech recognition Reduction of unwanted signal is an important issue In-car speech recognition: Road noise AI Speaker: multiple speech sources, reverberation
  8. Why is speech source separation difficult ? = + How

    to estimate S under the condition that only X is given and A,S, and N are unknown Strategy: Utilization of characteristics of A, S, and N Observed microphone input signal Clean speech signal Mixing matrix Background noise Signal modeling !
  9. Signal modeling for speech signal • Two models are utilized:

    Source model () and Spatial model () Speech source in time-frequency domain Time Frequency Source model: Sparse, non-stationary (time-varying), harmonic structure… White Gaussian
  10. Signal modeling for speech signal • Spatial model (Only multichannel

    approaches can be utilized): • Time (or Phase)-difference between microphones is depending on location of a speech source 1st Microphone 2nd Microphone Earlier Earlier
  11. Probabilistic model-based approaches with source and spatial models • Independent

    Component analysis (ICA) [Common 1994] • Source model: super Gaussian distribution • Local Gaussian modeling (LGM)[Duong 2010] • Time-varying multi-channel Gaussian distribution • Independent Vector Analysis (IVA)[Kim 2006][Hiroe 2006] • All frequencies have the same time-varying activity • Independent Low-Rank Matrix Analysis(ILRMA) [Kitamura 2016]: Time-varying activities are modeled by using small number of patterns, 2 or 3. Source model is too simple to express complicated speech source characteristics.
  12. DNN based source model • DNN can capture complicated speech-source

    characteristics Fig.2 in Y. Xu, J. Du, L. Dai and C. Lee, "A Regression Approach to Speech Enhancement Based on Deep Neural Networks," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, Jan. 2015.
  13. How to combine DNN based source model and spatial model

    1)Source model is learned by DNN, spatial model is updated based on conventional probabilistic models • LGM+DNN [Nugraha 2016], IDLMA [Mogami 2018], MVAE [Kameoka 2018] 2)Spatial model is constructed by using DNN based source model (time-frequency mask ,, ). • Mask based beamforming [Heymann 2016], Deep Clustering[Hershey 2016], Permutation Invariant Training (PIT) [Kolbæk 2017] , = 1 σ ,, ෍ ,, , , DNN does not concern speech quality after multi-channel speech source separation
  14. Activities in LINE Research Labs • We published 3 ICASSP2019

    papers and 1 Interspeech2019 Paper for how to combine source model and spatial model 1. Multi-channel Loss function with DNN source model and spatial model [Togami ICASSP2019_1] [Masuyama Interspeech2019] 2. Insertion of spatial model between two BLSTM layers in multi-channel Deep Clustering [Togami ICASSP2019_2] 3. Simultaneous optimization of forgetting factors and time-frequency masking in online beamfomers [Togami ICASSP2019_3] 4. Restoration of parameters of spatial model by using DNN based source model [Nakagome 2019](Student Poster Award)
  15. Multi-channel loss function for deep learning based speech source separation

    [Togami ICASSP2019_1] A loss function which evaluates estimated posterior probability density function is proposed ℒ = − log ,, , = ,, − ത ,, ഥ ,, −1 ,, − ത ,, + log ഥ ,, :DNN parameters, ,, : Oracle multi-channel speech signal, ത ,, : Map estimate of ,, , ഥ ,, : MSE of ,, , , :input signal LGM: ,, = 0, ,, , , , = 1 σ ,, σ ,, , , , , ,, are inferred via DNN, ത ,, , ഥ ,, is estimated by using time-varying multi-channel Wiener filtering ,, = ,, , ෍ ,, , −1
  16. DNN structure BLSTM N K  Linear ReLU Linear Sigmoid

    k l i M , , Linear Sigmoid k l i v , , s s KN KN  s KN N  s s KN KN  log ,, 2 , phase difference BLSTM … Similarly to conventional deep clustering and PIT, multiple BLSTM layers are utilized.
  17. Experimental results • The proposed loss function is compared with

    the other loss functions which evaluate the multi-channel output Approaches SIR (dB) SDR (dB) ΔMFCC L2 loss 11.99 9.46 3.94 MMSE 10.98 8.88 3.66 Prior 11.34 9.13 4.69 Proposed 12.16 10.17 4.66 Interpretation: 1) Discriminative Training 2) Regularization term 3) Multi-task training ℒ = ,, − ത ,, ഥ ,, −1 ,, − ത ,, + log ഥ ,,
  18. Experimental results for time-invariant beamformer [Masuyama Interspeech2019] • Only time-frequency

    mask ,, is utilized for construing time-invariant beamformer • The proposed losses were compared to a monaural loss (PSA). Approaches SIR (dB) SDR (dB) CD (dB) Mixed 0.91 0.20 4.44 PSA 6.21 5.79 3.93 Proposed 7.57 7.10 3.40 Oracle PSM 11.04 10.43 3.09 Time-frequency mask inferred by the proposed method outperformed state-of-the-art time-frequency mask estimator.
  19. Insertion of MWF between BLSTM layers [Togami ICASSP2019_2] • Roughly

    speaking, one BLSTM layer in DNN based speech source separation acts as one EM step. • There is no explicit spatial constraint, i.e., the same source is in the same spatial location (spatial location constraint). • By inserting multi-channel Wiener filtering (MWF) between two BLSTM layers, spatial location constraint is explicitly considered
  20. Experimental results for multi-channel speech source separation Approaches SIR (dB)

    SDR (dB) ΔMFCC w/o MWF insertion 15.04 13.87 5.03 w/ MWF insertion 15.36 14.06 5.18 The neural network is trained by using 4-microphones dataset Experimental results for 4-microphones dataset Approaches SIR (dB) SDR (dB) ΔMFCC w/o MWF insertion 17.45 15.73 5.47 w/ MWF insertion 18.14 16.33 5.76 Experimental results for 8-microphones dataset
  21. Conclusions • Signal modeling is also important in deep neural

    network era • In speech source separation context, how to combine DNN based source model and spatial model is a key issue • We proposed a DNN Loss function which reflects spatial model and a DNN structure which integrates spatial model • Signal modeling will be more and more important in the future, e.g., unsupervised learning.
  22. Future issues • “Labeling” dataset is highly difficult to be

    obtained in speech source separation, because labeling data is clean speech • Utilization of Dark data, weak supervision is more and more important such as Snorkel in Stanford University (https://github.com/HazyResearch/snorkel). • In this context, Spotlight came again on conventional unsupervised multi-channel speech source separation, e.g., unsupervised speech source separation with DNN [Lukas 2019]
  23. References [Bengio 2013] Y. Bengio, et al., "Representation Learning: A

    Review and New Perspectives," in IEEE TPAMI., vol. 35, no. 8, pp. 1798-1828, Aug. 2013. [Johnson 2016] M. Johnson et al., “Composing graphical models with neural networks for structured representations and fast inference”, NeurIPS, 2016. [Common 1994] P. Common, “Independent component analysis, a new concept?,” Signal Processing, vol. 36, no. 3, pp. 287–314, April 1994. [Duong 2010] N.Q.K. Duong, et al., “Underdetermined reverberant audio source separation using a fullrank spatial covariance model,” IEEE TASLP., vol. 18, no. 7, pp. 1830–1840, 2010. [Kim 2006] T. Kim, et al., “Independent vector analysis: an extension of ica to multivariate components,” in Proceedings ICA, pp. 165–172, Mar. 2006. [Hiroe 2006] A. Hiroe, “Solution of permutation problem in frequency domain ica using multivariate probability density functions,” in Proceedings ICA, pp. 601–608, Mar. 2006. [Kitamura 2016] D. Kitamura, et al., "Determined blind source separation unifying independent vector analysis and nonnegativematrix factorization," IEEE/ACM TASLP., vol. 24, no. 9, pp. 1626-1641, 2016. [Nugraha 2016] A.A. Nugraha, et al., “Multichannel audio source separation with deep neural networks,” IEEE/ACM TASLP., vol. 24, no. 9, pp. 1652–1664, 2016. [Mogami 2018] S. Mogami, et al., “Independent deeply learned matrix analysis for multichannel audio source separation,” in EUSIPCO, pp. 1557–1561, Sep. 2018. [Kameoka 2018] H. Kameoka, et al., "Semi-blind source separation with multichannel variational autoencoder," arXiv:1808.00892, Aug. 2018. [Heymann 2016] J. Heymann, et al., “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, 2016, pp. 196–200. [Hershey 2016] J.R. Hershey, et al., “Deep clustering: Discriminative embeddings for segmentation and separation,” in ICASSP, 2016, pp. 31–35. [Kolbæk 2017] M. Kolbæk, et al., “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM TASLP., vol. 25, pp. 1901–1913, 2017. [Togami ICASSP2019_1] M. Togami, “Multi-channel Itakura Saito Distance Minimization with deep neural network,” in ICASSP, 2019, pp. 536-540. [Togami ICASSP2019_2] M. Togami, “Spatial Constraint on Multi-channel Deep Clustering,” in ICASSP, 2019, pp. 531-535. [Togami ICASSP2019_3] M. Togami, “Simultaneous Optimization of Forgetting Factor and Time-frequency Mask for Block Online Multi-channel Speech Enhancement,” in ICASSP, 2019, pp. 2702-2706. [Nakagome 2019] Y. Nakagome, “Adaptive beamformer for desired source extraction with neural network based direction of arrival estimation,” in IEICE Technical Report, vol. 118, no. 497, SP2018-85 , pp. 143-147 (in Japanese), Mar. 2019. [Masuyama Interspeech2019] Y. Masuyama, et al., “Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming,” in Interspeech, Sep. 2019 (Accepted). [Togami 2005] M. Togami et al., “Adaptation methodology for minimum variance beam-former based on frequency segregation,” in Proc. of the 2005 Autumn Meeting of the Acoustical Society of Japan (in Japanese), Sep. 2005. [Lukas 2019] L. Drude, et al., "Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation," in ICASSP, 2019, pp. 695-699.
  24. Thank you

  25. Speech extraction from target direction with DNN supported DOA estimation

    [Nakagome 2019] • Previously, multi-channel beamforming with time-frequency masking has been proposed [Togami 2005] for target speech extraction • Under the assumption that DOA of the target speech is given, time-frequency mask is constructed by using a DOA estimate • However, the DOA estimate is sensitive to microphone alignment error and reverberation. • We propose a multi-channel beamforming with DOA based time-frequency masking in which DOA estimate is restored and reliability is estimated by DNN.
  26. Experimental results Speech extraction performance after MWF Approaches SNR (dB)

    SDR (dB) MFCCD w/o restoration and reliablity 7.13 8.12 3.21 w/ DOA restoration w/o reliablity (Loss is DOA estimation error) 10.32 14.90 2.80 w/ DOA restoration and reliablity (Loss is output speech quality) 13.53 17.98 2.42