Central Research Laboratory • 2016/10 ~ 2018/5 Hitachi America, Ltd. • 2017/1~ 2018/5 Stanford University Visiting Scholar • 2018/6 ~ LINE Corporation, ResearchLabs Senior Researcher • JSAI Board Member, IEEE Senior member, TC Member of IEEE SPS Industry DSP Technology Standing Committee, Member of Digital Signal Processing Editorial Board • Research interest • Statistical Signal Processing, Speech processing, array signal processing, robot audition Self introduction
prior knowledge in deep neural network context is highly important • Representation Learning [Bengio 2013] • Learning with generic priors: Smoothness, Temporal and spatial coherence, sparsity, Simplicity of factor dependencies, explanatory factors… • Structured Variational Auto Encoder (SVAE) [Johnson 2016] • Combination of probabilistic graphical models with neural network: GMM, State-space model
but sometimes, an actual signal is too complicated to be expressed by only simple models From signal modeling × deep neural network perspective, recent progress of multi-channel speech source separation with deep neural network and some LINE Research Labs activities are introduced.
to estimate S under the condition that only X is given and A,S, and N are unknown Strategy: Utilization of characteristics of A, S, and N Observed microphone input signal Clean speech signal Mixing matrix Background noise Signal modeling !
Component analysis (ICA) [Common 1994] • Source model: super Gaussian distribution • Local Gaussian modeling (LGM)[Duong 2010] • Time-varying multi-channel Gaussian distribution • Independent Vector Analysis (IVA)[Kim 2006][Hiroe 2006] • All frequencies have the same time-varying activity • Independent Low-Rank Matrix Analysis(ILRMA) [Kitamura 2016]: Time-varying activities are modeled by using small number of patterns, 2 or 3. Source model is too simple to express complicated speech source characteristics.
characteristics Fig.2 in Y. Xu, J. Du, L. Dai and C. Lee, "A Regression Approach to Speech Enhancement Based on Deep Neural Networks," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, Jan. 2015.
１）Source model is learned by DNN, spatial model is updated based on conventional probabilistic models • LGM+DNN [Nugraha 2016], IDLMA [Mogami 2018], MVAE [Kameoka 2018] ２）Spatial model is constructed by using DNN based source model (time-frequency mask ,, ). • Mask based beamforming [Heymann 2016], Deep Clustering[Hershey 2016], Permutation Invariant Training (PIT) [Kolbæk 2017] , = 1 σ ,, ,, , , DNN does not concern speech quality after multi-channel speech source separation
papers and 1 Interspeech2019 Paper for how to combine source model and spatial model 1. Multi-channel Loss function with DNN source model and spatial model [Togami ICASSP2019_1] [Masuyama Interspeech2019] 2. Insertion of spatial model between two BLSTM layers in multi-channel Deep Clustering [Togami ICASSP2019_2] 3. Simultaneous optimization of forgetting factors and time-frequency masking in online beamfomers [Togami ICASSP2019_3] 4. Restoration of parameters of spatial model by using DNN based source model [Nakagome 2019](Student Poster Award)
mask ,, is utilized for construing time-invariant beamformer • The proposed losses were compared to a monaural loss (PSA). Approaches SIR (dB) SDR (dB) CD (dB) Mixed 0.91 0.20 4.44 PSA 6.21 5.79 3.93 Proposed 7.57 7.10 3.40 Oracle PSM 11.04 10.43 3.09 Time-frequency mask inferred by the proposed method outperformed state-of-the-art time-frequency mask estimator.
speaking, one BLSTM layer in DNN based speech source separation acts as one EM step. • There is no explicit spatial constraint, i.e., the same source is in the same spatial location (spatial location constraint). • By inserting multi-channel Wiener filtering (MWF) between two BLSTM layers, spatial location constraint is explicitly considered
network era • In speech source separation context, how to combine DNN based source model and spatial model is a key issue • We proposed a DNN Loss function which reflects spatial model and a DNN structure which integrates spatial model • Signal modeling will be more and more important in the future, e.g., unsupervised learning.
obtained in speech source separation, because labeling data is clean speech • Utilization of Dark data, weak supervision is more and more important such as Snorkel in Stanford University (https://github.com/HazyResearch/snorkel). • In this context, Spotlight came again on conventional unsupervised multi-channel speech source separation, e.g., unsupervised speech source separation with DNN [Lukas 2019]
Review and New Perspectives," in IEEE TPAMI., vol. 35, no. 8, pp. 1798-1828, Aug. 2013. [Johnson 2016] M. Johnson et al., “Composing graphical models with neural networks for structured representations and fast inference”, NeurIPS, 2016. [Common 1994] P. Common, “Independent component analysis, a new concept?,” Signal Processing, vol. 36, no. 3, pp. 287–314, April 1994. [Duong 2010] N.Q.K. Duong, et al., “Underdetermined reverberant audio source separation using a fullrank spatial covariance model,” IEEE TASLP., vol. 18, no. 7, pp. 1830–1840, 2010. [Kim 2006] T. Kim, et al., “Independent vector analysis: an extension of ica to multivariate components,” in Proceedings ICA, pp. 165–172, Mar. 2006. [Hiroe 2006] A. Hiroe, “Solution of permutation problem in frequency domain ica using multivariate probability density functions,” in Proceedings ICA, pp. 601–608, Mar. 2006. [Kitamura 2016] D. Kitamura, et al., "Determined blind source separation unifying independent vector analysis and nonnegativematrix factorization," IEEE/ACM TASLP., vol. 24, no. 9, pp. 1626-1641, 2016. [Nugraha 2016] A.A. Nugraha, et al., “Multichannel audio source separation with deep neural networks,” IEEE/ACM TASLP., vol. 24, no. 9, pp. 1652–1664, 2016. [Mogami 2018] S. Mogami, et al., “Independent deeply learned matrix analysis for multichannel audio source separation,” in EUSIPCO, pp. 1557–1561, Sep. 2018. [Kameoka 2018] H. Kameoka, et al., "Semi-blind source separation with multichannel variational autoencoder," arXiv:1808.00892, Aug. 2018. [Heymann 2016] J. Heymann, et al., “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, 2016, pp. 196–200. [Hershey 2016] J.R. Hershey, et al., “Deep clustering: Discriminative embeddings for segmentation and separation,” in ICASSP, 2016, pp. 31–35. [Kolbæk 2017] M. Kolbæk, et al., “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM TASLP., vol. 25, pp. 1901–1913, 2017. [Togami ICASSP2019_1] M. Togami, “Multi-channel Itakura Saito Distance Minimization with deep neural network,” in ICASSP, 2019, pp. 536-540. [Togami ICASSP2019_2] M. Togami, “Spatial Constraint on Multi-channel Deep Clustering,” in ICASSP, 2019, pp. 531-535. [Togami ICASSP2019_3] M. Togami, “Simultaneous Optimization of Forgetting Factor and Time-frequency Mask for Block Online Multi-channel Speech Enhancement,” in ICASSP, 2019, pp. 2702-2706. [Nakagome 2019] Y. Nakagome, “Adaptive beamformer for desired source extraction with neural network based direction of arrival estimation,” in IEICE Technical Report, vol. 118, no. 497, SP2018-85 , pp. 143-147 (in Japanese), Mar. 2019. [Masuyama Interspeech2019] Y. Masuyama, et al., “Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming,” in Interspeech, Sep. 2019 (Accepted). [Togami 2005] M. Togami et al., “Adaptation methodology for minimum variance beam-former based on frequency segregation,” in Proc. of the 2005 Autumn Meeting of the Acoustical Society of Japan (in Japanese), Sep. 2005. [Lukas 2019] L. Drude, et al., "Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation," in ICASSP, 2019, pp. 695-699.
[Nakagome 2019] • Previously, multi-channel beamforming with time-frequency masking has been proposed [Togami 2005] for target speech extraction • Under the assumption that DOA of the target speech is given, time-frequency mask is constructed by using a DOA estimate • However, the DOA estimate is sensitive to microphone alignment error and reverberation. • We propose a multi-channel beamforming with DOA based time-frequency masking in which DOA estimate is restored and reliability is estimated by DNN.