Yoshiaki Bando1,2, Yoshiki Masuyama1,3, Aditya Arie Nugraha2, Kazuyoshi Yoshii2,4 1National Institute of Advanced Industrial Science and Technology (AIST) 2Center for Advanced Intelligent Project (AIP), RIKEN, 3Department of Computer Science, Tokyo Metropolitan University, 4Graduate School of Informatics, Kyoto University
basis of machine listening systems. โข Such systems are often required to work in diverse environments. โข This calls for BSS, which can work adaptively for the target environment. Distant speech recognition (DSR) [Watanabe+ 2020, Baker+ 2018] Sound event detection (SED) [Turpault+ 2020, Denton+ 2022] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 2
= ๐๐๐๐๐๐ ๐๐๐๐๐๐ H Fast and stable by the IP [Ono+ 2011] or ISS [Sheibler +] algorithm Weak against reverberations and diffuse noise. Full-rank spatial model: ๐๐๐๐๐๐ โ ๐๐๐๐ร๐๐ Robust against reverberations and diffuse noise. Computationally expensive due to its EM or MU algorithm. Jointly-diagonalizable (JD) spatial model: ๐๐๐๐๐๐ โ ๐๐๐๐ โ1 diag ๐ฐ๐ฐ๐๐ ๐๐๐๐ โH Still robust against reverberations and diffuse noise. Moderately fast by IP or ISS algorithm. ๐๐1 ๐๐2 can be considered as โ๐๐ ๐ค๐ค๐๐๐๐ ๐๐๐๐๐๐ ๐๐๐๐๐๐ H ๐๐1 ๐๐2 Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 6
Joint training of deep generative model and its inference model. โข We train the models regarding them as a โlarge VAEโ for a multichannel mixture. Computationally expensive due to the full-rank SCMs. Inference model Multichannel mixture โฏ โฏ ร ร โฏ Generative model Latent source features ร SCM Source PSD The training is performed to make the reconstruction closer to the observation. Estimated by a heavy EM algorithm Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 7
Speeding up neural FCA with a JD spatial model and the ISS algorithm. We utilize the ISS algorithm in the inference model to quickly estimate SCMs. Inference model Multichannel mixture Multichannel reconstruction โฏ Latent source features โฏ โฏ ร Source PSD ร ร โฏ SCM Generative model DNN ISS JD SCM parameters [Scheibler+ 2021] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 8
of the VAE, the ELBO โ is maximized by using SGD. After training, the models are used to separate unseen mixture signals. Generative model ๐๐ Multichannel mixture Multichannel reconstruction โฏ Latent source features โฏ Inference model ๐๐ โฏ JD SCM parameters Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 11
speech mixtures โข The simulation was almost the same as the spatialized WSJ0-mix dataset. โข The main difference is that # of srcs. was randomly drawn between 2 and 4. All the methods are performed by specifying a fixed # (5) of sources. โข We show that our method can work with only specifying the max. # of sources. Method Brief description # of iters. MNMF [Sawada+ 2013] Conventional linear BSS methods that have ability to solve frequency permutation ambiguity 200 ILRMA [Kitamura+ 2016] FastMNMF [Sekiguchi+ 2020] Neural FCA [Bando+ 2021] The conventional neural BSS method 200 Neural FastFCA (Proposed) The proposed neural BSS method Iteration free Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 12
neural FCA to reduce the computational cost. โข JD SCMs and ISS-based layers reduced the cost to 2% from the original. โข Our method was successfully trained from mixtures w/ unknown #s of sources. Future work: Joint dereverberation and separation of moving sources. Inference model Multichannel mixture Multichannel reconstruction Latent source features Source PSD SCM DNN ISS Generative model JD SCM parameters Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 16