Yoshiaki Bando1,2, Yoshiki Masuyama1,3, Aditya Arie Nugraha2, Kazuyoshi Yoshii2,4 1National Institute of Advanced Industrial Science and Technology (AIST) 2Center for Advanced Intelligent Project (AIP), RIKEN, 3Department of Computer Science, Tokyo Metropolitan University, 4Graduate School of Informatics, Kyoto University

basis of machine listening systems. • Such systems are often required to work in diverse environments. • This calls for BSS, which can work adaptively for the target environment. Distant speech recognition (DSR) [Watanabe+ 2020, Baker+ 2018] Sound event detection (SED) [Turpault+ 2020, Denton+ 2022] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 2

low-rank approximation [Ozerov+ 2009] • Source PSD is estimated by non-negative matrix factorization (NMF) Source models based on deep generative models [Bando+ 2018] • Source is precisely generated by a deep neural network (DNN). × ∼ 𝑠𝑠𝑓𝑓𝑓𝑓 𝜆𝜆𝑓𝑓𝑓𝑓 𝑢𝑢𝑓𝑓𝑓𝑓 𝑣𝑣𝑘𝑘𝑘𝑘 Source PSD Source signal Bases Activations ∼ DNN Latent features Source PSD Source signal 𝑠𝑠𝑓𝑓𝑓𝑓 𝜆𝜆𝑓𝑓𝑓𝑓 𝑧𝑧𝑡𝑡𝑡𝑡 𝑔𝑔𝜃𝜃,𝑓𝑓 Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 5

= 𝐚𝐚𝑛𝑛𝑛𝑛 𝐚𝐚𝑛𝑛𝑛𝑛 H Fast and stable by the IP [Ono+ 2011] or ISS [Sheibler +] algorithm Weak against reverberations and diffuse noise. Full-rank spatial model: 𝐇𝐇𝑛𝑛𝑛𝑛 ∈ 𝕊𝕊𝑀𝑀×𝑀𝑀 Robust against reverberations and diffuse noise. Computationally expensive due to its EM or MU algorithm. Jointly-diagonalizable (JD) spatial model: 𝐇𝐇𝑛𝑛𝑛𝑛 ≜ 𝐐𝐐𝑓𝑓 −1 diag 𝐰𝐰𝑛𝑛 𝐐𝐐𝑓𝑓 −H Still robust against reverberations and diffuse noise. Moderately fast by IP or ISS algorithm. 𝑚𝑚1 𝑚𝑚2 can be considered as ∑𝑚𝑚 𝑤𝑤𝑛𝑛𝑛𝑛 𝐚𝐚𝑓𝑓𝑓𝑓 𝐚𝐚𝑓𝑓𝑓𝑓 H 𝑚𝑚1 𝑚𝑚2 Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 6

Joint training of deep generative model and its inference model. • We train the models regarding them as a “large VAE” for a multichannel mixture. Computationally expensive due to the full-rank SCMs. Inference model Multichannel mixture ⋯ ⋯ × × ⋯ Generative model Latent source features × SCM Source PSD The training is performed to make the reconstruction closer to the observation. Estimated by a heavy EM algorithm Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 7

Speeding up neural FCA with a JD spatial model and the ISS algorithm. We utilize the ISS algorithm in the inference model to quickly estimate SCMs. Inference model Multichannel mixture Multichannel reconstruction ⋯ Latent source features ⋯ ⋯ × Source PSD × × ⋯ SCM Generative model DNN ISS JD SCM parameters [Scheibler+ 2021] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 8

of the VAE, the ELBO ℒ is maximized by using SGD. After training, the models are used to separate unseen mixture signals. Generative model 𝜃𝜃 Multichannel mixture Multichannel reconstruction ⋯ Latent source features ⋯ Inference model 𝜙𝜙 ⋯ JD SCM parameters Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 11

speech mixtures • The simulation was almost the same as the spatialized WSJ0-mix dataset. • The main difference is that # of srcs. was randomly drawn between 2 and 4. All the methods are performed by specifying a fixed # (5) of sources. • We show that our method can work with only specifying the max. # of sources. Method Brief description # of iters. MNMF [Sawada+ 2013] Conventional linear BSS methods that have ability to solve frequency permutation ambiguity 200 ILRMA [Kitamura+ 2016] FastMNMF [Sekiguchi+ 2020] Neural FCA [Bando+ 2021] The conventional neural BSS method 200 Neural FastFCA (Proposed) The proposed neural BSS method Iteration free Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 12

neural FCA to reduce the computational cost. • JD SCMs and ISS-based layers reduced the cost to 2% from the original. • Our method was successfully trained from mixtures w/ unknown #s of sources. Future work: Joint dereverberation and separation of moving sources. Inference model Multichannel mixture Multichannel reconstruction Latent source features Source PSD SCM DNN ISS Generative model JD SCM parameters Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 16