Slide 1

Slide 1 text

Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation Yoshiaki Bando1,2, Yoshiki Masuyama1,3, Aditya Arie Nugraha2, Kazuyoshi Yoshii2,4 1National Institute of Advanced Industrial Science and Technology (AIST) 2Center for Advanced Intelligent Project (AIP), RIKEN, 3Department of Computer Science, Tokyo Metropolitan University, 4Graduate School of Informatics, Kyoto University

Slide 2

Slide 2 text

Motivation: Blind Source Separation (BSS) Sound source separation forms the basis of machine listening systems. β€’ Such systems are often required to work in diverse environments. β€’ This calls for BSS, which can work adaptively for the target environment. Distant speech recognition (DSR) [Watanabe+ 2020, Baker+ 2018] Sound event detection (SED) [Turpault+ 2020, Denton+ 2022] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 2

Slide 3

Slide 3 text

Foundation of Modern BSS Methods Probabilistic generative models of multichannel mixture signals. β€’ The generative model consists of a source model and a spatial model Source model β‹― 𝑠𝑠𝑛𝑛𝑛𝑛𝑛𝑛 ∼ 𝒩𝒩ℂ 0, λ𝑛𝑛𝑛𝑛𝑛𝑛 𝑓𝑓 𝑑𝑑 𝑓𝑓 𝑑𝑑 Observed mixture 𝑓𝑓 𝑑𝑑 π‘šπ‘š Spatial model β‹― 𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛 ∼ 𝒩𝒩ℂ 0, λ𝑛𝑛𝑛𝑛𝑛𝑛 𝐇𝐇𝑛𝑛𝑛𝑛 𝑓𝑓 𝑑𝑑 𝑓𝑓 𝑑𝑑 π‘šπ‘š π‘šπ‘š 𝑠𝑠1𝑓𝑓𝑓𝑓 𝐱𝐱𝑓𝑓𝑓𝑓 ∼ 𝒩𝒩ℂ 0, βˆ‘π‘›π‘› λ𝑛𝑛𝑛𝑛𝑛𝑛 𝐇𝐇𝑛𝑛𝑓𝑓 𝑠𝑠𝑁𝑁𝑓𝑓𝑓𝑓 𝐱𝐱1𝑓𝑓𝑓𝑓 𝐱𝐱𝑁𝑁𝑁𝑁𝑁𝑁 𝐱𝐱𝑓𝑓𝑓𝑓 ∈ ℝ𝑀𝑀 Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 3

Slide 4

Slide 4 text

Multivariate Gaussian representation of source images 𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛 ∈ ℂ𝑀𝑀 𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛 ∼ 𝒩𝒩ℂ 0, λ𝑛𝑛𝑛𝑛𝑛𝑛 𝐇𝐇𝑛𝑛𝑛𝑛 β€’ Spatial covariance matrices (SCMs) 𝐇𝐇𝑛𝑛𝑛𝑛 ∈ π•Šπ•Š+ 𝑀𝑀×𝑀𝑀: β€œshape” of the ellipse β€’ Power spectral density (PSD) πœ†πœ†π‘›π‘›π‘›π‘›π‘›π‘› ∈ ℝ+ : β€œsize” of the ellipse Geometric Interpretation of Multichannel Generative Models こ んにけは! Hello! Late Early 𝑛𝑛 = 1 π‘šπ‘š1 π‘šπ‘š2 πœ†πœ†1𝑓𝑓𝑓𝑓 𝐇𝐇1𝑓𝑓 𝑛𝑛 = 2 πœ†πœ†2𝑓𝑓𝑓𝑓 𝐇𝐇2𝑓𝑓 Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 4

Slide 5

Slide 5 text

Source Models for Blind Source Separation Source models based on low-rank approximation [Ozerov+ 2009] β€’ Source PSD is estimated by non-negative matrix factorization (NMF) Source models based on deep generative models [Bando+ 2018] β€’ Source is precisely generated by a deep neural network (DNN). Γ— ∼ 𝑠𝑠𝑓𝑓𝑓𝑓 πœ†πœ†π‘“π‘“π‘“π‘“ 𝑒𝑒𝑓𝑓𝑓𝑓 π‘£π‘£π‘˜π‘˜π‘˜π‘˜ Source PSD Source signal Bases Activations ∼ DNN Latent features Source PSD Source signal 𝑠𝑠𝑓𝑓𝑓𝑓 πœ†πœ†π‘“π‘“π‘“π‘“ 𝑧𝑧𝑑𝑑𝑑𝑑 π‘”π‘”πœƒπœƒ,𝑓𝑓 Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 5

Slide 6

Slide 6 text

Spatial Models for Blind Source Separation Rank-1 spatial model: 𝐇𝐇𝑛𝑛𝑛𝑛 = πšπšπ‘›π‘›π‘›π‘› πšπšπ‘›π‘›π‘›π‘› H Fast and stable by the IP [Ono+ 2011] or ISS [Sheibler +] algorithm Weak against reverberations and diffuse noise. Full-rank spatial model: 𝐇𝐇𝑛𝑛𝑛𝑛 ∈ π•Šπ•Šπ‘€π‘€Γ—π‘€π‘€ Robust against reverberations and diffuse noise. Computationally expensive due to its EM or MU algorithm. Jointly-diagonalizable (JD) spatial model: 𝐇𝐇𝑛𝑛𝑛𝑛 β‰œ 𝐐𝐐𝑓𝑓 βˆ’1 diag 𝐰𝐰𝑛𝑛 𝐐𝐐𝑓𝑓 βˆ’H Still robust against reverberations and diffuse noise. Moderately fast by IP or ISS algorithm. π‘šπ‘š1 π‘šπ‘š2 can be considered as βˆ‘π‘šπ‘š 𝑀𝑀𝑛𝑛𝑛𝑛 πšπšπ‘“π‘“π‘“π‘“ πšπšπ‘“π‘“π‘“π‘“ H π‘šπ‘š1 π‘šπ‘š2 Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 6

Slide 7

Slide 7 text

β‹― Multichannel reconstruction Neural Full-Rank Spatial Covariance Analysis (Neural FCA) Joint training of deep generative model and its inference model. β€’ We train the models regarding them as a β€œlarge VAE” for a multichannel mixture. Computationally expensive due to the full-rank SCMs. Inference model Multichannel mixture β‹― β‹― Γ— Γ— β‹― Generative model Latent source features Γ— SCM Source PSD The training is performed to make the reconstruction closer to the observation. Estimated by a heavy EM algorithm Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 7

Slide 8

Slide 8 text

Deep Source Model + JD Spatial Model οƒ  Neural FastFCA Speeding up neural FCA with a JD spatial model and the ISS algorithm. We utilize the ISS algorithm in the inference model to quickly estimate SCMs. Inference model Multichannel mixture Multichannel reconstruction β‹― Latent source features β‹― β‹― Γ— Source PSD Γ— Γ— β‹― SCM Generative model DNN ISS JD SCM parameters [Scheibler+ 2021] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 8

Slide 9

Slide 9 text

Generative Model of Mixture Signals The full-rank SCMs 𝐇𝐇𝑛𝑛𝑛𝑛 is replaced by the JD SCMs 𝐐𝐐𝑓𝑓 βˆ’1 diag 𝐰𝐰𝑛𝑛 𝐐𝐐𝑓𝑓 βˆ’H β‹― Multichannel reconstruction Generative model Γ— Source PSD JD SCM 𝐐𝐐𝑓𝑓 βˆ’1 diag 𝐰𝐰1 𝐐𝐐𝑓𝑓 βˆ’H β‹― Γ— Γ— β‹― 𝐐𝐐𝑓𝑓 βˆ’1 diag 𝐰𝐰2 𝐐𝐐𝑓𝑓 βˆ’H 𝐐𝐐𝑓𝑓 βˆ’1 diag 𝐰𝐰𝑁𝑁 𝐐𝐐𝑓𝑓 βˆ’H 𝐱𝐱𝑓𝑓𝑓𝑓 ∼ 𝒩𝒩ℂ 0, 𝐐𝐐𝑓𝑓 βˆ’1 βˆ‘π‘›π‘› π‘”π‘”πœƒπœƒ,𝑓𝑓 𝐳𝐳𝑛𝑛𝑛𝑛 diag 𝐰𝐰𝑛𝑛 𝐐𝐐𝑓𝑓 βˆ’H Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 9

Slide 10

Slide 10 text

Inference Model Integrating DNN and ISS-Based Blocks The inference model estimates the params. of the generative model. β€’ The ISS algorithm is involved to quickly estimate 𝐐𝐐𝑓𝑓 from 𝐱𝐱𝑓𝑓𝑓𝑓 and mask π’Žπ’Žπœ™πœ™,𝑓𝑓𝑓𝑓 . β€’ Each DNN utilizes an intermediate diagonalization result for its estimate. DNN(1) ISS(1) 𝐑 πœ™πœ™,𝑛𝑛 (1) 𝐐𝐐 𝑛𝑛 (1) 𝐦 πœ™πœ™,𝑛𝑛𝑛𝑛 (1) DNN(0) 𝐐𝐐 𝑛𝑛 (0) 𝐑 πœ™πœ™,𝑛𝑛 (0) 𝐦 πœ™πœ™,𝑛𝑛𝑛𝑛 (0) 𝐱𝐱𝑛𝑛𝑛𝑛 𝐱𝐱 οΏ½ 𝑛𝑛𝑛𝑛 (1) DNN(𝐡) ISS(B) 1 Γ— 1 Conv 𝐱𝐱 οΏ½ 𝑛𝑛𝑛𝑛 (𝐡) 𝐑 πœ™πœ™,𝑛𝑛 (𝐡) πŽπœ™πœ™,𝑛𝑛𝑛𝑛𝑛𝑛 ππœ™πœ™,𝑛𝑛𝑛𝑛 πˆπœ™πœ™,𝑛𝑛𝑛𝑛 2 𝐐𝐐 𝑛𝑛 (𝐡) 1st blocks 𝐡-th blocks 1st blocks B-th blocks DNN(0) DNN(1) DNN(B) ISS(B) ISS(1) 1Γ—1 Conv Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 10

Slide 11

Slide 11 text

Training Based on Autoencoding Variational Bayes As in the training of the VAE, the ELBO β„’ is maximized by using SGD. After training, the models are used to separate unseen mixture signals. Generative model πœƒπœƒ Multichannel mixture Multichannel reconstruction β‹― Latent source features β‹― Inference model πœ™πœ™ β‹― JD SCM parameters Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 11

Slide 12

Slide 12 text

Experimental Condition: Speech Separation Evaluation was performed with simulated 8-ch speech mixtures β€’ The simulation was almost the same as the spatialized WSJ0-mix dataset. β€’ The main difference is that # of srcs. was randomly drawn between 2 and 4. All the methods are performed by specifying a fixed # (5) of sources. β€’ We show that our method can work with only specifying the max. # of sources. Method Brief description # of iters. MNMF [Sawada+ 2013] Conventional linear BSS methods that have ability to solve frequency permutation ambiguity 200 ILRMA [Kitamura+ 2016] FastMNMF [Sekiguchi+ 2020] Neural FCA [Bando+ 2021] The conventional neural BSS method 200 Neural FastFCA (Proposed) The proposed neural BSS method Iteration free Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 12

Slide 13

Slide 13 text

Experimental Results: Average Separation Performance Neural FastFCA outperformed the conventional BSS methods in all the metrics and slightly better than neural FCA in SDR and STOI. 7.5 7 9.3 11.1 11.6 6 7 8 9 10 11 12 SDR 1.49 1.43 1.6 1.88 1.85 1.32 1.42 1.52 1.62 1.72 1.82 PESQ 0.76 0.76 0.8 0.84 0.85 0.74 0.76 0.78 0.8 0.82 0.84 0.86 STOI β–  MNMF β–  ILRMA β–  FastMNMF β–  Neural FCA β–  Neural FastFCA Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 13

Slide 14

Slide 14 text

Experimental Results: Elapsed Time for Inference The elapsed time was drastically improved from neural FCA thanks to the JD spatial model and ISS-based inference model. 0.09 4.77 1.81 1.36 2.07 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Elapsed time for separating a 5-second mixture using NVIDIA V100 GPU [s] β–  MNMF β–  ILRMA β–  FastMNMF β–  Neural FCA β–  Neural FastFCA 53x faster Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 14

Slide 15

Slide 15 text

Experimental Results: Performance at Each # of Sources Neural FastFCA was successfully trained from mixtures of unknown numbers of sources by specifying their maximum number. 13 8.3 3.9 13.2 7.7 3.2 15.3 10.1 5.3 16.4 12.2 7.2 17.4 12.7 7.5 0 2 4 6 8 10 12 14 16 18 20 N=2 N=3 N=4 SDR β–  MNMF β–  ILRMA β–  FastMNMF β–  Neural FCA β–  Neural FastFCA Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 15

Slide 16

Slide 16 text

Conclusion: Neural Fast Full-Rank Spatial Covariance Analysis An extension of neural FCA to reduce the computational cost. β€’ JD SCMs and ISS-based layers reduced the cost to 2% from the original. β€’ Our method was successfully trained from mixtures w/ unknown #s of sources. Future work: Joint dereverberation and separation of moving sources. Inference model Multichannel mixture Multichannel reconstruction Latent source features Source PSD SCM DNN ISS Generative model JD SCM parameters Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 16