Slide 1

Slide 1 text

Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning Yoshiaki Bando National Institute of Advanced Industrial Science and Technology (AIST), Japan Center for Advanced Intelligent Project (AIP), RIKEN, Japan

Slide 2

Slide 2 text

Blind Source Separation (BSS) Sound source separation forms the basis of machine listening systems. โ€ข Such systems are often required to work in diverse environments. โ€ข This calls for BSS, which can work adaptively for the target environment. Distant speech recognition (DSR) [Watanabe+ 2020, Baker+ 2018] Sound event detection (SED) [Turpault+ 2020, Denton+ 2022] Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 2

Slide 3

Slide 3 text

Foundation of Modern BSS Methods Probabilistic generative models of multichannel mixture signals. โ€ข The generative model consists of a source model and a spatial model Source model โ‹ฏ ๐‘ ๐‘ ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› โˆผ ๐’ฉ๐’ฉโ„‚ 0, ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘“๐‘“ ๐‘ก๐‘ก Observed mixture ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘š๐‘š Spatial model โ‹ฏ ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› โˆผ ๐’ฉ๐’ฉโ„‚ 0, ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‡๐‡๐‘›๐‘›๐‘›๐‘› ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘š๐‘š ๐‘š๐‘š ๐‘ ๐‘ 1๐‘“๐‘“๐‘“๐‘“ ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ โˆผ ๐’ฉ๐’ฉโ„‚ 0, โˆ‘๐‘›๐‘› ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‡๐‡๐‘›๐‘›๐‘“๐‘“ ๐‘ ๐‘ ๐‘๐‘๐‘“๐‘“๐‘“๐‘“ ๐ฑ๐ฑ1๐‘“๐‘“๐‘“๐‘“ ๐ฑ๐ฑ๐‘๐‘๐‘๐‘๐‘๐‘ ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ โˆˆ โ„๐‘€๐‘€ Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 3

Slide 4

Slide 4 text

Geometric Interpretation of Multichannel Generative Models Multivariate Gaussian representation of source images ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› โˆˆ โ„‚๐‘€๐‘€ ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› โˆผ ๐’ฉ๐’ฉโ„‚ 0, ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‡๐‡๐‘›๐‘›๐‘›๐‘› โ€ข Spatial covariance matrices (SCMs) ๐‡๐‡๐‘›๐‘›๐‘›๐‘› โˆˆ ๐•Š๐•Š+ ๐‘€๐‘€ร—๐‘€๐‘€: โ€œshapeโ€ of the ellipse โ€ข Power spectral density (PSD) ๐œ†๐œ†๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› โˆˆ โ„+ : โ€œsizeโ€ of the ellipse ใ“ ใ‚“ใซใกใฏ๏ผ Hello! Late Early ๐‘›๐‘› = 1 ๐‘š๐‘š1 ๐‘š๐‘š2 ๐œ†๐œ†1๐‘“๐‘“๐‘“๐‘“ ๐‡๐‡1๐‘“๐‘“ ๐‘›๐‘› = 2 ๐œ†๐œ†2๐‘“๐‘“๐‘“๐‘“ ๐‡๐‡2๐‘“๐‘“ Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 4

Slide 5

Slide 5 text

Spatial Models for Blind Source Separation Rank-1 spatial model: ๐‡๐‡๐‘›๐‘›๐‘›๐‘› = ๐š๐š๐‘›๐‘›๐‘›๐‘› ๐š๐š๐‘›๐‘›๐‘›๐‘› H Fast and stable by the IP [Ono+ 2011] or ISS [Sheibler+ 2020] algorithm Weak against reverberations and diffuse noise. Full-rank spatial model: ๐‡๐‡๐‘›๐‘›๐‘›๐‘› โˆˆ ๐•Š๐•Š๐‘€๐‘€ร—๐‘€๐‘€ Robust against reverberations and diffuse noise. Computationally expensive due to its EM or MU algorithm. Jointly-diagonalizable (JD) spatial model: ๐‡๐‡๐‘›๐‘›๐‘›๐‘› โ‰œ ๐๐๐‘“๐‘“ โˆ’1 diag ๐ฐ๐ฐ๐‘›๐‘› ๐๐๐‘“๐‘“ โˆ’H Still robust against reverberations and diffuse noise. Moderately fast by IP or ISS algorithm. ๐‘š๐‘š1 ๐‘š๐‘š2 can be considered as โˆ‘๐‘š๐‘š ๐‘ค๐‘ค๐‘›๐‘›๐‘›๐‘› ๐š๐š๐‘“๐‘“๐‘“๐‘“ ๐š๐š๐‘“๐‘“๐‘“๐‘“ H ๐‘š๐‘š1 ๐‘š๐‘š2 [Duong+ 2010] [Yoshii+ 2013] Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 5

Slide 6

Slide 6 text

Source Model Based on Low-Rank Approximation Source power spectral density (PSD) often has low-rank structures. โ€ข Source PSD is estimated by non-negative matrix factorization (NMF) [Ozerov+ 2009]. โ€ข Its inference is fast and does not require supervised pre-training. ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ โˆผ ๐’ฉ๐’ฉโ„‚ 0, โˆ‘๐‘˜๐‘˜ ๐‘ข๐‘ข๐‘“๐‘“๐‘“๐‘“ ๐‘ฃ๐‘ฃ๐‘˜๐‘˜๐‘˜๐‘˜ Is there a more powerful representation of source spectra? ร— โˆผ ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ ๐œ†๐œ†๐‘“๐‘“๐‘“๐‘“ ๐‘ข๐‘ข๐‘“๐‘“๐‘“๐‘“ ๐‘ฃ๐‘ฃ๐‘˜๐‘˜๐‘˜๐‘˜ Source PSD Source signal Bases Activations Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 6

Slide 7

Slide 7 text

Source Model Based on Deep Generative Model Source spectra are represented with low-dim. latent feature vectors. โ€ข A DNN is used to generate source power spectral density (PSD) precisely. โ€ข Freq.-independent latent features helps us to solve freq. permutation ambiguity. โˆผ DNN Latent features Source PSD Source signal ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ ๐œ†๐œ†๐‘“๐‘“๐‘“๐‘“ ๐‘ง๐‘ง๐‘ก๐‘ก๐‘ก๐‘ก ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“ ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ โˆฃ ๐ณ๐ณ๐‘ก๐‘ก โˆผ ๐’ฉ๐’ฉโ„‚ 0, ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“ ๐ณ๐ณ๐‘ก๐‘ก Y. Bando, et al. "Statistical speech enhancement based on probabilistic integration of variational autoencoder and non- negative matrix factorization." IEEE ICASSP, pp. 716-720, 2018. ๐‘ง๐‘ง๐‘ก๐‘ก๐‘ก๐‘ก โˆผ ๐’ฉ๐’ฉ 0, 1 Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 7

Slide 8

Slide 8 text

Various BSS methods Combining Spatial and Source Models โ€ป In this talk, we often call GMVAE as MVAE. Rank-1 spatial model Independent Low-Rank Matrix Analysis (ILRMA) [Kitamura+ 2016] Multi-channel NMF (MNMF) [Sawada+ 2013] FastMNMF [Ito+ 2019, Sekiguchi+2019] Full-rank spatial model JD spatial model NMF source model VAE source model (Supervised) Generalized MVAE (GMVAE) [Seki + 2019] Multi-channel VAE (MVAE) [Kameoka+ 2018] Neural Full-rank Spatial Covariance Analysis (Neural FCA) [Bando+ 2021] Neural FastFCA [Bando+ 2023] VAE source model (Unsupervised) Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 8

Slide 9

Slide 9 text

Contents Two applications of deep source generative models. 1. Semi-supervised speech enhancement โ€ข We enhance speech signals by training on only clean speech signal. โ€ข Combination of a deep speech model and low-rank noise models 2. Self-supervised source separation โ€ข We train neural source separation model only from multichannel mixtures โ€ข The joint training of the source generative model and its inference model 2. Extensions of self-supervised training for real-world understanding โ€ข Handling moving sources / speeding up training & inference โ€ข Application for joint speech separation and diarization Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 9

Slide 10

Slide 10 text

Part 1: Multichannel Speech Enhancement Based on Supervised Deep Source Model โ€ข K. Sekiguchi, Y. Bando, A. A. Nugraha, K. Yoshii, T. Kawahara, โ€œSemi-supervised Multichannel Speech Enhancement with a Deep Speech Prior,โ€ IEEE/ACM TASLP, 2019 โ€ข K. Sekiguchi, A. A. Nugraha, Y. Bando, K. Yoshii, โ€œFast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices,โ€ EUSIPCO, 2019 โ€ข Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, T. Kawahara, โ€œStatistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Nonnegative Matrix Factorization,โ€ IEEE ICASSP, 2018 Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning / 46 10

Slide 11

Slide 11 text

Speech Enhancement A task to extract speech signals from a mixture of speech and noise โ€ข Various applications such as DSR, search-and-rescue, and hearing aids. Robustness against various acoustic environment is essential. โ€ข It is often difficult to assume the environment where they are used. Hey, Siriโ€ฆ CC0: https://pxhere.com/ja/photo/1234569 CC0: https://pxhere.com/ja/photo/742585 Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 11

Slide 12

Slide 12 text

Semi-Supervised Enhancement With Deep Speech Prior A hybrid method of deep speech model and statistical noise model โ€ข We can use many speech corpus ๏ƒ  deep speech prior โ€ข Noise training data are often few ๏ƒ  statistical noise prior w/ low-rank model + โ‰ˆ Observed noisy speech Deep speech prior Statistical noise prior Speech corpus Pre-training Estimated on the fly Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 12

Slide 13

Slide 13 text

Supervised Training of Deep Speech Prior (DP) The training based on a variational autoencoder (VAE) [Kingma+ 2013] โ€ข An encoder ๐‘ž๐‘ž๐œ™๐œ™ ๐™๐™ ๐’๐’ is introduced to estimate latent features from clean speech. The objective function is the evidence lower bound (ELBO) โ„’๐œƒ๐œƒ,๐œ™๐œ™ โ„’๐œƒ๐œƒ,๐œ™๐œ™ = ๐”ผ๐”ผ๐‘ž๐‘ž๐œ™๐œ™ log ๐‘๐‘๐œƒ๐œƒ ๐’๐’ ๐™๐™ โˆ’ ๐’Ÿ๐’ŸKL ๐‘ž๐‘ž๐œ™๐œ™ ๐™๐™|๐’๐’ ๐‘๐‘ ๐™๐™ Reconstructed speech Latent features ๐™๐™ Observed speech Reconstruction term (IS-div.) Regularization term (KL-div.) Encoder ๐‘ž๐‘ž๐œ™๐œ™ ๐™๐™ ๐’๐’ Decoder ๐‘๐‘๐œƒ๐œƒ ๐’๐’ ๐™๐™ The training is performed by making the reconstruction closer to the observation. Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 13

Slide 14

Slide 14 text

FastMNMF with a Deep Speech Prior (FastMNMF-DP) A unified generative model combining the VAE-based source model, NMF- based noise model, and jointly-diagonalizable (JD) spatial model. VAE-based speech model DNN ๐‘ง๐‘ง๐‘‘๐‘‘๐‘‘๐‘‘ ๐œ†๐œ†1๐‘“๐‘“๐‘“๐‘“ NMF-based noise model ร— ๐‘๐‘ ร— JD spatial model SCM ๐‡๐‡๐‘›๐‘›๐‘›๐‘› JD spatial model SCM ๐‡๐‡1๐‘“๐‘“ ๐‘š๐‘š1 ๐‘š๐‘š2 ๐‘š๐‘š1 ๐‘š๐‘š2 ๐œ†๐œ†๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› Latent features Speech PSD Noise PSDs Activations ๐‘ฃ๐‘ฃ๐‘˜๐‘˜๐‘˜๐‘˜ Bases ๐‘ข๐‘ข๐‘˜๐‘˜๐‘˜๐‘˜ Speech image Noise images Noisy observation ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐ฑ๐ฑ1๐‘“๐‘“๐‘“๐‘“ ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ JD SCMs ๐‡๐‡๐‘›๐‘›๐‘›๐‘› = ๐๐๐‘“๐‘“ diag ๐ ๐ ๐‘›๐‘›๐‘›๐‘› ๐๐๐‘“๐‘“ ๏ƒ  Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 14

Slide 15

Slide 15 text

Monte-Carlo Expectation-Maximization (MC-EM) Inference Speech and noise are separated by estimating the model parameters. Speech signal is finally obtained by multichannel Wiener filtering. E-step samples latent features from its posterior ๐ณ๐ณ๐‘ก๐‘ก โˆผ ๐‘๐‘ ๐ณ๐ณ๐‘ก๐‘ก ๐—๐— โ€ข Metropolis-Hasting sampling is utilized due to its intractability. M-step updates the other parameters to maximize log ๐‘๐‘ ๐—๐— ๐๐, ๏ฟฝ ๐‡๐‡, ๐”๐”, ๐•๐• โ€ข ๐๐ is updated by the iterative-projection (IP) algorithm [Ono+ 2011]. โ€ข ๏ฟฝ ๐‡๐‡, ๐”๐”, ๐•๐• are updated by multiplicative-update (MU) algorithm [Nakano+ 2010]. 1) domain transformation 2) TF masking 3) projection back Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 15

Slide 16

Slide 16 text

Experimental Condition We evaluated with a part of the CHiME-3 noisy speech dataset โ€ข 100 utterances from the CHiME-3 evaluation set โ€ข Each utterance was recorded by a 6-channel* mic. array on a tablet device. โ€ข The CHiME-3 dataset includes four noise environments: Evaluation metrics: โ€ข Source-to-distortion ratio (SDR) [dB] for evaluating enhancement performance โ€ข Computational time [msec] for evaluating the efficiency of the method. On a bus In a cafeteria In a pedestrian area On a street junction http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html *We emitted one microphone on the back of the tablet Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 16

Slide 17

Slide 17 text

Enhancement Performance in SDRs DP successively improved SDRs for FastMNMF and MNMF. โ€ข The JD full-rank model was better than full-rank and rank-1 models. Method Source model Spatial model FastMNMF-DP DP + NMF JD full-rank FastMNMF NMF JD full-rank MNMF-DP DP + NMF Full-rank MNMF NMF Full-rank ILRMA NMF Rank-1 [Sekiguchi+ 2019] [Sekiguchi+ 2019] [Sawada+ 2013] [Kitamura+ 2016] 15.1 13.2 18.6 16.8 18.9 12 13 14 15 16 17 18 19 20 [Sekiguchi+ 2019] Average SDR [dB] over 100 utterances Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 17

Slide 18

Slide 18 text

Computational Times for Speech Enhancement Although DP slightly increased computational cost, FastMNMF-DP was much faster than MNMF. Method Source model Spatial model FastMNMF-DP DP + NMF JD full-rank FastMNMF NMF JD full-rank MNMF-DP DP + NMF Full-rank MNMF NMF Full-rank ILRMA NMF Rank-1 [Sekiguchi+ 2019] [Sekiguchi+ 2019] [Sawada+ 2013] [Kitamura+ 2016] 10 660 710 40 78 0 100 200 300 400 500 600 700 800 [Sekiguchi+ 2019] Computational time [ms] for an 8-second signal *Evaluation is performed with NVIDIA TITAN RTX Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 18

Slide 19

Slide 19 text

Excerpts of Enhancement Results Observation Clean speech ILRMA FastMNMF-DP Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 19

Slide 20

Slide 20 text

Part 2: Self-Supervised Learning of Deep Source Generative Model and Its Inference Model โ€ข Y. Bando, K. Sekiguchi, Y. Masuyama, A. A. Nugraha, M. Fontaine, K. Yoshii, โ€œNeural full-rank spatial covariance analysis for blind source separation,โ€ IEEE SP Letters, 2021 Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning / 46 20

Slide 21

Slide 21 text

Source Separation Based on Multichannel VAEs (MVAEs) Deep source generative models achieved excellent performance. โ€ข ๐ณ๐ณ๐‘›๐‘›๐‘›๐‘› and ๐‡๐‡๐‘“๐‘“๐‘“๐‘“ are estimated to maximize the likelihood function at the inference Can the deep source models be trained only from mixture signals? Generative model Multichannel reconstruction โ‹ฏ Latent source features โ‹ฏ ร— ร— ร— โ‹ฏ SCM Source PSD [Kameoka+ 2018, Seki+ 2019] Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 21

Slide 22

Slide 22 text

Self-Supervised Training of Deep Source Model The generative model is trained jointly with its inference model. โ€ข We train the models regarding them as a โ€œlarge VAEโ€ for a multichannel mixture. The training is performed to make the reconstruction closer to the observation. Inference model Generative model Multichannel mixture Multichannel reconstruction โ‹ฏ โ‹ฏ Latent source features โ‹ฏ ร— ร— ร— โ‹ฏ SCM Source PSD Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 22

Slide 23

Slide 23 text

Training Based on Autoencoding Variational Bayes As in the training of the VAE, the ELBO โ„’ is maximized by using SGD. โ€ข Our training can be considered as BSS for all the training mixtures. Generative model Multichannel mixture Multichannel reconstruction โ‹ฏ โ‹ฏ Inference model Latent source features โ‹ฏ Minimize ๐’Ÿ๐’Ÿ๐พ๐พ๐พ๐พ ๐‘ž๐‘ž ๐™๐™ ๐—๐— ๐‘๐‘ ๐™๐™ ๐—๐—, ๐‡๐‡ Maximize ๐‘๐‘ ๐—๐— ๐‡๐‡ EM update rule Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 23

Slide 24

Slide 24 text

Solving Frequency Permutation Ambiguity We solve the ambiguity by making latent vectors ๐ณ๐ณ1๐‘ก๐‘ก , โ€ฆ , ๐ณ๐ณ๐‘๐‘๐‘๐‘ independent. ๏Œ Each source shares the same content ๏ƒ  Latent vectors have a LARGE correlation The KL term weight ๐›ฝ๐›ฝ is set to a large value for first several epochs. โ€ข approaches to the std. Gaussian dist. (no correlation between sources). โ€ข Disentanglement of the latent features by ฮฒ-VAE. ๏Š Each source has a different content ๏ƒ  Latent vectors have a SMALL correlation ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘“๐‘“ ๐‘ก๐‘ก Source 1 Source 2 Source 1 Source 2 Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 24

Slide 25

Slide 25 text

Relations Between Neural FCA and Existing Methods Neural FCA is a NEURAL & BLIND source separation method โ€ข Self-supervised training of the deep source generative model Linear BLIND Source Separation NEURAL (Semi-)supervised Source Separation MNMF [Ozerov+ 2009, Sawada+ 2013] ILRMA [Kitamura+ 2015] FastMNMF [Sekiguchi+ 2019, Ito+ 2019] IVA [Ono+ 2011] MVAE [Kameoka+ 2018] FastMNMF-DP [Sekiguchi+ 2018, Leglaive+ 2019] IDLMA [Mogami+ 2018] DNN-MSS [Nugraha+ 2016] Neural FCA (proposed) NF-IVA [Nugraha+ 2020] NF-FastMNMF [Nugraha+ 2022] Neural spatial models Neural source model NEURAL BLIND Source Separation Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 25

Slide 26

Slide 26 text

Experimental Condition Evaluation with the spatialized WSJ0-2mix dataset โ€ข 4-ch mixture signals of two speech sources with RT60 = 200โ€“600 ms โ€ข All mixture signals were dereverberated in advance by using WPE. Method Brief description Permutation solver cACGMM [Ito+ 2016] Conventional linear BSS methods (for determined conditions) Required FCA [Duong+ 2010] Required FastMNMF2 [Sekiguchi+ 2020] Free Pseudo supervised [Togami+ 2020] DNN imitates the MWF of BSS (FCA) results Required Neural cACGMM [Drude+ 2019] DNN is trained to maximize the log-marginal likelihood of the cACGMM Required MVAE[Seki+ 2019] The supervised version of our neural FCA โ€“ Neural FCA (proposed) Our neural blind source separation method Free Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 26

Slide 27

Slide 27 text

Experimental Results With SDRs Neural FCA outperformed conventional BSS methods and neural unsupervised methods and was comparable to the supervised MVAE. 15.2 2.9 15.2 12.4 14.7 13.0 12.7 10.8 0 2 4 6 8 10 12 14 16 cACGMM FCA FastMNMF2 Pseudo supervised Neural cACGMM Neural FCA MVAE (random init.) MVAE (FCA init.) SDR (higher is better) [dB] Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 27

Slide 28

Slide 28 text

Excerpts of Separation Results Neural FCA *More separation examples: https://ybando.jp/projects/spl2021 FastMNMF MVAE (supervised) Mixture input Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 28

Slide 29

Slide 29 text

Part 3: Toward Real-World Understanding via Spatial Self-Supervised Learning โ€ข H. Munakata, Y. Bando, R. Takeda, K. Komatani, M. Onishi, โ€œJoint Separation and Localization of Moving Sound Sources Based on Neural Full-Rank Spatial Covariance Analysis,โ€ IEEE SP Letters, 2023 โ€ข Y. Bando, Y. Masuyama, A. A. Nugraha, K. Yoshii, โ€œNeural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation,โ€ EUSIPCO, 2023 โ€ข Y. Bando, T. Nakamura, S. Watanabe, โ€œNeural blind source separation and diarization for distant speech recognition,โ€ accepted to INTERSPEECH 2024 Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning / 46 29

Slide 30

Slide 30 text

Extension 1: Separation of Moving Sound Sources BSS methods usually assume that sources are (almost) stationary. โ€ข Many daily sound sources move (e.g., walking persons, natural habitats, cars, โ€ฆ) โ€ข All sources relatively move if the microphone moves (e.g., mobile robots). Woo-hoo! Broom! Chirp, chip Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 30

Slide 31

Slide 31 text

Time-Varying (TV) Neural FCA Joint source localization and separation for tracking moving sources. โ€ข The localization results are constrained to be smooth by moving average. โ€ข SCMs are then constrained by the time-varying smoothed localization results. Generative model Inference model ๐‡๐‡0๐‘›๐‘›๐‘›๐‘› ๐‡๐‡1๐‘›๐‘›๐‘›๐‘› ๐‡๐‡๐‘๐‘๐‘›๐‘›๐‘›๐‘› ๐ฎ1๐‘›๐‘› ๐ฎ๐‘๐‘๐‘›๐‘› Time-varying SCMs Latent spectral features Time-varying DoAs Regularize Separation Localization SCM Source PSD Multichannel mixture Multichannel reconstruction ๐‘”๐‘”๐œƒ๐œƒ,๐‘›๐‘› ๐ณ๐ณ0๐‘›๐‘› ๐‘”๐‘”๐œƒ๐œƒ,๐‘›๐‘› ๐ณ๐ณ๐‘๐‘๐‘›๐‘› ๐‘”๐‘”๐œƒ๐œƒ,๐‘›๐‘› ๐ณ๐ณ1๐‘›๐‘› Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 31

Slide 32

Slide 32 text

Training on Mixtures of Two Moving Speech Sources TV Neural FCA performed well regardless of source velocity. โ€ข FastMNMF2 and Neural FCA drastically degraded when sources move fast. โ€ข TV-Neural FCA can improved avg. SDR 4.2dB from that of DoA-HMM [Higuchi+ 2014] SDR [dB] 0 2 4 6 8 10 12 14 Average 0-15ยฐ/s 15-30ยฐ/s 30-45ยฐ/s TV-Neural FCA Neural FCA FastMNMF DOA-HMM Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 32

Slide 33

Slide 33 text

Separation Results of Moving Sound Sources Our method can be trained from mixtures of moving sources. โ€ข Robustness against real audio recordings was improved. Stationary condition Moving condition FastMNMF FastMNMF TV Neural FCA TV Neural FCA Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 33

Slide 34

Slide 34 text

โ‹ฏ Multichannel reconstruction Extension 2: Speeding Up Neural FCA Iterative estimation of full-rank SCMs is computationally demanding. โ€ข Training for 30 hours of 8-ch data (WSJ0-2mix) requires 400 GPU hours @ NVIDIA V100 Inference model Multichannel mixture โ‹ฏ โ‹ฏ ร— ร— โ‹ฏ Generative model Latent source features ร— SCM Source PSD The neural models are jointly trained to maximize the likelihood for training mixtures. Estimated by a heavy EM algorithm Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 34

Slide 35

Slide 35 text

Deep Source Model + JD Spatial Model ๏ƒ  Neural FastFCA Speeding up neural FCA with a JD spatial model and the ISS algorithm. We utilize the ISS algorithm in the inference model to quickly estimate SCMs. Inference model Multichannel mixture Multichannel reconstruction โ‹ฏ Latent source features โ‹ฏ โ‹ฏ ร— Source PSD ร— ร— โ‹ฏ SCM Generative model DNN ISS JD SCM parameters [Scheibler+ 2021] Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 35

Slide 36

Slide 36 text

A multichannel mixture is generated by a local Gaussian model w/ JD SCMs. โ€ข This likelihood can be simplified with ๏ฟฝ ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ โ‰œ ๐๐๐‘“๐‘“ ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ and ๏ฟฝ ๐‘ฆ๐‘ฆ๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“ โ‰œ โˆ‘๐‘›๐‘› ๐‘ค๐‘ค๐‘›๐‘›๐‘›๐‘› ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“ ๐ณ๐ณ๐‘›๐‘›๐‘›๐‘› as: โ‹ฏ Multichannel reconstruction Generative Model of Mixture Signals Based on a JD Spatial Model Generative model ร— Source PSD JD SCM ๐๐๐‘“๐‘“ โˆ’1 diag ๐ฐ๐ฐ1 ๐๐๐‘“๐‘“ โˆ’H โ‹ฏ ร— ร— โ‹ฏ ๐๐๐‘“๐‘“ โˆ’1 diag ๐ฐ๐ฐ2 ๐๐๐‘“๐‘“ โˆ’H ๐๐๐‘“๐‘“ โˆ’1 diag ๐ฐ๐ฐ๐‘๐‘ ๐๐๐‘“๐‘“ โˆ’H ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ โˆผ ๐’ฉ๐’ฉโ„‚ 0, ๐๐๐‘“๐‘“ โˆ’1 โˆ‘๐‘›๐‘› ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“ ๐ณ๐ณ๐‘›๐‘›๐‘›๐‘› diag ๐ฐ๐ฐ๐‘›๐‘› ๐๐๐‘“๐‘“ โˆ’H ๐‘๐‘๐œƒ๐œƒ ๐—๐— ๐๐, ๐–๐–, ๐™๐™ = 2๐‘‡๐‘‡ โˆ‘๐‘“๐‘“ log |๐๐๐‘“๐‘“ | โˆ’ โˆ‘๐‘“๐‘“,๐‘ก๐‘ก,๐‘š๐‘š log ๏ฟฝ ๐‘ฆ๐‘ฆ๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“ + ๏ฟฝ ๐‘ฅ๐‘ฅ๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“ 2 ๏ฟฝ ๐‘ฆ๐‘ฆ๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“ Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 36

Slide 37

Slide 37 text

Inference Model Integrating DNN and ISS-Based Blocks The inference model estimates the params. of the generative model. โ€ข The ISS algorithm is involved to quickly estimate ๐๐๐‘“๐‘“ from ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ and mask ๐’Ž๐’Ž๐œ™๐œ™,๐‘“๐‘“๐‘“๐‘“. โ€ข Each DNN utilizes an intermediate diagonalization result for its estimate. DNN(1) ISS(1) ๐ก ๐œ™๐œ™,๐‘›๐‘› (1) ๐๐ ๐‘›๐‘› (1) ๐ฆ ๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘› (1) DNN(0) ๐๐ ๐‘›๐‘› (0) ๐ก ๐œ™๐œ™,๐‘›๐‘› (0) ๐ฆ ๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘› (0) ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘› ๐ฑ๐ฑ ๏ฟฝ ๐‘›๐‘›๐‘›๐‘› (1) DNN(๐ต) ISS(B) 1 ร— 1 Conv ๐ฑ๐ฑ ๏ฟฝ ๐‘›๐‘›๐‘›๐‘› (๐ต) ๐ก ๐œ™๐œ™,๐‘›๐‘› (๐ต) ๐Ž๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘› ๐ˆ๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘› 2 ๐๐ ๐‘›๐‘› (๐ต) 1st blocks ๐ต-th blocks 1st blocks B-th blocks DNN(0) DNN(1) DNN(B) ISS(B) ISS(1) 1ร—1 Conv Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 37

Slide 38

Slide 38 text

Experimental Condition: Speech Separation Evaluation was performed with simulated 8-ch speech mixtures โ€ข The simulation was almost the same as the spatialized WSJ0-mix dataset. โ€ข The main difference is that # of srcs. was randomly drawn between 2 and 4. โ€ข All the mixtures were dereverberated in advance by using the WPE method. All the methods are performed by specifying a fixed # (5) of sources. โ€ข We show that our method can work with only specifying the max. # of sources. Method Brief description # of iters. MNMF [Sawada+ 2013] Conventional linear BSS methods that have ability to solve frequency permutation ambiguity 200 ILRMA [Kitamura+ 2016] FastMNMF [Sekiguchi+ 2020] Neural FCA [Bando+ 2021] The conventional neural BSS method 200 Neural FastFCA (Proposed) The proposed neural BSS method Iteration free Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 38

Slide 39

Slide 39 text

Experimental Results: Average Separation Performance Our method outperformed the conventional BSS methods in all the metrics and slightly better than neural FCA in SDR and STOI. 7.5 7 9.3 8.9 11.1 11.6 6 7 8 9 10 11 12 SDR 1.49 1.43 1.6 1.71 1.88 1.85 1.32 1.42 1.52 1.62 1.72 1.82 PESQ 0.76 0.76 0.8 0.79 0.84 0.85 0.74 0.76 0.78 0.8 0.82 0.84 0.86 STOI โ–  MNMF โ–  ILRMA โ–  FastMNMF โ–  Neural FCA (fix z) โ–  Neural FCA โ–  Neural FastFCA Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 39

Slide 40

Slide 40 text

Experimental Results: Elapsed Time for Inference On the other hand, the elapsed time was drastically improved from neural FCA thanks to the JD spatial model and ISS-based inference model. 0.09 4.77 2.67 1.81 1.36 2.07 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Elapsed time for separating a 5-second mixture using NVIDIA V100 GPU [s] โ–  MNMF โ–  ILRMA โ–  FastMNMF โ–  Neural FCA (fix z) โ–  Neural FCA โ–  Neural FastFCA 53x faster Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 40

Slide 41

Slide 41 text

Extension 3: Front-End of Distant Speech Recognition (DSR) It is essential for DSR to extract speech sources from noisy mixture signals having dynamically changing numbers of active speakers. Single-speaker DSR (e.g., smart speakers) has achieved excellent performance. (e.g., CHiME-3, 4 Challenges) https://spandh.dcs.shef.ac.uk//chime_challenge/chime2015/overview.html (e.g., CHiME-5, 6, 7, 8 Challenges) Multi-speaker DSR (e.g., home parties) is still a challenging problem. https://spandh.dcs.shef.ac.uk//chime_challenge/CHiME5/overview.html Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 41

Slide 42

Slide 42 text

Proposed Method: Neural FCA With Speaker Activity (FCASA) Multi-task learning of self-supervised separation and supervised diarization. Generative model Latent source features Inference model โ‹ฏ SCMs ร— Source PSD ร— ร— โ‹ฏ SCM Multichannel mixture โ‹ฏ Multichannel reconstruction โ‹ฏ โ‹ฏ Separation ๐‘ข๐‘ข1๐‘ก๐‘ก ๐‘ข๐‘ข2๐‘ก๐‘ก ๐‘ข๐‘ข๐‘๐‘๐‘๐‘ Source activity masking Diarization Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 42

Slide 43

Slide 43 text

Demo: Separation and Diarization of English Conversation Training only on 80 hours of 8-ch mixtures and diarization annotations (AMI) ๏ƒ  Inference on a recording of our real chatting with our mic. array Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 43

Slide 44

Slide 44 text

Demo: Separation and Diarization of โ€œJapaneseโ€ Conversation Since neural FCASA involves the ISS algorithm, it is reasonably robust against language mismatch between the training and inference data. Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 44

Slide 45

Slide 45 text

Conclusion Two applications of deep source generative models: 1. Semi-supervised speech enhancement ๏ƒ  FastMNMF-DP 2. Self-supervised source separation ๏ƒ  Neural FCA Future work: โ€ข Handling unknown # of sources โ€ข Training neural FCA on diverse real audio recordings โˆผ DNN Latent features Source PSD Source signal ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ ๐œ†๐œ†๐‘“๐‘“๐‘“๐‘“ ๐‘ง๐‘ง๐‘‘๐‘‘๐‘‘๐‘‘ ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“ ๐‘ง๐‘ง๐‘ก๐‘ก๐‘ก๐‘ก โˆผ ๐’ฉ๐’ฉ 0, 1 Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning /46 45