INTERSPEECH 2023 T5 Part4: Source Separation Based on Deep Source Generative Models and Its Self-Supervisedย Learning

The slides used for Part 4 of INTERSPEECH 2023 Tutorial T5: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models."

Source Separation Based on Deep Source Generative Models and Its Self-Supervised Learning Yoshiaki Bando National Institute of Advanced Industrial Science and Technology (AIST), Japan Center for Advanced Intelligent Project (AIP), RIKEN, Japan T5: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models, INTERSPEECH 2023, Dublin, Ireland

Sound source separation forms the basis of machine listening systems. โข Such systems are often required to work in diverse environments. โข This calls for BSS, which can work adaptively for the target environment. Blind Source Separation (BSS) Distant speech recognition (DSR) [Watanabe+ 2020, Baker+ 2018] Sound event detection (SED) [Turpault+ 2020, Denton+ 2022] Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 2

Source Model Based on Low-Rank Approximation Source power spectral density (PSD) often has low-rank structures. โข Source PSD is estimated by non-negative matrix factorization (NMF) [Ozerov+ 2009] . โข Its inference is fast and does not require supervised pre-training. ๐ ๐ ๐๐๐๐ โผ ๐ฉ๐ฉโ 0, โ๐๐ ๐ข๐ข๐๐๐๐ ๐ฃ๐ฃ๐๐๐๐ Is there a more powerful representation of source spectra? ร โผ ๐ ๐ ๐๐๐๐ ๐๐๐๐๐๐ ๐ข๐ข๐๐๐๐ ๐ฃ๐ฃ๐๐๐๐ Source PSD Source signal Bases Activations Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 4

Source Model Based on Deep Generative Model Source spectra are represented with low-dim. latent feature vectors. โข A DNN is used to generate source power spectral density (PSD) precisely. โข Freq.-independent latent features helps us to solve freq. permutation ambiguity. โผ DNN Latent features Source PSD Source signal ๐ ๐ ๐๐๐๐ ๐๐๐๐๐๐ ๐ง๐ง๐ก๐ก๐ก๐ก ๐๐๐๐,๐๐ ๐ ๐ ๐๐๐๐ โฃ ๐ณ๐ณ๐ก๐ก โผ ๐ฉ๐ฉโ 0, ๐๐๐๐,๐๐ ๐ณ๐ณ๐ก๐ก Y. Bando, et al. "Statistical speech enhancement based on probabilistic integration of variational autoencoder and non- negative matrix factorization." IEEE ICASSP, pp. 716-720, 2018. ๐ง๐ง๐ก๐ก๐ก๐ก โผ ๐ฉ๐ฉ 0, 1 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 5

Contents Two applications of deep source generative models. Source Separation Based on Deep Generative Models and Its Self-Supervised Learning 1. Semi-supervised speech enhancement โข We enhance speech signals by training on only clean speech signals โข Combination of a deep speech model and low-rank noise models 2. Self-supervised source separation โข We train neural source separation model only from multichannel mixtures โข The joint training of the source generative model and its inference model /33 6

Multichannel Speech Enhancement Based on Supervised Deep Source Model โข K. Sekiguchi, Y. Bando, A. A. Nugraha, K. Yoshii, T. Kawahara, โSemi-supervised Multichannel Speech Enhancement with a Deep Speech Prior,โ IEEE/ACM TASLP, 2019 โข K. Sekiguchi, A. A. Nugraha, Y. Bando, K. Yoshii, โFast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices,โ EUSIPCO, 2019 โข Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, T. Kawahara, โStatistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Nonnegative Matrix Factorization,โ IEEE ICASSP, 2018 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 7

Speech Enhancement A task to extract speech signals from a mixture of speech and noise โข Various applications such as DSR, search-and-rescue, and hearing aids. Robustness against various acoustic environment is essential. โข It is often difficult to assume the environment where they are used. Hey, Siriโฆ CC0: https://pxhere.com/ja/photo/1234569 CC0: https://pxhere.com/ja/photo/742585 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 8

Semi-Supervised Enhancement With Deep Speech Prior A hybrid method of deep speech model and statistical noise model โข We can use many speech corpus ๏ deep speech prior โข Noise training data are often few ๏ statistical noise prior w/ low-rank model + โ Observed noisy speech Deep speech prior Statistical noise prior Speech corpus Pre-training Estimated on the fly Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 9

The training based on a variational autoencoder (VAE) [Kingma+ 2013] โข An encoder ๐๐๐๐ ๐๐ ๐๐ is introduced to estimate latent features from clean speech. The objective function is the evidence lower bound (ELBO) โ๐๐,๐๐ โ๐๐,๐๐ = ๐ผ๐ผ๐๐๐๐ log ๐๐๐๐ ๐๐ ๐๐ โ ๐๐KL ๐๐๐๐ ๐๐|๐๐ ๐๐ ๐๐ Supervised Training of Deep Speech Prior (DP) Reconstructed speech Latent features ๐๐ Observed speech Reconstruction term (IS-div.) Regularization term (KL-div.) Encoder ๐๐๐๐ ๐๐ ๐๐ Decoder ๐๐๐๐ ๐๐ ๐๐ The training is performed by making the reconstruction closer to the observation. Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 10

Monte-Carlo Expectation-Maximization (MC-EM) Inference Speech and noise are separated by estimating the model parameters. Speech signal is finally obtained by multichannel Wiener filtering. ๐ ๐ ๐๐๐๐ = ๐ผ๐ผ ๐ ๐ ๐๐๐๐ ๐๐, ๐๐, ๏ฟฝ ๐๐, ๐๐, ๐๐, ๐๐ = ๐๐๐๐ โ1diag ๐๐0๐๐๐๐ ฬ ๐ก๐ก๐๐๐๐ โ๐๐ ๐๐๐๐๐๐๐๐ ฬ ๐ก๐ก๐๐๐๐ ๐๐๐๐ โH๐ฑ๐ฑ๐๐๐๐ E-step samples latent features from its posterior ๐ณ๐ณ๐ก๐ก โผ ๐๐ ๐ณ๐ณ๐ก๐ก ๐๐ โข Metropolis-Hasting sampling is utilized due to its intractability. M-step updates the other parameters to maximize log ๐๐ ๐๐ ๐๐, ๏ฟฝ ๐๐, ๐๐, ๐๐ โข ๐๐ is updated by the iterative-projection (IP) algorithm [Ono+ 2011] . โข ๏ฟฝ ๐๐, ๐๐, ๐๐ are updated by multiplicative-update (MU) algorithm [Nakano+ 2010] . Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 12

Experimental Condition We evaluated with a part of the CHiME-3 noisy speech dataset โข 100 utterances from the CHiME-3 evaluation set โข Each utterance was recorded by a 6-channel* mic. array on a tablet device. โข The CHiME-3 dataset includes four noise environments: Evaluation metrics: โข Source-to-distortion ratio (SDR) [dB] for evaluating enhancement performance โข Computational time [msec] for evaluating the efficiency of the method. On a bus In a cafeteria In a pedestrian area On a street junction http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html *We emitted one microphone on the back of the tablet Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 13

Computational Times for Speech Enhancement Although DP slightly increased computational cost, FastMNMF-DP was much faster than MNMF. Method Source model Spatial model FastMNMF-DP DP + NMF JD full-rank FastMNMF NMF JD full-rank MNMF-DP DP + NMF Full-rank MNMF NMF Full-rank ILRMA NMF Rank-1 [Sekiguchi+ 2019] [Sekiguchi+ 2019] [Sawada+ 2013] [Kitamura+ 2016] 10 660 710 40 78 0 100 200 300 400 500 600 700 800 [Sekiguchi+ 2019] Computational time [ms] for an 8-second signal *Evaluation is performed with NVIDIA TITAN RTX Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 15

Excerpts of Enhancement Results Observation Clean speech ILRMA FastMNMF-DP Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 16

Self-Supervised Learning of Deep Source Generative Model and Its Inference Model โข Y. Bando, K. Sekiguchi, Y. Masuyama, A. A. Nugraha, M. Fontaine, K. Yoshii, โNeural full-rank spatial covariance analysis for blind source separation,โ IEEE SP Letters, 2021 โข Y. Bando, T, Aizawa, K. Itoyama, K. Nakadai, โWeakly-supervised neural full-rank spatial covariance analysis for a front-end system of distant speech recognition,โ INTERSPEECH, 2022 โข H. Munakata, Y. Bando, R. Takeda, K. Komatani, M. Onishi, โJoint Separation and Localization of Moving Sound Sources Based on Neural Full-Rank Spatial Covariance Analysis,โ IEEE SP Letters, 2023 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 17

Source Separation Based on Multichannel VAEs (MVAEs) Deep source generative models achieved excellent performance. โข ๐ณ๐ณ๐๐๐๐ and ๐๐๐๐๐๐ are estimated to maximize the likelihood function at the inference Can the deep source models be trained only from mixture signals? Generative model Multichannel reconstruction โฏ Latent source features โฏ ร ร ร โฏ SCM Source PSD [Kameoka+ 2018, Seki+ 2019] Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 18

Self-Supervised Training of Deep Source Model The generative model is trained jointly with its inference model. โข We train the models regarding them as a โlarge VAEโ for a multichannel mixture. The training is performed to make the reconstruction closer to the observation. Inference model Generative model Multichannel mixture Multichannel reconstruction โฏ โฏ Latent source features โฏ ร ร ร โฏ SCM Source PSD Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 19

Training Based on Autoencoding Variational Bayes As in the training of the VAE, the ELBO โ is maximized by using SGD. โข Our training can be considered as BSS for all the training mixtures. Generative model Multichannel mixture Multichannel reconstruction โฏ โฏ Inference model Latent source features โฏ Minimize ๐๐๐พ๐พ๐พ๐พ ๐๐ ๐๐ ๐๐ ๐๐ ๐๐ ๐๐, ๐๐ Maximize ๐๐ ๐๐ ๐๐ EM update rule Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 20

Solving Frequency Permutation Ambiguity We solve the ambiguity by making latent vectors ๐ณ๐ณ1๐ก๐ก , โฆ , ๐ณ๐ณ๐๐๐๐ independent. ๏ Each source shares the same content ๏ Latent vectors have a LARGE correlation The KL term weight ๐ฝ๐ฝ is set to a large value for first several epochs. โข approaches to the std. Gaussian dist. (no correlation between sources). โข Disentanglement of the latent features by ฮฒ-VAE. ๏ Each source has a different content ๏ Latent vectors have a SMALL correlation ๐๐ ๐ก๐ก ๐๐ ๐ก๐ก ๐๐ ๐ก๐ก ๐๐ ๐ก๐ก Source 1 Source 2 Source 1 Source 2 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 21

Relations Between Neural FCA and Existing Methods Neural FCA is a DEEP & BLIND source separation method โข Self-supervised training of the deep source generative model Linear BLIND Source Separation DEEP (Semi-)supervised Source Separation MNMF [Ozerov+ 2009, Sawada+ 2013] ILRMA [Kitamura+ 2015] FastMNMF [Sekiguchi+ 2019, Ito+ 2019] IVA [Ono+ 2011] MVAE [Kameoka+ 2018] FastMNMF-DP [Sekiguchi+ 2018, Leglaive+ 2019] IDLMA [Mogami+ 2018] DNN-MSS [Nugraha+ 2016] Neural FCA (proposed) NF-IVA [Nugraha+ 2020] NF-FastMNMF [Nugraha+ 2022] Deep spatial models Deep source model DEEP BLIND Source Separation Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 22

Experimental Condition Evaluation with the spatialized WSJ0-2mix dataset โข 4-ch mixture signals of two speech sources with RT60 = 200โ600 ms โข All mixture signals were dereverberated in advance by using WPE. Method Brief description Permutation solver cACGMM [Ito+ 2016] Conventional linear BSS methods (for determined conditions) Required FCA [Duong+ 2010] Required FastMNMF2 [Sekiguchi+ 2020] Free Pseudo supervised [Togami+ 2020] DNN imitates the MWF of BSS (FCA) results Required Neural cACGMM [Drude+ 2019] DNN is trained to maximize the log-marginal likelihood of the cACGMM Required MVAE [Seki+ 2019] The supervised version of our neural FCA โ Neural FCA (proposed) Our neural blind source separation method Free Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 23

Excerpts of Separation Results Neural FCA *More separation examples: https://ybando.jp/projects/spl2021 FastMNMF MVAE (supervised) Mixture input Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 25

Extension 1: Front-End System of Multi-Speaker DSR It is essential for DSR to separate target speech sources from mixture recordings distorted by reverberation and overlapped speech. (e.g., CHiME-3, 4 Challenges) (e.g., CHiME-5, 6 Challenges) Single-speaker DSR (e.g., smart speakers) has achieved excellent performance. Multi-speaker DSR (e.g., home parties) is still a challenging problem. https://spandh.dcs.shef.ac.uk//chime_challenge/chime2015/overview.html https://spandh.dcs.shef.ac.uk//chime_challenge/CHiME5/overview.html Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 26

Weakly-Supervised Neural FCA for DSR Variable # of speech sources should be handled in real conversations. โข We introduce temporal voice activities ๐ข๐ข๐๐๐๐ โ 0, 1 to neural FCA. ๐๐|๐ข๐ข๐๐๐๐ = 1 Generative model Multichannel reconstruction โฏ Latent source features โฏ ร ร ร โฏ SCM Source PSD ๐ข๐ข1๐ก๐ก ๐ข๐ข2๐ก๐ก ๐ข๐ข๐๐๐๐ Voice activity ร ร ร Speech sources โข High degrees of freedom in latent space โข Limited time activity Noise source(s) โข Low degrees of freedom in latent space โข Always active Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 27

Evaluation on CHiME-6 DSR Benchmark We evaluated WERs of our front-end system for dinner-party recordings. โข The participants converse any topics without any artificial scenario-ization. *WER was measured with the official baseline ASR (Kaldi) model https://spandh.dcs.shef.ac.uk//chime_challenge/CHiME5/overview.html Kinect v2 (4ch) Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 28

Extension 2: Separation of Moving Sound Sources BSS methods usually assume that sources are (almost) stationary. โข Many daily sound sources move (e.g., walking persons, natural habitats, cars, โฆ) โข All sources relatively move if the microphone moves (e.g., mobile robots). Woo-hoo! Broom! Chirp, chip Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 29

Time-Varying (TV) Neural FCA Joint source localization and separation for tracking moving sources. โข The localization results are constrained to be smooth by moving average. โข SCMs are then constrained by the time-varying smoothed localization results. Generative model Inference model ๐๐0๐๐๐๐ ๐๐1๐๐๐๐ ๐๐๐๐๐๐๐๐ ๐ฎ1๐๐ ๐ฎ๐๐๐๐ Time-varying SCMs Latent spectral features Time-varying DoAs Regularize Separation Localization SCM Source PSD Multichannel mixture Multichannel reconstruction ๐๐๐๐,๐๐ ๐ณ๐ณ0๐๐ ๐๐๐๐,๐๐ ๐ณ๐ณ๐๐๐๐ ๐๐๐๐,๐๐ ๐ณ๐ณ1๐๐ Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 30

Training on Mixtures of Two Moving Speech Sources TV Neural FCA performed well regardless of source velocity. โข FastMNMF2 and Neural FCA drastically degraded when sources move fast. โข TV-Neural FCA can improved avg. SDR 4.2dB from that of DoA-HMM [Higuchi+ 2014] SDR [dB] Source Separation Based on Deep Generative Models and Its Self-Supervised Learning 0 2 4 6 8 10 12 14 Average 0-15ยฐ/s 15-30ยฐ/s 30-45ยฐ/s TV-Neural FCA Neural FCA FastMNMF DOA-HMM /33 31

Separation Results of Moving Sound Sources Our method can be trained from mixtures of moving sources. โข Robustness against real audio recordings was improved. Stationary condition Moving condition FastMNMF FastMNMF TV Neural FCA TV Neural FCA Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 32

Conclusion Two applications of deep source generative models: 1. Semi-supervised speech enhancement ๏ FastMNMF-DP 2. Self-supervised source separation ๏ Neural FCA Future work: โข Speeding up neural FCA & handling unknown # of sources ๏ EUSIPCO 2023 โข Training neural FCA on diverse real audio recordings. Source Separation Based on Deep Generative Models and Its Self-Supervised Learning โผ DNN Latent features Source PSD Source signal ๐ ๐ ๐๐๐๐ ๐๐๐๐๐๐ ๐ง๐ง๐๐๐๐ ๐๐๐๐,๐๐ ๐ง๐ง๐ก๐ก๐ก๐ก โผ ๐ฉ๐ฉ 0, 1 /33 33