INTERSPEECH 2023 T5 Part4: Source Separation Based on Deep Source Generative Models and Its Self-Supervised Learning

The slides used for Part 4 of INTERSPEECH 2023 Tutorial T5: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models."

Self-Supervised Learning Yoshiaki Bando National Institute of Advanced Industrial Science and Technology (AIST), Japan Center for Advanced Intelligent Project (AIP), RIKEN, Japan T5: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models, INTERSPEECH 2023, Dublin, Ireland

• Such systems are often required to work in diverse environments. • This calls for BSS, which can work adaptively for the target environment. Blind Source Separation (BSS) Distant speech recognition (DSR) [Watanabe+ 2020, Baker+ 2018] Sound event detection (SED) [Turpault+ 2020, Denton+ 2022] Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 2

(PSD) often has low-rank structures. • Source PSD is estimated by non-negative matrix factorization (NMF) [Ozerov+ 2009] . • Its inference is fast and does not require supervised pre-training. 𝑠𝑠𝑓𝑓𝑓𝑓 ∼ 𝒩𝒩ℂ 0, ∑𝑘𝑘 𝑢𝑢𝑓𝑓𝑓𝑓 𝑣𝑣𝑘𝑘𝑘𝑘 Is there a more powerful representation of source spectra? × ∼ 𝑠𝑠𝑓𝑓𝑓𝑓 𝜆𝜆𝑓𝑓𝑓𝑓 𝑢𝑢𝑓𝑓𝑓𝑓 𝑣𝑣𝑘𝑘𝑘𝑘 Source PSD Source signal Bases Activations Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 4

represented with low-dim. latent feature vectors. • A DNN is used to generate source power spectral density (PSD) precisely. • Freq.-independent latent features helps us to solve freq. permutation ambiguity. ∼ DNN Latent features Source PSD Source signal 𝑠𝑠𝑓𝑓𝑓𝑓 𝜆𝜆𝑓𝑓𝑓𝑓 𝑧𝑧𝑡𝑡𝑡𝑡 𝑔𝑔𝜃𝜃,𝑓𝑓 𝑠𝑠𝑓𝑓𝑓𝑓 ∣ 𝐳𝐳𝑡𝑡 ∼ 𝒩𝒩ℂ 0, 𝑔𝑔𝜃𝜃,𝑓𝑓 𝐳𝐳𝑡𝑡 Y. Bando, et al. "Statistical speech enhancement based on probabilistic integration of variational autoencoder and non- negative matrix factorization." IEEE ICASSP, pp. 716-720, 2018. 𝑧𝑧𝑡𝑡𝑡𝑡 ∼ 𝒩𝒩 0, 1 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 5

Based on Deep Generative Models and Its Self-Supervised Learning 1. Semi-supervised speech enhancement • We enhance speech signals by training on only clean speech signals • Combination of a deep speech model and low-rank noise models 2. Self-supervised source separation • We train neural source separation model only from multichannel mixtures • The joint training of the source generative model and its inference model /33 6

K. Sekiguchi, Y. Bando, A. A. Nugraha, K. Yoshii, T. Kawahara, “Semi-supervised Multichannel Speech Enhancement with a Deep Speech Prior,” IEEE/ACM TASLP, 2019 • K. Sekiguchi, A. A. Nugraha, Y. Bando, K. Yoshii, “Fast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices,” EUSIPCO, 2019 • Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, T. Kawahara, “Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Nonnegative Matrix Factorization,” IEEE ICASSP, 2018 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 7

mixture of speech and noise • Various applications such as DSR, search-and-rescue, and hearing aids. Robustness against various acoustic environment is essential. • It is often difficult to assume the environment where they are used. Hey, Siri… CC0: https://pxhere.com/ja/photo/1234569 CC0: https://pxhere.com/ja/photo/742585 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 8

deep speech model and statistical noise model • We can use many speech corpus deep speech prior • Noise training data are often few statistical noise prior w/ low-rank model + ≈ Observed noisy speech Deep speech prior Statistical noise prior Speech corpus Pre-training Estimated on the fly Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 9

• An encoder 𝑞𝑞𝜙𝜙 𝐙𝐙 𝐒𝐒 is introduced to estimate latent features from clean speech. The objective function is the evidence lower bound (ELBO) ℒ𝜃𝜃,𝜙𝜙 ℒ𝜃𝜃,𝜙𝜙 = 𝔼𝔼𝑞𝑞𝜙𝜙 log 𝑝𝑝𝜃𝜃 𝐒𝐒 𝐙𝐙 − 𝒟𝒟KL 𝑞𝑞𝜙𝜙 𝐙𝐙|𝐒𝐒 𝑝𝑝 𝐙𝐙 Supervised Training of Deep Speech Prior (DP) Reconstructed speech Latent features 𝐙𝐙 Observed speech Reconstruction term (IS-div.) Regularization term (KL-div.) Encoder 𝑞𝑞𝜙𝜙 𝐙𝐙 𝐒𝐒 Decoder 𝑝𝑝𝜃𝜃 𝐒𝐒 𝐙𝐙 The training is performed by making the reconstruction closer to the observation. Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 10

estimating the model parameters. Speech signal is finally obtained by multichannel Wiener filtering. 𝑠𝑠𝑓𝑓𝑓𝑓 = 𝔼𝔼 𝑠𝑠𝑓𝑓𝑓𝑓 𝐗𝐗, 𝐐𝐐, � 𝐇𝐇, 𝐔𝐔, 𝐕𝐕, 𝐙𝐙 = 𝐐𝐐𝑓𝑓 −1diag 𝜆𝜆0𝑓𝑓𝑓𝑓 ̃ 𝐡𝐡𝑛𝑛𝑛𝑛 ∑𝑛𝑛 𝜆𝜆𝑛𝑛𝑛𝑛𝑛𝑛 ̃ 𝐡𝐡𝑛𝑛𝑛𝑛 𝐐𝐐𝑓𝑓 −H𝐱𝐱𝑓𝑓𝑓𝑓 E-step samples latent features from its posterior 𝐳𝐳𝑡𝑡 ∼ 𝑝𝑝 𝐳𝐳𝑡𝑡 𝐗𝐗 • Metropolis-Hasting sampling is utilized due to its intractability. M-step updates the other parameters to maximize log 𝑝𝑝 𝐗𝐗 𝐐𝐐, � 𝐇𝐇, 𝐔𝐔, 𝐕𝐕 • 𝐐𝐐 is updated by the iterative-projection (IP) algorithm [Ono+ 2011] . • � 𝐇𝐇, 𝐔𝐔, 𝐕𝐕 are updated by multiplicative-update (MU) algorithm [Nakano+ 2010] . Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 12

noisy speech dataset • 100 utterances from the CHiME-3 evaluation set • Each utterance was recorded by a 6-channel* mic. array on a tablet device. • The CHiME-3 dataset includes four noise environments: Evaluation metrics: • Source-to-distortion ratio (SDR) [dB] for evaluating enhancement performance • Computational time [msec] for evaluating the efficiency of the method. On a bus In a cafeteria In a pedestrian area On a street junction http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html *We emitted one microphone on the back of the tablet Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 13

cost, FastMNMF-DP was much faster than MNMF. Method Source model Spatial model FastMNMF-DP DP + NMF JD full-rank FastMNMF NMF JD full-rank MNMF-DP DP + NMF Full-rank MNMF NMF Full-rank ILRMA NMF Rank-1 [Sekiguchi+ 2019] [Sekiguchi+ 2019] [Sawada+ 2013] [Kitamura+ 2016] 10 660 710 40 78 0 100 200 300 400 500 600 700 800 [Sekiguchi+ 2019] Computational time [ms] for an 8-second signal *Evaluation is performed with NVIDIA TITAN RTX Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 15

Model • Y. Bando, K. Sekiguchi, Y. Masuyama, A. A. Nugraha, M. Fontaine, K. Yoshii, “Neural full-rank spatial covariance analysis for blind source separation,” IEEE SP Letters, 2021 • Y. Bando, T, Aizawa, K. Itoyama, K. Nakadai, “Weakly-supervised neural full-rank spatial covariance analysis for a front-end system of distant speech recognition,” INTERSPEECH, 2022 • H. Munakata, Y. Bando, R. Takeda, K. Komatani, M. Onishi, “Joint Separation and Localization of Moving Sound Sources Based on Neural Full-Rank Spatial Covariance Analysis,” IEEE SP Letters, 2023 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 17

models achieved excellent performance. • 𝐳𝐳𝑛𝑛𝑛𝑛 and 𝐇𝐇𝑓𝑓𝑓𝑓 are estimated to maximize the likelihood function at the inference Can the deep source models be trained only from mixture signals? Generative model Multichannel reconstruction ⋯ Latent source features ⋯ × × × ⋯ SCM Source PSD [Kameoka+ 2018, Seki+ 2019] Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 18

trained jointly with its inference model. • We train the models regarding them as a “large VAE” for a multichannel mixture. The training is performed to make the reconstruction closer to the observation. Inference model Generative model Multichannel mixture Multichannel reconstruction ⋯ ⋯ Latent source features ⋯ × × × ⋯ SCM Source PSD Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 19

of the VAE, the ELBO ℒ is maximized by using SGD. • Our training can be considered as BSS for all the training mixtures. Generative model Multichannel mixture Multichannel reconstruction ⋯ ⋯ Inference model Latent source features ⋯ Minimize 𝒟𝒟𝐾𝐾𝐾𝐾 𝑞𝑞 𝐙𝐙 𝐗𝐗 𝑝𝑝 𝐙𝐙 𝐗𝐗, 𝐇𝐇 Maximize 𝑝𝑝 𝐗𝐗 𝐇𝐇 EM update rule Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 20

latent vectors 𝐳𝐳1𝑡𝑡 , … , 𝐳𝐳𝑁𝑁𝑁𝑁 independent. Each source shares the same content Latent vectors have a LARGE correlation The KL term weight 𝛽𝛽 is set to a large value for first several epochs. • approaches to the std. Gaussian dist. (no correlation between sources). • Disentanglement of the latent features by β-VAE. Each source has a different content Latent vectors have a SMALL correlation 𝑓𝑓 𝑡𝑡 𝑓𝑓 𝑡𝑡 𝑓𝑓 𝑡𝑡 𝑓𝑓 𝑡𝑡 Source 1 Source 2 Source 1 Source 2 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 21

a DEEP & BLIND source separation method • Self-supervised training of the deep source generative model Linear BLIND Source Separation DEEP (Semi-)supervised Source Separation MNMF [Ozerov+ 2009, Sawada+ 2013] ILRMA [Kitamura+ 2015] FastMNMF [Sekiguchi+ 2019, Ito+ 2019] IVA [Ono+ 2011] MVAE [Kameoka+ 2018] FastMNMF-DP [Sekiguchi+ 2018, Leglaive+ 2019] IDLMA [Mogami+ 2018] DNN-MSS [Nugraha+ 2016] Neural FCA (proposed) NF-IVA [Nugraha+ 2020] NF-FastMNMF [Nugraha+ 2022] Deep spatial models Deep source model DEEP BLIND Source Separation Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 22

mixture signals of two speech sources with RT60 = 200–600 ms • All mixture signals were dereverberated in advance by using WPE. Method Brief description Permutation solver cACGMM [Ito+ 2016] Conventional linear BSS methods (for determined conditions) Required FCA [Duong+ 2010] Required FastMNMF2 [Sekiguchi+ 2020] Free Pseudo supervised [Togami+ 2020] DNN imitates the MWF of BSS (FCA) results Required Neural cACGMM [Drude+ 2019] DNN is trained to maximize the log-marginal likelihood of the cACGMM Required MVAE [Seki+ 2019] The supervised version of our neural FCA – Neural FCA (proposed) Our neural blind source separation method Free Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 23

for DSR to separate target speech sources from mixture recordings distorted by reverberation and overlapped speech. (e.g., CHiME-3, 4 Challenges) (e.g., CHiME-5, 6 Challenges) Single-speaker DSR (e.g., smart speakers) has achieved excellent performance. Multi-speaker DSR (e.g., home parties) is still a challenging problem. https://spandh.dcs.shef.ac.uk//chime_challenge/chime2015/overview.html https://spandh.dcs.shef.ac.uk//chime_challenge/CHiME5/overview.html Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 26

should be handled in real conversations. • We introduce temporal voice activities 𝑢𝑢𝑛𝑛𝑛𝑛 ∈ 0, 1 to neural FCA. 𝑛𝑛|𝑢𝑢𝑛𝑛𝑛𝑛 = 1 Generative model Multichannel reconstruction ⋯ Latent source features ⋯ × × × ⋯ SCM Source PSD 𝑢𝑢1𝑡𝑡 𝑢𝑢2𝑡𝑡 𝑢𝑢𝑁𝑁𝑁𝑁 Voice activity × × × Speech sources • High degrees of freedom in latent space • Limited time activity Noise source(s) • Low degrees of freedom in latent space • Always active Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 27

front-end system for dinner-party recordings. • The participants converse any topics without any artificial scenario-ization. *WER was measured with the official baseline ASR (Kaldi) model https://spandh.dcs.shef.ac.uk//chime_challenge/CHiME5/overview.html Kinect v2 (4ch) Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 28

assume that sources are (almost) stationary. • Many daily sound sources move (e.g., walking persons, natural habitats, cars, …) • All sources relatively move if the microphone moves (e.g., mobile robots). Woo-hoo! Broom! Chirp, chip Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 29

tracking moving sources. • The localization results are constrained to be smooth by moving average. • SCMs are then constrained by the time-varying smoothed localization results. Generative model Inference model 𝐇𝐇0𝑛𝑛𝑛𝑛 𝐇𝐇1𝑛𝑛𝑛𝑛 𝐇𝐇𝑁𝑁𝑛𝑛𝑛𝑛 𝐮1𝑛𝑛 𝐮𝑁𝑁𝑛𝑛 Time-varying SCMs Latent spectral features Time-varying DoAs Regularize Separation Localization SCM Source PSD Multichannel mixture Multichannel reconstruction 𝑔𝑔𝜃𝜃,𝑛𝑛 𝐳𝐳0𝑛𝑛 𝑔𝑔𝜃𝜃,𝑛𝑛 𝐳𝐳𝑁𝑁𝑛𝑛 𝑔𝑔𝜃𝜃,𝑛𝑛 𝐳𝐳1𝑛𝑛 Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 30

FCA performed well regardless of source velocity. • FastMNMF2 and Neural FCA drastically degraded when sources move fast. • TV-Neural FCA can improved avg. SDR 4.2dB from that of DoA-HMM [Higuchi+ 2014] SDR [dB] Source Separation Based on Deep Generative Models and Its Self-Supervised Learning 0 2 4 6 8 10 12 14 Average 0-15°/s 15-30°/s 30-45°/s TV-Neural FCA Neural FCA FastMNMF DOA-HMM /33 31

trained from mixtures of moving sources. • Robustness against real audio recordings was improved. Stationary condition Moving condition FastMNMF FastMNMF TV Neural FCA TV Neural FCA Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33 32