Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

EUSIPCO 2023: Neural Fast Full-Rank Spatial Cov...

Avatar for Yoshiaki Bando Yoshiaki Bando
September 05, 2023

EUSIPCO 2023: Neural Fast Full-Rank Spatial Covariance Analysis for Blind Sourceย Separation

Presentation slides used in EUSIPCO 2023
https://arxiv.org/abs/2306.10240

Avatar for Yoshiaki Bando

Yoshiaki Bando

September 05, 2023
Tweet

More Decks by Yoshiaki Bando

Other Decks in Research

Transcript

  1. Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation

    Yoshiaki Bando1,2, Yoshiki Masuyama1,3, Aditya Arie Nugraha2, Kazuyoshi Yoshii2,4 1National Institute of Advanced Industrial Science and Technology (AIST) 2Center for Advanced Intelligent Project (AIP), RIKEN, 3Department of Computer Science, Tokyo Metropolitan University, 4Graduate School of Informatics, Kyoto University
  2. Motivation: Blind Source Separation (BSS) Sound source separation forms the

    basis of machine listening systems. โ€ข Such systems are often required to work in diverse environments. โ€ข This calls for BSS, which can work adaptively for the target environment. Distant speech recognition (DSR) [Watanabe+ 2020, Baker+ 2018] Sound event detection (SED) [Turpault+ 2020, Denton+ 2022] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 2
  3. Foundation of Modern BSS Methods Probabilistic generative models of multichannel

    mixture signals. โ€ข The generative model consists of a source model and a spatial model Source model โ‹ฏ ๐‘ ๐‘ ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› โˆผ ๐’ฉ๐’ฉโ„‚ 0, ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘“๐‘“ ๐‘ก๐‘ก Observed mixture ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘š๐‘š Spatial model โ‹ฏ ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› โˆผ ๐’ฉ๐’ฉโ„‚ 0, ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‡๐‡๐‘›๐‘›๐‘›๐‘› ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘“๐‘“ ๐‘ก๐‘ก ๐‘š๐‘š ๐‘š๐‘š ๐‘ ๐‘ 1๐‘“๐‘“๐‘“๐‘“ ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ โˆผ ๐’ฉ๐’ฉโ„‚ 0, โˆ‘๐‘›๐‘› ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‡๐‡๐‘›๐‘›๐‘“๐‘“ ๐‘ ๐‘ ๐‘๐‘๐‘“๐‘“๐‘“๐‘“ ๐ฑ๐ฑ1๐‘“๐‘“๐‘“๐‘“ ๐ฑ๐ฑ๐‘๐‘๐‘๐‘๐‘๐‘ ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ โˆˆ โ„๐‘€๐‘€ Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 3
  4. Multivariate Gaussian representation of source images ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› โˆˆ โ„‚๐‘€๐‘€ ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›

    โˆผ ๐’ฉ๐’ฉโ„‚ 0, ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‡๐‡๐‘›๐‘›๐‘›๐‘› โ€ข Spatial covariance matrices (SCMs) ๐‡๐‡๐‘›๐‘›๐‘›๐‘› โˆˆ ๐•Š๐•Š+ ๐‘€๐‘€ร—๐‘€๐‘€: โ€œshapeโ€ of the ellipse โ€ข Power spectral density (PSD) ๐œ†๐œ†๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› โˆˆ โ„+ : โ€œsizeโ€ of the ellipse Geometric Interpretation of Multichannel Generative Models ใ“ ใ‚“ใซใกใฏ๏ผ Hello! Late Early ๐‘›๐‘› = 1 ๐‘š๐‘š1 ๐‘š๐‘š2 ๐œ†๐œ†1๐‘“๐‘“๐‘“๐‘“ ๐‡๐‡1๐‘“๐‘“ ๐‘›๐‘› = 2 ๐œ†๐œ†2๐‘“๐‘“๐‘“๐‘“ ๐‡๐‡2๐‘“๐‘“ Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 4
  5. Source Models for Blind Source Separation Source models based on

    low-rank approximation [Ozerov+ 2009] โ€ข Source PSD is estimated by non-negative matrix factorization (NMF) Source models based on deep generative models [Bando+ 2018] โ€ข Source is precisely generated by a deep neural network (DNN). ร— โˆผ ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ ๐œ†๐œ†๐‘“๐‘“๐‘“๐‘“ ๐‘ข๐‘ข๐‘“๐‘“๐‘“๐‘“ ๐‘ฃ๐‘ฃ๐‘˜๐‘˜๐‘˜๐‘˜ Source PSD Source signal Bases Activations โˆผ DNN Latent features Source PSD Source signal ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ ๐œ†๐œ†๐‘“๐‘“๐‘“๐‘“ ๐‘ง๐‘ง๐‘ก๐‘ก๐‘ก๐‘ก ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“ Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 5
  6. Spatial Models for Blind Source Separation Rank-1 spatial model: ๐‡๐‡๐‘›๐‘›๐‘›๐‘›

    = ๐š๐š๐‘›๐‘›๐‘›๐‘› ๐š๐š๐‘›๐‘›๐‘›๐‘› H Fast and stable by the IP [Ono+ 2011] or ISS [Sheibler +] algorithm Weak against reverberations and diffuse noise. Full-rank spatial model: ๐‡๐‡๐‘›๐‘›๐‘›๐‘› โˆˆ ๐•Š๐•Š๐‘€๐‘€ร—๐‘€๐‘€ Robust against reverberations and diffuse noise. Computationally expensive due to its EM or MU algorithm. Jointly-diagonalizable (JD) spatial model: ๐‡๐‡๐‘›๐‘›๐‘›๐‘› โ‰œ ๐๐๐‘“๐‘“ โˆ’1 diag ๐ฐ๐ฐ๐‘›๐‘› ๐๐๐‘“๐‘“ โˆ’H Still robust against reverberations and diffuse noise. Moderately fast by IP or ISS algorithm. ๐‘š๐‘š1 ๐‘š๐‘š2 can be considered as โˆ‘๐‘š๐‘š ๐‘ค๐‘ค๐‘›๐‘›๐‘›๐‘› ๐š๐š๐‘“๐‘“๐‘“๐‘“ ๐š๐š๐‘“๐‘“๐‘“๐‘“ H ๐‘š๐‘š1 ๐‘š๐‘š2 Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 6
  7. โ‹ฏ Multichannel reconstruction Neural Full-Rank Spatial Covariance Analysis (Neural FCA)

    Joint training of deep generative model and its inference model. โ€ข We train the models regarding them as a โ€œlarge VAEโ€ for a multichannel mixture. Computationally expensive due to the full-rank SCMs. Inference model Multichannel mixture โ‹ฏ โ‹ฏ ร— ร— โ‹ฏ Generative model Latent source features ร— SCM Source PSD The training is performed to make the reconstruction closer to the observation. Estimated by a heavy EM algorithm Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 7
  8. Deep Source Model + JD Spatial Model ๏ƒ  Neural FastFCA

    Speeding up neural FCA with a JD spatial model and the ISS algorithm. We utilize the ISS algorithm in the inference model to quickly estimate SCMs. Inference model Multichannel mixture Multichannel reconstruction โ‹ฏ Latent source features โ‹ฏ โ‹ฏ ร— Source PSD ร— ร— โ‹ฏ SCM Generative model DNN ISS JD SCM parameters [Scheibler+ 2021] Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 8
  9. Generative Model of Mixture Signals The full-rank SCMs ๐‡๐‡๐‘›๐‘›๐‘›๐‘› is

    replaced by the JD SCMs ๐๐๐‘“๐‘“ โˆ’1 diag ๐ฐ๐ฐ๐‘›๐‘› ๐๐๐‘“๐‘“ โˆ’H โ‹ฏ Multichannel reconstruction Generative model ร— Source PSD JD SCM ๐๐๐‘“๐‘“ โˆ’1 diag ๐ฐ๐ฐ1 ๐๐๐‘“๐‘“ โˆ’H โ‹ฏ ร— ร— โ‹ฏ ๐๐๐‘“๐‘“ โˆ’1 diag ๐ฐ๐ฐ2 ๐๐๐‘“๐‘“ โˆ’H ๐๐๐‘“๐‘“ โˆ’1 diag ๐ฐ๐ฐ๐‘๐‘ ๐๐๐‘“๐‘“ โˆ’H ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ โˆผ ๐’ฉ๐’ฉโ„‚ 0, ๐๐๐‘“๐‘“ โˆ’1 โˆ‘๐‘›๐‘› ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“ ๐ณ๐ณ๐‘›๐‘›๐‘›๐‘› diag ๐ฐ๐ฐ๐‘›๐‘› ๐๐๐‘“๐‘“ โˆ’H Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 9
  10. Inference Model Integrating DNN and ISS-Based Blocks The inference model

    estimates the params. of the generative model. โ€ข The ISS algorithm is involved to quickly estimate ๐๐๐‘“๐‘“ from ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“ and mask ๐’Ž๐’Ž๐œ™๐œ™,๐‘“๐‘“๐‘“๐‘“ . โ€ข Each DNN utilizes an intermediate diagonalization result for its estimate. DNN(1) ISS(1) ๐ก ๐œ™๐œ™,๐‘›๐‘› (1) ๐๐ ๐‘›๐‘› (1) ๐ฆ ๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘› (1) DNN(0) ๐๐ ๐‘›๐‘› (0) ๐ก ๐œ™๐œ™,๐‘›๐‘› (0) ๐ฆ ๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘› (0) ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘› ๐ฑ๐ฑ ๏ฟฝ ๐‘›๐‘›๐‘›๐‘› (1) DNN(๐ต) ISS(B) 1 ร— 1 Conv ๐ฑ๐ฑ ๏ฟฝ ๐‘›๐‘›๐‘›๐‘› (๐ต) ๐ก ๐œ™๐œ™,๐‘›๐‘› (๐ต) ๐Ž๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘› ๐ˆ๐œ™๐œ™,๐‘›๐‘›๐‘›๐‘› 2 ๐๐ ๐‘›๐‘› (๐ต) 1st blocks ๐ต-th blocks 1st blocks B-th blocks DNN(0) DNN(1) DNN(B) ISS(B) ISS(1) 1ร—1 Conv Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 10
  11. Training Based on Autoencoding Variational Bayes As in the training

    of the VAE, the ELBO โ„’ is maximized by using SGD. After training, the models are used to separate unseen mixture signals. Generative model ๐œƒ๐œƒ Multichannel mixture Multichannel reconstruction โ‹ฏ Latent source features โ‹ฏ Inference model ๐œ™๐œ™ โ‹ฏ JD SCM parameters Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 11
  12. Experimental Condition: Speech Separation Evaluation was performed with simulated 8-ch

    speech mixtures โ€ข The simulation was almost the same as the spatialized WSJ0-mix dataset. โ€ข The main difference is that # of srcs. was randomly drawn between 2 and 4. All the methods are performed by specifying a fixed # (5) of sources. โ€ข We show that our method can work with only specifying the max. # of sources. Method Brief description # of iters. MNMF [Sawada+ 2013] Conventional linear BSS methods that have ability to solve frequency permutation ambiguity 200 ILRMA [Kitamura+ 2016] FastMNMF [Sekiguchi+ 2020] Neural FCA [Bando+ 2021] The conventional neural BSS method 200 Neural FastFCA (Proposed) The proposed neural BSS method Iteration free Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 12
  13. Experimental Results: Average Separation Performance Neural FastFCA outperformed the conventional

    BSS methods in all the metrics and slightly better than neural FCA in SDR and STOI. 7.5 7 9.3 11.1 11.6 6 7 8 9 10 11 12 SDR 1.49 1.43 1.6 1.88 1.85 1.32 1.42 1.52 1.62 1.72 1.82 PESQ 0.76 0.76 0.8 0.84 0.85 0.74 0.76 0.78 0.8 0.82 0.84 0.86 STOI โ–ช MNMF โ–ช ILRMA โ–ช FastMNMF โ–ช Neural FCA โ–ช Neural FastFCA Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 13
  14. Experimental Results: Elapsed Time for Inference The elapsed time was

    drastically improved from neural FCA thanks to the JD spatial model and ISS-based inference model. 0.09 4.77 1.81 1.36 2.07 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Elapsed time for separating a 5-second mixture using NVIDIA V100 GPU [s] โ–ช MNMF โ–ช ILRMA โ–ช FastMNMF โ–ช Neural FCA โ–ช Neural FastFCA 53x faster Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 14
  15. Experimental Results: Performance at Each # of Sources Neural FastFCA

    was successfully trained from mixtures of unknown numbers of sources by specifying their maximum number. 13 8.3 3.9 13.2 7.7 3.2 15.3 10.1 5.3 16.4 12.2 7.2 17.4 12.7 7.5 0 2 4 6 8 10 12 14 16 18 20 N=2 N=3 N=4 SDR โ–ช MNMF โ–ช ILRMA โ–ช FastMNMF โ–ช Neural FCA โ–ช Neural FastFCA Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 15
  16. Conclusion: Neural Fast Full-Rank Spatial Covariance Analysis An extension of

    neural FCA to reduce the computational cost. โ€ข JD SCMs and ISS-based layers reduced the cost to 2% from the original. โ€ข Our method was successfully trained from mixtures w/ unknown #s of sources. Future work: Joint dereverberation and separation of moving sources. Inference model Multichannel mixture Multichannel reconstruction Latent source features Source PSD SCM DNN ISS Generative model JD SCM parameters Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16 16