Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EUSIPCO 2023: Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation

Yoshiaki Bando
September 05, 2023

EUSIPCO 2023: Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation

Presentation slides used in EUSIPCO 2023
https://arxiv.org/abs/2306.10240

Yoshiaki Bando

September 05, 2023
Tweet

More Decks by Yoshiaki Bando

Other Decks in Research

Transcript

  1. Neural Fast Full-Rank Spatial Covariance
    Analysis for Blind Source Separation
    Yoshiaki Bando1,2, Yoshiki Masuyama1,3, Aditya Arie Nugraha2, Kazuyoshi Yoshii2,4
    1National Institute of Advanced Industrial Science and Technology (AIST)
    2Center for Advanced Intelligent Project (AIP), RIKEN,
    3Department of Computer Science, Tokyo Metropolitan University,
    4Graduate School of Informatics, Kyoto University

    View full-size slide

  2. Motivation: Blind Source Separation (BSS)
    Sound source separation forms the basis of machine listening systems.
    • Such systems are often required to work in diverse environments.
    • This calls for BSS, which can work adaptively for the target environment.
    Distant speech recognition (DSR)
    [Watanabe+ 2020, Baker+ 2018]
    Sound event detection (SED)
    [Turpault+ 2020, Denton+ 2022]
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    2

    View full-size slide

  3. Foundation of Modern BSS Methods
    Probabilistic generative models of multichannel mixture signals.
    • The generative model consists of a source model and a spatial model
    Source model

    𝑠𝑠𝑛𝑛𝑛𝑛𝑛𝑛
    ∼ 𝒩𝒩ℂ
    0, λ𝑛𝑛𝑛𝑛𝑛𝑛
    𝑓𝑓
    𝑡𝑡
    𝑓𝑓
    𝑡𝑡
    Observed mixture
    𝑓𝑓
    𝑡𝑡
    𝑚𝑚
    Spatial model

    𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛
    ∼ 𝒩𝒩ℂ
    0, λ𝑛𝑛𝑛𝑛𝑛𝑛
    𝐇𝐇𝑛𝑛𝑛𝑛
    𝑓𝑓
    𝑡𝑡
    𝑓𝑓
    𝑡𝑡
    𝑚𝑚
    𝑚𝑚
    𝑠𝑠1𝑓𝑓𝑓𝑓
    𝐱𝐱𝑓𝑓𝑓𝑓
    ∼ 𝒩𝒩ℂ
    0, ∑𝑛𝑛
    λ𝑛𝑛𝑛𝑛𝑛𝑛
    𝐇𝐇𝑛𝑛𝑓𝑓
    𝑠𝑠𝑁𝑁𝑓𝑓𝑓𝑓
    𝐱𝐱1𝑓𝑓𝑓𝑓
    𝐱𝐱𝑁𝑁𝑁𝑁𝑁𝑁
    𝐱𝐱𝑓𝑓𝑓𝑓
    ∈ ℝ𝑀𝑀
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    3

    View full-size slide

  4. Multivariate Gaussian representation of source images 𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛
    ∈ ℂ𝑀𝑀
    𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛
    ∼ 𝒩𝒩ℂ
    0, λ𝑛𝑛𝑛𝑛𝑛𝑛
    𝐇𝐇𝑛𝑛𝑛𝑛
    • Spatial covariance matrices (SCMs) 𝐇𝐇𝑛𝑛𝑛𝑛
    ∈ 𝕊𝕊+
    𝑀𝑀×𝑀𝑀: “shape” of the ellipse
    • Power spectral density (PSD) 𝜆𝜆𝑛𝑛𝑛𝑛𝑛𝑛
    ∈ ℝ+
    : “size” of the ellipse
    Geometric Interpretation of Multichannel Generative Models
    こ んにちは!
    Hello!
    Late
    Early
    𝑛𝑛 = 1
    𝑚𝑚1
    𝑚𝑚2
    𝜆𝜆1𝑓𝑓𝑓𝑓
    𝐇𝐇1𝑓𝑓
    𝑛𝑛 = 2
    𝜆𝜆2𝑓𝑓𝑓𝑓
    𝐇𝐇2𝑓𝑓
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    4

    View full-size slide

  5. Source Models for Blind Source Separation
    Source models based on low-rank approximation [Ozerov+ 2009]
    • Source PSD is estimated by non-negative matrix factorization (NMF)
    Source models based on deep generative models [Bando+ 2018]
    • Source is precisely generated by a deep neural network (DNN).
    ×

    𝑠𝑠𝑓𝑓𝑓𝑓 𝜆𝜆𝑓𝑓𝑓𝑓
    𝑢𝑢𝑓𝑓𝑓𝑓
    𝑣𝑣𝑘𝑘𝑘𝑘
    Source PSD
    Source signal Bases
    Activations
    ∼ DNN
    Latent features
    Source PSD
    Source signal
    𝑠𝑠𝑓𝑓𝑓𝑓 𝜆𝜆𝑓𝑓𝑓𝑓
    𝑧𝑧𝑡𝑡𝑡𝑡
    𝑔𝑔𝜃𝜃,𝑓𝑓
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    5

    View full-size slide

  6. Spatial Models for Blind Source Separation
    Rank-1 spatial model: 𝐇𝐇𝑛𝑛𝑛𝑛
    = 𝐚𝐚𝑛𝑛𝑛𝑛
    𝐚𝐚𝑛𝑛𝑛𝑛
    H
    Fast and stable by the IP [Ono+ 2011] or ISS [Sheibler +] algorithm
    Weak against reverberations and diffuse noise.
    Full-rank spatial model: 𝐇𝐇𝑛𝑛𝑛𝑛
    ∈ 𝕊𝕊𝑀𝑀×𝑀𝑀
    Robust against reverberations and diffuse noise.
    Computationally expensive due to its EM or MU algorithm.
    Jointly-diagonalizable (JD) spatial model: 𝐇𝐇𝑛𝑛𝑛𝑛
    ≜ 𝐐𝐐𝑓𝑓
    −1 diag 𝐰𝐰𝑛𝑛
    𝐐𝐐𝑓𝑓
    −H
    Still robust against reverberations and diffuse noise.
    Moderately fast by IP or ISS algorithm.
    𝑚𝑚1
    𝑚𝑚2
    can be considered as
    ∑𝑚𝑚
    𝑤𝑤𝑛𝑛𝑛𝑛
    𝐚𝐚𝑓𝑓𝑓𝑓
    𝐚𝐚𝑓𝑓𝑓𝑓
    H
    𝑚𝑚1
    𝑚𝑚2
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    6

    View full-size slide


  7. Multichannel
    reconstruction
    Neural Full-Rank Spatial Covariance Analysis (Neural FCA)
    Joint training of deep generative model and its inference model.
    • We train the models regarding them as a “large VAE” for a multichannel mixture.
    Computationally expensive due to the full-rank SCMs.
    Inference
    model
    Multichannel
    mixture


    ×
    ×

    Generative model
    Latent source
    features
    ×
    SCM
    Source PSD
    The training is performed to make the reconstruction closer to the observation.
    Estimated by a heavy EM algorithm
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    7

    View full-size slide

  8. Deep Source Model + JD Spatial Model  Neural FastFCA
    Speeding up neural FCA with a JD spatial model and the ISS algorithm.
    We utilize the ISS algorithm in the inference model to quickly estimate SCMs.
    Inference
    model
    Multichannel
    mixture
    Multichannel
    reconstruction

    Latent source
    features


    ×
    Source PSD
    ×
    ×

    SCM
    Generative model
    DNN
    ISS
    JD SCM
    parameters
    [Scheibler+ 2021]
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    8

    View full-size slide

  9. Generative Model of Mixture Signals
    The full-rank SCMs 𝐇𝐇𝑛𝑛𝑛𝑛
    is replaced by the JD SCMs 𝐐𝐐𝑓𝑓
    −1 diag 𝐰𝐰𝑛𝑛
    𝐐𝐐𝑓𝑓
    −H

    Multichannel
    reconstruction
    Generative model
    ×
    Source PSD JD SCM
    𝐐𝐐𝑓𝑓
    −1 diag 𝐰𝐰1
    𝐐𝐐𝑓𝑓
    −H

    ×
    ×

    𝐐𝐐𝑓𝑓
    −1 diag 𝐰𝐰2
    𝐐𝐐𝑓𝑓
    −H
    𝐐𝐐𝑓𝑓
    −1 diag 𝐰𝐰𝑁𝑁
    𝐐𝐐𝑓𝑓
    −H
    𝐱𝐱𝑓𝑓𝑓𝑓
    ∼ 𝒩𝒩ℂ
    0, 𝐐𝐐𝑓𝑓
    −1 ∑𝑛𝑛
    𝑔𝑔𝜃𝜃,𝑓𝑓
    𝐳𝐳𝑛𝑛𝑛𝑛
    diag 𝐰𝐰𝑛𝑛
    𝐐𝐐𝑓𝑓
    −H
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    9

    View full-size slide

  10. Inference Model Integrating DNN and ISS-Based Blocks
    The inference model estimates the params. of the generative model.
    • The ISS algorithm is involved to quickly estimate 𝐐𝐐𝑓𝑓
    from 𝐱𝐱𝑓𝑓𝑓𝑓
    and mask 𝒎𝒎𝜙𝜙,𝑓𝑓𝑓𝑓
    .
    • Each DNN utilizes an intermediate diagonalization result for its estimate.
    DNN(1)
    ISS(1)
    𝐡
    𝜙𝜙,𝑛𝑛
    (1)
    𝐐𝐐
    𝑛𝑛
    (1)
    𝐦
    𝜙𝜙,𝑛𝑛𝑛𝑛
    (1)
    DNN(0)
    𝐐𝐐
    𝑛𝑛
    (0)
    𝐡
    𝜙𝜙,𝑛𝑛
    (0)
    𝐦
    𝜙𝜙,𝑛𝑛𝑛𝑛
    (0)
    𝐱𝐱𝑛𝑛𝑛𝑛
    𝐱𝐱

    𝑛𝑛𝑛𝑛
    (1)
    DNN(𝐵)
    ISS(B)
    1 × 1
    Conv
    𝐱𝐱

    𝑛𝑛𝑛𝑛
    (𝐵)
    𝐡
    𝜙𝜙,𝑛𝑛
    (𝐵) 𝝎𝜙𝜙,𝑛𝑛𝑛𝑛𝑛𝑛
    𝝁𝜙𝜙,𝑛𝑛𝑛𝑛
    𝝈𝜙𝜙,𝑛𝑛𝑛𝑛
    2
    𝐐𝐐
    𝑛𝑛
    (𝐵)
    1st blocks 𝐵-th blocks
    1st blocks B-th blocks
    DNN(0) DNN(1) DNN(B)
    ISS(B)
    ISS(1)
    1×1
    Conv
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    10

    View full-size slide

  11. Training Based on Autoencoding Variational Bayes
    As in the training of the VAE, the ELBO ℒ is maximized by using SGD.
    After training, the models are used to separate unseen mixture signals.
    Generative
    model 𝜃𝜃
    Multichannel
    mixture
    Multichannel
    reconstruction

    Latent source
    features

    Inference
    model 𝜙𝜙

    JD SCM
    parameters
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    11

    View full-size slide

  12. Experimental Condition: Speech Separation
    Evaluation was performed with simulated 8-ch speech mixtures
    • The simulation was almost the same as the spatialized WSJ0-mix dataset.
    • The main difference is that # of srcs. was randomly drawn between 2 and 4.
    All the methods are performed by specifying a fixed # (5) of sources.
    • We show that our method can work with only specifying the max. # of sources.
    Method Brief description # of iters.
    MNMF [Sawada+ 2013]
    Conventional linear BSS methods that have ability to
    solve frequency permutation ambiguity
    200
    ILRMA [Kitamura+ 2016]
    FastMNMF [Sekiguchi+ 2020]
    Neural FCA [Bando+ 2021] The conventional neural BSS method 200
    Neural FastFCA (Proposed) The proposed neural BSS method Iteration free
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    12

    View full-size slide

  13. Experimental Results: Average Separation Performance
    Neural FastFCA outperformed the conventional BSS methods in all the
    metrics and slightly better than neural FCA in SDR and STOI.
    7.5
    7
    9.3
    11.1
    11.6
    6
    7
    8
    9
    10
    11
    12
    SDR
    1.49
    1.43
    1.6
    1.88
    1.85
    1.32
    1.42
    1.52
    1.62
    1.72
    1.82
    PESQ
    0.76 0.76
    0.8
    0.84
    0.85
    0.74
    0.76
    0.78
    0.8
    0.82
    0.84
    0.86
    STOI
    ■ MNMF ■ ILRMA ■ FastMNMF ■ Neural FCA ■ Neural FastFCA
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    13

    View full-size slide

  14. Experimental Results: Elapsed Time for Inference
    The elapsed time was drastically improved from neural FCA thanks to
    the JD spatial model and ISS-based inference model.
    0.09
    4.77
    1.81
    1.36
    2.07
    0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
    Elapsed time for separating a 5-second mixture using NVIDIA V100 GPU [s]
    ■ MNMF ■ ILRMA ■ FastMNMF ■ Neural FCA ■ Neural FastFCA
    53x faster
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    14

    View full-size slide

  15. Experimental Results: Performance at Each # of Sources
    Neural FastFCA was successfully trained from mixtures of unknown
    numbers of sources by specifying their maximum number.
    13
    8.3
    3.9
    13.2
    7.7
    3.2
    15.3
    10.1
    5.3
    16.4
    12.2
    7.2
    17.4
    12.7
    7.5
    0
    2
    4
    6
    8
    10
    12
    14
    16
    18
    20
    N=2 N=3 N=4
    SDR
    ■ MNMF ■ ILRMA ■ FastMNMF ■ Neural FCA ■ Neural FastFCA
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    15

    View full-size slide

  16. Conclusion: Neural Fast Full-Rank Spatial Covariance Analysis
    An extension of neural FCA to reduce the computational cost.
    • JD SCMs and ISS-based layers reduced the cost to 2% from the original.
    • Our method was successfully trained from mixtures w/ unknown #s of sources.
    Future work: Joint dereverberation and separation of moving sources.
    Inference
    model
    Multichannel
    mixture
    Multichannel
    reconstruction
    Latent source
    features Source PSD SCM
    DNN
    ISS
    Generative model
    JD SCM
    parameters
    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation /16
    16

    View full-size slide