$30 off During Our Annual Pro Sale. View Details »

INTERSPEECH 2023 T5 Part4: Source Separation Based on Deep Source Generative Models and Its Self-Supervised Learning

INTERSPEECH 2023 T5 Part4: Source Separation Based on Deep Source Generative Models and Its Self-Supervisedย Learning

The slides used for Part 4 of INTERSPEECH 2023 Tutorial T5: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models."

Yoshiaki Bando

August 22, 2023
Tweet

More Decks by Yoshiaki Bando

Other Decks in Research

Transcript

  1. Source Separation Based on Deep Source
    Generative Models and Its Self-Supervised Learning
    Yoshiaki Bando
    National Institute of Advanced Industrial Science and Technology (AIST), Japan
    Center for Advanced Intelligent Project (AIP), RIKEN, Japan
    T5: Foundations, Extensions and Applications of Statistical Multichannel Speech Separation Models,
    INTERSPEECH 2023, Dublin, Ireland

    View Slide

  2. Sound source separation forms the basis of machine listening systems.
    โ€ข Such systems are often required to work in diverse environments.
    โ€ข This calls for BSS, which can work adaptively for the target environment.
    Blind Source Separation (BSS)
    Distant speech recognition (DSR)
    [Watanabe+ 2020, Baker+ 2018]
    Sound event detection (SED)
    [Turpault+ 2020, Denton+ 2022]
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    2

    View Slide

  3. Foundation of Modern BSS Methods
    Probabilistic generative models of multichannel mixture signals.
    โ€ข A precise source model is required for defining the likelihood of a source signal.
    Source model
    โ‹ฏ
    ๐‘ ๐‘ ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›
    โˆผ ๐’ฉ๐’ฉโ„‚
    0, ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›
    ๐‘“๐‘“
    ๐‘ก๐‘ก
    ๐‘“๐‘“
    ๐‘ก๐‘ก
    Observed mixture
    ๐‘“๐‘“
    ๐‘ก๐‘ก
    ๐‘š๐‘š
    Spatial model
    โ‹ฏ
    ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›
    โˆผ ๐’ฉ๐’ฉโ„‚
    0, ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›
    ๐‡๐‡๐‘›๐‘›๐‘›๐‘›
    ๐‘“๐‘“
    ๐‘ก๐‘ก
    ๐‘“๐‘“
    ๐‘ก๐‘ก
    ๐‘š๐‘š
    ๐‘š๐‘š
    ๐‘ ๐‘ 1๐‘“๐‘“๐‘“๐‘“
    ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“
    โˆผ ๐’ฉ๐’ฉโ„‚
    0, โˆ‘๐‘›๐‘›
    ฮป๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›
    ๐‡๐‡๐‘›๐‘›๐‘“๐‘“
    ๐‘ ๐‘ ๐‘๐‘๐‘“๐‘“๐‘“๐‘“
    ๐ฑ๐ฑ1๐‘“๐‘“๐‘“๐‘“
    ๐ฑ๐ฑ๐‘๐‘๐‘๐‘๐‘๐‘
    ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“
    โˆˆ โ„๐‘€๐‘€
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    3

    View Slide

  4. Source Model Based on Low-Rank Approximation
    Source power spectral density (PSD) often has low-rank structures.
    โ€ข Source PSD is estimated by non-negative matrix factorization (NMF) [Ozerov+ 2009]
    .
    โ€ข Its inference is fast and does not require supervised pre-training.
    ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“
    โˆผ ๐’ฉ๐’ฉโ„‚
    0, โˆ‘๐‘˜๐‘˜
    ๐‘ข๐‘ข๐‘“๐‘“๐‘“๐‘“
    ๐‘ฃ๐‘ฃ๐‘˜๐‘˜๐‘˜๐‘˜
    Is there a more powerful representation of source spectra?
    ร—
    โˆผ
    ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ ๐œ†๐œ†๐‘“๐‘“๐‘“๐‘“
    ๐‘ข๐‘ข๐‘“๐‘“๐‘“๐‘“
    ๐‘ฃ๐‘ฃ๐‘˜๐‘˜๐‘˜๐‘˜
    Source PSD
    Source signal Bases
    Activations
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    4

    View Slide

  5. Source Model Based on Deep Generative Model
    Source spectra are represented with low-dim. latent feature vectors.
    โ€ข A DNN is used to generate source power spectral density (PSD) precisely.
    โ€ข Freq.-independent latent features helps us to solve freq. permutation ambiguity.
    โˆผ DNN
    Latent features
    Source PSD
    Source signal
    ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ ๐œ†๐œ†๐‘“๐‘“๐‘“๐‘“
    ๐‘ง๐‘ง๐‘ก๐‘ก๐‘ก๐‘ก
    ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“
    ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“
    โˆฃ ๐ณ๐ณ๐‘ก๐‘ก
    โˆผ ๐’ฉ๐’ฉโ„‚
    0, ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“
    ๐ณ๐ณ๐‘ก๐‘ก
    Y. Bando, et al. "Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-
    negative matrix factorization." IEEE ICASSP, pp. 716-720, 2018.
    ๐‘ง๐‘ง๐‘ก๐‘ก๐‘ก๐‘ก
    โˆผ ๐’ฉ๐’ฉ 0, 1
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    5

    View Slide

  6. Contents
    Two applications of deep source generative models.
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning
    1. Semi-supervised speech enhancement
    โ€ข We enhance speech signals by training on only clean speech signals
    โ€ข Combination of a deep speech model and low-rank noise models
    2. Self-supervised source separation
    โ€ข We train neural source separation model only from multichannel mixtures
    โ€ข The joint training of the source generative model and its inference model
    /33
    6

    View Slide

  7. Multichannel Speech Enhancement
    Based on Supervised Deep Source Model
    โ€ข K. Sekiguchi, Y. Bando, A. A. Nugraha, K. Yoshii, T. Kawahara,
    โ€œSemi-supervised Multichannel Speech Enhancement with a Deep Speech Prior,โ€ IEEE/ACM TASLP, 2019
    โ€ข K. Sekiguchi, A. A. Nugraha, Y. Bando, K. Yoshii,
    โ€œFast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices,โ€ EUSIPCO, 2019
    โ€ข Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, T. Kawahara,
    โ€œStatistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Nonnegative Matrix Factorization,โ€ IEEE ICASSP, 2018
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    7

    View Slide

  8. Speech Enhancement
    A task to extract speech signals from a mixture of speech and noise
    โ€ข Various applications such as DSR, search-and-rescue, and hearing aids.
    Robustness against various acoustic environment is essential.
    โ€ข It is often difficult to assume the environment where they are used.
    Hey, Siriโ€ฆ
    CC0: https://pxhere.com/ja/photo/1234569 CC0: https://pxhere.com/ja/photo/742585
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    8

    View Slide

  9. Semi-Supervised Enhancement With Deep Speech Prior
    A hybrid method of deep speech model and statistical noise model
    โ€ข We can use many speech corpus ๏ƒ  deep speech prior
    โ€ข Noise training data are often few ๏ƒ  statistical noise prior w/ low-rank model
    +
    โ‰ˆ
    Observed noisy speech Deep speech prior Statistical noise prior
    Speech corpus
    Pre-training
    Estimated on the fly
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    9

    View Slide

  10. The training based on a variational autoencoder (VAE) [Kingma+ 2013]
    โ€ข An encoder ๐‘ž๐‘ž๐œ™๐œ™
    ๐™๐™ ๐’๐’ is introduced to estimate latent features from clean speech.
    The objective function is the evidence lower bound (ELBO) โ„’๐œƒ๐œƒ,๐œ™๐œ™
    โ„’๐œƒ๐œƒ,๐œ™๐œ™
    = ๐”ผ๐”ผ๐‘ž๐‘ž๐œ™๐œ™
    log ๐‘๐‘๐œƒ๐œƒ
    ๐’๐’ ๐™๐™ โˆ’ ๐’Ÿ๐’ŸKL
    ๐‘ž๐‘ž๐œ™๐œ™
    ๐™๐™|๐’๐’ ๐‘๐‘ ๐™๐™
    Supervised Training of Deep Speech Prior (DP)
    Reconstructed speech
    Latent features ๐™๐™
    Observed speech
    Reconstruction term (IS-div.) Regularization term (KL-div.)
    Encoder
    ๐‘ž๐‘ž๐œ™๐œ™
    ๐™๐™ ๐’๐’
    Decoder
    ๐‘๐‘๐œƒ๐œƒ
    ๐’๐’ ๐™๐™
    The training is performed by making the reconstruction closer to the observation.
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    10

    View Slide

  11. A unified generative model combining the VAE-based source model,
    NMF-based noise model, and jointly-diagonalizable (JD) spatial model.
    FastMNMF with a Deep Speech Prior (FastMNMF-DP)
    VAE-based speech model
    DNN
    ๐‘ง๐‘ง๐‘‘๐‘‘๐‘‘๐‘‘
    ๐œ†๐œ†0๐‘“๐‘“๐‘“๐‘“
    NMF-based noise model ร— ๐‘๐‘
    ร—
    JD spatial model
    SCM ๐‡๐‡๐‘›๐‘›๐‘›๐‘›
    JD spatial model
    SCM ๐‡๐‡0๐‘“๐‘“
    ๐‘š๐‘š1
    ๐‘š๐‘š2
    ๐‘š๐‘š1
    ๐‘š๐‘š2
    ๐œ†๐œ†๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›
    Latent features
    Speech PSD
    Noise PSDs
    Activations ๐‘ฃ๐‘ฃ๐‘˜๐‘˜๐‘˜๐‘˜
    Bases ๐‘ข๐‘ข๐‘˜๐‘˜๐‘˜๐‘˜
    Speech image
    Noise images
    Noisy observation
    ๐ฑ๐ฑ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›
    ๐ฑ๐ฑ0๐‘“๐‘“๐‘“๐‘“
    ๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“
    JD SCMs ๐‡๐‡๐‘›๐‘›๐‘›๐‘›
    = ๐๐๐‘“๐‘“
    diag ๐ ๐ ๐‘›๐‘›๐‘›๐‘›
    ๐๐๐‘“๐‘“
    ๏ƒ 
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    11

    View Slide

  12. Monte-Carlo Expectation-Maximization (MC-EM) Inference
    Speech and noise are separated by estimating the model parameters.
    Speech signal is finally obtained by multichannel Wiener filtering.
    ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“
    = ๐”ผ๐”ผ ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“
    ๐—๐—, ๐๐, ๏ฟฝ
    ๐‡๐‡, ๐”๐”, ๐•๐•, ๐™๐™ = ๐๐๐‘“๐‘“
    โˆ’1diag
    ๐œ†๐œ†0๐‘“๐‘“๐‘“๐‘“
    ฬƒ
    ๐ก๐ก๐‘›๐‘›๐‘›๐‘›
    โˆ‘๐‘›๐‘›
    ๐œ†๐œ†๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›
    ฬƒ
    ๐ก๐ก๐‘›๐‘›๐‘›๐‘›
    ๐๐๐‘“๐‘“
    โˆ’H๐ฑ๐ฑ๐‘“๐‘“๐‘“๐‘“
    E-step samples latent features from its posterior ๐ณ๐ณ๐‘ก๐‘ก
    โˆผ ๐‘๐‘ ๐ณ๐ณ๐‘ก๐‘ก
    ๐—๐—
    โ€ข Metropolis-Hasting sampling is utilized due to its intractability.
    M-step updates the other parameters to maximize log ๐‘๐‘ ๐—๐— ๐๐, ๏ฟฝ
    ๐‡๐‡, ๐”๐”, ๐•๐•
    โ€ข ๐๐ is updated by the iterative-projection (IP) algorithm [Ono+ 2011]
    .
    โ€ข ๏ฟฝ
    ๐‡๐‡, ๐”๐”, ๐•๐• are updated by multiplicative-update (MU) algorithm [Nakano+ 2010]
    .
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    12

    View Slide

  13. Experimental Condition
    We evaluated with a part of the CHiME-3 noisy speech dataset
    โ€ข 100 utterances from the CHiME-3 evaluation set
    โ€ข Each utterance was recorded by a 6-channel* mic. array on a tablet device.
    โ€ข The CHiME-3 dataset includes four noise environments:
    Evaluation metrics:
    โ€ข Source-to-distortion ratio (SDR) [dB] for evaluating enhancement performance
    โ€ข Computational time [msec] for evaluating the efficiency of the method.
    On a bus In a cafeteria In a pedestrian area On a street junction
    http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html
    *We emitted one microphone on the back of the tablet
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    13

    View Slide

  14. Enhancement Performance in SDRs
    DP successively improved SDRs for FastMNMF and MNMF.
    โ€ข The JD full-rank model was better than full-rank and rank-1 models.
    Method Source model Spatial model
    FastMNMF-DP DP + NMF JD full-rank
    FastMNMF NMF JD full-rank
    MNMF-DP DP + NMF Full-rank
    MNMF NMF Full-rank
    ILRMA NMF Rank-1
    [Sekiguchi+ 2019]
    [Sekiguchi+ 2019]
    [Sawada+ 2013]
    [Kitamura+ 2016]
    15.1
    13.2
    18.6
    16.8
    18.9
    12 13 14 15 16 17 18 19 20
    [Sekiguchi+ 2019]
    Average SDR [dB] over 100 utterances
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    14

    View Slide

  15. Computational Times for Speech Enhancement
    Although DP slightly increased computational cost, FastMNMF-DP was
    much faster than MNMF.
    Method Source model Spatial model
    FastMNMF-DP DP + NMF JD full-rank
    FastMNMF NMF JD full-rank
    MNMF-DP DP + NMF Full-rank
    MNMF NMF Full-rank
    ILRMA NMF Rank-1
    [Sekiguchi+ 2019]
    [Sekiguchi+ 2019]
    [Sawada+ 2013]
    [Kitamura+ 2016]
    10
    660
    710
    40
    78
    0 100 200 300 400 500 600 700 800
    [Sekiguchi+ 2019]
    Computational time [ms] for an 8-second signal
    *Evaluation is performed with NVIDIA TITAN RTX
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    15

    View Slide

  16. Excerpts of Enhancement Results
    Observation Clean speech
    ILRMA FastMNMF-DP
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    16

    View Slide

  17. Self-Supervised Learning of Deep Source
    Generative Model and Its Inference Model
    โ€ข Y. Bando, K. Sekiguchi, Y. Masuyama, A. A. Nugraha, M. Fontaine, K. Yoshii,
    โ€œNeural full-rank spatial covariance analysis for blind source separation,โ€ IEEE SP Letters, 2021
    โ€ข Y. Bando, T, Aizawa, K. Itoyama, K. Nakadai,
    โ€œWeakly-supervised neural full-rank spatial covariance analysis for a front-end system of distant speech recognition,โ€ INTERSPEECH, 2022
    โ€ข H. Munakata, Y. Bando, R. Takeda, K. Komatani, M. Onishi,
    โ€œJoint Separation and Localization of Moving Sound Sources Based on Neural Full-Rank Spatial Covariance Analysis,โ€ IEEE SP Letters, 2023
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    17

    View Slide

  18. Source Separation Based on Multichannel VAEs (MVAEs)
    Deep source generative models achieved excellent performance.
    โ€ข ๐ณ๐ณ๐‘›๐‘›๐‘›๐‘›
    and ๐‡๐‡๐‘“๐‘“๐‘“๐‘“
    are estimated to maximize the likelihood function at the inference
    Can the deep source models be trained only from mixture signals?
    Generative model
    Multichannel
    reconstruction
    โ‹ฏ
    Latent source
    features
    โ‹ฏ
    ร—
    ร—
    ร—
    โ‹ฏ
    SCM
    Source PSD
    [Kameoka+ 2018, Seki+ 2019]
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    18

    View Slide

  19. Self-Supervised Training of Deep Source Model
    The generative model is trained jointly with its inference model.
    โ€ข We train the models regarding them as a โ€œlarge VAEโ€ for a multichannel mixture.
    The training is performed to make the reconstruction closer to the observation.
    Inference
    model Generative model
    Multichannel
    mixture
    Multichannel
    reconstruction
    โ‹ฏ
    โ‹ฏ
    Latent source
    features
    โ‹ฏ
    ร—
    ร—
    ร—
    โ‹ฏ
    SCM
    Source PSD
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    19

    View Slide

  20. Training Based on Autoencoding Variational Bayes
    As in the training of the VAE, the ELBO โ„’ is maximized by using SGD.
    โ€ข Our training can be considered as BSS for all the training mixtures.
    Generative model
    Multichannel
    mixture
    Multichannel
    reconstruction
    โ‹ฏ
    โ‹ฏ
    Inference model
    Latent source
    features
    โ‹ฏ
    Minimize ๐’Ÿ๐’Ÿ๐พ๐พ๐พ๐พ
    ๐‘ž๐‘ž ๐™๐™ ๐—๐— ๐‘๐‘ ๐™๐™ ๐—๐—, ๐‡๐‡
    Maximize ๐‘๐‘ ๐—๐— ๐‡๐‡
    EM update rule
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    20

    View Slide

  21. Solving Frequency Permutation Ambiguity
    We solve the ambiguity by making latent vectors ๐ณ๐ณ1๐‘ก๐‘ก
    , โ€ฆ , ๐ณ๐ณ๐‘๐‘๐‘๐‘
    independent.
    ๏Œ Each source shares the same content
    ๏ƒ  Latent vectors have a LARGE correlation
    The KL term weight ๐›ฝ๐›ฝ is set to a large value for first several epochs.
    โ€ข approaches to the std. Gaussian dist. (no correlation between sources).
    โ€ข Disentanglement of the latent features by ฮฒ-VAE.
    ๏Š Each source has a different content
    ๏ƒ  Latent vectors have a SMALL correlation
    ๐‘“๐‘“
    ๐‘ก๐‘ก
    ๐‘“๐‘“
    ๐‘ก๐‘ก
    ๐‘“๐‘“
    ๐‘ก๐‘ก
    ๐‘“๐‘“
    ๐‘ก๐‘ก
    Source 1 Source 2 Source 1 Source 2
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    21

    View Slide

  22. Relations Between Neural FCA and Existing Methods
    Neural FCA is a DEEP & BLIND source separation method
    โ€ข Self-supervised training of the deep source generative model
    Linear BLIND Source Separation DEEP (Semi-)supervised Source Separation
    MNMF
    [Ozerov+ 2009, Sawada+ 2013]
    ILRMA
    [Kitamura+ 2015]
    FastMNMF
    [Sekiguchi+ 2019, Ito+ 2019]
    IVA
    [Ono+ 2011]
    MVAE
    [Kameoka+ 2018]
    FastMNMF-DP
    [Sekiguchi+ 2018, Leglaive+ 2019]
    IDLMA
    [Mogami+ 2018]
    DNN-MSS
    [Nugraha+ 2016]
    Neural FCA
    (proposed)
    NF-IVA
    [Nugraha+ 2020]
    NF-FastMNMF
    [Nugraha+ 2022]
    Deep spatial models
    Deep source model
    DEEP BLIND Source Separation
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    22

    View Slide

  23. Experimental Condition
    Evaluation with the spatialized WSJ0-2mix dataset
    โ€ข 4-ch mixture signals of two speech sources with RT60
    = 200โ€“600 ms
    โ€ข All mixture signals were dereverberated in advance by using WPE.
    Method Brief description Permutation
    solver
    cACGMM [Ito+ 2016]
    Conventional linear BSS methods
    (for determined conditions)
    Required
    FCA [Duong+ 2010] Required
    FastMNMF2 [Sekiguchi+ 2020] Free
    Pseudo supervised [Togami+ 2020] DNN imitates the MWF of BSS (FCA) results Required
    Neural cACGMM [Drude+ 2019]
    DNN is trained to maximize the log-marginal likelihood
    of the cACGMM
    Required
    MVAE [Seki+ 2019] The supervised version of our neural FCA โ€“
    Neural FCA (proposed) Our neural blind source separation method Free
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    23

    View Slide

  24. Experimental Results With SDRs
    Neural FCA outperformed conventional BSS methods and neural
    unsupervised methods and was comparable to the supervised MVAE.
    15.2
    2.9
    15.2
    12.4
    14.7
    13.0
    12.7
    10.8
    0 2 4 6 8 10 12 14 16
    cACGMM
    FCA
    FastMNMF2
    Pseudo supervised
    Neural cACGMM
    Neural FCA
    MVAE (random init.)
    MVAE (FCA init.)
    SDR (higher is better) [dB]
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    24

    View Slide

  25. Excerpts of Separation Results
    Neural FCA
    *More separation examples: https://ybando.jp/projects/spl2021
    FastMNMF MVAE (supervised)
    Mixture input
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    25

    View Slide

  26. Extension 1: Front-End System of Multi-Speaker DSR
    It is essential for DSR to separate target speech sources from mixture
    recordings distorted by reverberation and overlapped speech.
    (e.g., CHiME-3, 4 Challenges) (e.g., CHiME-5, 6 Challenges)
    Single-speaker DSR (e.g., smart speakers)
    has achieved excellent performance.
    Multi-speaker DSR (e.g., home parties)
    is still a challenging problem.
    https://spandh.dcs.shef.ac.uk//chime_challenge/chime2015/overview.html https://spandh.dcs.shef.ac.uk//chime_challenge/CHiME5/overview.html
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    26

    View Slide

  27. Weakly-Supervised Neural FCA for DSR
    Variable # of speech sources should be handled in real conversations.
    โ€ข We introduce temporal voice activities ๐‘ข๐‘ข๐‘›๐‘›๐‘›๐‘›
    โˆˆ 0, 1 to neural FCA.
    ๐‘›๐‘›|๐‘ข๐‘ข๐‘›๐‘›๐‘›๐‘›
    = 1
    Generative model
    Multichannel
    reconstruction
    โ‹ฏ
    Latent source
    features
    โ‹ฏ
    ร—
    ร—
    ร—
    โ‹ฏ
    SCM
    Source PSD
    ๐‘ข๐‘ข1๐‘ก๐‘ก
    ๐‘ข๐‘ข2๐‘ก๐‘ก
    ๐‘ข๐‘ข๐‘๐‘๐‘๐‘
    Voice activity
    ร—
    ร—
    ร—
    Speech sources
    โ€ข High degrees of freedom in
    latent space
    โ€ข Limited time activity
    Noise source(s)
    โ€ข Low degrees of freedom in
    latent space
    โ€ข Always active
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    27

    View Slide

  28. Evaluation on CHiME-6 DSR Benchmark
    We evaluated WERs of our front-end system for dinner-party recordings.
    โ€ข The participants converse any topics without any artificial scenario-ization.
    *WER was measured with the official baseline ASR (Kaldi) model
    https://spandh.dcs.shef.ac.uk//chime_challenge/CHiME5/overview.html
    Kinect v2 (4ch)
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    28

    View Slide

  29. Extension 2: Separation of Moving Sound Sources
    BSS methods usually assume that sources are (almost) stationary.
    โ€ข Many daily sound sources move (e.g., walking persons, natural habitats, cars, โ€ฆ)
    โ€ข All sources relatively move if the microphone moves (e.g., mobile robots).
    Woo-hoo!
    Broom!
    Chirp, chip
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    29

    View Slide

  30. Time-Varying (TV) Neural FCA
    Joint source localization and separation for tracking moving sources.
    โ€ข The localization results are constrained to be smooth by moving average.
    โ€ข SCMs are then constrained by the time-varying smoothed localization results.
    Generative model
    Inference model
    ๐‡๐‡0๐‘›๐‘›๐‘›๐‘›
    ๐‡๐‡1๐‘›๐‘›๐‘›๐‘›
    ๐‡๐‡๐‘๐‘๐‘›๐‘›๐‘›๐‘›
    ๐ฎ1๐‘›๐‘›
    ๐ฎ๐‘๐‘๐‘›๐‘›
    Time-varying SCMs
    Latent spectral features
    Time-varying DoAs
    Regularize
    Separation
    Localization
    SCM
    Source PSD
    Multichannel
    mixture
    Multichannel
    reconstruction
    ๐‘”๐‘”๐œƒ๐œƒ,๐‘›๐‘›
    ๐ณ๐ณ0๐‘›๐‘›
    ๐‘”๐‘”๐œƒ๐œƒ,๐‘›๐‘›
    ๐ณ๐ณ๐‘๐‘๐‘›๐‘›
    ๐‘”๐‘”๐œƒ๐œƒ,๐‘›๐‘›
    ๐ณ๐ณ1๐‘›๐‘›
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    30

    View Slide

  31. Training on Mixtures of Two Moving Speech Sources
    TV Neural FCA performed well regardless of source velocity.
    โ€ข FastMNMF2 and Neural FCA drastically degraded when sources move fast.
    โ€ข TV-Neural FCA can improved avg. SDR 4.2dB from that of DoA-HMM [Higuchi+ 2014]
    SDR [dB]
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning
    0
    2
    4
    6
    8
    10
    12
    14
    Average 0-15ยฐ/s 15-30ยฐ/s 30-45ยฐ/s
    TV-Neural FCA Neural FCA FastMNMF DOA-HMM
    /33
    31

    View Slide

  32. Separation Results of Moving Sound Sources
    Our method can be trained from mixtures of moving sources.
    โ€ข Robustness against real audio recordings was improved.
    Stationary condition Moving condition
    FastMNMF FastMNMF
    TV Neural FCA TV Neural FCA
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning /33
    32

    View Slide

  33. Conclusion
    Two applications of deep source generative models:
    1. Semi-supervised speech enhancement ๏ƒ  FastMNMF-DP
    2. Self-supervised source separation ๏ƒ  Neural FCA
    Future work:
    โ€ข Speeding up neural FCA & handling unknown # of sources ๏ƒ  EUSIPCO 2023
    โ€ข Training neural FCA on diverse real audio recordings.
    Source Separation Based on Deep Generative Models and Its Self-Supervised Learning
    โˆผ DNN
    Latent features
    Source PSD
    Source signal
    ๐‘ ๐‘ ๐‘“๐‘“๐‘“๐‘“ ๐œ†๐œ†๐‘“๐‘“๐‘“๐‘“
    ๐‘ง๐‘ง๐‘‘๐‘‘๐‘‘๐‘‘
    ๐‘”๐‘”๐œƒ๐œƒ,๐‘“๐‘“
    ๐‘ง๐‘ง๐‘ก๐‘ก๐‘ก๐‘ก
    โˆผ ๐’ฉ๐’ฉ 0, 1
    /33
    33

    View Slide