Interspeech 2020: Adaptive Neural Speech Enhanc...

Interspeech 2020: Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder

Presentation slides used for Interspeech 2020: Yoshiaki Bando, Kouhei Sekiguchi, Kazuyoshi Yoshii: Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder

Yoshiaki Bando

October 25, 2020

  1. Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Yoshiaki

    Bando1,2, Kouhei Sekiguchi2,3, Kazuyoshi Yoshii2,3 1AIST, Japan 2RIKEN AIP, Japan 3Kyoto University, Japan https://ybando.jp/demo/is2020/
  2. Background|Speech Enhancement A task to extract a speech signal from

    a mixture of speech and noise • Frontend for an automatic speech recognition system • Hearing aids It is often difficult to assume the environment where they are used  Robustness against various acoustic environment is essential Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html https://ja.wikipedia.org/wiki/ファイル:広島駅新幹線改札口(乗り換え).JPG OK, google… / 15 2
  3. Related Work| Supervised Neural Speech Enhancement The network is trained

    to directly convert a noisy speech into a clean speech in a supervised manner  Excellent performance by the non-linear mapping  They sometimes deteriorate in unknown environments [Pascual+ INTERSPEECH2017, Heymann+ ICASSP2016, Lu+ INTERSPEECH2013, …] Noisy speech Clean speech DNN (e.g., LSTMs, CNNs) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 3
  4. Related Work|Semi-supervised Neural Methods VAE-NMF|A hybrid enhancement method of a

    low-rank noise model and a deep speech model Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder [Bando+ ICASSP2018, Leglaive+ MLSP2018, Pariente+ INTERSPEECH2019] Noise training data are often few We can use many speech corpus / 15 4
  5. Deep Speech Prior Based on Variational Autoencoder The decoder of

    a VAE is trained with a clean speech dataset for the speech model of statistical speech enhancement • VAE is trained to maximize a lower-bound of log 𝑝𝑝 𝐒𝐒 called ELBO • DNN-based powerful representation for speech spectra Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Predicted speech Decoder DNN 𝑝𝑝𝜙𝜙 𝐒𝐒 𝐙𝐙 Latent Representation 𝐙𝐙 𝐙𝐙 ~ 𝓝𝓝 𝟎𝟎, 𝟏𝟏 Clean speech Encoder DNN 𝑞𝑞𝜃𝜃 𝐙𝐙 𝐒𝐒 / 15 5
  6. Problem|Unnatural Speech-Like Noise in VAE-NMF The results of VAE-NMF often

    include unnatural speech-like noise • The latent representation of speech in VAE is assumed to follow 𝒩𝒩 0,1 • This prior tries to make the estimated speech into an “average” speech signal Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Clean speech VAE-NMF (Semi-supervised) / 15 6
  7. Key Idea|Denoising Variational Autoencoder A denoising VAE is used to

    estimate the prior distribution of . • This framework can be considered as adaptive neural enhancement Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 7
  8. Demo|Enhancement Results for Unseen Data Training w/ CHiME-4  Testing

    w/ TIMIT + LITIS ROUEN Dataset Input mixture LSTM-PSA (supervised) VAE-NMF (semi-supervised) DnVAE-NMF (supervised) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 8
  9. Denoising Variational Autoencoder The encoder is trained to enhance speech

    from a noisy mixture • The training is conducted by maximizing the ELBO as in VAE • Multitask learning w/ mask estimation is conducted for efficient training Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder BiLSTM layer FC layer Speech mask Noisy mixture Encoder ∼ Clean speech Decoder / 15 9
  10. Generative Model of Noisy Speech Mixture 𝐙𝐙 follows a prior

    distribution based on the outputs of the encoder Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder VAE-based speech spectrogram 𝐒𝐒 = + Noisy speech spectrogram 𝐗𝐗 Decoder 𝑝𝑝 𝐒𝐒 𝐙𝐙 Latent variable 𝐙𝐙 ∼ Outputs of denoising encoder Noise spectrogram 𝐍𝐍 Basis vectors 𝐖𝐖 Activations 𝐇𝐇 NMF-based noise spectrogram 𝐍𝐍 / 15 10 Speech spectrogram 𝐒𝐒
  11. Inference Based on Variational EM Algorithm We estimate such that

    the is minimized (𝐖𝐖 and 𝐇𝐇 are updated such that is maximized) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder M-step [Leglaive+ 2018] E-step 𝐀𝐀 and 𝐁𝐁 are updated with an Adam optimizer for maximizing the ELBO: / 15 11
  12. Evaluation|Datasets We evaluated with seen and unseen datasets Seen dataset|CHiME-4

    dataset for both training and testing • 7138 utterances for training / 1320 utterances for testing • Four types of noise environments: Unseen dataset|TIMIT + ROUEN dataset for only testing • 1320 utterances for testing • Noise signals are provided by LITIS ROUEN Audio scene dataset Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder On a bus In a cafeteria In a pedestrian area On a street junction http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html / 15 12
  13. Evaluation|Results for Seen Dataset Adaptive Neural Speech Enhancement with a

    Denoising Variational Autoencoder For seen data, BiLSTM-MSA or -PSA achieved best performance • DnVAE-NMF outperformed VAE-NMF in all measures (SDR, PESQ, and STOI) SDR [dB] PESQ STOI BiLSTM-MSA BiLSTM-PSA VAE-NMF DnVAE-NMF (Proposed)    / 15 13
  14. Evaluation|Results for Unseen Dataset For unseen data, DnVAE-NMF outperformed all

    the other methods • DnVAE-NMF had less failure cases (e.g., SDR < 5dB) than BiLSTM-MSA and -PSA Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder SDR [dB] PESQ STOI BiLSTM-MSA BiLSTM-PSA VAE-NMF DnVAE-NMF (Proposed)    / 15 14
  15. Conclusion Goal|Speech enhancement robust against unknown environment • Supervised methods

    often deteriorate in unknown environment Our idea|Introducing a probabilistic feedback mechanism • A denoising encoder is used to estimate the prior distribution of Future Work|Using a recurrent VAE & time-domain enhancement Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 15