Interspeech 2020: Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder

Slide 1

Slide 1 text

Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Yoshiaki Bando1,2, Kouhei Sekiguchi2,3, Kazuyoshi Yoshii2,3 1AIST, Japan 2RIKEN AIP, Japan 3Kyoto University, Japan https://ybando.jp/demo/is2020/

Slide 2

Slide 2 text

Background｜Speech Enhancement A task to extract a speech signal from a mixture of speech and noise • Frontend for an automatic speech recognition system • Hearing aids It is often difficult to assume the environment where they are used  Robustness against various acoustic environment is essential Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html https://ja.wikipedia.org/wiki/ファイル:広島駅新幹線改札口(乗り換え).JPG OK, google… / 15 2

Slide 3

Slide 3 text

Related Work｜ Supervised Neural Speech Enhancement The network is trained to directly convert a noisy speech into a clean speech in a supervised manner  Excellent performance by the non-linear mapping  They sometimes deteriorate in unknown environments [Pascual+ INTERSPEECH2017, Heymann+ ICASSP2016, Lu+ INTERSPEECH2013, …] Noisy speech Clean speech DNN (e.g., LSTMs, CNNs) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 3

Slide 4

Slide 4 text

Related Work｜Semi-supervised Neural Methods VAE-NMF｜A hybrid enhancement method of a low-rank noise model and a deep speech model Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder [Bando+ ICASSP2018, Leglaive+ MLSP2018, Pariente+ INTERSPEECH2019] Noise training data are often few We can use many speech corpus / 15 4

Slide 5

Slide 5 text

Deep Speech Prior Based on Variational Autoencoder The decoder of a VAE is trained with a clean speech dataset for the speech model of statistical speech enhancement • VAE is trained to maximize a lower-bound of log 𝑝𝑝 𝐒𝐒 called ELBO • DNN-based powerful representation for speech spectra Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Predicted speech Decoder DNN 𝑝𝑝𝜙𝜙 𝐒𝐒 𝐙𝐙 Latent Representation 𝐙𝐙 𝐙𝐙 ~ 𝓝𝓝 𝟎𝟎, 𝟏𝟏 Clean speech Encoder DNN 𝑞𝑞𝜃𝜃 𝐙𝐙 𝐒𝐒 / 15 5

Slide 6

Slide 6 text

Problem｜Unnatural Speech-Like Noise in VAE-NMF The results of VAE-NMF often include unnatural speech-like noise • The latent representation of speech in VAE is assumed to follow 𝒩𝒩 0,1 • This prior tries to make the estimated speech into an “average” speech signal Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Clean speech VAE-NMF (Semi-supervised) / 15 6

Slide 7

Slide 7 text

Key Idea｜Denoising Variational Autoencoder A denoising VAE is used to estimate the prior distribution of . • This framework can be considered as adaptive neural enhancement Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 7

Slide 8

Slide 8 text

Demo｜Enhancement Results for Unseen Data Training w/ CHiME-4  Testing w/ TIMIT + LITIS ROUEN Dataset Input mixture LSTM-PSA (supervised) VAE-NMF (semi-supervised) DnVAE-NMF (supervised) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 8

Slide 9

Slide 9 text

Denoising Variational Autoencoder The encoder is trained to enhance speech from a noisy mixture • The training is conducted by maximizing the ELBO as in VAE • Multitask learning w/ mask estimation is conducted for efficient training Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder BiLSTM layer FC layer Speech mask Noisy mixture Encoder ∼ Clean speech Decoder / 15 9

Slide 10

Slide 10 text

Generative Model of Noisy Speech Mixture 𝐙𝐙 follows a prior distribution based on the outputs of the encoder Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder VAE-based speech spectrogram 𝐒𝐒 = + Noisy speech spectrogram 𝐗𝐗 Decoder 𝑝𝑝 𝐒𝐒 𝐙𝐙 Latent variable 𝐙𝐙 ∼ Outputs of denoising encoder Noise spectrogram 𝐍𝐍 Basis vectors 𝐖𝐖 Activations 𝐇𝐇 NMF-based noise spectrogram 𝐍𝐍 / 15 10 Speech spectrogram 𝐒𝐒

Slide 11

Slide 11 text

Inference Based on Variational EM Algorithm We estimate such that the is minimized (𝐖𝐖 and 𝐇𝐇 are updated such that is maximized) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder M-step [Leglaive+ 2018] E-step 𝐀𝐀 and 𝐁𝐁 are updated with an Adam optimizer for maximizing the ELBO: / 15 11

Slide 12

Slide 12 text

Evaluation｜Datasets We evaluated with seen and unseen datasets Seen dataset｜CHiME-4 dataset for both training and testing • 7138 utterances for training / 1320 utterances for testing • Four types of noise environments: Unseen dataset｜TIMIT + ROUEN dataset for only testing • 1320 utterances for testing • Noise signals are provided by LITIS ROUEN Audio scene dataset Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder On a bus In a cafeteria In a pedestrian area On a street junction http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html / 15 12

Slide 13

Slide 13 text

Evaluation｜Results for Seen Dataset Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder For seen data, BiLSTM-MSA or -PSA achieved best performance • DnVAE-NMF outperformed VAE-NMF in all measures (SDR, PESQ, and STOI) SDR [dB] PESQ STOI BiLSTM-MSA BiLSTM-PSA VAE-NMF DnVAE-NMF (Proposed)    / 15 13

Slide 14

Slide 14 text

Evaluation｜Results for Unseen Dataset For unseen data, DnVAE-NMF outperformed all the other methods • DnVAE-NMF had less failure cases (e.g., SDR < 5dB) than BiLSTM-MSA and -PSA Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder SDR [dB] PESQ STOI BiLSTM-MSA BiLSTM-PSA VAE-NMF DnVAE-NMF (Proposed)    / 15 14

Slide 15

Slide 15 text

Conclusion Goal｜Speech enhancement robust against unknown environment • Supervised methods often deteriorate in unknown environment Our idea｜Introducing a probabilistic feedback mechanism • A denoising encoder is used to estimate the prior distribution of Future Work｜Using a recurrent VAE & time-domain enhancement Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 15