Interspeech 2020: Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder

Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Yoshiaki
Bando1,2, Kouhei Sekiguchi2,3, Kazuyoshi Yoshii2,3 1AIST, Japan 2RIKEN AIP, Japan 3Kyoto University, Japan https://ybando.jp/demo/is2020/

Background｜Speech Enhancement A task to extract a speech signal from
a mixture of speech and noise • Frontend for an automatic speech recognition system • Hearing aids It is often difficult to assume the environment where they are used  Robustness against various acoustic environment is essential Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html https://ja.wikipedia.org/wiki/ファイル:広島駅新幹線改札口(乗り換え).JPG OK, google… / 15 2

Related Work｜ Supervised Neural Speech Enhancement The network is trained
to directly convert a noisy speech into a clean speech in a supervised manner  Excellent performance by the non-linear mapping  They sometimes deteriorate in unknown environments [Pascual+ INTERSPEECH2017, Heymann+ ICASSP2016, Lu+ INTERSPEECH2013, …] Noisy speech Clean speech DNN (e.g., LSTMs, CNNs) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 3

Related Work｜Semi-supervised Neural Methods VAE-NMF｜A hybrid enhancement method of a
low-rank noise model and a deep speech model Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder [Bando+ ICASSP2018, Leglaive+ MLSP2018, Pariente+ INTERSPEECH2019] Noise training data are often few We can use many speech corpus / 15 4

Deep Speech Prior Based on Variational Autoencoder The decoder of
a VAE is trained with a clean speech dataset for the speech model of statistical speech enhancement • VAE is trained to maximize a lower-bound of log 𝑝𝑝 𝐒𝐒 called ELBO • DNN-based powerful representation for speech spectra Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Predicted speech Decoder DNN 𝑝𝑝𝜙𝜙 𝐒𝐒 𝐙𝐙 Latent Representation 𝐙𝐙 𝐙𝐙 ~ 𝓝𝓝 𝟎𝟎, 𝟏𝟏 Clean speech Encoder DNN 𝑞𝑞𝜃𝜃 𝐙𝐙 𝐒𝐒 / 15 5

Problem｜Unnatural Speech-Like Noise in VAE-NMF The results of VAE-NMF often
include unnatural speech-like noise • The latent representation of speech in VAE is assumed to follow 𝒩𝒩 0,1 • This prior tries to make the estimated speech into an “average” speech signal Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Clean speech VAE-NMF (Semi-supervised) / 15 6

Key Idea｜Denoising Variational Autoencoder A denoising VAE is used to
estimate the prior distribution of . • This framework can be considered as adaptive neural enhancement Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 7

Demo｜Enhancement Results for Unseen Data Training w/ CHiME-4  Testing
w/ TIMIT + LITIS ROUEN Dataset Input mixture LSTM-PSA (supervised) VAE-NMF (semi-supervised) DnVAE-NMF (supervised) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 8

Denoising Variational Autoencoder The encoder is trained to enhance speech
from a noisy mixture • The training is conducted by maximizing the ELBO as in VAE • Multitask learning w/ mask estimation is conducted for efficient training Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder BiLSTM layer FC layer Speech mask Noisy mixture Encoder ∼ Clean speech Decoder / 15 9

Generative Model of Noisy Speech Mixture 𝐙𝐙 follows a prior
distribution based on the outputs of the encoder Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder VAE-based speech spectrogram 𝐒𝐒 = + Noisy speech spectrogram 𝐗𝐗 Decoder 𝑝𝑝 𝐒𝐒 𝐙𝐙 Latent variable 𝐙𝐙 ∼ Outputs of denoising encoder Noise spectrogram 𝐍𝐍 Basis vectors 𝐖𝐖 Activations 𝐇𝐇 NMF-based noise spectrogram 𝐍𝐍 / 15 10 Speech spectrogram 𝐒𝐒

Inference Based on Variational EM Algorithm We estimate such that
the is minimized (𝐖𝐖 and 𝐇𝐇 are updated such that is maximized) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder M-step [Leglaive+ 2018] E-step 𝐀𝐀 and 𝐁𝐁 are updated with an Adam optimizer for maximizing the ELBO: / 15 11

Evaluation｜Datasets We evaluated with seen and unseen datasets Seen dataset｜CHiME-4
dataset for both training and testing • 7138 utterances for training / 1320 utterances for testing • Four types of noise environments: Unseen dataset｜TIMIT + ROUEN dataset for only testing • 1320 utterances for testing • Noise signals are provided by LITIS ROUEN Audio scene dataset Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder On a bus In a cafeteria In a pedestrian area On a street junction http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html / 15 12

Evaluation｜Results for Seen Dataset Adaptive Neural Speech Enhancement with a
Denoising Variational Autoencoder For seen data, BiLSTM-MSA or -PSA achieved best performance • DnVAE-NMF outperformed VAE-NMF in all measures (SDR, PESQ, and STOI) SDR [dB] PESQ STOI BiLSTM-MSA BiLSTM-PSA VAE-NMF DnVAE-NMF (Proposed)    / 15 13

Evaluation｜Results for Unseen Dataset For unseen data, DnVAE-NMF outperformed all
the other methods • DnVAE-NMF had less failure cases (e.g., SDR < 5dB) than BiLSTM-MSA and -PSA Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder SDR [dB] PESQ STOI BiLSTM-MSA BiLSTM-PSA VAE-NMF DnVAE-NMF (Proposed)    / 15 14

Conclusion Goal｜Speech enhancement robust against unknown environment • Supervised methods
often deteriorate in unknown environment Our idea｜Introducing a probabilistic feedback mechanism • A denoising encoder is used to estimate the prior distribution of Future Work｜Using a recurrent VAE & time-domain enhancement Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 15

Interspeech 2020: Adaptive Neural Speech Enhanc...

Interspeech 2020: Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder

Yoshiaki Bando

More Decks by Yoshiaki Bando

Other Decks in Research

Featured

Transcript

Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Yoshiaki

Background｜Speech Enhancement A task to extract a speech signal from

Related Work｜ Supervised Neural Speech Enhancement The network is trained

Related Work｜Semi-supervised Neural Methods VAE-NMF｜A hybrid enhancement method of a

Deep Speech Prior Based on Variational Autoencoder The decoder of

Problem｜Unnatural Speech-Like Noise in VAE-NMF The results of VAE-NMF often

Key Idea｜Denoising Variational Autoencoder A denoising VAE is used to

Demo｜Enhancement Results for Unseen Data Training w/ CHiME-4  Testing

Denoising Variational Autoencoder The encoder is trained to enhance speech

Generative Model of Noisy Speech Mixture 𝐙𝐙 follows a prior

Inference Based on Variational EM Algorithm We estimate such that

Evaluation｜Datasets We evaluated with seen and unseen datasets Seen dataset｜CHiME-4

Evaluation｜Results for Seen Dataset Adaptive Neural Speech Enhancement with a

Evaluation｜Results for Unseen Dataset For unseen data, DnVAE-NMF outperformed all

Conclusion Goal｜Speech enhancement robust against unknown environment • Supervised methods