a mixture of speech and noise • Frontend for an automatic speech recognition system • Hearing aids It is often difficult to assume the environment where they are used Robustness against various acoustic environment is essential Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html https://ja.wikipedia.org/wiki/ファイル:広島駅新幹線改札口(乗り換え).JPG OK, google… / 15 2
to directly convert a noisy speech into a clean speech in a supervised manner Excellent performance by the non-linear mapping They sometimes deteriorate in unknown environments [Pascual+ INTERSPEECH2017, Heymann+ ICASSP2016, Lu+ INTERSPEECH2013, …] Noisy speech Clean speech DNN (e.g., LSTMs, CNNs) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 3
low-rank noise model and a deep speech model Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder [Bando+ ICASSP2018, Leglaive+ MLSP2018, Pariente+ INTERSPEECH2019] Noise training data are often few We can use many speech corpus / 15 4
a VAE is trained with a clean speech dataset for the speech model of statistical speech enhancement • VAE is trained to maximize a lower-bound of log 𝑝𝑝 𝐒𝐒 called ELBO • DNN-based powerful representation for speech spectra Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Predicted speech Decoder DNN 𝑝𝑝𝜙𝜙 𝐒𝐒 𝐙𝐙 Latent Representation 𝐙𝐙 𝐙𝐙 ~ 𝓝𝓝 𝟎𝟎, 𝟏𝟏 Clean speech Encoder DNN 𝑞𝑞𝜃𝜃 𝐙𝐙 𝐒𝐒 / 15 5
include unnatural speech-like noise • The latent representation of speech in VAE is assumed to follow 𝒩𝒩 0,1 • This prior tries to make the estimated speech into an “average” speech signal Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder Clean speech VAE-NMF (Semi-supervised) / 15 6
estimate the prior distribution of . • This framework can be considered as adaptive neural enhancement Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 7
from a noisy mixture • The training is conducted by maximizing the ELBO as in VAE • Multitask learning w/ mask estimation is conducted for efficient training Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder BiLSTM layer FC layer Speech mask Noisy mixture Encoder ∼ Clean speech Decoder / 15 9
the is minimized (𝐖𝐖 and 𝐇𝐇 are updated such that is maximized) Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder M-step [Leglaive+ 2018] E-step 𝐀𝐀 and 𝐁𝐁 are updated with an Adam optimizer for maximizing the ELBO: / 15 11
dataset for both training and testing • 7138 utterances for training / 1320 utterances for testing • Four types of noise environments: Unseen dataset|TIMIT + ROUEN dataset for only testing • 1320 utterances for testing • Noise signals are provided by LITIS ROUEN Audio scene dataset Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder On a bus In a cafeteria In a pedestrian area On a street junction http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html / 15 12
Denoising Variational Autoencoder For seen data, BiLSTM-MSA or -PSA achieved best performance • DnVAE-NMF outperformed VAE-NMF in all measures (SDR, PESQ, and STOI) SDR [dB] PESQ STOI BiLSTM-MSA BiLSTM-PSA VAE-NMF DnVAE-NMF (Proposed) / 15 13
the other methods • DnVAE-NMF had less failure cases (e.g., SDR < 5dB) than BiLSTM-MSA and -PSA Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder SDR [dB] PESQ STOI BiLSTM-MSA BiLSTM-PSA VAE-NMF DnVAE-NMF (Proposed) / 15 14
often deteriorate in unknown environment Our idea|Introducing a probabilistic feedback mechanism • A denoising encoder is used to estimate the prior distribution of Future Work|Using a recurrent VAE & time-domain enhancement Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15 15