Adaptive Neural Speech Enhancement
with a Denoising Variational Autoencoder
Yoshiaki Bando1,2, Kouhei Sekiguchi2,3, Kazuyoshi Yoshii2,3
1AIST, Japan 2RIKEN AIP, Japan 3Kyoto University, Japan
https://ybando.jp/demo/is2020/
Slide 2
Slide 2 text
Background|Speech Enhancement
A task to extract a speech signal from a mixture of speech and noise
• Frontend for an automatic speech recognition system
• Hearing aids
It is often difficult to assume the environment where they are used
Robustness against various acoustic environment is essential
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html
https://ja.wikipedia.org/wiki/ファイル:広島駅新幹線改札口(乗り換え).JPG
OK, google…
/ 15
2
Slide 3
Slide 3 text
Related Work| Supervised Neural Speech Enhancement
The network is trained to directly convert a noisy speech
into a clean speech in a supervised manner
Excellent performance by the non-linear mapping
They sometimes deteriorate in unknown environments
[Pascual+ INTERSPEECH2017, Heymann+ ICASSP2016, Lu+ INTERSPEECH2013, …]
Noisy speech Clean speech
DNN (e.g., LSTMs, CNNs)
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15
3
Slide 4
Slide 4 text
Related Work|Semi-supervised Neural Methods
VAE-NMF|A hybrid enhancement method of
a low-rank noise model and a deep speech model
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
[Bando+ ICASSP2018, Leglaive+ MLSP2018, Pariente+ INTERSPEECH2019]
Noise training data are often few We can use many speech corpus
/ 15
4
Slide 5
Slide 5 text
Deep Speech Prior Based on Variational Autoencoder
The decoder of a VAE is trained with a clean speech dataset
for the speech model of statistical speech enhancement
• VAE is trained to maximize a lower-bound of log 𝑝𝑝 𝐒𝐒 called ELBO
• DNN-based powerful representation for speech spectra
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
Predicted speech
Decoder DNN
𝑝𝑝𝜙𝜙
𝐒𝐒 𝐙𝐙
Latent
Representation 𝐙𝐙
𝐙𝐙 ~ 𝓝𝓝 𝟎𝟎, 𝟏𝟏
Clean speech
Encoder DNN
𝑞𝑞𝜃𝜃
𝐙𝐙 𝐒𝐒
/ 15
5
Slide 6
Slide 6 text
Problem|Unnatural Speech-Like Noise in VAE-NMF
The results of VAE-NMF often include unnatural speech-like noise
• The latent representation of speech in VAE is assumed to follow 𝒩𝒩 0,1
• This prior tries to make the estimated speech into an “average” speech signal
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
Clean speech
VAE-NMF
(Semi-supervised)
/ 15
6
Slide 7
Slide 7 text
Key Idea|Denoising Variational Autoencoder
A denoising VAE is used to estimate the prior distribution of .
• This framework can be considered as adaptive neural enhancement
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15
7
Slide 8
Slide 8 text
Demo|Enhancement Results for Unseen Data
Training w/ CHiME-4 Testing w/ TIMIT + LITIS ROUEN Dataset
Input mixture
LSTM-PSA
(supervised)
VAE-NMF
(semi-supervised)
DnVAE-NMF
(supervised)
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15
8
Slide 9
Slide 9 text
Denoising Variational Autoencoder
The encoder is trained to enhance speech from a noisy mixture
• The training is conducted by maximizing the ELBO as in VAE
• Multitask learning w/ mask estimation is conducted for efficient training
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
BiLSTM layer
FC layer
Speech mask
Noisy mixture
Encoder
∼
Clean speech
Decoder
/ 15
9
Slide 10
Slide 10 text
Generative Model of Noisy Speech Mixture
𝐙𝐙 follows a prior distribution based on the outputs of the encoder
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
VAE-based speech spectrogram 𝐒𝐒
=
+
Noisy speech spectrogram 𝐗𝐗
Decoder
𝑝𝑝 𝐒𝐒 𝐙𝐙
Latent variable 𝐙𝐙
∼
Outputs of
denoising encoder
Noise spectrogram 𝐍𝐍
Basis vectors 𝐖𝐖
Activations 𝐇𝐇
NMF-based noise spectrogram 𝐍𝐍
/ 15
10
Speech spectrogram 𝐒𝐒
Slide 11
Slide 11 text
Inference Based on Variational EM Algorithm
We estimate such that the is minimized
(𝐖𝐖 and 𝐇𝐇 are updated such that is maximized)
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
M-step
[Leglaive+ 2018]
E-step
𝐀𝐀 and 𝐁𝐁 are updated with an Adam optimizer for maximizing the ELBO:
/ 15
11
Slide 12
Slide 12 text
Evaluation|Datasets
We evaluated with seen and unseen datasets
Seen dataset|CHiME-4 dataset for both training and testing
• 7138 utterances for training / 1320 utterances for testing
• Four types of noise environments:
Unseen dataset|TIMIT + ROUEN dataset for only testing
• 1320 utterances for testing
• Noise signals are provided by LITIS ROUEN Audio scene dataset
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
On a bus In a cafeteria In a pedestrian area On a street junction
http://spandh.dcs.shef.ac.uk/chime_challenge/CHiME4/data.html
/ 15
12
Slide 13
Slide 13 text
Evaluation|Results for Seen Dataset
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
For seen data, BiLSTM-MSA or -PSA achieved best performance
• DnVAE-NMF outperformed VAE-NMF in all measures (SDR, PESQ, and STOI)
SDR [dB] PESQ STOI
BiLSTM-MSA
BiLSTM-PSA
VAE-NMF
DnVAE-NMF
(Proposed)
/ 15
13
Slide 14
Slide 14 text
Evaluation|Results for Unseen Dataset
For unseen data, DnVAE-NMF outperformed all the other methods
• DnVAE-NMF had less failure cases (e.g., SDR < 5dB) than BiLSTM-MSA and -PSA
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
SDR [dB] PESQ STOI
BiLSTM-MSA
BiLSTM-PSA
VAE-NMF
DnVAE-NMF
(Proposed)
/ 15
14
Slide 15
Slide 15 text
Conclusion
Goal|Speech enhancement robust against unknown environment
• Supervised methods often deteriorate in unknown environment
Our idea|Introducing a probabilistic feedback mechanism
• A denoising encoder is used to estimate the prior distribution of
Future Work|Using a recurrent VAE & time-domain enhancement
Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder / 15
15