Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Universal Score-based Speech Enhancement with H...

Universal Score-based Speech Enhancement with High Content Preservation

Slides of the presentation given at Interspeech 2024 about universal score-based speech enhancement.
Abstract: We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifica­tions to the network architecture, improving training stability and final performance. Second, we introduce an adversarial loss to promote learning high quality speech features. Third, we pro­pose a low-rank adaptation scheme with a phoneme fidelity loss to improve content preservation in the enhanced speech. In the experiments, we train a universal enhancement model on a large scale dataset of speech degraded by noise, reverberation, and various distortions. The results on multiple public benchmark datasets demonstrate that UNIVERSE++ compares favorably to both discriminative and generative baselines for a wide range of qualitative and intelligibility metrics.

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. © LY Corporation Universal Score-based Speech Enhancement with High Content

    Preservation LY Corporation, Music Processing Team Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu Interspeech 2024
  2. © LY Corporation Universal Speech Enhancement noise distortion enhancement DNN

    reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous Generative models can sample from suitable reconstructions! 3
  3. © LY Corporation Universal Speech Enhancement noise distortion enhancement DNN

    reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous Generative models can sample from suitable reconstructions! 3
  4. © LY Corporation Universal Speech Enhancement noise distortion enhancement DNN

    reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous Generative models can sample from suitable reconstructions! 3
  5. © LY Corporation Universal Speech Enhancement noise distortion enhancement DNN

    reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network SE Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous USE Generative models can sample from suitable reconstructions! 3
  6. © LY Corporation Universal Speech Enhancement noise distortion enhancement DNN

    reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network SE Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous USE generate missing Generative models can sample from suitable reconstructions! 3
  7. © LY Corporation Universal Speech Enhancement noise distortion enhancement DNN

    reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network SE Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous USE generate missing Generative models can sample from suitable reconstructions! 3
  8. © LY Corporation Existing Generative SE Methods Adversarial Training clean

    speech noisy speech discrimi nator denoising network real/fake Examples ‚ SEGAN [Pascual2017] ‚ HiFi++ [Andreev2023] Diffusion-based forward process reverse process Examples ‚ CDiffuse [Lu2022] ‚ SGMSE [Welker2022] ‚ StoRM [Lemercier2023] ‚ Universe [Serra2022] 4
  9. © LY Corporation UNIVERSE Speech Enhancement Model [Serra2022] ‚ Diffusion

    model: Variance Exploding SDE ‚ Architecture: split feature extraction / score prediction ‚ Dataset: various speech, noise + 55 types of distortions conditioning network auxiliary loss auxiliary loss auxiliary loss score network score network score matching loss score network noisy speech di usion clean noise Problems ‚ During re-implementation from paper, found network hard to train ‚ Found auxiliary loss as possible area for improvement 5
  10. © LY Corporation UNIVERSE++: Improved UNIVERSE Model Contributions 1. several

    network improvements 2. adversarial loss to produce high-quality features 3. fine-tuning with linguistic content loss 7 conditioning network score network score matching loss GAN loss new
  11. © LY Corporation original universe model residual score encoder condition

    encoder condition decoder score decoder ... ... true score clean speech noisy speech di usion process mel-spec. score network (U-Net) MDN loss KL mix Gauss conv MDN loss MSE loss emb MDN: mixture density network feat. target condition network 8
  12. © LY Corporation Improvement 1.1: Score Network IO Normalization Score

    network ingests noisy input and outputs score x + σtz S ÝÑ ´z σt 0 1 diffusion time t Noise schedule: t 0 1 diffusion time t 40 40 (dB) SNR: 10log10 (Var(x)/ 2 t ) High dynamic range input and output! Solution: score network re-parameterized as [Karras2022]: S(x, c, σt) = cskip x + coutS1 (cin x, c, σt) , (1) weights cskip, cin, cout are such that Var(cin(s + σtn)) = 1 and Var(target) = 1. Possibly most potent improvement! 9
  13. © LY Corporation Improvement 1.1: Score Network IO Normalization Score

    network ingests noisy input and outputs score x + σtz S ÝÑ ´z σt 0 1 diffusion time t Noise schedule: t 0 1 diffusion time t 40 40 (dB) SNR: 10log10 (Var(x)/ 2 t ) High dynamic range input and output! Solution: score network re-parameterized as [Karras2022]: S(x, c, σt) = cskip x + coutS1 (cin x, c, σt) , (1) weights cskip, cin, cout are such that Var(cin(s + σtn)) = 1 and Var(target) = 1. Possibly most potent improvement! 9
  14. © LY Corporation Improvement 1.1: Score Network IO Normalization Score

    network ingests noisy input and outputs score x + σtz S ÝÑ ´z σt 0 1 diffusion time t Noise schedule: t 0 1 diffusion time t 40 40 (dB) SNR: 10log10 (Var(x)/ 2 t ) High dynamic range input and output! Solution: score network re-parameterized as [Karras2022]: S(x, c, σt) = cskip x + coutS1 (cin x, c, σt) , (1) weights cskip, cin, cout are such that Var(cin(s + σtn)) = 1 and Var(target) = 1. Possibly most potent improvement! 9
  15. © LY Corporation Improvement 1.1: Score Network IO Normalization Score

    network ingests noisy input and outputs score x + σtz S ÝÑ ´z σt 0 1 diffusion time t Noise schedule: t 0 1 diffusion time t 40 40 (dB) SNR: 10log10 (Var(x)/ 2 t ) High dynamic range input and output! Solution: score network re-parameterized as [Karras2022]: S(x, c, σt) = cskip x + coutS1 (cin x, c, σt) , (1) weights cskip, cin, cout are such that Var(cin(s + σtn)) = 1 and Var(target) = 1. Possibly most potent improvement! 9
  16. © LY Corporation Improvements 1.2: Antialiasing conv low-pass conv T

    low-pass residual connection downsampling in encoder upsampling in decoder high freq. not processed at lower stages of the UNet ‚ Add antialiasing filters to all down/up sampling layers in score network ‚ Detrimental in condition network due to lack of residual connections ‚ Originally: reduce artifacts in GANs for images [Karras2021] 10
  17. © LY Corporation Improvement 2: Adversarial Training with HiFi-GAN Loss

    residual score encoder condition encoder condition decoder score decoder ... ... true score clean speech noisy speech di usion process score network (U-Net) MSE loss feature matching loss adverse. loss mel-spec. L1 loss emb condition network disc. disc. train loss real/fake mixture density network Ñ HiFi-GAN loss [Kong2020] 11
  18. © LY Corporation Improvement 3: Fine-tuning with Linguistic Content Loss

    clean speech noisy speech phoneme recognizer di usion N-2 steps di usion 2 steps condition phoneme recognizer decode CTC loss multi-res. STFT z 🔥 🔥 ❄ ❄ ‚ linguistic content loss ‚ multi-resolution spectrogram loss ‚ fine-tuning by low-rank adaptation [Hu2022] 12
  19. © LY Corporation Ablation Experiment Setup Dataset: Voicebank-DEMAND [Valentini2016] ‚

    Speech from VCTK, Noise from DEMAND ‚ Sampling rate: 16 kHz Training ‚ Optimizer: AdamW ‚ Learning rate: 10´6 Ñ 10´4 Ñ 10´6 (warmup + cosine schedule) ‚ 300 000 steps ‚ Batch size: 40 14
  20. © LY Corporation Ablation Study 1.5 2.0 2.5 3.0 3.5

    PESQ-WB Unprocessed +HiFi-GAN +Network Original 1.97 2.98 2.93 2.86 0.7 0.8 0.9 ESTOI 0.79 0.86 0.86 0.85 2.0 2.5 3.0 3.5 DNSMOS (OVRL) 2.70 3.20 3.18 3.16 15
  21. © LY Corporation Universal Speech Enhancement Experiment Hyperparameters ‚ AdamW

    / 1 500 000 steps / 40 batch size ‚ Learning rate: 10´6 Ñ 10´4 Ñ 10´6 (warmup + cosine schedule) Training Set (24 kHz) ‚ Speech: 537 h ‚ Noise: 601 h (environmental + BGM) ‚ Distortions: reverberation, band limitation, equalization distortion, clipping, random attenuation, packet loss, codec distortion (MP3) Test Sets (24 kHz) ‚ VBD-LP: Voicebank-DEMAND low-passed at 4 kHz ‚ Signal Improvement Challenge (various real distortions) 16
  22. © LY Corporation Universal Speech Enhancement Experiment Hyperparameters ‚ AdamW

    / 1 500 000 steps / 40 batch size ‚ Learning rate: 10´6 Ñ 10´4 Ñ 10´6 (warmup + cosine schedule) Training Set (24 kHz) ‚ Speech: 537 h ‚ Noise: 601 h (environmental + BGM) ‚ Distortions: reverberation, band limitation, equalization distortion, clipping, random attenuation, packet loss, codec distortion (MP3) Test Sets (24 kHz) ‚ VBD-LP: Voicebank-DEMAND low-passed at 4 kHz ‚ Signal Improvement Challenge (various real distortions) 16
  23. © LY Corporation Universal Speech Enhancement Experiment Hyperparameters ‚ AdamW

    / 1 500 000 steps / 40 batch size ‚ Learning rate: 10´6 Ñ 10´4 Ñ 10´6 (warmup + cosine schedule) Training Set (24 kHz) ‚ Speech: 537 h ‚ Noise: 601 h (environmental + BGM) ‚ Distortions: reverberation, band limitation, equalization distortion, clipping, random attenuation, packet loss, codec distortion (MP3) Test Sets (24 kHz) ‚ VBD-LP: Voicebank-DEMAND low-passed at 4 kHz ‚ Signal Improvement Challenge (various real distortions) 16
  24. © LY Corporation Results for VBD-LP: (Denoising+Bandwith Extension) 2 3

    +finetune UNIVERSE++ UNIVERSE StoRM BSRNN Unprocessed 2.57 2.72 2.36 1.62 2.58 1.89 PESQ-WB 0 20 3.9 5.4 10.3 6.8 5.3 4.1 WER (%) 2 3 3.19 3.19 3.11 2.79 2.92 2.70 DNSMOS BSRNN [Yu2023] Discriminative: metric-gan (PESQ) loss StoRM [Lemercier2023] Hybrid: discriminative denoising + diffusion refinement 17
  25. © LY Corporation Results for Signal Improvement Challenge 0 20

    +finetune UNIVERSE++ UNIVERSE StoRM BSRNN Unprocessed 21.8 24.1 28.5 25.5 22.9 21.7 WER (%) 2 3 2.93 2.80 2.86 2.53 2.59 2.43 DNSMOS BSRNN [Yu2023] Discriminative: metric-gan (PESQ) loss StoRM [Lemercier2023] Hybrid: discriminative denoising + diffusion refinement 18
  26. © LY Corporation Audio Samples Unprocessed StoRM BSRNN UNIVERSE UNIVERSE++

    SIG BWE BSRNN [Yu2023] Discriminative: metric-gan (PESQ) loss StoRM [Lemercier2023] Hybrid: discriminative denoising + diffusion refinement More samples: https://www.robinscheibler.org/interspeech2024-universepp-samples/ 19
  27. © LY Corporation Conclusion Summary ‚ UNIVERSE++: improved UNIVERSE model

    ‚ several network improvements ‚ HiFi-GAN loss ‚ fine-tuning with linguistic content loss ‚ enhancement for wide range of conditions Future Work ‚ improve intelligibility ‚ smaller model, shorter training, etc... Github line/open-universe 21
  28. © LY Corporation Conclusion Summary ‚ UNIVERSE++: improved UNIVERSE model

    ‚ several network improvements ‚ HiFi-GAN loss ‚ fine-tuning with linguistic content loss ‚ enhancement for wide range of conditions Future Work ‚ improve intelligibility ‚ smaller model, shorter training, etc... Github line/open-universe 21
  29. © LY Corporation References 1/2 [Pascual2017] Pascual et al., "SEGAN:

    Speech Enhancement Generative Adversar- ial Network," Interspeech, 2017. [Andreev2023] Andreev et al., "HIFI++: A Unified Framework for Bandwidth Exten- sion and Speech Enhancement," ICASSP, 2023. [Lu2022] Lu et al., "Conditional Diffusion Probabilistic Model for Speech En- hancement," ICASSP, 2022. [Welker2022] Welker et al., "Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain," Interspeech, 2022. [Richter2022] Richter et al., "Speech Enhancement and Dereverberation with Diffusion-based Generative Models," TASLP, 2022. [Lemercier2023] Lemercier et al., "Speech Enhancement and Dereverberation with Diffusion-based Generative Models," TASLP, 2023. [Serra2022] Serra et al., "Universal Speech Enhancement with Score-based Dif- fusion," arXiv, 2022. 23
  30. © LY Corporation References 2/2 [Karras2022] Karras et al., "Elucidating

    the Design Space of Diffusion-Based Gen- erative Models," NeurIPS, 2022. [Karras2021] Karras et al., "Alias-Free Generative Adversarial Networks,” NeurIPS, 2021. [Kong2020] Kong et al., "HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis," NeurIPS 2020. [Hu2022] Hu et al., "LoRA: Low-rank adaptation of large language models," ICLR, 2022. [Valentini2016] Valentini-Botinhao et al., "Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Net- works," Interspeech, 2016. [Yu2023] Yu et al., "High fidelity speech enhancement with band-split RNN," Interspeech, 2023. 24
  31. © LY Corporation HiFi-GAN Loss [Kong2020] ‚ Adversarial Training: proposed

    for TTS ‚ Multi-resolution/period discriminators ‚ Feature matching: L1 distance of intermediate layers of discriminator mel-spec. L1 multi-res. discriminator multi-period discriminator feature matching L1 feature matching L1 disc. train loss disc. train loss clean speech synthesized speech adversarial loss adversarial loss 25