Universal Score-based Speech Enhancement with High Content Preservation

© LY Corporation Universal Score-based Speech Enhancement with High Content
Preservation LY Corporation, Music Processing Team Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu Interspeech 2024

© LY Corporation Outline 1. Introduction: Universal Speech Enhancement 2.
UNIVERSE++ 3. Experiments 1

© LY Corporation Universal Speech Enhancement noise distortion enhancement DNN
reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous Generative models can sample from suitable reconstructions! 3

reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network SE Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous USE Generative models can sample from suitable reconstructions! 3

reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network SE Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous USE generate missing Generative models can sample from suitable reconstructions! 3

© LY Corporation Existing Generative SE Methods Adversarial Training clean
speech noisy speech discrimi nator denoising network real/fake Examples ‚ SEGAN [Pascual2017] ‚ HiFi++ [Andreev2023] Diffusion-based forward process reverse process Examples ‚ CDiffuse [Lu2022] ‚ SGMSE [Welker2022] ‚ StoRM [Lemercier2023] ‚ Universe [Serra2022] 4

© LY Corporation UNIVERSE Speech Enhancement Model [Serra2022] ‚ Diffusion
model: Variance Exploding SDE ‚ Architecture: split feature extraction / score prediction ‚ Dataset: various speech, noise + 55 types of distortions conditioning network auxiliary loss auxiliary loss auxiliary loss score network score network score matching loss score network noisy speech di usion clean noise Problems ‚ During re-implementation from paper, found network hard to train ‚ Found auxiliary loss as possible area for improvement 5

© LY Corporation UNIVERSE++: Improved UNIVERSE Model Contributions 1. several
network improvements 2. adversarial loss to produce high-quality features 3. fine-tuning with linguistic content loss 7 conditioning network score network score matching loss GAN loss new

© LY Corporation original universe model residual score encoder condition
encoder condition decoder score decoder ... ... true score clean speech noisy speech di usion process mel-spec. score network (U-Net) MDN loss KL mix Gauss conv MDN loss MSE loss emb MDN: mixture density network feat. target condition network 8

© LY Corporation Improvement 1.1: Score Network IO Normalization Score
network ingests noisy input and outputs score x + σtz S ÝÑ ´z σt 0 1 diffusion time t Noise schedule: t 0 1 diffusion time t 40 40 (dB) SNR: 10log10 (Var(x)/ 2 t ) High dynamic range input and output! Solution: score network re-parameterized as [Karras2022]: S(x, c, σt) = cskip x + coutS1 (cin x, c, σt) , (1) weights cskip, cin, cout are such that Var(cin(s + σtn)) = 1 and Var(target) = 1. Possibly most potent improvement! 9

© LY Corporation Improvements 1.2: Antialiasing conv low-pass conv T
low-pass residual connection downsampling in encoder upsampling in decoder high freq. not processed at lower stages of the UNet ‚ Add antialiasing filters to all down/up sampling layers in score network ‚ Detrimental in condition network due to lack of residual connections ‚ Originally: reduce artifacts in GANs for images [Karras2021] 10

© LY Corporation Improvement 2: Adversarial Training with HiFi-GAN Loss
residual score encoder condition encoder condition decoder score decoder ... ... true score clean speech noisy speech di usion process score network (U-Net) MSE loss feature matching loss adverse. loss mel-spec. L1 loss emb condition network disc. disc. train loss real/fake mixture density network Ñ HiFi-GAN loss [Kong2020] 11

© LY Corporation Improvement 3: Fine-tuning with Linguistic Content Loss
clean speech noisy speech phoneme recognizer di usion N-2 steps di usion 2 steps condition phoneme recognizer decode CTC loss multi-res. STFT z 🔥 🔥 ❄ ❄ ‚ linguistic content loss ‚ multi-resolution spectrogram loss ‚ fine-tuning by low-rank adaptation [Hu2022] 12

© LY Corporation Ablation Experiment Setup Dataset: Voicebank-DEMAND [Valentini2016] ‚
Speech from VCTK, Noise from DEMAND ‚ Sampling rate: 16 kHz Training ‚ Optimizer: AdamW ‚ Learning rate: 10´6 Ñ 10´4 Ñ 10´6 (warmup + cosine schedule) ‚ 300 000 steps ‚ Batch size: 40 14

© LY Corporation Ablation Study 1.5 2.0 2.5 3.0 3.5
PESQ-WB Unprocessed +HiFi-GAN +Network Original 1.97 2.98 2.93 2.86 0.7 0.8 0.9 ESTOI 0.79 0.86 0.86 0.85 2.0 2.5 3.0 3.5 DNSMOS (OVRL) 2.70 3.20 3.18 3.16 15

© LY Corporation Universal Speech Enhancement Experiment Hyperparameters ‚ AdamW
/ 1 500 000 steps / 40 batch size ‚ Learning rate: 10´6 Ñ 10´4 Ñ 10´6 (warmup + cosine schedule) Training Set (24 kHz) ‚ Speech: 537 h ‚ Noise: 601 h (environmental + BGM) ‚ Distortions: reverberation, band limitation, equalization distortion, clipping, random attenuation, packet loss, codec distortion (MP3) Test Sets (24 kHz) ‚ VBD-LP: Voicebank-DEMAND low-passed at 4 kHz ‚ Signal Improvement Challenge (various real distortions) 16

© LY Corporation Results for VBD-LP: (Denoising+Bandwith Extension) 2 3
+finetune UNIVERSE++ UNIVERSE StoRM BSRNN Unprocessed 2.57 2.72 2.36 1.62 2.58 1.89 PESQ-WB 0 20 3.9 5.4 10.3 6.8 5.3 4.1 WER (%) 2 3 3.19 3.19 3.11 2.79 2.92 2.70 DNSMOS BSRNN [Yu2023] Discriminative: metric-gan (PESQ) loss StoRM [Lemercier2023] Hybrid: discriminative denoising + diffusion refinement 17

© LY Corporation Results for Signal Improvement Challenge 0 20
+finetune UNIVERSE++ UNIVERSE StoRM BSRNN Unprocessed 21.8 24.1 28.5 25.5 22.9 21.7 WER (%) 2 3 2.93 2.80 2.86 2.53 2.59 2.43 DNSMOS BSRNN [Yu2023] Discriminative: metric-gan (PESQ) loss StoRM [Lemercier2023] Hybrid: discriminative denoising + diffusion refinement 18

© LY Corporation Audio Samples Unprocessed StoRM BSRNN UNIVERSE UNIVERSE++
SIG BWE BSRNN [Yu2023] Discriminative: metric-gan (PESQ) loss StoRM [Lemercier2023] Hybrid: discriminative denoising + diffusion refinement More samples: https://www.robinscheibler.org/interspeech2024-universepp-samples/ 19

© LY Corporation Conclusion Summary ‚ UNIVERSE++: improved UNIVERSE model
‚ several network improvements ‚ HiFi-GAN loss ‚ fine-tuning with linguistic content loss ‚ enhancement for wide range of conditions Future Work ‚ improve intelligibility ‚ smaller model, shorter training, etc... Github line/open-universe 21

© LY Corporation References 1/2 [Pascual2017] Pascual et al., "SEGAN:
Speech Enhancement Generative Adversar- ial Network," Interspeech, 2017. [Andreev2023] Andreev et al., "HIFI++: A Unified Framework for Bandwidth Exten- sion and Speech Enhancement," ICASSP, 2023. [Lu2022] Lu et al., "Conditional Diffusion Probabilistic Model for Speech En- hancement," ICASSP, 2022. [Welker2022] Welker et al., "Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain," Interspeech, 2022. [Richter2022] Richter et al., "Speech Enhancement and Dereverberation with Diffusion-based Generative Models," TASLP, 2022. [Lemercier2023] Lemercier et al., "Speech Enhancement and Dereverberation with Diffusion-based Generative Models," TASLP, 2023. [Serra2022] Serra et al., "Universal Speech Enhancement with Score-based Dif- fusion," arXiv, 2022. 23

© LY Corporation References 2/2 [Karras2022] Karras et al., "Elucidating
the Design Space of Diffusion-Based Gen- erative Models," NeurIPS, 2022. [Karras2021] Karras et al., "Alias-Free Generative Adversarial Networks,” NeurIPS, 2021. [Kong2020] Kong et al., "HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis," NeurIPS 2020. [Hu2022] Hu et al., "LoRA: Low-rank adaptation of large language models," ICLR, 2022. [Valentini2016] Valentini-Botinhao et al., "Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Net- works," Interspeech, 2016. [Yu2023] Yu et al., "High fidelity speech enhancement with band-split RNN," Interspeech, 2023. 24

© LY Corporation HiFi-GAN Loss [Kong2020] ‚ Adversarial Training: proposed
for TTS ‚ Multi-resolution/period discriminators ‚ Feature matching: L1 distance of intermediate layers of discriminator mel-spec. L1 multi-res. discriminator multi-period discriminator feature matching L1 feature matching L1 disc. train loss disc. train loss clean speech synthesized speech adversarial loss adversarial loss 25

Universal Score-based Speech Enhancement with H...

Universal Score-based Speech Enhancement with High Content Preservation

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Featured

Transcript