Universal Score-based Speech Enhancement with High Content Preservation

Slide 1

Slide 1 text

© LY Corporation Universal Score-based Speech Enhancement with High Content Preservation LY Corporation, Music Processing Team Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu Interspeech 2024

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

© LY Corporation Universal Speech Enhancement noise distortion enhancement DNN reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous Generative models can sample from suitable reconstructions! 3

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

© LY Corporation Universal Speech Enhancement noise distortion enhancement DNN reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network SE Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous USE Generative models can sample from suitable reconstructions! 3

Slide 8

Slide 8 text

© LY Corporation Universal Speech Enhancement noise distortion enhancement DNN reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network SE Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous USE generate missing Generative models can sample from suitable reconstructions! 3

Slide 9

Slide 9 text

© LY Corporation Universal Speech Enhancement noise distortion enhancement DNN reverb low-pass clipping codec ... Conventional SE ‚ additive: x = s + n ‚ target is always in the input ‚ clear task for network SE Universal SE ‚ any distortion (clip, reverb, . . .) ‚ target may be partly missing ‚ task is ambiguous USE generate missing Generative models can sample from suitable reconstructions! 3

Slide 10

Slide 10 text

© LY Corporation Existing Generative SE Methods Adversarial Training clean speech noisy speech discrimi nator denoising network real/fake Examples ‚ SEGAN [Pascual2017] ‚ HiFi++ [Andreev2023] Diffusion-based forward process reverse process Examples ‚ CDiffuse [Lu2022] ‚ SGMSE [Welker2022] ‚ StoRM [Lemercier2023] ‚ Universe [Serra2022] 4

Slide 11

Slide 11 text

© LY Corporation UNIVERSE Speech Enhancement Model [Serra2022] ‚ Diffusion model: Variance Exploding SDE ‚ Architecture: split feature extraction / score prediction ‚ Dataset: various speech, noise + 55 types of distortions conditioning network auxiliary loss auxiliary loss auxiliary loss score network score network score matching loss score network noisy speech di usion clean noise Problems ‚ During re-implementation from paper, found network hard to train ‚ Found auxiliary loss as possible area for improvement 5

Slide 12

Slide 12 text

Slide 13

Slide 13 text

© LY Corporation UNIVERSE++: Improved UNIVERSE Model Contributions 1. several network improvements 2. adversarial loss to produce high-quality features 3. fine-tuning with linguistic content loss 7 conditioning network score network score matching loss GAN loss new

Slide 14

Slide 14 text

© LY Corporation original universe model residual score encoder condition encoder condition decoder score decoder ... ... true score clean speech noisy speech di usion process mel-spec. score network (U-Net) MDN loss KL mix Gauss conv MDN loss MSE loss emb MDN: mixture density network feat. target condition network 8

Slide 15

Slide 15 text

© LY Corporation Improvement 1.1: Score Network IO Normalization Score network ingests noisy input and outputs score x + σtz S ÝÑ ´z σt 0 1 diffusion time t Noise schedule: t 0 1 diffusion time t 40 40 (dB) SNR: 10log10 (Var(x)/ 2 t ) High dynamic range input and output! Solution: score network re-parameterized as [Karras2022]: S(x, c, σt) = cskip x + coutS1 (cin x, c, σt) , (1) weights cskip, cin, cout are such that Var(cin(s + σtn)) = 1 and Var(target) = 1. Possibly most potent improvement! 9

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

© LY Corporation Improvements 1.2: Antialiasing conv low-pass conv T low-pass residual connection downsampling in encoder upsampling in decoder high freq. not processed at lower stages of the UNet ‚ Add antialiasing filters to all down/up sampling layers in score network ‚ Detrimental in condition network due to lack of residual connections ‚ Originally: reduce artifacts in GANs for images [Karras2021] 10

Slide 20

Slide 20 text

© LY Corporation Improvement 2: Adversarial Training with HiFi-GAN Loss residual score encoder condition encoder condition decoder score decoder ... ... true score clean speech noisy speech di usion process score network (U-Net) MSE loss feature matching loss adverse. loss mel-spec. L1 loss emb condition network disc. disc. train loss real/fake mixture density network Ñ HiFi-GAN loss [Kong2020] 11

Slide 21

Slide 21 text

© LY Corporation Improvement 3: Fine-tuning with Linguistic Content Loss clean speech noisy speech phoneme recognizer di usion N-2 steps di usion 2 steps condition phoneme recognizer decode CTC loss multi-res. STFT z 🔥 🔥 ❄ ❄ ‚ linguistic content loss ‚ multi-resolution spectrogram loss ‚ fine-tuning by low-rank adaptation [Hu2022] 12

Slide 22

Slide 22 text

Slide 23

Slide 23 text

© LY Corporation Ablation Experiment Setup Dataset: Voicebank-DEMAND [Valentini2016] ‚ Speech from VCTK, Noise from DEMAND ‚ Sampling rate: 16 kHz Training ‚ Optimizer: AdamW ‚ Learning rate: 10´6 Ñ 10´4 Ñ 10´6 (warmup + cosine schedule) ‚ 300 000 steps ‚ Batch size: 40 14

Slide 24

Slide 24 text

© LY Corporation Ablation Study 1.5 2.0 2.5 3.0 3.5 PESQ-WB Unprocessed +HiFi-GAN +Network Original 1.97 2.98 2.93 2.86 0.7 0.8 0.9 ESTOI 0.79 0.86 0.86 0.85 2.0 2.5 3.0 3.5 DNSMOS (OVRL) 2.70 3.20 3.18 3.16 15

Slide 25

Slide 25 text

© LY Corporation Universal Speech Enhancement Experiment Hyperparameters ‚ AdamW / 1 500 000 steps / 40 batch size ‚ Learning rate: 10´6 Ñ 10´4 Ñ 10´6 (warmup + cosine schedule) Training Set (24 kHz) ‚ Speech: 537 h ‚ Noise: 601 h (environmental + BGM) ‚ Distortions: reverberation, band limitation, equalization distortion, clipping, random attenuation, packet loss, codec distortion (MP3) Test Sets (24 kHz) ‚ VBD-LP: Voicebank-DEMAND low-passed at 4 kHz ‚ Signal Improvement Challenge (various real distortions) 16

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

© LY Corporation Results for VBD-LP: (Denoising+Bandwith Extension) 2 3 +finetune UNIVERSE++ UNIVERSE StoRM BSRNN Unprocessed 2.57 2.72 2.36 1.62 2.58 1.89 PESQ-WB 0 20 3.9 5.4 10.3 6.8 5.3 4.1 WER (%) 2 3 3.19 3.19 3.11 2.79 2.92 2.70 DNSMOS BSRNN [Yu2023] Discriminative: metric-gan (PESQ) loss StoRM [Lemercier2023] Hybrid: discriminative denoising + diffusion refinement 17

Slide 29

Slide 29 text

© LY Corporation Results for Signal Improvement Challenge 0 20 +finetune UNIVERSE++ UNIVERSE StoRM BSRNN Unprocessed 21.8 24.1 28.5 25.5 22.9 21.7 WER (%) 2 3 2.93 2.80 2.86 2.53 2.59 2.43 DNSMOS BSRNN [Yu2023] Discriminative: metric-gan (PESQ) loss StoRM [Lemercier2023] Hybrid: discriminative denoising + diffusion refinement 18

Slide 30

Slide 30 text

© LY Corporation Audio Samples Unprocessed StoRM BSRNN UNIVERSE UNIVERSE++ SIG BWE BSRNN [Yu2023] Discriminative: metric-gan (PESQ) loss StoRM [Lemercier2023] Hybrid: discriminative denoising + diffusion refinement More samples: https://www.robinscheibler.org/interspeech2024-universepp-samples/ 19

Slide 31

Slide 31 text

Slide 32

Slide 32 text

© LY Corporation Conclusion Summary ‚ UNIVERSE++: improved UNIVERSE model ‚ several network improvements ‚ HiFi-GAN loss ‚ fine-tuning with linguistic content loss ‚ enhancement for wide range of conditions Future Work ‚ improve intelligibility ‚ smaller model, shorter training, etc... Github line/open-universe 21

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

© LY Corporation References 1/2 [Pascual2017] Pascual et al., "SEGAN: Speech Enhancement Generative Adversar- ial Network," Interspeech, 2017. [Andreev2023] Andreev et al., "HIFI++: A Unified Framework for Bandwidth Exten- sion and Speech Enhancement," ICASSP, 2023. [Lu2022] Lu et al., "Conditional Diffusion Probabilistic Model for Speech En- hancement," ICASSP, 2022. [Welker2022] Welker et al., "Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain," Interspeech, 2022. [Richter2022] Richter et al., "Speech Enhancement and Dereverberation with Diffusion-based Generative Models," TASLP, 2022. [Lemercier2023] Lemercier et al., "Speech Enhancement and Dereverberation with Diffusion-based Generative Models," TASLP, 2023. [Serra2022] Serra et al., "Universal Speech Enhancement with Score-based Dif- fusion," arXiv, 2022. 23

Slide 36

Slide 36 text

© LY Corporation References 2/2 [Karras2022] Karras et al., "Elucidating the Design Space of Diffusion-Based Gen- erative Models," NeurIPS, 2022. [Karras2021] Karras et al., "Alias-Free Generative Adversarial Networks,” NeurIPS, 2021. [Kong2020] Kong et al., "HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis," NeurIPS 2020. [Hu2022] Hu et al., "LoRA: Low-rank adaptation of large language models," ICLR, 2022. [Valentini2016] Valentini-Botinhao et al., "Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Net- works," Interspeech, 2016. [Yu2023] Yu et al., "High fidelity speech enhancement with band-split RNN," Interspeech, 2023. 24

Slide 37

Slide 37 text

© LY Corporation HiFi-GAN Loss [Kong2020] ‚ Adversarial Training: proposed for TTS ‚ Multi-resolution/period discriminators ‚ Feature matching: L1 distance of intermediate layers of discriminator mel-spec. L1 multi-res. discriminator multi-period discriminator feature matching L1 feature matching L1 disc. train loss disc. train loss clean speech synthesized speech adversarial loss adversarial loss 25