Parallel WaveGAN: Fast and High-Quality GPU Text-to-Speech

Ryuichi Yamamoto › Joined LINE at 2018 › Work location
› Kyoto office › NAVER green factory › Kyoto office (now) › Speech synthesis R&D @ Voice team › Likes open-source software r9y9

Agenda › What is “TTS” ? › Trade off: Quality
vs. Speed › About Parallel WaveGAN (ICASSP 2020)

What is “TTS” ? to Speech Text Hello

TTS in LINE

Order Answer ASR TTS LINE AiCall

Overview of a TTS system Acoustic model Vocoder Text
processing

Overview of a TTS system Acoustic model Vocoder [ a-
ri^ ga- to- o- ] Text processing Text Linguistic features

Overview of a TTS system Acoustic model Vocoder Text processing
Acoustic features [ a- ri^ ga- to- o- ] Linguistic features

Overview of a TTS system Acoustic model Vocoder Text processing
Acoustic features Waveform [ a- ri^ ga- to- o- ]

Overview of a TTS system Acoustic model Text processing Acoustic
features Waveform [ a- ri^ ga- to- o- ] Today’s focus Neural Vocoder

WaveNet [1] A. van den Oord et al., “WaveNet: A
generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016 ! " = $ %&' ( !(*% |*' , … , *%.' ) Autoregressive modeling 16,000 samples / sec

WaveNet https://deepmind.com/blog/article/wavenet-generative-model-raw-audio 3.86 3.67 4.21 4.55 3.47 3.79 4.08 4.21
3 3.5 4 4.5 5 Concatenative Parametric WaveNet Human Speech US English Mandarin Chinese

Trade off: Quality vs. Speed Autoregressive Generation Non-autoregressive Generation WaveNet
[1] Parallel WaveNet [2] ClariNet [3] [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019. AR !" !# !# !$ !%&# !' AR AR (% %)# ' Feed-forward model !% %)# '

[1] Parallel WaveNet [2] ClariNet [3] [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019.

[1] Parallel WaveNet [2] ClariNet [3] [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019. 180 0.03

Trade off: Quality vs. Speed Speed L J Autoregressive Generation
Non-autoregressive Generation

Trade off: Quality vs. Speed Speed L J Quality J
L Autoregressive Generation Non-autoregressive Generation

Speed L J Quality J L Training J L Trade
off: Quality vs. Speed Autoregressive Generation Non-autoregressive Generation

Speed L J Quality J J Training J J Trade
off: Quality vs. Speed Autoregressive Generation Our approach

Fast Parallel High-quality GAN Efficient WaveNet Multi-resolution STFT loss Parallel
WaveGAN [4] R. Yamamoto et al., “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020, pp. 703.

Agenda › › › About Parallel WaveGAN (ICASSP 2020) ›
Technical details

Parallel WaveGAN: WaveNet-based waveform generator WaveNet [1] Waveform Samples Parallel
WaveGAN Random Noise Input Convolution Output Deep residual block Convolution Previous Samples !":$%" Current Samples !$ Previous Samples !":$ Current Samples !$&"

Learn to discriminate more correctly min $ max ' (
log ,(.) + ( log(1 − ,(3(.)) Random Noise Generated object Ground Truth Which image is a real? Discriminator Generator Generative Adversarial Networks (GANs) min $ max ' ( log ,(.) + ( log(1 − ,(3(.)) min $ max ' ( log ,(.) + ( log(1 − ,(3(.)) Learn to generate more real

Effects of Auxiliary loss for GANs Ground Truth Adversarial loss
[5] J. Zhu et al., “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” in Proc. ICCV, 2017, pp. 2223--2232. + Auxiliary loss

Time Frequency Magnitude Magnitude (dB) Short-time Fourier transform (STFT)

!"#$ = 1 ' log STFT . − log |STFT
1 . | 2 | log STFT . − log |STFT 1 . || log STFT . log |STFT 1 . | ' : number of elements in the STFT magnitude STFT loss: Log-scale STFT magnitude loss (2/2) [6] S. O. Arik, et al, “Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks,” IEEE Signal Procees. Letters, 2019.

512 / 240 / 50 1024 / 600 / 120
2048 / 1200 / 240 Higher frequency resolution Higher temporal resolution Balanced !"#$ (&) = 1 * + ,-. / !0 , (&) !0 (&) = 12~4 2 ,6~47898 [!0; <, = < + !?"@ <, = < ] * : number of STFT losses FFT size / window size / shift Multi-resolution STFT loss

!" ($) Parallel WaveGAN: Training process Random noise & '
Natural speech !()' !(*+ ,(*+ !- !. …… is Real? Acoustic features / 0 ' 1 $ ' !" (2) !" (1) Generator Discriminator STFT loss (1st ) STFT loss (2nd ) STFT loss (Mth ) Adversarial loss Discriminator loss 3 3 0 ' Generated speech Parameter update Parameter update

Agenda › › › About Parallel WaveGAN (ICASSP 2020) ›
Performance Evaluations

Acoustic model Text processing Acoustic features Waveform [ a-
ri^ ga- to- o- ] Neural Vocoder Text Linguistic features Performance Evaluations

Evaluations: Baseline systems WaveNet [1] ClariNet [3] AR !" !#
!# !$ !%&# !' AR AR (% %)# ' Feed-forward model !% %)# ' [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019.

Evaluations: Database 11,449 250 250 Training / validation / test
(# of utts.) 23 0.5 0.5 (hours)

1.36 4.46 4.06 1 1.5 2 2.5 3 3.5 4
4.5 5 Parallel WaveGAN Single STFT Loss Three STFT Loss Using multi-resolution STFT loss significantly improved perceptual quality Effects of multi-resolution STFT loss Reference

Speed: Inference time WaveNet Parallel WaveGAN x 10,000 Faster J
vs. 321 0.03

Speed: Training time WaveNet Parallel WaveGAN vs. vs. ClariNet Non-autoregressive
Non-autoregressive 12.7 2.8 7.4 x 4 Faster J

Acoustic model Text processing Acoustic features Waveform [ a-
ri^ ga- to- o- ] Neural Vocoder Text Linguistic features Performance Evaluations

Our model achieved 4.16 MOS competitive to ClariNet [3]. Naturalness
evaluation in TTS 3.33 4 4.16 4.46 2 2.5 3 3.5 4 4.5 5 WaveNet ClariNet Parallel WaveGAN (Ours) Reference = High Quality J

New TTS sample

Related work 2019.10.8 MelGAN [7] 2019.10.25 Parallel WaveGAN [4] 2020.5.11
Multi-band MelGAN [8] 2020.7.30 VocGAN [9] 2020.5.18 Quasi-periodic Parallel WaveGAN [11] 2020.7.8 FastSpeech 2s [12] 2020.10.12 HiFi-GAN [10] 2020.9.3 HiFiSinger [13]

3.33 4 4.16 4.46 2 2.5 3 3.5 4 4.5
5 WaveNet ClariNet Parallel WaveGAN (Ours) Reference High Quality J Fast J Conclusion vs. ClariNet Parallel WaveGAN 0.03 2.8 vs. 12.7 321 WaveNet Parallel WaveGAN

Thank you

References [1] A. van den Oord et al., “WaveNet: A
generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019. [4] R. Yamamoto et al., “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi- resolution spectrogram,” in Proc. ICASSP, 2020, pp. 703. [5] J. Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017, pp. 2223--2232. [6] S. O. Arik, et al, “Fast spectrogram inversion using multi-head convolutional neural networks,” IEEE Signal Procees. Letters, 2019. 2017. [7] K. Kumar, et al, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in Proc. NeurIPS, 2019, pp. 14881–14892. [8] Y. Geng, et al, “Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech,” arXiv preprint arXiv:2005.05106, 2020. [9] J. Yang, et al., “VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network,” in Proc. INTERSPEECH, 2020. [10] J. Kong, et al, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020. [11] Y. Wo, et al., “Quasi-Periodic Parallel WaveGAN vocoder: A non-autoregressive pitch-dependent dilated convolution model for parametric speech generation,” in Proc. INTERSPEECH, 2020. [12] Y. Ren et al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020. [13] J. Chen et al., “HiFiSinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020. Icon made by Pixel perfect from www.flaticon.com

Parallel WaveGAN: Fast and High-Quality GPU Tex...

Parallel WaveGAN: Fast and High-Quality GPU Text-to-Speech

More Decks by LINE DevDay 2020

Other Decks in Technology

Featured

Transcript