Parallel WaveGAN: Fast and High-Quality GPU Text-to-Speech

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Ryuichi Yamamoto › Joined LINE at 2018 › Work location › Kyoto office › NAVER green factory › Kyoto office (now) › Speech synthesis R&D @ Voice team › Likes open-source software r9y9

Slide 3

Slide 3 text

Agenda › What is “TTS” ? › Trade off: Quality vs. Speed › About Parallel WaveGAN (ICASSP 2020)

Slide 4

Slide 4 text

What is “TTS” ? to Speech Text Hello

Slide 5

Slide 5 text

TTS in LINE

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Order Answer ASR TTS LINE AiCall

Slide 8

Slide 8 text

Overview of a TTS system Acoustic model Vocoder Text processing

Slide 9

Slide 9 text

Overview of a TTS system Acoustic model Vocoder [ a- ri^ ga- to- o- ] Text processing Text Linguistic features

Slide 10

Slide 10 text

Overview of a TTS system Acoustic model Vocoder Text processing Acoustic features [ a- ri^ ga- to- o- ] Linguistic features

Slide 11

Slide 11 text

Overview of a TTS system Acoustic model Vocoder Text processing Acoustic features Waveform [ a- ri^ ga- to- o- ]

Slide 12

Slide 12 text

Overview of a TTS system Acoustic model Text processing Acoustic features Waveform [ a- ri^ ga- to- o- ] Today’s focus Neural Vocoder

Slide 13

Slide 13 text

WaveNet [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016 ! " = $ %&' ( !(*% |*' , … , *%.' ) Autoregressive modeling 16,000 samples / sec

Slide 14

Slide 14 text

WaveNet https://deepmind.com/blog/article/wavenet-generative-model-raw-audio 3.86 3.67 4.21 4.55 3.47 3.79 4.08 4.21 3 3.5 4 4.5 5 Concatenative Parametric WaveNet Human Speech US English Mandarin Chinese

Slide 15

Slide 15 text

Trade off: Quality vs. Speed Autoregressive Generation Non-autoregressive Generation WaveNet [1] Parallel WaveNet [2] ClariNet [3] [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019. AR !" !# !# !$ !%&# !' AR AR (% %)# ' Feed-forward model !% %)# '

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Trade off: Quality vs. Speed Speed L J Autoregressive Generation Non-autoregressive Generation

Slide 19

Slide 19 text

Trade off: Quality vs. Speed Speed L J Quality J L Autoregressive Generation Non-autoregressive Generation

Slide 20

Slide 20 text

Speed L J Quality J L Training J L Trade off: Quality vs. Speed Autoregressive Generation Non-autoregressive Generation

Slide 21

Slide 21 text

Speed L J Quality J J Training J J Trade off: Quality vs. Speed Autoregressive Generation Our approach

Slide 22

Slide 22 text

Fast Parallel High-quality GAN Efficient WaveNet Multi-resolution STFT loss Parallel WaveGAN [4] R. Yamamoto et al., “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020, pp. 703.

Slide 23

Slide 23 text

Agenda › › › About Parallel WaveGAN (ICASSP 2020) › Technical details

Slide 24

Slide 24 text

Parallel WaveGAN: WaveNet-based waveform generator WaveNet [1] Waveform Samples Parallel WaveGAN Random Noise Input Convolution Output Deep residual block Convolution Previous Samples !":$%" Current Samples !$ Previous Samples !":$ Current Samples !$&"

Slide 25

Slide 25 text

Learn to discriminate more correctly min $ max ' ( log ,(.) + ( log(1 − ,(3(.)) Random Noise Generated object Ground Truth Which image is a real? Discriminator Generator Generative Adversarial Networks (GANs) min $ max ' ( log ,(.) + ( log(1 − ,(3(.)) min $ max ' ( log ,(.) + ( log(1 − ,(3(.)) Learn to generate more real

Slide 26

Slide 26 text

Effects of Auxiliary loss for GANs Ground Truth Adversarial loss [5] J. Zhu et al., “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” in Proc. ICCV, 2017, pp. 2223--2232. + Auxiliary loss

Slide 27

Slide 27 text

Time Frequency Magnitude Magnitude (dB) Short-time Fourier transform (STFT)

Slide 28

Slide 28 text

Slide 29

Slide 29 text

!"#$ = 1 ' log STFT . − log |STFT 1 . | 2 | log STFT . − log |STFT 1 . || log STFT . log |STFT 1 . | ' : number of elements in the STFT magnitude STFT loss: Log-scale STFT magnitude loss (2/2) [6] S. O. Arik, et al, “Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks,” IEEE Signal Procees. Letters, 2019.

Slide 30

Slide 30 text

512 / 240 / 50 1024 / 600 / 120 2048 / 1200 / 240 Higher frequency resolution Higher temporal resolution Balanced !"#$ (&) = 1 * + ,-. / !0 , (&) !0 (&) = 12~4 2 ,6~47898 [!0; <, = < + !?"@ <, = < ] * : number of STFT losses FFT size / window size / shift Multi-resolution STFT loss

Slide 31

Slide 31 text

!" ($) Parallel WaveGAN: Training process Random noise & ' Natural speech !()' !(*+ ,(*+ !- !. …… is Real? Acoustic features / 0 ' 1 $ ' !" (2) !" (1) Generator Discriminator STFT loss (1st ) STFT loss (2nd ) STFT loss (Mth ) Adversarial loss Discriminator loss 3 3 0 ' Generated speech Parameter update Parameter update

Slide 32

Slide 32 text

Agenda › › › About Parallel WaveGAN (ICASSP 2020) › Performance Evaluations

Slide 33

Slide 33 text

Acoustic model Text processing Acoustic features Waveform [ a- ri^ ga- to- o- ] Neural Vocoder Text Linguistic features Performance Evaluations

Slide 34

Slide 34 text

Evaluations: Baseline systems WaveNet [1] ClariNet [3] AR !" !# !# !$ !%&# !' AR AR (% %)# ' Feed-forward model !% %)# ' [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019.

Slide 35

Slide 35 text

Evaluations: Database 11,449 250 250 Training / validation / test (# of utts.) 23 0.5 0.5 (hours)

Slide 36

Slide 36 text

1.36 4.46 4.06 1 1.5 2 2.5 3 3.5 4 4.5 5 Parallel WaveGAN Single STFT Loss Three STFT Loss Using multi-resolution STFT loss significantly improved perceptual quality Effects of multi-resolution STFT loss Reference

Slide 37

Slide 37 text

Speed: Inference time WaveNet Parallel WaveGAN x 10,000 Faster J vs. 321 0.03

Slide 38

Slide 38 text

Speed: Training time WaveNet Parallel WaveGAN vs. vs. ClariNet Non-autoregressive Non-autoregressive 12.7 2.8 7.4 x 4 Faster J

Slide 39

Slide 39 text

Acoustic model Text processing Acoustic features Waveform [ a- ri^ ga- to- o- ] Neural Vocoder Text Linguistic features Performance Evaluations

Slide 40

Slide 40 text

Our model achieved 4.16 MOS competitive to ClariNet [3]. Naturalness evaluation in TTS 3.33 4 4.16 4.46 2 2.5 3 3.5 4 4.5 5 WaveNet ClariNet Parallel WaveGAN (Ours) Reference = High Quality J

Slide 41

Slide 41 text

New TTS sample

Slide 42

Slide 42 text

Related work 2019.10.8 MelGAN [7] 2019.10.25 Parallel WaveGAN [4] 2020.5.11 Multi-band MelGAN [8] 2020.7.30 VocGAN [9] 2020.5.18 Quasi-periodic Parallel WaveGAN [11] 2020.7.8 FastSpeech 2s [12] 2020.10.12 HiFi-GAN [10] 2020.9.3 HiFiSinger [13]

Slide 43

Slide 43 text

3.33 4 4.16 4.46 2 2.5 3 3.5 4 4.5 5 WaveNet ClariNet Parallel WaveGAN (Ours) Reference High Quality J Fast J Conclusion vs. ClariNet Parallel WaveGAN 0.03 2.8 vs. 12.7 321 WaveNet Parallel WaveGAN

Slide 44

Slide 44 text

Thank you

Slide 45

Slide 45 text

References [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019. [4] R. Yamamoto et al., “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi- resolution spectrogram,” in Proc. ICASSP, 2020, pp. 703. [5] J. Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017, pp. 2223--2232. [6] S. O. Arik, et al, “Fast spectrogram inversion using multi-head convolutional neural networks,” IEEE Signal Procees. Letters, 2019. 2017. [7] K. Kumar, et al, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in Proc. NeurIPS, 2019, pp. 14881–14892. [8] Y. Geng, et al, “Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech,” arXiv preprint arXiv:2005.05106, 2020. [9] J. Yang, et al., “VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network,” in Proc. INTERSPEECH, 2020. [10] J. Kong, et al, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020. [11] Y. Wo, et al., “Quasi-Periodic Parallel WaveGAN vocoder: A non-autoregressive pitch-dependent dilated convolution model for parametric speech generation,” in Proc. INTERSPEECH, 2020. [12] Y. Ren et al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020. [13] J. Chen et al., “HiFiSinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020. Icon made by Pixel perfect from www.flaticon.com