Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel WaveGAN: Fast and High-Quality GPU Text-to-Speech

Parallel WaveGAN: Fast and High-Quality GPU Text-to-Speech

LINE DevDay 2020

November 25, 2020
Tweet

More Decks by LINE DevDay 2020

Other Decks in Technology

Transcript

  1. Ryuichi Yamamoto › Joined LINE at 2018 › Work location

    › Kyoto office › NAVER green factory › Kyoto office (now) › Speech synthesis R&D @ Voice team › Likes open-source software r9y9
  2. Agenda › What is “TTS” ? › Trade off: Quality

    vs. Speed › About Parallel WaveGAN (ICASSP 2020)
  3. Overview of a TTS system Acoustic model Vocoder [ a-

    ri^ ga- to- o- ] Text processing Text  Linguistic features
  4. Overview of a TTS system Acoustic model Vocoder Text processing

    Acoustic features  [ a- ri^ ga- to- o- ] Linguistic features
  5. Overview of a TTS system Acoustic model Vocoder Text processing

    Acoustic features Waveform  [ a- ri^ ga- to- o- ]
  6. Overview of a TTS system Acoustic model Text processing Acoustic

    features Waveform  [ a- ri^ ga- to- o- ] Today’s focus Neural Vocoder
  7. WaveNet [1] A. van den Oord et al., “WaveNet: A

    generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016 ! " = $ %&' ( !(*% |*' , … , *%.' ) Autoregressive modeling 16,000 samples / sec
  8. WaveNet https://deepmind.com/blog/article/wavenet-generative-model-raw-audio 3.86 3.67 4.21 4.55 3.47 3.79 4.08 4.21

    3 3.5 4 4.5 5 Concatenative Parametric WaveNet Human Speech US English Mandarin Chinese
  9. Trade off: Quality vs. Speed Autoregressive Generation Non-autoregressive Generation WaveNet

    [1] Parallel WaveNet [2] ClariNet [3] [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019. AR !" !# !# !$ !%&# !' AR AR (% %)# ' Feed-forward model !% %)# '
  10. Trade off: Quality vs. Speed Autoregressive Generation Non-autoregressive Generation WaveNet

    [1] Parallel WaveNet [2] ClariNet [3] [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019.
  11. Trade off: Quality vs. Speed Autoregressive Generation Non-autoregressive Generation WaveNet

    [1] Parallel WaveNet [2] ClariNet [3] [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019. 180 0.03
  12. Trade off: Quality vs. Speed Speed L J Quality J

    L Autoregressive Generation Non-autoregressive Generation
  13. Speed L J Quality J L Training J L Trade

    off: Quality vs. Speed Autoregressive Generation Non-autoregressive Generation
  14. Speed L J Quality J J Training J J Trade

    off: Quality vs. Speed Autoregressive Generation Our approach
  15. Fast Parallel High-quality GAN Efficient WaveNet Multi-resolution STFT loss Parallel

    WaveGAN [4] R. Yamamoto et al., “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020, pp. 703.
  16. Parallel WaveGAN: WaveNet-based waveform generator WaveNet [1] Waveform Samples Parallel

    WaveGAN Random Noise Input Convolution Output Deep residual block Convolution Previous Samples !":$%" Current Samples !$ Previous Samples !":$ Current Samples !$&"
  17. Learn to discriminate more correctly min $ max ' (

    log ,(.) + ( log(1 − ,(3(.)) Random Noise Generated object Ground Truth Which image is a real? Discriminator Generator Generative Adversarial Networks (GANs) min $ max ' ( log ,(.) + ( log(1 − ,(3(.)) min $ max ' ( log ,(.) + ( log(1 − ,(3(.)) Learn to generate more real
  18. Effects of Auxiliary loss for GANs Ground Truth Adversarial loss

    [5] J. Zhu et al., “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” in Proc. ICCV, 2017, pp. 2223--2232. + Auxiliary loss
  19. !"# = STFT ( − |STFT + ( | ,

    |STFT ( | , |STFT + ( | | STFT ( − STFT + ( | STFT ( [6] S. O. Arik, et al, “Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks,” IEEE Signal Procees. Letters, 2019. STFT loss: Spectral convergence (SC) (1/2)
  20. !"#$ = 1 ' log STFT . − log |STFT

    1 . | 2 | log STFT . − log |STFT 1 . || log STFT . log |STFT 1 . | ' : number of elements in the STFT magnitude STFT loss: Log-scale STFT magnitude loss (2/2) [6] S. O. Arik, et al, “Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks,” IEEE Signal Procees. Letters, 2019.
  21. 512 / 240 / 50 1024 / 600 / 120

    2048 / 1200 / 240 Higher frequency resolution Higher temporal resolution Balanced !"#$ (&) = 1 * + ,-. / !0 , (&) !0 (&) = 12~4 2 ,6~47898 [!0; <, = < + !?"@ <, = < ] * : number of STFT losses FFT size / window size / shift Multi-resolution STFT loss
  22. !" ($) Parallel WaveGAN: Training process Random noise & '

    Natural speech !()' !(*+ ,(*+ !- !. …… is Real? Acoustic features / 0 ' 1 $ ' !" (2) !" (1) Generator Discriminator STFT loss (1st ) STFT loss (2nd ) STFT loss (Mth ) Adversarial loss Discriminator loss 3 3 0 ' Generated speech Parameter update Parameter update
  23. Acoustic model Text processing Acoustic features Waveform  [ a-

    ri^ ga- to- o- ] Neural Vocoder Text Linguistic features Performance Evaluations
  24. Evaluations: Baseline systems WaveNet [1] ClariNet [3] AR !" !#

    !# !$ !%&# !' AR AR (% %)# ' Feed-forward model !% %)# ' [1] A. van den Oord et al., “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019.
  25. 1.36 4.46 4.06 1 1.5 2 2.5 3 3.5 4

    4.5 5 Parallel WaveGAN Single STFT Loss Three STFT Loss Using multi-resolution STFT loss significantly improved perceptual quality Effects of multi-resolution STFT loss Reference
  26. Acoustic model Text processing Acoustic features Waveform  [ a-

    ri^ ga- to- o- ] Neural Vocoder Text Linguistic features Performance Evaluations
  27. Our model achieved 4.16 MOS competitive to ClariNet [3]. Naturalness

    evaluation in TTS 3.33 4 4.16 4.46 2 2.5 3 3.5 4 4.5 5 WaveNet ClariNet Parallel WaveGAN (Ours) Reference = High Quality J
  28. Related work 2019.10.8 MelGAN [7] 2019.10.25 Parallel WaveGAN [4] 2020.5.11

    Multi-band MelGAN [8] 2020.7.30 VocGAN [9] 2020.5.18 Quasi-periodic Parallel WaveGAN [11] 2020.7.8 FastSpeech 2s [12] 2020.10.12 HiFi-GAN [10] 2020.9.3 HiFiSinger [13]
  29. 3.33 4 4.16 4.46 2 2.5 3 3.5 4 4.5

    5 WaveNet ClariNet Parallel WaveGAN (Ours) Reference High Quality J Fast J Conclusion vs. ClariNet Parallel WaveGAN 0.03 2.8 vs. 12.7 321 WaveNet Parallel WaveGAN
  30. References [1] A. van den Oord et al., “WaveNet: A

    generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] A. van den Oord, et al., “Parallel WaveNet: fast high-fidelity speech synthesis,” in Proc. ICML, 2018. [3] W. Ping, et al., “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in Proc. ICLR, 2019. [4] R. Yamamoto et al., “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi- resolution spectrogram,” in Proc. ICASSP, 2020, pp. 703. [5] J. Zhu et al., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017, pp. 2223--2232. [6] S. O. Arik, et al, “Fast spectrogram inversion using multi-head convolutional neural networks,” IEEE Signal Procees. Letters, 2019. 2017. [7] K. Kumar, et al, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in Proc. NeurIPS, 2019, pp. 14881–14892. [8] Y. Geng, et al, “Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech,” arXiv preprint arXiv:2005.05106, 2020. [9] J. Yang, et al., “VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network,” in Proc. INTERSPEECH, 2020. [10] J. Kong, et al, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020. [11] Y. Wo, et al., “Quasi-Periodic Parallel WaveGAN vocoder: A non-autoregressive pitch-dependent dilated convolution model for parametric speech generation,” in Proc. INTERSPEECH, 2020. [12] Y. Ren et al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020. [13] J. Chen et al., “HiFiSinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020. Icon made by Pixel perfect from www.flaticon.com