WaveNet: A Generative Model for Raw Audio

WaveNet: A Generative Model for Raw Audio

Slides for the invited talk on May 14 at National Institute of Information and Communications Technology (NICT).

Review of relevant research for WaveNet and what I learned from developing an open-source implementation.

Speech samples are available at https://r9y9.github.io/wavenet_vocoder/ and https://r9y9.github.io/blog/2018/05/20/tacotron2/.

6d356017749662a0c75ca192c7c88ed6?s=128

Ryuichi Yamamoto

May 14, 2018
Tweet

Transcript

  1. WaveNet: A Generative Model for Raw Audio What I learned

    from developing an open-source implementation 2018/05/14 Ryuichi Yamamoto @ LINE corp. Github: https://github.com/r9y9/wavenet_vocoder 1
  2. Outline • Introduction – Why I started the project •

    Background of WaveNet – PixelRNN / PixelCNN – Gated PixelCNN – WaveNet • Recent advances • Development – Details – Practical Tips – Samples – Open-source development Pros. And Cons. • Summary 2
  3. Who am I • 㿊劤 륊♧ / Ryuichi Yamamoto (@r9y9)

    • MSc @ Nagoya Institute of Technology (2013) • Software Engineer @ teamLab. Inc. (2013-2017) • Software Engineer @ LINE corp. (2018-Present) 3
  4. What WaveNet brings TTS? • Quality improvement – Reducing gap

    between SOTA and human-level performance in TTS – Outperforms both Google’s SOTA unit selection and parametric TTS • Waveform-level modeling/generation, no vocoder https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 4
  5. 1/2 Why I started the project I wonder if WaveNet

    works in practice… Does it really work well? 5
  6. 2/2 Why I started the project • Many open-source implementations*

    lack of support for local conditioning, which make it unusable for TTS • Do we need 32 GPUs? – I believe not! • Why open-source? – I wanted to get feedback from people around the world quickly 6
  7. Overview PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Gated activation Conditioning

    Dilated convolution 7
  8. Overview PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution Neural

    network-based autoregressive generative models 8 [van den Oord; ’16a] Gated activation Conditioning
  9. [van den Oord; ’16a] PixelRNN/CNN: Autoregressive models for image !

    " = $ %&' () !(+% |+' , … , +%/' ) • Best log likelihood on MNIST, CIFAR10 – Less blurred image compared to VAE, while stable to train than GAN 9
  10. PixelRNN/CNN: Autoregressive models for image • Pixels as discrete variables

    – With sufficient data it can model multimodal distribution efficiently 10
  11. PixelRNN vs PixelCNN • Difference is the context-size: fixed or

    variable – PixelRNN performs better than PixelCNN – PixelCNN is faster than PixelRNN in training 11
  12. Overview PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Gated activation Dilated

    convolution Can we get comparable performance to PixelRNN with faster training? 12 [van den Oord; ’16b] [van den Oord; ’16a]
  13. [van den Oord; ’16b] Gated PixelCNN • Extended work of

    PixelCNN – One of the key: Gated activation function Conv x y ReLU Conv x y tanh σ Conv Conv 13
  14. Conditional image generation 14

  15. Overview PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution What

    happens with raw audio? 15 [van den Oord; ’16b] [van den Oord; ’16a] [van den Oord; ’16c] Gated activation Conditioning
  16. [van den Oord; ’16c] WaveNet: A Generative Model for Raw

    Audio • Formulation and network architecture are basically same as Gated PixelCNN • Domain-specific problems – How to capture long-term dependency? – How to handle 16-bit (65536 classes) valued discrete signal? 16
  17. Dilated casual convolution 17 Receptive field: Linear vs exponential +

    Dilation
  18. Softmax output layer • Audio as a sequence of discrete

    values – But 16 bit linear PCM (# of classes 65536) is hard to model → 16-bit to 8-bit mu-law encoding without loosing much information 18
  19. WaveNet: Summary PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution

    • An autoregressive generative model • Gated activation + conditioning • Stack of dilated convolution to capture long term dependency • Audio as a sequence of discrete values: mu law encoding 19 Gated activation Conditioning
  20. Recent advances PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution

    PixelCNN++ Parallel WaveNet 16-bit linear PCM Mixture of logistic distributions Fast WaveNet 20 Tacotron 2 Tacotron + WaveNet Tacotron Gated activation Conditioning End-to-end
  21. Recent advances PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution

    PixelCNN++ Parallel WaveNet 16-bit linear PCM Mixture of logistic distributions Fast WaveNet 21 Tacotron 2 Tacotron + WaveNet Tacotron Gated activation Conditioning End-to-end [Salimans; ‘17]
  22. [Salimans; ‘17] PixelCNN++ • The problem of softmax output –

    Model do not know how close is between127 and 126 → Mixture of logistic distributions (MoL) 22
  23. Recent advances PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution

    PixelCNN++ Parallel WaveNet 16-bit linear PCM Mixture of logistic distributions Fast WaveNet 23 Tacotron 2 Tacotron + WaveNet Tacotron Gated activation Conditioning [Shen; ’17] [Wang; ‘17] [van den Oord; ‘17] [Le Paine; ‘16]
  24. %FWFMPQNFOU Conditional/unconditional WaveNet 24

  25. Development process 25 • TODOs on Github issue • 98

    comments • Total 53 issues and 8 PRs
  26. Progress • Done – WaveNet training and inference – Local

    conditioning and global conditioning – Mixture of logistic distribution output (from PixelCNN++) – Tacotron2 (with WaveNet) TTS demo • Generation is super slow though L • Not yet – Fast WaveNet Generation Algorithm [Le Paine; ‘16] – Parallel WaveNet [van den Oord; ‘17] 26
  27. Prototyping in New Year's EveJ 27

  28. Development details • Originally written in Python 3.6 and PyTorch

    v0.2 – Latest PyTorch v0.4 is currently supported • Development and experiments are all done on a single machine (CPU: i7-7700K) with single GPU (GTX 1080Ti, 12GB) 28
  29. Datasets I’ve worked on • CMU Arctic – 7 male/female

    speakers – 6.25 hours • LJSpeech – Single female speaker – 24 hours – Public domain • VCTK 29
  30. Samples from unconditional WaveNet trained on CMU Arctic 30 https://github.com/r9y9/wavenet_vocoder/issues/1#issuecomment-354586299

    Real Generated Real Generated
  31. 1. Scientists at the CERN laboratory say they have discovered

    a new particle. 2. There’s a way to measure the acute emotional intelligence that has never gone out of style. 3. President Trump met with other leaders at the Group of 20 conference. 4. The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled. 5. Generative adversarial network or variational auto-encoder. Samples: [Ping; ‘17] Deep Voice 3 (w/o WN) vs Tacotron 2 (trained on LJSpeech) 31 Deep Voice 3 (w/o WN), Tacotron 2 (LJSpeech), respectively.
  32. Practical Tips • Slicing entire audio into small mini-batches –

    Minibatch size=2, length of time slice=8000 roughly requires ~10GB of GPU RAM • Mixture of logistic distributions output ([Salimans: ‘17] ) requires approx. 10x longer time to train than softmax output. – 1 ~ 2 million steps to get high quality samples – Exponential moving average (EMA, [van den Oord; ‘17] ) works but it needs long time to converge • Time resolution adjustment for conditional features is crucial – Trainable upsampling layers work better than fixed feature duplication ([Tamamori; ‘17]) 32
  33. Open-source development: Pros. and Cons. • Pros. J – Open

    discussion – Some fixed a bug for me. – Some created work on top of my work (e.g., Tacotron 2, Parallel WaveNet etc.) – Job offers, invited talk • Cons. L – Too many questions 33
  34. Summary • WaveNet: An autoregressive model for raw audio –

    Extended work from Gated PixelCNN and PixelCNN++ – Stack of dilated convolutions to capture long time dependency – Mu-law quantize + softmax or mixture of logistic distributions to model raw audio • Motivation: does it really work well? – Yes! – Super slow, but quality is amazing – WaveNet can be trained even with single GPU • 100k ~ 200k steps (~2 days) w/ softmax (batch size 2) • 1000k ~2000k steps (~2 weeks) w/ MoL (batch size 2) • Open-source development is fun J 34
  35. Pointers • Code: https://github.com/r9y9/wavenet_vocoder • Samples: https://r9y9.github.io/wavenet_vocoder/ • Online TTS

    demo: https://github.com/r9y9/Colaboratory 35
  36. 1/2 References • [DeepMind; ‘16] WaveNet: A Generative Model for

    Raw Audio: https://deepmind.com/blog/wavenet-generative- model-raw-audio/ • [van den Oord; ’16a] Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, “Pixel Recurrent Neural Networks ”. ICML 2016. • [van den Oord; ’16b] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, et al, “Conditional Image Generation with PixelCNN Decoders”, NIPS 2016. • [van den Oord; ’16c] Aaron van den Oord, Sander Dieleman, Heiga Zen, et al, "WaveNet: A Generative Model for Raw Audio", arXiv:1609.03499, Sep 2016. • [Salimans: ‘17] Tim Salimans, Andrej Karpathy, Xi Chen, et al, “PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications, Jan 2017. • [van den Oord; ‘17] Aaron van den Oord, Yazhe Li, Igor Babuschkin, et al, "Parallel WaveNet: Fast High-Fidelity Speech Synthesis", arXiv:1711.10433, Nov 2017. • [Wang; ‘17] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, et al, “Tacotron: Towards End-to-End Speech Synthesis”, arXiv:1703.10135, Apr. 2017. • [Shen; ’17] Jonathan Shen, Ruoming Pang, Ron J. Weiss, et al, "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", arXiv:1712.05884, Dec 2017. 36
  37. 2/2 References • [Le Paine; ‘16] Tom Le Paine, Pooya

    Khorrami, Shiyu Chang, et al, “Fast Wavenet Generation Algorithm”, arXiv:1611.09482, Nov. 2016. • [Ping; ‘17] Wei Ping, Kainan Peng, Andrew Gibiansky, et al, “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning”, arXiv:1710.07654, Oct. 2017. • [Tamamori; ‘17] Tamamori, Akira and Hayashi, Tomoki and Kobayashi, Kazuhiro and Takeda, et al, “Speaker- dependent WaveNet vocoder”, Proc. of INTERSPEECH 2017. • CMU ARCTIC speech synthesis database: http://festvox.org/cmu_arctic/ • The LJ Speech Dataset : https://keithito.com/LJ-Speech-Dataset/ 37
  38. Open-source implementations on Github • https://github.com/ibab/tensorflow-wavenet – A tensorflow implementation.

    Most stared repository on Github (Ӻ3571 at 18/05/14) • https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth/wavenet – An implementation from tensorflow community • https://github.com/kan-bayashi/PytorchWaveNetVocoder – By the author of [Tamamori; ‘17] from Nagoya University • https://github.com/tomlepaine/fast-wavenet – By the author of [Le Paine; ‘16]. • https://github.com/vincentherrmann/pytorch-wavenet – A PyTorch implementation • https://github.com/dhpollack/fast-wavenet.pytorch – A faster WaveNet implementation • https://github.com/basveeling/wavenet – Keras implementation • https://github.com/musyoku/wavenet – Chainer implementation 38