WaveNet: A Generative Model for Raw Audio

WaveNet: A Generative Model for Raw Audio What I learned
from developing an open-source implementation 2018/05/14 Ryuichi Yamamoto @ LINE corp. Github: https://github.com/r9y9/wavenet_vocoder 1

Outline • Introduction – Why I started the project •
Background of WaveNet – PixelRNN / PixelCNN – Gated PixelCNN – WaveNet • Recent advances • Development – Details – Practical Tips – Samples – Open-source development Pros. And Cons. • Summary 2

Who am I • 㿊劤 륊♧ / Ryuichi Yamamoto (@r9y9)
• MSc @ Nagoya Institute of Technology (2013) • Software Engineer @ teamLab. Inc. (2013-2017) • Software Engineer @ LINE corp. (2018-Present) 3

What WaveNet brings TTS? • Quality improvement – Reducing gap
between SOTA and human-level performance in TTS – Outperforms both Google’s SOTA unit selection and parametric TTS • Waveform-level modeling/generation, no vocoder https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 4

1/2 Why I started the project I wonder if WaveNet
works in practice… Does it really work well? 5

2/2 Why I started the project • Many open-source implementations*
lack of support for local conditioning, which make it unusable for TTS • Do we need 32 GPUs? – I believe not! • Why open-source? – I wanted to get feedback from people around the world quickly 6

Overview PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Gated activation Conditioning
Dilated convolution 7

Overview PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution Neural
network-based autoregressive generative models 8 [van den Oord; ’16a] Gated activation Conditioning

[van den Oord; ’16a] PixelRNN/CNN: Autoregressive models for image !
" = $ %&' () !(+% |+' , … , +%/' ) • Best log likelihood on MNIST, CIFAR10 – Less blurred image compared to VAE, while stable to train than GAN 9

PixelRNN/CNN: Autoregressive models for image • Pixels as discrete variables
– With sufficient data it can model multimodal distribution efficiently 10

PixelRNN vs PixelCNN • Difference is the context-size: fixed or
variable – PixelRNN performs better than PixelCNN – PixelCNN is faster than PixelRNN in training 11

Overview PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Gated activation Dilated
convolution Can we get comparable performance to PixelRNN with faster training? 12 [van den Oord; ’16b] [van den Oord; ’16a]

[van den Oord; ’16b] Gated PixelCNN • Extended work of
PixelCNN – One of the key: Gated activation function Conv x y ReLU Conv x y tanh σ Conv Conv 13

Conditional image generation 14

Overview PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution What
happens with raw audio? 15 [van den Oord; ’16b] [van den Oord; ’16a] [van den Oord; ’16c] Gated activation Conditioning

[van den Oord; ’16c] WaveNet: A Generative Model for Raw
Audio • Formulation and network architecture are basically same as Gated PixelCNN • Domain-specific problems – How to capture long-term dependency? – How to handle 16-bit (65536 classes) valued discrete signal? 16

Dilated casual convolution 17 Receptive field: Linear vs exponential +
Dilation

Softmax output layer • Audio as a sequence of discrete
values – But 16 bit linear PCM (# of classes 65536) is hard to model → 16-bit to 8-bit mu-law encoding without loosing much information 18

WaveNet: Summary PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution
• An autoregressive generative model • Gated activation + conditioning • Stack of dilated convolution to capture long term dependency • Audio as a sequence of discrete values: mu law encoding 19 Gated activation Conditioning

Recent advances PixelRNN PixelCNN Gated PixelCNN WaveNet Approx. Dilated convolution
PixelCNN++ Parallel WaveNet 16-bit linear PCM Mixture of logistic distributions Fast WaveNet 20 Tacotron 2 Tacotron + WaveNet Tacotron Gated activation Conditioning End-to-end

PixelCNN++ Parallel WaveNet 16-bit linear PCM Mixture of logistic distributions Fast WaveNet 21 Tacotron 2 Tacotron + WaveNet Tacotron Gated activation Conditioning End-to-end [Salimans; ‘17]

[Salimans; ‘17] PixelCNN++ • The problem of softmax output –
Model do not know how close is between127 and 126 → Mixture of logistic distributions (MoL) 22

PixelCNN++ Parallel WaveNet 16-bit linear PCM Mixture of logistic distributions Fast WaveNet 23 Tacotron 2 Tacotron + WaveNet Tacotron Gated activation Conditioning [Shen; ’17] [Wang; ‘17] [van den Oord; ‘17] [Le Paine; ‘16]

%FWFMPQNFOU Conditional/unconditional WaveNet 24

Development process 25 • TODOs on Github issue • 98
comments • Total 53 issues and 8 PRs

Progress • Done – WaveNet training and inference – Local
conditioning and global conditioning – Mixture of logistic distribution output (from PixelCNN++) – Tacotron2 (with WaveNet) TTS demo • Generation is super slow though L • Not yet – Fast WaveNet Generation Algorithm [Le Paine; ‘16] – Parallel WaveNet [van den Oord; ‘17] 26

Prototyping in New Year's EveJ 27

Development details • Originally written in Python 3.6 and PyTorch
v0.2 – Latest PyTorch v0.4 is currently supported • Development and experiments are all done on a single machine (CPU: i7-7700K) with single GPU (GTX 1080Ti, 12GB) 28

Datasets I’ve worked on • CMU Arctic – 7 male/female
speakers – 6.25 hours • LJSpeech – Single female speaker – 24 hours – Public domain • VCTK 29

Samples from unconditional WaveNet trained on CMU Arctic 30 https://github.com/r9y9/wavenet_vocoder/issues/1#issuecomment-354586299
Real Generated Real Generated

1. Scientists at the CERN laboratory say they have discovered
a new particle. 2. There’s a way to measure the acute emotional intelligence that has never gone out of style. 3. President Trump met with other leaders at the Group of 20 conference. 4. The Senate’s bill to repeal and replace the Affordable Care Act is now imperiled. 5. Generative adversarial network or variational auto-encoder. Samples: [Ping; ‘17] Deep Voice 3 (w/o WN) vs Tacotron 2 (trained on LJSpeech) 31 Deep Voice 3 (w/o WN), Tacotron 2 (LJSpeech), respectively.

Practical Tips • Slicing entire audio into small mini-batches –
Minibatch size=2, length of time slice=8000 roughly requires ~10GB of GPU RAM • Mixture of logistic distributions output ([Salimans: ‘17] ) requires approx. 10x longer time to train than softmax output. – 1 ~ 2 million steps to get high quality samples – Exponential moving average (EMA, [van den Oord; ‘17] ) works but it needs long time to converge • Time resolution adjustment for conditional features is crucial – Trainable upsampling layers work better than fixed feature duplication ([Tamamori; ‘17]) 32

Open-source development: Pros. and Cons. • Pros. J – Open
discussion – Some fixed a bug for me. – Some created work on top of my work (e.g., Tacotron 2, Parallel WaveNet etc.) – Job offers, invited talk • Cons. L – Too many questions 33

Summary • WaveNet: An autoregressive model for raw audio –
Extended work from Gated PixelCNN and PixelCNN++ – Stack of dilated convolutions to capture long time dependency – Mu-law quantize + softmax or mixture of logistic distributions to model raw audio • Motivation: does it really work well? – Yes! – Super slow, but quality is amazing – WaveNet can be trained even with single GPU • 100k ~ 200k steps (~2 days) w/ softmax (batch size 2) • 1000k ~2000k steps (~2 weeks) w/ MoL (batch size 2) • Open-source development is fun J 34

Pointers • Code: https://github.com/r9y9/wavenet_vocoder • Samples: https://r9y9.github.io/wavenet_vocoder/ • Online TTS
demo: https://github.com/r9y9/Colaboratory 35

1/2 References • [DeepMind; ‘16] WaveNet: A Generative Model for
Raw Audio: https://deepmind.com/blog/wavenet-generative- model-raw-audio/ • [van den Oord; ’16a] Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, “Pixel Recurrent Neural Networks ”. ICML 2016. • [van den Oord; ’16b] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, et al, “Conditional Image Generation with PixelCNN Decoders”, NIPS 2016. • [van den Oord; ’16c] Aaron van den Oord, Sander Dieleman, Heiga Zen, et al, "WaveNet: A Generative Model for Raw Audio", arXiv:1609.03499, Sep 2016. • [Salimans: ‘17] Tim Salimans, Andrej Karpathy, Xi Chen, et al, “PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications, Jan 2017. • [van den Oord; ‘17] Aaron van den Oord, Yazhe Li, Igor Babuschkin, et al, "Parallel WaveNet: Fast High-Fidelity Speech Synthesis", arXiv:1711.10433, Nov 2017. • [Wang; ‘17] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, et al, “Tacotron: Towards End-to-End Speech Synthesis”, arXiv:1703.10135, Apr. 2017. • [Shen; ’17] Jonathan Shen, Ruoming Pang, Ron J. Weiss, et al, "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", arXiv:1712.05884, Dec 2017. 36

2/2 References • [Le Paine; ‘16] Tom Le Paine, Pooya
Khorrami, Shiyu Chang, et al, “Fast Wavenet Generation Algorithm”, arXiv:1611.09482, Nov. 2016. • [Ping; ‘17] Wei Ping, Kainan Peng, Andrew Gibiansky, et al, “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning”, arXiv:1710.07654, Oct. 2017. • [Tamamori; ‘17] Tamamori, Akira and Hayashi, Tomoki and Kobayashi, Kazuhiro and Takeda, et al, “Speaker- dependent WaveNet vocoder”, Proc. of INTERSPEECH 2017. • CMU ARCTIC speech synthesis database: http://festvox.org/cmu_arctic/ • The LJ Speech Dataset : https://keithito.com/LJ-Speech-Dataset/ 37

Open-source implementations on Github • https://github.com/ibab/tensorflow-wavenet – A tensorflow implementation.
Most stared repository on Github (Ӻ3571 at 18/05/14) • https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth/wavenet – An implementation from tensorflow community • https://github.com/kan-bayashi/PytorchWaveNetVocoder – By the author of [Tamamori; ‘17] from Nagoya University • https://github.com/tomlepaine/fast-wavenet – By the author of [Le Paine; ‘16]. • https://github.com/vincentherrmann/pytorch-wavenet – A PyTorch implementation • https://github.com/dhpollack/fast-wavenet.pytorch – A faster WaveNet implementation • https://github.com/basveeling/wavenet – Keras implementation • https://github.com/musyoku/wavenet – Chainer implementation 38

WaveNet: A Generative Model for Raw Audio

WaveNet: A Generative Model for Raw Audio

More Decks by Ryuichi Yamamoto

Other Decks in Research

Featured

Transcript