$30 off During Our Annual Pro Sale. View Details »

WaveNet: A Generative Model for Raw Audio

WaveNet: A Generative Model for Raw Audio

Slides for the invited talk on May 14 at National Institute of Information and Communications Technology (NICT).

Review of relevant research for WaveNet and what I learned from developing an open-source implementation.

Speech samples are available at https://r9y9.github.io/wavenet_vocoder/ and https://r9y9.github.io/blog/2018/05/20/tacotron2/.

Ryuichi Yamamoto

May 14, 2018
Tweet

More Decks by Ryuichi Yamamoto

Other Decks in Research

Transcript

  1. WaveNet: A Generative Model for Raw Audio
    What I learned from developing an open-source implementation
    2018/05/14 Ryuichi Yamamoto @ LINE corp.
    Github: https://github.com/r9y9/wavenet_vocoder
    1

    View Slide

  2. Outline
    • Introduction
    – Why I started the project
    • Background of WaveNet
    – PixelRNN / PixelCNN
    – Gated PixelCNN
    – WaveNet
    • Recent advances
    • Development
    – Details
    – Practical Tips
    – Samples
    – Open-source development Pros. And Cons.
    • Summary
    2

    View Slide

  3. Who am I
    • 㿊劤 륊♧ / Ryuichi Yamamoto (@r9y9)
    • MSc @ Nagoya Institute of Technology (2013)
    • Software Engineer @ teamLab. Inc. (2013-2017)
    • Software Engineer @ LINE corp. (2018-Present)
    3

    View Slide

  4. What WaveNet brings TTS?
    • Quality improvement
    – Reducing gap between SOTA and human-level performance in TTS
    – Outperforms both Google’s SOTA unit selection and parametric TTS
    • Waveform-level modeling/generation, no vocoder
    https://deepmind.com/blog/wavenet-generative-model-raw-audio/ 4

    View Slide

  5. 1/2 Why I started the project
    I wonder if WaveNet works in practice…
    Does it really work well?
    5

    View Slide

  6. 2/2 Why I started the project
    • Many open-source implementations* lack of support for local
    conditioning, which make it unusable for TTS
    • Do we need 32 GPUs?
    – I believe not!
    • Why open-source?
    – I wanted to get feedback from people around the world quickly
    6

    View Slide

  7. Overview
    PixelRNN
    PixelCNN
    Gated
    PixelCNN
    WaveNet
    Approx.
    Gated activation
    Conditioning Dilated convolution
    7

    View Slide

  8. Overview
    PixelRNN
    PixelCNN
    Gated
    PixelCNN
    WaveNet
    Approx.
    Dilated convolution
    Neural network-based autoregressive generative models
    8
    [van den Oord; ’16a]
    Gated activation
    Conditioning

    View Slide

  9. [van den Oord; ’16a] PixelRNN/CNN:
    Autoregressive models for image
    ! " = $
    %&'
    ()
    !(+%
    |+'
    , … , +%/'
    )
    • Best log likelihood on MNIST, CIFAR10
    – Less blurred image compared to VAE, while stable to train than GAN
    9

    View Slide

  10. PixelRNN/CNN: Autoregressive models for
    image
    • Pixels as discrete variables
    – With sufficient data it can model multimodal distribution efficiently
    10

    View Slide

  11. PixelRNN vs PixelCNN
    • Difference is the context-size: fixed or variable
    – PixelRNN performs better than PixelCNN
    – PixelCNN is faster than PixelRNN in training
    11

    View Slide

  12. Overview
    PixelRNN
    PixelCNN
    Gated
    PixelCNN
    WaveNet
    Approx.
    Gated activation Dilated convolution
    Can we get comparable performance to PixelRNN with faster training?
    12
    [van den Oord; ’16b]
    [van den Oord; ’16a]

    View Slide

  13. [van den Oord; ’16b] Gated PixelCNN
    • Extended work of PixelCNN
    – One of the key: Gated activation function
    Conv
    x
    y
    ReLU
    Conv
    x
    y
    tanh σ
    Conv Conv
    13

    View Slide

  14. Conditional image generation
    14

    View Slide

  15. Overview
    PixelRNN
    PixelCNN
    Gated
    PixelCNN
    WaveNet
    Approx.
    Dilated convolution
    What happens with raw audio?
    15
    [van den Oord; ’16b]
    [van den Oord; ’16a]
    [van den Oord; ’16c]
    Gated activation
    Conditioning

    View Slide

  16. [van den Oord; ’16c] WaveNet: A Generative
    Model for Raw Audio
    • Formulation and network architecture are basically same as Gated
    PixelCNN
    • Domain-specific problems
    – How to capture long-term dependency?
    – How to handle 16-bit (65536 classes) valued discrete signal?
    16

    View Slide

  17. Dilated casual convolution
    17
    Receptive field:
    Linear vs exponential
    + Dilation

    View Slide

  18. Softmax output layer
    • Audio as a sequence of discrete values
    – But 16 bit linear PCM (# of classes 65536) is hard to model
    → 16-bit to 8-bit mu-law encoding without loosing much information
    18

    View Slide

  19. WaveNet: Summary
    PixelRNN
    PixelCNN
    Gated
    PixelCNN
    WaveNet
    Approx.
    Dilated convolution
    • An autoregressive generative model
    • Gated activation + conditioning
    • Stack of dilated convolution to capture long term dependency
    • Audio as a sequence of discrete values: mu law encoding
    19
    Gated activation
    Conditioning

    View Slide

  20. Recent advances
    PixelRNN
    PixelCNN
    Gated
    PixelCNN
    WaveNet
    Approx.
    Dilated convolution
    PixelCNN++
    Parallel
    WaveNet
    16-bit linear PCM
    Mixture of logistic distributions
    Fast
    WaveNet
    20
    Tacotron 2
    Tacotron + WaveNet
    Tacotron
    Gated activation
    Conditioning
    End-to-end

    View Slide

  21. Recent advances
    PixelRNN
    PixelCNN
    Gated
    PixelCNN
    WaveNet
    Approx.
    Dilated convolution
    PixelCNN++
    Parallel
    WaveNet
    16-bit linear PCM
    Mixture of logistic distributions
    Fast
    WaveNet
    21
    Tacotron 2
    Tacotron + WaveNet
    Tacotron
    Gated activation
    Conditioning
    End-to-end
    [Salimans; ‘17]

    View Slide

  22. [Salimans; ‘17] PixelCNN++
    • The problem of softmax output
    – Model do not know how close is between127 and 126
    → Mixture of logistic distributions (MoL)
    22

    View Slide

  23. Recent advances
    PixelRNN
    PixelCNN
    Gated
    PixelCNN
    WaveNet
    Approx.
    Dilated convolution
    PixelCNN++
    Parallel
    WaveNet
    16-bit linear PCM
    Mixture of logistic distributions
    Fast
    WaveNet
    23
    Tacotron 2
    Tacotron + WaveNet
    Tacotron
    Gated activation
    Conditioning
    [Shen; ’17]
    [Wang; ‘17]
    [van den Oord; ‘17]
    [Le Paine; ‘16]

    View Slide

  24. %FWFMPQNFOU
    Conditional/unconditional WaveNet
    24

    View Slide

  25. Development process
    25
    • TODOs on Github issue
    • 98 comments
    • Total 53 issues and 8 PRs

    View Slide

  26. Progress
    • Done
    – WaveNet training and inference
    – Local conditioning and global conditioning
    – Mixture of logistic distribution output (from PixelCNN++)
    – Tacotron2 (with WaveNet) TTS demo
    • Generation is super slow though L
    • Not yet
    – Fast WaveNet Generation Algorithm [Le Paine; ‘16]
    – Parallel WaveNet [van den Oord; ‘17]
    26

    View Slide

  27. Prototyping in New Year's EveJ
    27

    View Slide

  28. Development details
    • Originally written in Python 3.6 and PyTorch v0.2
    – Latest PyTorch v0.4 is currently supported
    • Development and experiments are all done on a single machine
    (CPU: i7-7700K) with single GPU (GTX 1080Ti, 12GB)
    28

    View Slide

  29. Datasets I’ve worked on
    • CMU Arctic
    – 7 male/female speakers
    – 6.25 hours
    • LJSpeech
    – Single female speaker
    – 24 hours
    – Public domain
    • VCTK
    29

    View Slide

  30. Samples from unconditional WaveNet
    trained on CMU Arctic
    30
    https://github.com/r9y9/wavenet_vocoder/issues/1#issuecomment-354586299
    Real
    Generated
    Real
    Generated

    View Slide

  31. 1. Scientists at the CERN laboratory say they have discovered a
    new particle.
    2. There’s a way to measure the acute emotional intelligence
    that has never gone out of style.
    3. President Trump met with other leaders at the Group of 20
    conference.
    4. The Senate’s bill to repeal and replace the Affordable Care Act
    is now imperiled.
    5. Generative adversarial network or variational auto-encoder.
    Samples: [Ping; ‘17] Deep Voice 3 (w/o WN) vs Tacotron 2
    (trained on LJSpeech)
    31
    Deep Voice 3 (w/o WN), Tacotron 2 (LJSpeech), respectively.

    View Slide

  32. Practical Tips
    • Slicing entire audio into small mini-batches
    – Minibatch size=2, length of time slice=8000 roughly requires ~10GB of GPU
    RAM
    • Mixture of logistic distributions output ([Salimans: ‘17] ) requires approx.
    10x longer time to train than softmax output.
    – 1 ~ 2 million steps to get high quality samples
    – Exponential moving average (EMA, [van den Oord; ‘17] ) works but it needs
    long time to converge
    • Time resolution adjustment for conditional features is crucial
    – Trainable upsampling layers work better than fixed feature duplication
    ([Tamamori; ‘17])
    32

    View Slide

  33. Open-source development: Pros. and Cons.
    • Pros. J
    – Open discussion
    – Some fixed a bug for me.
    – Some created work on top of my work (e.g., Tacotron 2, Parallel
    WaveNet etc.)
    – Job offers, invited talk
    • Cons. L
    – Too many questions
    33

    View Slide

  34. Summary
    • WaveNet: An autoregressive model for raw audio
    – Extended work from Gated PixelCNN and PixelCNN++
    – Stack of dilated convolutions to capture long time dependency
    – Mu-law quantize + softmax or mixture of logistic distributions to model
    raw audio
    • Motivation: does it really work well?
    – Yes!
    – Super slow, but quality is amazing
    – WaveNet can be trained even with single GPU
    • 100k ~ 200k steps (~2 days) w/ softmax (batch size 2)
    • 1000k ~2000k steps (~2 weeks) w/ MoL (batch size 2)
    • Open-source development is fun J
    34

    View Slide

  35. Pointers
    • Code: https://github.com/r9y9/wavenet_vocoder
    • Samples: https://r9y9.github.io/wavenet_vocoder/
    • Online TTS demo: https://github.com/r9y9/Colaboratory
    35

    View Slide

  36. 1/2 References
    • [DeepMind; ‘16] WaveNet: A Generative Model for Raw Audio: https://deepmind.com/blog/wavenet-generative-
    model-raw-audio/
    • [van den Oord; ’16a] Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu, “Pixel Recurrent Neural
    Networks ”. ICML 2016.
    • [van den Oord; ’16b] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, et al, “Conditional Image Generation
    with PixelCNN Decoders”, NIPS 2016.
    • [van den Oord; ’16c] Aaron van den Oord, Sander Dieleman, Heiga Zen, et al, "WaveNet: A Generative Model for
    Raw Audio", arXiv:1609.03499, Sep 2016.
    • [Salimans: ‘17] Tim Salimans, Andrej Karpathy, Xi Chen, et al, “PixelCNN++: Improving the PixelCNN with
    Discretized Logistic Mixture Likelihood and Other Modifications, Jan 2017.
    • [van den Oord; ‘17] Aaron van den Oord, Yazhe Li, Igor Babuschkin, et al, "Parallel WaveNet: Fast High-Fidelity
    Speech Synthesis", arXiv:1711.10433, Nov 2017.
    • [Wang; ‘17] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, et al, “Tacotron: Towards End-to-End Speech
    Synthesis”, arXiv:1703.10135, Apr. 2017.
    • [Shen; ’17] Jonathan Shen, Ruoming Pang, Ron J. Weiss, et al, "Natural TTS Synthesis by Conditioning WaveNet
    on Mel Spectrogram Predictions", arXiv:1712.05884, Dec 2017.
    36

    View Slide

  37. 2/2 References
    • [Le Paine; ‘16] Tom Le Paine, Pooya Khorrami, Shiyu Chang, et al, “Fast Wavenet Generation Algorithm”,
    arXiv:1611.09482, Nov. 2016.
    • [Ping; ‘17] Wei Ping, Kainan Peng, Andrew Gibiansky, et al, “Deep Voice 3: Scaling Text-to-Speech with
    Convolutional Sequence Learning”, arXiv:1710.07654, Oct. 2017.
    • [Tamamori; ‘17] Tamamori, Akira and Hayashi, Tomoki and Kobayashi, Kazuhiro and Takeda, et al, “Speaker-
    dependent WaveNet vocoder”, Proc. of INTERSPEECH 2017.
    • CMU ARCTIC speech synthesis database: http://festvox.org/cmu_arctic/
    • The LJ Speech Dataset : https://keithito.com/LJ-Speech-Dataset/
    37

    View Slide

  38. Open-source implementations on Github
    • https://github.com/ibab/tensorflow-wavenet
    – A tensorflow implementation. Most stared repository on Github (Ӻ3571 at 18/05/14)
    • https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth/wavenet
    – An implementation from tensorflow community
    • https://github.com/kan-bayashi/PytorchWaveNetVocoder
    – By the author of [Tamamori; ‘17] from Nagoya University
    • https://github.com/tomlepaine/fast-wavenet
    – By the author of [Le Paine; ‘16].
    • https://github.com/vincentherrmann/pytorch-wavenet
    – A PyTorch implementation
    • https://github.com/dhpollack/fast-wavenet.pytorch
    – A faster WaveNet implementation
    • https://github.com/basveeling/wavenet
    – Keras implementation
    • https://github.com/musyoku/wavenet
    – Chainer implementation
    38

    View Slide