Reformer Text-to-Speech

Slide 1

Slide 1 text

Reformer Text-To-Speech MATEUSZ OLKO / mo382777 TOMASZ MIŚKÓW / tm385898 KRZYSZTOF KOWALCZYK / kk385830

Slide 2

Slide 2 text

Our project: TTS for Trump’s voice

Slide 3

Slide 3 text

Modern Text-To-Speech (TTS) systems Hello World! Spectrogram generation Sound wave synthesis Reform er SqueezeW ave https://user-images.githubusercontent.com/22403532/72682430-e2207380-3ac4-11ea-9a20-88569af7a47a.png

Slide 4

Slide 4 text

Transformer architecture: a quick introduction ● Well known ● Widely used, mostly for NLP tasks ● Can be applied to any seq2seq task ● Already a few successful attempts to use in TTS were made https://www.researchgate.net/publication/323904682/figure/fig1/AS:606458626465792@1521602412057/The-Transformer-model-architecture.png

Slide 5

Slide 5 text

Attention is all you need More extensive explanation: http://jalammar.github.io/illustrated-transformer/

Slide 6

Slide 6 text

Reformer - a new solution for memory issues Transformer requires a lot of memory - especially for long sequences (attention matrice size is sequence length squared) To address this problem authors of Reformer architecture use, amongst other tricks, two main components: ● Local-Sensitive-Hashing Attention ● Reversible layers

Slide 7

Slide 7 text

Local-Sensitive-Hashing Attention More extensive explanation: https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0

Slide 8

Slide 8 text

Reversible layers No need to save gradients and activations for backpropagation. They can be easily re-calculated during backward pass. More extensive explanation: https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0

Slide 9

Slide 9 text

Flow-based generative models https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html

Slide 10

Slide 10 text

Waveglow A ﬂow-based generative network for speech synthesis https://arxiv.org/abs/1811.00002

Slide 11

Slide 11 text

Squeezewave Lighter and faster alternative: ● Reshape: upsample channels, reduce temporal length ● Use depthwise separable convolution layers ● 60x-70x faster than WaveGlow (10% reduction in audio quality) https://arxiv.org/abs/1811.00002

Slide 12

Slide 12 text

Implementation & experiments

Slide 13

Slide 13 text

Preprocessing data for experiments Trump Speech Dataset ● Audio and transcripts scraped from rev.com ● >12 hours of audio ● Sample size from 1s to 45s (clipped) ● 1 speaker (others were ﬁltered) LJ Speech Dataset ● Most common open dataset for TTS ● >25 hours of audio extracted from books ● Sample size from 1s to 10s ● 1 speaker Common preprocessing and format conversion logic Output: id, text, phonemes, audio file, spectrogram file

Slide 14

Slide 14 text

Implementing Reformer-TTS Synergy of two papers: ● Reformer: The Efﬁcient Transformer (https://arxiv.org/abs/2001.04451) ● Neural Speech Synthesis with Transformer Network (https://arxiv.org/abs/1809.08895) Written in PyTorch based on: ● Original implementation by Google (in trax) ● Modules from reformer_pytorch repository LSH attention from reformer_pytorch ● Experimented with HuggingFace Transformers implementation EncoderPreNet ScaledPositional Encoding LSH SelfAttention FeedForward DecoderPreNet ScaledPositional Encoding LSH SelfAttention Multihead Attention PostNet x N FeedForward x N MelLinear StopLinear Mel Spectrogram Stop Token Text (phonemes) Mel Spectrogram R E V E R S I B L E R E V E R S I B L E loss BCE MSE MSE

Slide 15

Slide 15 text

Getting the reformer to work correctly attention focused on zero-padded regions at the end of phoneme and spectrogram sequences train loss plots followed similar patterns across all experiments training tended to focus on raw pred and stop token loss, marginalized postnet loss added input masks after ground-truth stop-token locations discovered PyTorch dataloader doesn’t shufﬂe training data by default and overrode it added loss weights and loss masks after ground-truth stop-token locations during loss calculation

Slide 16

Slide 16 text

Summary: experiments we performed ● dropouts: postnet, attention ● batch size: 20, 60, 80, 120 ● depth: 2, 3, 6 ● bucket size: 64, 128, 256 ● LSH attention implementation: reformer_pytorch, HuggingFace ● loss weights: stop, raw, post ● loss types: MSE, L1 ● learning rate: 10-6-10-3 ● learning rate scheduling ● weight decay: 10-9 -10-4 ● gradient clipping ● learning rate warmup ● augmentation - gaussian noise ● inference strategy: concat, replace

Slide 17

Slide 17 text

Adding dropout to improve attention learning

Slide 18

Slide 18 text

Adding regularization to prevent overﬁtting ﬁrst experiment (May 15th) last baseline (June 15th)

Slide 19

Slide 19 text

Conclusion: we probably need to train longer We should apply more regularization to reduce overﬁtting and push that time to around a week The longest we could run training without overﬁtting was around 3 days, but it wasn’t enough.

Slide 20

Slide 20 text

What we learned - useful tips and tricks don’t trust research papers without original implementation most cloud providers suck (too slow in comparison to entropy) watch out for bad defaults: shufﬂing use batch size aggregation to train models on smaller devices

Slide 21

Slide 21 text

Re-implementing Squeezewave vocoder Tacotron Tacotron2 NV-WaveNet SqueezeWave (original) SqueezeWave (our implementation) WaveGlow Improved version by NVIDIA Git submodule for audio processing Copied code for audio processing Copied (almost) all code Unreadable code for legacy PyTorch, no CLI, no package, broken conﬁg values, does not run on CPU out of the box Written in modern PyTorch (1.5.0), Uses torchaudio for audio processing, matching tacotron2 outputs Works on CPU and GPU Published as an importable package

Slide 22

Slide 22 text

Matching results with original implementation Total training time: 3d 16h 13min (305 epochs initially + 987 epochs resumed later) Samples: http://students.mimuw.edu.pl/~kk385830/gsn-presentation/

Slide 23

Slide 23 text

What we learned: tips and tricks ● Test your data format ○ Early on, we discovered that ‘mel spectrograms’ used in tacotron2, waveglow and squeezewave are not actually in mel scale (missing square root in implementation) ● Flow-based models are nearly impossible to overfit ● Watch out for numerical issues ○ We fixed errors created when subtracting large values in loss calculation by using float64 values

Slide 24

Slide 24 text

Summary of the whole project

Slide 25

Slide 25 text

KEY STATS 3612 LINES of PYTHON CODE 103 GIT COMMITS incl. 39 PRs 55h on GOOGLE MEET 106 EXPERIMENTS 162 GPU-DAYS equiv. to 4500 USD on AWS

Slide 26

Slide 26 text

Don’t start projects that are way too ambitious What we wanted to do: ● Improve a state-of-the-art model for text-to-speech ● Train it on a new, custom dataset What we could’ve done better: ● Focus on reproduction instead of customization or optimization What we actually did: ● Implemented a nicer version of two state-of-the-art models ● Trained them on a popular dataset ● Learned many techniques for optimizing model size and learning process ● Proved that using reformer for speech is harder than one may think ● Released everything as a readable and easily usable Python package

Slide 27

Slide 27 text

Pytorch Lightning is awesome

Slide 28

Slide 28 text

Research log and experiments tracking Our workﬂow: ● All code is reviewed before merge to master ● Checkpoint meeting every 1-3 days ○ Go through all experiments ○ Plan and divide new tasks ○ Summarize everything in a research log ● All experiments logged to neptune ○ Name, description, tags ○ All metrics from Pytorch Lightning ○ Visualizations and audio samples

Slide 29

Slide 29 text

How we organized code, config and data Code ● Git + GitHub ● Python 3.8 ● Installable package ● Single entrypoint to run anything (cli.py) Config ● Defined as dataclass in the same module where it is used ● Defaults are defined for all possible keys ● Values can be changed using yaml files Data ● DVC (Data Version Control) ● Remote on GCP ● Pipeline for preprocessing every dataset ● Checkpoints added manually

Slide 30

Slide 30 text

Github: https://github.com/kowaalczyk/reformer-tts Mateusz Olko https://www.linkedin.com/in/mateusz-olko/ Tomasz Miśków https://www.linkedin.com/in/tmiskow/ Krzysztof Kowalczyk https://www.linkedin.com/in/kowaalczyk/