Paper Reading: Accelerating Training of Transformer Based Language Models with Progressive Layer Dropping

Paper Reading Accelerating Training of Transformer-Based Language Models with Progressive
Layer Dropping, NeurIPS 2020. Dang Thang Email: [email protected] Date: 11-09-2021

Research field Accelerating the training process of NLP models. Why
this research field is important? Training a GPT-3 model with 175 billion parameters would take 36 years on eight V100 GPUs, or seven months with 512 V100 GPUs. 2 https://arxiv.org/pdf/2104.04473.pdf

Some directions to solve this problem 1. Efficient Pre-Training and
Fine-Tuning: a. Multimodal pre-trained (e.g., text--speech) models -> Avoiding task-specific fine tuning of pre-trained models (T5: https://arxiv.org/abs/1910.10683) b. New efficient architectures for pre-trained models (ELECTRA: https://arxiv.org/abs/2003.10555) 2. Model Compression: a. Quantization (a survey: https://arxiv.org/pdf/2103.13630.pdf) b. Pruning (https://jacobgil.github.io/deeplearning/pruning-deep-learning) c. Layer decomposition and knowledge distillation (https://www.ijcai.org/proceedings/2018/0384.pdf) 3. Efficient Training: a. Improving the optimizer for faster training (LAMB: https://openreview.net/forum?id=Syx4wnEtvH) b. Distributed training (https://arxiv.org/abs/2104.05588) 4. Data pre-processing: data augmentation 5. Efficient Transformer. (survey: https://arxiv.org/pdf/2009.06732.pdf) 3

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
-> speedup pre-training Transformer networks by exploring architectural change and training techniques, not at the cost of excessive hardware resources. Why? There are some redundant computation in training -> Try to remove or reduce it. • Paper proposed a new architecture unit, called the Switchable-Transformer (ST) block. • Paper proposed a progressive schedule to add extra-stableness for pre-training Transformer networks with layer dropping https://www.deepspeed.ai/tutorials/progressive_layer_dropping/ 4

5 Layer Normalization: https://arxiv.org/pdf/1607.06450.pdf Self-attention: https://theaisummer.com/self-attention/

Identity mapping reordering. placing the layer normalization only on the input stream of the sublayers. PostLN: BERT model with layer normalization applied after the addition in Transformer blocks. PreLN: BERT model with Identity mapping reordering. Figure 1 shows: PostLN suffers from unbalanced gradients (vanishing gradients as the layer ID decreases) PreLN eliminates the unbalanced gradients. 6 Identity mapping was first defined in paper: He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In European conference on computer vision 2016 Oct 8 (pp. 630-645).

Switchable gates: They include a gate (G={0,1}) which controls whether a sub-layer is disabled or not during training. 7

They propose a progressive schedule θ(t) – a temporal schedule for the expected number of Switchable Transformer blocks are used or not: 8 Curriculum Learning: https://dl.acm.org/doi/pdf/10.1145/1553374.1553380 The schedule smoothly increases the layer dropping rate for each mini-batch as training evolves by adapting in time the parameter of the Bernoulli distribution used for sampling.

Evaluation: AI Model: BERT Datasets: English Wikipedia corpus and BookCorpus with 2.8B word tokens. Framework: Huggingface Pytorch Hardware: 4xDGX-2 boxes with 64xV100 GPU Learning rate: 1r=1e -4 Batch size: 4K 9

Evaluation 10

END THANK YOU! 11

Paper Reading: Accelerating Training of Transfo...

Paper Reading: Accelerating Training of Transformer Based Language Models with Progressive Layer Dropping

Thang

Other Decks in Research

Featured

Transcript

Paper Reading Accelerating Training of Transformer-Based Language Models with Progressive

Research field Accelerating the training process of NLP models. Why

Some directions to solve this problem 1. Efficient Pre-Training and

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

END THANK YOU! 11