Save 37% off PRO during our Black Friday Sale! »

Paper Reading: Accelerating Training of Transformer Based Language Models with Progressive Layer Dropping

E780d050ac463db6656fb1a1834f1e76?s=47 Thang
September 11, 2021

Paper Reading: Accelerating Training of Transformer Based Language Models with Progressive Layer Dropping

In this slide, I give an overview of efficient NLP research direction.
The reason why this research direction is important and should be considered.
After that, there is a paper reading section about the paper: "Accelerating Training of Transformer Based Language Models with Progressive Layer Dropping" that was published at NeurIPS 2020.

E780d050ac463db6656fb1a1834f1e76?s=128

Thang

September 11, 2021
Tweet

Transcript

  1. Paper Reading Accelerating Training of Transformer-Based Language Models with Progressive

    Layer Dropping, NeurIPS 2020. Dang Thang Email: thangddnt@gmail.com Date: 11-09-2021
  2. Research field Accelerating the training process of NLP models. Why

    this research field is important? Training a GPT-3 model with 175 billion parameters would take 36 years on eight V100 GPUs, or seven months with 512 V100 GPUs. 2 https://arxiv.org/pdf/2104.04473.pdf
  3. Some directions to solve this problem 1. Efficient Pre-Training and

    Fine-Tuning: a. Multimodal pre-trained (e.g., text--speech) models -> Avoiding task-specific fine tuning of pre-trained models (T5: https://arxiv.org/abs/1910.10683) b. New efficient architectures for pre-trained models (ELECTRA: https://arxiv.org/abs/2003.10555) 2. Model Compression: a. Quantization (a survey: https://arxiv.org/pdf/2103.13630.pdf) b. Pruning (https://jacobgil.github.io/deeplearning/pruning-deep-learning) c. Layer decomposition and knowledge distillation (https://www.ijcai.org/proceedings/2018/0384.pdf) 3. Efficient Training: a. Improving the optimizer for faster training (LAMB: https://openreview.net/forum?id=Syx4wnEtvH) b. Distributed training (https://arxiv.org/abs/2104.05588) 4. Data pre-processing: data augmentation 5. Efficient Transformer. (survey: https://arxiv.org/pdf/2009.06732.pdf) 3
  4. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    -> speedup pre-training Transformer networks by exploring architectural change and training techniques, not at the cost of excessive hardware resources. Why? There are some redundant computation in training -> Try to remove or reduce it. • Paper proposed a new architecture unit, called the Switchable-Transformer (ST) block. • Paper proposed a progressive schedule to add extra-stableness for pre-training Transformer networks with layer dropping https://www.deepspeed.ai/tutorials/progressive_layer_dropping/ 4
  5. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    5 Layer Normalization: https://arxiv.org/pdf/1607.06450.pdf Self-attention: https://theaisummer.com/self-attention/
  6. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    Identity mapping reordering. placing the layer normalization only on the input stream of the sublayers. PostLN: BERT model with layer normalization applied after the addition in Transformer blocks. PreLN: BERT model with Identity mapping reordering. Figure 1 shows: PostLN suffers from unbalanced gradients (vanishing gradients as the layer ID decreases) PreLN eliminates the unbalanced gradients. 6 Identity mapping was first defined in paper: He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In European conference on computer vision 2016 Oct 8 (pp. 630-645).
  7. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    Switchable gates: They include a gate (G={0,1}) which controls whether a sub-layer is disabled or not during training. 7
  8. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    They propose a progressive schedule θ(t) – a temporal schedule for the expected number of Switchable Transformer blocks are used or not: 8 Curriculum Learning: https://dl.acm.org/doi/pdf/10.1145/1553374.1553380 The schedule smoothly increases the layer dropping rate for each mini-batch as training evolves by adapting in time the parameter of the Bernoulli distribution used for sampling.
  9. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    Evaluation: AI Model: BERT Datasets: English Wikipedia corpus and BookCorpus with 2.8B word tokens. Framework: Huggingface Pytorch Hardware: 4xDGX-2 boxes with 64xV100 GPU Learning rate: 1r=1e -4 Batch size: 4K 9
  10. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    Evaluation 10
  11. END THANK YOU! 11