Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Paper Reading: Accelerating Training of Transformer Based Language Models with Progressive Layer Dropping

Thang
September 11, 2021

Paper Reading: Accelerating Training of Transformer Based Language Models with Progressive Layer Dropping

In this slide, I give an overview of efficient NLP research direction.
The reason why this research direction is important and should be considered.
After that, there is a paper reading section about the paper: "Accelerating Training of Transformer Based Language Models with Progressive Layer Dropping" that was published at NeurIPS 2020.

Thang

September 11, 2021
Tweet

Other Decks in Research

Transcript

  1. Research field Accelerating the training process of NLP models. Why

    this research field is important? Training a GPT-3 model with 175 billion parameters would take 36 years on eight V100 GPUs, or seven months with 512 V100 GPUs. 2 https://arxiv.org/pdf/2104.04473.pdf
  2. Some directions to solve this problem 1. Efficient Pre-Training and

    Fine-Tuning: a. Multimodal pre-trained (e.g., text--speech) models -> Avoiding task-specific fine tuning of pre-trained models (T5: https://arxiv.org/abs/1910.10683) b. New efficient architectures for pre-trained models (ELECTRA: https://arxiv.org/abs/2003.10555) 2. Model Compression: a. Quantization (a survey: https://arxiv.org/pdf/2103.13630.pdf) b. Pruning (https://jacobgil.github.io/deeplearning/pruning-deep-learning) c. Layer decomposition and knowledge distillation (https://www.ijcai.org/proceedings/2018/0384.pdf) 3. Efficient Training: a. Improving the optimizer for faster training (LAMB: https://openreview.net/forum?id=Syx4wnEtvH) b. Distributed training (https://arxiv.org/abs/2104.05588) 4. Data pre-processing: data augmentation 5. Efficient Transformer. (survey: https://arxiv.org/pdf/2009.06732.pdf) 3
  3. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    -> speedup pre-training Transformer networks by exploring architectural change and training techniques, not at the cost of excessive hardware resources. Why? There are some redundant computation in training -> Try to remove or reduce it. • Paper proposed a new architecture unit, called the Switchable-Transformer (ST) block. • Paper proposed a progressive schedule to add extra-stableness for pre-training Transformer networks with layer dropping https://www.deepspeed.ai/tutorials/progressive_layer_dropping/ 4
  4. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    5 Layer Normalization: https://arxiv.org/pdf/1607.06450.pdf Self-attention: https://theaisummer.com/self-attention/
  5. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    Identity mapping reordering. placing the layer normalization only on the input stream of the sublayers. PostLN: BERT model with layer normalization applied after the addition in Transformer blocks. PreLN: BERT model with Identity mapping reordering. Figure 1 shows: PostLN suffers from unbalanced gradients (vanishing gradients as the layer ID decreases) PreLN eliminates the unbalanced gradients. 6 Identity mapping was first defined in paper: He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In European conference on computer vision 2016 Oct 8 (pp. 630-645).
  6. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    Switchable gates: They include a gate (G={0,1}) which controls whether a sub-layer is disabled or not during training. 7
  7. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    They propose a progressive schedule θ(t) – a temporal schedule for the expected number of Switchable Transformer blocks are used or not: 8 Curriculum Learning: https://dl.acm.org/doi/pdf/10.1145/1553374.1553380 The schedule smoothly increases the layer dropping rate for each mini-batch as training evolves by adapting in time the parameter of the Bernoulli distribution used for sampling.
  8. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

    Evaluation: AI Model: BERT Datasets: English Wikipedia corpus and BookCorpus with 2.8B word tokens. Framework: Huggingface Pytorch Hardware: 4xDGX-2 boxes with 64xV100 GPU Learning rate: 1r=1e -4 Batch size: 4K 9