Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenTalks.AI - Денис Тимонин, Megatron-LM: Обучение мультимиллиардных LMs при помощи техники Model Parallelism

Ad8ae7af280edaecb09bd73a551b5e5f?s=47 OpenTalks.AI
February 04, 2021

OpenTalks.AI - Денис Тимонин, Megatron-LM: Обучение мультимиллиардных LMs при помощи техники Model Parallelism

Ad8ae7af280edaecb09bd73a551b5e5f?s=128

OpenTalks.AI

February 04, 2021
Tweet

Transcript

  1. Denis Timonin, DL/ML Solutions Architect NVIDIA dtimonin@nvidia.com February 2021 MEGATRON-LM:

    TRAINING TRILLION PARAMETER LANGUAGE MODELS WITH GPU MODEL PARALLELISM
  2. 2 Megatron LM Training Large Scale Language Models Using Model

    Parallelism on GPUs Model Parallelism Types of Model Parallelism and their implementation Code Example Parallelizing Linear layer with Tensor Parallel technique AGENDA
  3. 3 MEGATRON-LM

  4. 4 WHAT IS MEGATRON? Paper: https://arxiv.org/abs/1909.08053 Repo: https://github.com/NVIDIA/Megatron-LM NVIDIA’s framework

    for efficiently training the world’s largest language models
  5. 9 MODEL SIZE TREND IN NLP • Training the largest

    transformer-based language model has recently been one of the best ways to advance the state-of-the-art in NLP applications • NLP model size increases by almost an order of magnitude every year • Unsupervised pretraining on large text corpora has eliminated training dataset size issues • Lots of downstream NLP applications have benefited from recent advancements • Training larger models with more data results in better accuracy in almost all cases
  6. 10 MOTIVATION Why Megatron? Training the largest transformer based language

    model has recently been the best way to advance the state of the art in NLP applications. Unsupervised Language Models such as Megatron, GPT-3 and T5 demonstrate the power of large language models trained on a huge corpus NVIDIA DGX SuperPOD optimized for Deep Learning and HPC provides a unique opportunity for training very large models
  7. 11 GOALS & CHALLENGES Training of transformer-based language models with

    billions and trillions of parameters Requires model parallelism to fit in GPU memory Achieving high utilization and scaling up to thousands of GPUs Devising simple methods that require minimal changes to our existing code-base (reducing barrier to entry) Using the developed methodology to scale out Transformer language models such as GPT-3 and BERT and to explore their representation capabilities What would we like to do with Megatron?
  8. 25 MODEL PARALLELISM

  9. 26 MODEL PARALLELISM Complementary Types of Model Parallelism Inter-Layer (Pipeline)

    Parallelism Split sets of layers across multiple devices Layer 0,1,2 and layer 3,4,5 are on different devices Intra-Layer (Tensor) Parallelism Split individual layers across multiple devices Both devices compute different parts of Layer 0,1,2,3,4,5
  10. 29 MODEL PARALLELISM Complementary Types of Model Parallelism Inter +

    Intra Parallelism
  11. 30 MODEL PARALLELISM Parallel GEMMs. Row parallel

  12. 31 MODEL PARALLELISM Parallel GEMMs. Row parallel

  13. 32 MODEL PARALLELISM Parallel GEMMs. Row parallel

  14. 33 MODEL PARALLELISM Parallel GEMMs. Row parallel

  15. 34 MODEL PARALLELISM Parallel GEMMs. Row parallel Y1

  16. 35 MODEL PARALLELISM Parallel GEMMs. Row parallel Y1 Y2 Y

  17. 36 MODEL PARALLELISM Parallel GEMMs. Column parallel , Y2 Y1

    Y
  18. 37 MODEL PARALLELISM Parallel GEMMs. Column parallel , Y2 Y1

    Y
  19. 38 MODEL PARALLELISM Column Parallel Linear Layer Row Parallel Linear

    Layer
  20. 40 APPROACH Group math heavy operations (such as GEMMs) to

    minimize parallel sync points Develop an approach that can be fully implemented with the insertion of a few simple collectives Rely on pre-existing NCCL/PyTorch operations for a native PyTorch implementation Use Amper’s tensor cores for mixed precision training Transformer Goals
  21. 42 APPROACH Fused MLP

  22. 43 APPROACH Fused Self-Attention Nonlinearity so no Row-Parallel Figure courtesy

    of Vaswani et al. 2017
  23. 45 APPROACH Fused Self-Attention

  24. 46 APPROACH Putting It All Together: Parallel Transformer Layer

  25. 47 PIPELINE PARALLELISM • Split per-instance batch into smaller microbatches,

    flush pipeline at the end of a batch • Minimize number of active micro-batches to reduce memory footprint • Can also use activation recomputation to reduce memory footprint 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 GPU 0 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 7 8 9 GPU 1 1 2 3 1 4 2 5 3 6 4 7 5 8 6 9 7 8 9 GPU 2 1 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 9 GPU 3 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 Forward prop Backward prop
  26. 48 Weak Scaling

  27. 49 CODE EXAMPLE

  28. 50 GEMM PARALLELIZATION Default data parallelism in PyTorch with DDP

    (Distributed Data parallel) We have multiple GPUs and each GPU has a full copy of the same model All distributed code for gradient reducing is incapsulated into DDP We can run this code with: python -m torch.distributed.launch --nproc_per_node=2 train.py `rank` - variable that shows what GPU do this process have to use With NCCL backend all-reduce operations happens directly between GPUs (p2p) Regular DDP in PyTorch with NCCL
  29. 51 GEMM PARALLELIZATION Regular DDP in PyTorch with NCCL GPU

    1 Data 1 Data 2 MODEL LOSS 1 GPU 2 MODEL LOSS 2 forward forward backward backward All-reduce Gradients Between GPUs
  30. 53 All-Gather Broadcast All-Reduce Reduce Gather Scatter GEMM PARALLELIZATION We

    can’t use DDP because it copies full model onto separate devices. We don’t want this But PyTorch has multiple lower-level tools to support distributed training with custom design torch.distributed is what we need. It allows to operate tensors between GPU transparently PyTorch Distributed
  31. 54 GPU 2 GPU 1 GEMM PARALLELIZATION Column Parallel Linear

    Layer
  32. 55 GEMM PARALLELIZATION Column Parallel Linear Layer (Multi-GPU) Linear model

    for 1 GPU Column Parallel Linear Layer (Multi-GPU)
  33. 56 GEMM PARALLELIZATION Column Parallel Linear Layer (Multi-GPU) Functions g

    and f are inherited from torch.autograd.Function to describe their forward and backward behavior
  34. 77 NVIDIA’s GTC brings together a global community of developers,

    researchers, engineers, and innovators to experience global innovation and collaboration. Don’t miss out on the exclusive GTC keynote by Jensen Huang on April 12, available to everyone. Visit www.nvidia.com/gtc to learn more and be notified when registration opens. THE CONFERENCE FOR AI INNOVATORS, TECHNOLOGISTS, AND CREATIVES Join us at GTC 2021 on April 12 - 16 for the latest in AI, HPC, healthcare, game developing, networking, and more.
  35. None