Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenTalks.AI - Денис Тимонин, Megatron-LM: Обучение мультимиллиардных LMs при помощи техники Model Parallelism

OpenTalks.AI
February 04, 2021

OpenTalks.AI - Денис Тимонин, Megatron-LM: Обучение мультимиллиардных LMs при помощи техники Model Parallelism

OpenTalks.AI

February 04, 2021
Tweet

More Decks by OpenTalks.AI

Other Decks in Business

Transcript

  1. Denis Timonin, DL/ML Solutions Architect NVIDIA [email protected] February 2021 MEGATRON-LM:

    TRAINING TRILLION PARAMETER LANGUAGE MODELS WITH GPU MODEL PARALLELISM
  2. 2 Megatron LM Training Large Scale Language Models Using Model

    Parallelism on GPUs Model Parallelism Types of Model Parallelism and their implementation Code Example Parallelizing Linear layer with Tensor Parallel technique AGENDA
  3. 9 MODEL SIZE TREND IN NLP • Training the largest

    transformer-based language model has recently been one of the best ways to advance the state-of-the-art in NLP applications • NLP model size increases by almost an order of magnitude every year • Unsupervised pretraining on large text corpora has eliminated training dataset size issues • Lots of downstream NLP applications have benefited from recent advancements • Training larger models with more data results in better accuracy in almost all cases
  4. 10 MOTIVATION Why Megatron? Training the largest transformer based language

    model has recently been the best way to advance the state of the art in NLP applications. Unsupervised Language Models such as Megatron, GPT-3 and T5 demonstrate the power of large language models trained on a huge corpus NVIDIA DGX SuperPOD optimized for Deep Learning and HPC provides a unique opportunity for training very large models
  5. 11 GOALS & CHALLENGES Training of transformer-based language models with

    billions and trillions of parameters Requires model parallelism to fit in GPU memory Achieving high utilization and scaling up to thousands of GPUs Devising simple methods that require minimal changes to our existing code-base (reducing barrier to entry) Using the developed methodology to scale out Transformer language models such as GPT-3 and BERT and to explore their representation capabilities What would we like to do with Megatron?
  6. 26 MODEL PARALLELISM Complementary Types of Model Parallelism Inter-Layer (Pipeline)

    Parallelism Split sets of layers across multiple devices Layer 0,1,2 and layer 3,4,5 are on different devices Intra-Layer (Tensor) Parallelism Split individual layers across multiple devices Both devices compute different parts of Layer 0,1,2,3,4,5
  7. 40 APPROACH Group math heavy operations (such as GEMMs) to

    minimize parallel sync points Develop an approach that can be fully implemented with the insertion of a few simple collectives Rely on pre-existing NCCL/PyTorch operations for a native PyTorch implementation Use Amper’s tensor cores for mixed precision training Transformer Goals
  8. 47 PIPELINE PARALLELISM • Split per-instance batch into smaller microbatches,

    flush pipeline at the end of a batch • Minimize number of active micro-batches to reduce memory footprint • Can also use activation recomputation to reduce memory footprint 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 GPU 0 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 7 8 9 GPU 1 1 2 3 1 4 2 5 3 6 4 7 5 8 6 9 7 8 9 GPU 2 1 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 9 GPU 3 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 Forward prop Backward prop
  9. 50 GEMM PARALLELIZATION Default data parallelism in PyTorch with DDP

    (Distributed Data parallel) We have multiple GPUs and each GPU has a full copy of the same model All distributed code for gradient reducing is incapsulated into DDP We can run this code with: python -m torch.distributed.launch --nproc_per_node=2 train.py `rank` - variable that shows what GPU do this process have to use With NCCL backend all-reduce operations happens directly between GPUs (p2p) Regular DDP in PyTorch with NCCL
  10. 51 GEMM PARALLELIZATION Regular DDP in PyTorch with NCCL GPU

    1 Data 1 Data 2 MODEL LOSS 1 GPU 2 MODEL LOSS 2 forward forward backward backward All-reduce Gradients Between GPUs
  11. 53 All-Gather Broadcast All-Reduce Reduce Gather Scatter GEMM PARALLELIZATION We

    can’t use DDP because it copies full model onto separate devices. We don’t want this But PyTorch has multiple lower-level tools to support distributed training with custom design torch.distributed is what we need. It allows to operate tensors between GPU transparently PyTorch Distributed
  12. 55 GEMM PARALLELIZATION Column Parallel Linear Layer (Multi-GPU) Linear model

    for 1 GPU Column Parallel Linear Layer (Multi-GPU)
  13. 56 GEMM PARALLELIZATION Column Parallel Linear Layer (Multi-GPU) Functions g

    and f are inherited from torch.autograd.Function to describe their forward and backward behavior
  14. 77 NVIDIA’s GTC brings together a global community of developers,

    researchers, engineers, and innovators to experience global innovation and collaboration. Don’t miss out on the exclusive GTC keynote by Jensen Huang on April 12, available to everyone. Visit www.nvidia.com/gtc to learn more and be notified when registration opens. THE CONFERENCE FOR AI INNOVATORS, TECHNOLOGISTS, AND CREATIVES Join us at GTC 2021 on April 12 - 16 for the latest in AI, HPC, healthcare, game developing, networking, and more.