OpenTalks.AI - Денис Тимонин, Megatron-LM: Обучение мультимиллиардных LMs при помощи техники Model Parallelism

Denis Timonin, DL/ML Solutions Architect NVIDIA [email protected] February 2021 MEGATRON-LM:
TRAINING TRILLION PARAMETER LANGUAGE MODELS WITH GPU MODEL PARALLELISM

2 Megatron LM Training Large Scale Language Models Using Model
Parallelism on GPUs Model Parallelism Types of Model Parallelism and their implementation Code Example Parallelizing Linear layer with Tensor Parallel technique AGENDA

3 MEGATRON-LM

4 WHAT IS MEGATRON? Paper: https://arxiv.org/abs/1909.08053 Repo: https://github.com/NVIDIA/Megatron-LM NVIDIA’s framework
for efficiently training the world’s largest language models

9 MODEL SIZE TREND IN NLP • Training the largest
transformer-based language model has recently been one of the best ways to advance the state-of-the-art in NLP applications • NLP model size increases by almost an order of magnitude every year • Unsupervised pretraining on large text corpora has eliminated training dataset size issues • Lots of downstream NLP applications have benefited from recent advancements • Training larger models with more data results in better accuracy in almost all cases

10 MOTIVATION Why Megatron? Training the largest transformer based language
model has recently been the best way to advance the state of the art in NLP applications. Unsupervised Language Models such as Megatron, GPT-3 and T5 demonstrate the power of large language models trained on a huge corpus NVIDIA DGX SuperPOD optimized for Deep Learning and HPC provides a unique opportunity for training very large models

11 GOALS & CHALLENGES Training of transformer-based language models with
billions and trillions of parameters Requires model parallelism to fit in GPU memory Achieving high utilization and scaling up to thousands of GPUs Devising simple methods that require minimal changes to our existing code-base (reducing barrier to entry) Using the developed methodology to scale out Transformer language models such as GPT-3 and BERT and to explore their representation capabilities What would we like to do with Megatron?

25 MODEL PARALLELISM

26 MODEL PARALLELISM Complementary Types of Model Parallelism Inter-Layer (Pipeline)
Parallelism Split sets of layers across multiple devices Layer 0,1,2 and layer 3,4,5 are on different devices Intra-Layer (Tensor) Parallelism Split individual layers across multiple devices Both devices compute different parts of Layer 0,1,2,3,4,5

29 MODEL PARALLELISM Complementary Types of Model Parallelism Inter +
Intra Parallelism

30 MODEL PARALLELISM Parallel GEMMs. Row parallel

34 MODEL PARALLELISM Parallel GEMMs. Row parallel Y1

35 MODEL PARALLELISM Parallel GEMMs. Row parallel Y1 Y2 Y

36 MODEL PARALLELISM Parallel GEMMs. Column parallel , Y2 Y1
Y

37 MODEL PARALLELISM Parallel GEMMs. Column parallel , Y2 Y1
Y

38 MODEL PARALLELISM Column Parallel Linear Layer Row Parallel Linear
Layer

40 APPROACH Group math heavy operations (such as GEMMs) to
minimize parallel sync points Develop an approach that can be fully implemented with the insertion of a few simple collectives Rely on pre-existing NCCL/PyTorch operations for a native PyTorch implementation Use Amper’s tensor cores for mixed precision training Transformer Goals

42 APPROACH Fused MLP

43 APPROACH Fused Self-Attention Nonlinearity so no Row-Parallel Figure courtesy
of Vaswani et al. 2017

45 APPROACH Fused Self-Attention

46 APPROACH Putting It All Together: Parallel Transformer Layer

47 PIPELINE PARALLELISM • Split per-instance batch into smaller microbatches,
flush pipeline at the end of a batch • Minimize number of active micro-batches to reduce memory footprint • Can also use activation recomputation to reduce memory footprint 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 GPU 0 1 2 3 4 1 5 2 6 3 7 4 8 5 9 6 7 8 9 GPU 1 1 2 3 1 4 2 5 3 6 4 7 5 8 6 9 7 8 9 GPU 2 1 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 9 GPU 3 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 Forward prop Backward prop

48 Weak Scaling

49 CODE EXAMPLE

50 GEMM PARALLELIZATION Default data parallelism in PyTorch with DDP
(Distributed Data parallel) We have multiple GPUs and each GPU has a full copy of the same model All distributed code for gradient reducing is incapsulated into DDP We can run this code with: python -m torch.distributed.launch --nproc_per_node=2 train.py `rank` - variable that shows what GPU do this process have to use With NCCL backend all-reduce operations happens directly between GPUs (p2p) Regular DDP in PyTorch with NCCL

51 GEMM PARALLELIZATION Regular DDP in PyTorch with NCCL GPU
1 Data 1 Data 2 MODEL LOSS 1 GPU 2 MODEL LOSS 2 forward forward backward backward All-reduce Gradients Between GPUs

53 All-Gather Broadcast All-Reduce Reduce Gather Scatter GEMM PARALLELIZATION We
can’t use DDP because it copies full model onto separate devices. We don’t want this But PyTorch has multiple lower-level tools to support distributed training with custom design torch.distributed is what we need. It allows to operate tensors between GPU transparently PyTorch Distributed

54 GPU 2 GPU 1 GEMM PARALLELIZATION Column Parallel Linear
Layer

55 GEMM PARALLELIZATION Column Parallel Linear Layer (Multi-GPU) Linear model
for 1 GPU Column Parallel Linear Layer (Multi-GPU)

56 GEMM PARALLELIZATION Column Parallel Linear Layer (Multi-GPU) Functions g
and f are inherited from torch.autograd.Function to describe their forward and backward behavior

77 NVIDIA’s GTC brings together a global community of developers,
researchers, engineers, and innovators to experience global innovation and collaboration. Don’t miss out on the exclusive GTC keynote by Jensen Huang on April 12, available to everyone. Visit www.nvidia.com/gtc to learn more and be notified when registration opens. THE CONFERENCE FOR AI INNOVATORS, TECHNOLOGISTS, AND CREATIVES Join us at GTC 2021 on April 12 - 16 for the latest in AI, HPC, healthcare, game developing, networking, and more.

OpenTalks.AI - Денис Тимонин, Megatron-LM: Обуч...

OpenTalks.AI - Денис Тимонин, Megatron-LM: Обучение мультимиллиардных LMs при помощи техники Model Parallelism

More Decks by OpenTalks.AI

Other Decks in Business

Featured

Transcript