Servey: Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep Learning

Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep
Learning 2022-12-05 Zheng, Lianmin, et al. "Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning." arXiv preprint arXiv:2201.12023 (2022).

/ GPT-3 100billion(1 ) 2

billion = 10 Figure 1 of "Efficient large-scale language model
training on gpu clusters using megatron-lm." 3

? + (Alpa) 2 mesh (N M XPU ) XPU
1 1: Gpipeline 4

(1/3) Resnet( ) P* Q* R (F) P1 X (P1
X) 5

(2/3) Y W Stage 1: 6

(3/3) Stage 7

[30] Woo-Yeon Lee, et al. Automating system configuration of distributed
machine learning. In 2019 IEEE 39th International Con ference on Distributed Computing Systems (ICDCS), pages 2057–2067. IEEE, 2019. [31] Dmitry Lepikhin, et al. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020. [17] Shiqing Fan, et al. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021. [38] Deepak Narayanan, et al. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019. [55] Minjie Wang, et al. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–17, 2019. 8

? Stage( ) Stage XPU Stage 1: Alpa Stage XPU
XPU 9

? 1. 2. 2 (1) (ILP) (2) (DP) DP 2
mesh 1: ("PipeDream: generalized pipeline parallelism for DNN training.", "Automatic graph partitioning for very large-scale deep learning.") 2: 4 XPU Stage M 2 mesh 10

Python(16K LoC) C++(6K Loc) Jax, XLA device mesh worker Ray
XLA NCCL Jax XLA 1: XLA Accelerated Linear Algebra https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/xla Jax: https://github.com/google/jax, Ray: https://github.com/ray-project/ray, NCCL: https://github.com/NVIDIA/nccl 11

700 Throughput(FLOPS) Intra( ) Inter( ) GPT-3 Megatron-LM v2 Gshared
MoE DeepSpeed Wide-ResNet PP-DP( ) Wide-ResNet pipeline DP 12

8node, 64GPUs (8 GPU/node) Amazon EC2 p3.16xlarge instance 8 NVIDIA
V100 16GB GPUs 64 vCPUs 488 GB memory 13

Case Study Wide-ResNet on 4(a) and 8(b) GPUs batch →batch
14

Throughput GPT Transformer-based Megatron-LM Gshard MoE DeepSpeed 2 3.5× 4
9.7× Wide-ResNet 15

= (GPT-3 , 32 ) 1 16

? On Optimizing the Communication of Model Parallelism." arXiv e-prints
(2022): arXiv-2211. [17] Dapple: A pipelined data parallel approach for training large models, 2021. [38] Pipedream: generalized pipeline parallelism for dnn training, 2019. [55] Supporting very large models using automatic dataflow graph partitioning, 2019. 17

Alpa 1. 2. Stage Stage 3. 2 "Stage " 1:
19

pipeline / pipeline pipeline Stage Stage a~h 8 microbatch pipeline
20

Stage Stage Stage microbatch microbatch 1: B=8 21

2. Stage A 2 " " 22

2. Stage A Stage A ? 23

2. Stage A Stage B s 1: Stage 24

2. (DP) " " Stage NG Stage 25

2 , Stage A Y = PX ... 26

Stage Y = PX → Z = QY 4 XPU
27

Y = PX Z=QY XPU ( ) 28

0 1 1 29

Stage G = (V, E) : : : : (i,
j) 30

Cross-mesh resharding 31

Servey: Alpa: Automating Inter- and Intra- Oper...

Servey: Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep Learning

Nariaki Tateiwa

More Decks by Nariaki Tateiwa

Other Decks in Technology

Featured

Transcript

Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep

/ GPT-3 100billion(1 ) 2

billion = 10 Figure 1 of "Efficient large-scale language model

? + (Alpa) 2 mesh (N M XPU ) XPU

(1/3) Resnet( ) P* Q* R (F) P1 X (P1

(2/3) Y W Stage 1: 6

(3/3) Stage 7

[30] Woo-Yeon Lee, et al. Automating system configuration of distributed

? Stage( ) Stage XPU Stage 1: Alpa Stage XPU

? 1. 2. 2 (1) (ILP) (2) (DP) DP 2

Python(16K LoC) C++(6K Loc) Jax, XLA device mesh worker Ray

700 Throughput(FLOPS) Intra( ) Inter( ) GPT-3 Megatron-LM v2 Gshared

8node, 64GPUs (8 GPU/node) Amazon EC2 p3.16xlarge instance 8 NVIDIA

Case Study Wide-ResNet on 4(a) and 8(b) GPUs batch →batch

Throughput GPT Transformer-based Megatron-LM Gshard MoE DeepSpeed 2 3.5× 4

= (GPT-3 , 32 ) 1 16

? On Optimizing the Communication of Model Parallelism." arXiv e-prints

Alpa 1. 2. Stage Stage 3. 2 "Stage " 1:

pipeline / pipeline pipeline Stage Stage a~h 8 microbatch pipeline