Servey: Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep Learning

Slide 1

Slide 1 text

Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep Learning 2022-12-05 Zheng, Lianmin, et al. "Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning." arXiv preprint arXiv:2201.12023 (2022).

Slide 2

Slide 2 text

/ GPT-3 100billion(1 ) 2

Slide 3

Slide 3 text

billion = 10 Figure 1 of "Efficient large-scale language model training on gpu clusters using megatron-lm." 3

Slide 4

Slide 4 text

? + (Alpa) 2 mesh (N M XPU ) XPU 1 1: Gpipeline 4

Slide 5

Slide 5 text

(1/3) Resnet( ) P* Q* R (F) P1 X (P1 X) 5

Slide 6

Slide 6 text

(2/3) Y W Stage 1: 6

Slide 7

Slide 7 text

(3/3) Stage 7

Slide 8

Slide 8 text

[30] Woo-Yeon Lee, et al. Automating system configuration of distributed machine learning. In 2019 IEEE 39th International Con ference on Distributed Computing Systems (ICDCS), pages 2057–2067. IEEE, 2019. [31] Dmitry Lepikhin, et al. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020. [17] Shiqing Fan, et al. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021. [38] Deepak Narayanan, et al. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019. [55] Minjie Wang, et al. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–17, 2019. 8

Slide 9

Slide 9 text

? Stage( ) Stage XPU Stage 1: Alpa Stage XPU XPU 9

Slide 10

Slide 10 text

? 1. 2. 2 (1) (ILP) (2) (DP) DP 2 mesh 1: ("PipeDream: generalized pipeline parallelism for DNN training.", "Automatic graph partitioning for very large-scale deep learning.") 2: 4 XPU Stage M 2 mesh 10

Slide 11

Slide 11 text

Python(16K LoC) C++(6K Loc) Jax, XLA device mesh worker Ray XLA NCCL Jax XLA 1: XLA Accelerated Linear Algebra https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/xla Jax: https://github.com/google/jax, Ray: https://github.com/ray-project/ray, NCCL: https://github.com/NVIDIA/nccl 11

Slide 12

Slide 12 text

700 Throughput(FLOPS) Intra( ) Inter( ) GPT-3 Megatron-LM v2 Gshared MoE DeepSpeed Wide-ResNet PP-DP( ) Wide-ResNet pipeline DP 12

Slide 13

Slide 13 text

8node, 64GPUs (8 GPU/node) Amazon EC2 p3.16xlarge instance 8 NVIDIA V100 16GB GPUs 64 vCPUs 488 GB memory 13

Slide 14

Slide 14 text

Case Study Wide-ResNet on 4(a) and 8(b) GPUs batch →batch 14

Slide 15

Slide 15 text

Throughput GPT Transformer-based Megatron-LM Gshard MoE DeepSpeed 2 3.5× 4 9.7× Wide-ResNet 15

Slide 16

Slide 16 text

= (GPT-3 , 32 ) 1 16

Slide 17

Slide 17 text

? On Optimizing the Communication of Model Parallelism." arXiv e-prints (2022): arXiv-2211. [17] Dapple: A pipelined data parallel approach for training large models, 2021. [38] Pipedream: generalized pipeline parallelism for dnn training, 2019. [55] Supporting very large models using automatic dataflow graph partitioning, 2019. 17

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Alpa 1. 2. Stage Stage 3. 2 "Stage " 1: 19

Slide 20

Slide 20 text

pipeline / pipeline pipeline Stage Stage a~h 8 microbatch pipeline 20

Slide 21

Slide 21 text

Stage Stage Stage microbatch microbatch 1: B=8 21

Slide 22

Slide 22 text

2. Stage A 2 " " 22

Slide 23

Slide 23 text

2. Stage A Stage A ? 23

Slide 24

Slide 24 text

2. Stage A Stage B s 1: Stage 24

Slide 25

Slide 25 text

2. (DP) " " Stage NG Stage 25

Slide 26

Slide 26 text

2 , Stage A Y = PX ... 26

Slide 27

Slide 27 text

Stage Y = PX → Z = QY 4 XPU 27

Slide 28

Slide 28 text

Y = PX Z=QY XPU ( ) 28

Slide 29

Slide 29 text

0 1 1 29

Slide 30

Slide 30 text

Stage G = (V, E) : : : : (i, j) 30

Slide 31

Slide 31 text

Cross-mesh resharding 31