Distributed and Parallel Training for PyTorch

Slide 1

Slide 1 text

AI 2024.8.22 @tattaka_sun GO株式会社 Distributed and Parallel Training for PyTorch

Slide 2

Slide 2 text

AI 2 ▪ 基盤モデルの流行などにより、大きなモデルを効率的に学習を進める手法の需要が高まっている ▪ このスライドでわかるようになること ▪ PyTorchによる分散学習の基本的な仕組み ▪ どのような分散学習の手法があるか ▪ それぞれの分散学習手法の使い分けはじめに

Slide 3

Slide 3 text

AI 3 PyTorch Distributed Overview High level Low level Communication backend Communication APIs (C10D) Sharding primitives Parallelism APIs Laucher Gloo Open MPI NCCL send recv broadcast all_reduce reduce all_gather gather scatter reduce_scatter all_to_all barrier DTensor DeviceMesh Data-Parallel Distributed Data-Parallel Fully Sharded Data-Parallel ZeRO series (DeepSpeed etc…) Tensor Parallel Pipeline Parallel torchrun (Elastic Launch) torch.distributed.launch

Slide 4

Slide 4 text

AI 4 PyTorch Distributed Overview High level Low level Communication backend Communication APIs (C10D) Sharding primitives Parallelism APIs Laucher Gloo Open MPI NCCL send recv broadcast all_reduce reduce all_gather gather scatter reduce_scatter all_to_all barrier DTensor DeviceMesh Data-Parallel Distributed Data-Parallel Fully Sharded Data-Parallel ZeRO series (DeepSpeed etc…) Tensor Parallel Pipeline Parallel torchrun (Elastic Launch) torch.distributed.launch

Slide 5

Slide 5 text

AI ▪ それぞれのprocess間で情報の通信を行う ▪ それぞれのprocessごとに番号(rank)が振られ、 rank=0をmasterとして扱う Distributed Communications Backend Machine 2 Machine 1 5 PyTorchのDistributed Communicationの仕組み Process 1 (Rank 2) Process 2 (Rank 3) Process 1 (Rank 0) Process 2 (Rank 1)

Slide 6

Slide 6 text

AI 6 ▪ PyTorchでは以下の3つから分散通信に用いるbackendを選択することができる ▪ Gloo ▪ CPU上での通信と、GPU上での一部の通信が実装されている ▪ NCCL ▪ GPU上での最適化された通信が実装されている ▪ GPUではGlooより高速 ▪ Open MPI ▪ ビルド済みパッケージに含まれないため、ソースからビルドする必要がある ▪ 上2つで十分なため、特別な理由がないかぎり使用されない利用できるDistributed Communications Backend

Slide 7

Slide 7 text

AI 7 ▪ それぞれのbackendでできることが異なる ▪ 主要な操作については後述利用できるDistributed Communications Backend https://pytorch.org/docs/stable/distributed.html

Slide 8

Slide 8 text

AI ▪ torch.distributed.init_process_group を用いて初期化を行う ▪ 引数: ▪ rank: 現在のprocessのrank ▪ world_size: 全体のprocess数 ▪ backend: 分散通信にどのライブラリを使用するか defaultではgloo(cpu)とnccl(gpu)が併用される ▪ 他に環境変数で以下を設定する必要がある ▪ MASTER_PORT ▪ MASTER_ADDR ▪ (RANKとWORLD_SIZEも指定でき、その場合はinit_process_groupで指定する必要はない) 8 Distributed SettingのSet Up

Slide 9

Slide 9 text

AI ▪ 実装例 9 Distributed SettingのSet Up https://pytorch.org/docs/stable/distributed.html

Slide 10

Slide 10 text

AI 10 PyTorch Distributed Overview High level Low level Communication backend Communication APIs (C10D) Sharding primitives Parallelism APIs Laucher Gloo Open MPI NCCL send recv broadcast all_reduce reduce all_gather gather scatter reduce_scatter all_to_all barrier DTensor DeviceMesh Data-Parallel Distributed Data-Parallel Fully Sharded Data-Parallel ZeRO series (DeepSpeed etc…) Tensor Parallel Pipeline Parallel torchrun (Elastic Launch) torch.distributed.launch

Slide 11

Slide 11 text

AI 11 ▪ Point-to-Point Communication ● send (送信) ● recv (受信) Communication APIs (C10D) https://pytorch.org/tutorials/intermediate/dist_tuto.html

Slide 12

Slide 12 text

AI 12 ▪ Collective Communication ▪ 全てのrank間に対しての通信 Communication APIs (C10D) https://pytorch.org/tutorials/intermediate/dist_tuto.html

Slide 13

Slide 13 text

AI 13 PyTorch Distributed Overview High level Low level Communication backend Communication APIs (C10D) Sharding primitives Parallelism APIs Laucher Gloo Open MPI NCCL send recv broadcast all_reduce reduce all_gather gather scatter reduce_scatter all_to_all barrier DTensor DeviceMesh Data-Parallel Distributed Data-Parallel Fully Sharded Data-Parallel ZeRO series (DeepSpeed etc…) Tensor Parallel Pipeline Parallel torchrun (Elastic Launch) torch.distributed.launch

Slide 14

Slide 14 text

AI 14 ▪ Multi-Host, Multi-GPUを用いる場合、設定が複雑になりがち🤯 DeviceMesh https://pytorch.org/tutorials/recipes/distributed_device_mesh.html Host 1 GPU 0 GPU 1 GPU 2 GPU 3 Host 2 GPU 0 GPU 1 GPU 2 GPU 3

Slide 15

Slide 15 text

AI 15 ▪ DeviceMeshを使って抽象的に2次元の ProcessGroup(processのsubset)を扱うことができる DeviceMesh https://pytorch.org/tutorials/recipes/distributed_device_mesh.html

Slide 16

Slide 16 text

AI 16 ▪ distributed tensor ▪ TensorやModuleを先述のDeviceMeshに基づいてprocessに配置できる DTensor (Prototype Release) https://github.com/pytorch/pytorch/tree/main/torch/distributed/_tensor

Slide 17

Slide 17 text

AI 17 PyTorch Distributed Overview High level Low level Communication backend Communication APIs (C10D) Sharding primitives Parallelism APIs Laucher Gloo Open MPI NCCL send recv broadcast all_reduce reduce all_gather gather scatter reduce_scatter all_to_all barrier DTensor DeviceMesh Data-Parallel Distributed Data-Parallel Fully Sharded Data-Parallel ZeRO series (DeepSpeed etc…) Tensor Parallel Pipeline Parallel torchrun (Elastic Launch) torch.distributed.launch

Slide 18

Slide 18 text

AI 18 ▪ ここまで説明した低レイヤーな操作を抽象化して nn.Moduleを並列するAPI ▪ Data-Parallel (DP) ▪ Distributed Data-Parallel (DDP) ▪ Tensor Parallel (TP) ▪ Pipeline Parallel (PP) ▪ Fully Sharded Data-Parallel (FSDP) ▪ ZeRO (DeepSpeedやFairScaleなどのサードパーティにて実装) Parallelism APIs

Slide 19

Slide 19 text

AI 19 ▪ それぞれのGPUにmodelをコピーし、Batchを分割して学習、逆伝搬、更新後にパラメータをGPU間で同期する ▪ 単一のprocessがGPUを管理するので実装がシンプル ▪ オーバヘッドが大きいため現在は非推奨となっている Data-Parallel (DP) Dataloader GPU:0 GPU:1 GPU:2 batch Model0 Model1 Model2 batch分割 Loss calc 勾配の計算モデルパラメータの更新後に分散出力の集約勾配の集約

Slide 20

Slide 20 text

AI 20 ▪ それぞれのGPU上のpipelineを別々のprocessが持つ ▪ DPと異なり、GPU間の通信は勾配の集約・分散のみ Distributed Data-Parallel (DDP) Dataloader GPU:0 GPU:1 GPU:2 batch Model0 Model1 Model2 勾配の集約・分散 Loss calc Lossの計算 Dataloader batch Dataloader batch Loss calc Loss calc 勾配の計算それぞれのGPUでモデルパラメータの更新

Slide 21

Slide 21 text

AI 21 ▪ 分散環境をsetup し、modelを DDP()でラップ ▪ checkpointの保存・読み込みはprocess 1のみ行うようにする Distributed Data-Parallel (DDP) https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Slide 22

Slide 22 text

AI 22 ▪ 個々のTensorをそれぞれのGPUに分割する ▪ Megatron-LMで提案された ▪ 行列積は行方向・列方向ともに分割でき、後で集約する Tensor Parallel (TP) https://arxiv.org/abs/1909.08053

Slide 23

Slide 23 text

AI 23 ▪ modelをいくつかのユニットに分割し、別々のprocessに配置する ▪ あるGPUで処理している間の他のGPUのidle期間を緩和するために、batchをchunkして用いる process 1/GPU:0 Pipeline Parallel (PP) process 2/GPU:1 process 3/GPU:2 model module 1 module 2 module 3 module 4 module 5 module 6

Slide 24

Slide 24 text

AI 24 ▪ モデルパラメータや勾配をGPU間で分割して保持する ▪ forward・backward中、それぞれ決められたユニット内で計算を行うためメモリを節約することができる Fully Sharded Data-Parallel Training (FSDP) https://arxiv.org/abs/2304.11277

Slide 25

Slide 25 text

AI 25 ▪ モデルが1 GPUに載る ▪ DP (非推奨) ▪ DDP ▪ モデルが1 GPUに載らない ▪ 演算ごとに細かく分割したい ▪ TP ▪ モデルの段階ごとに細かく分割したい ▪ PP ▪ PyTorchに分割はお任せしたい ▪ FSDP ▪ size_based_auto_wrap_policy 使い分け

Slide 26

Slide 26 text

AI 26 PyTorch Distributed Overview High level Low level Communication backend Communication APIs (C10D) Sharding primitives Parallelism APIs Laucher Gloo Open MPI NCCL send recv broadcast all_reduce reduce all_gather gather scatter reduce_scatter all_to_all barrier DTensor DeviceMesh Data-Parallel Distributed Data-Parallel Fully Sharded Data-Parallel ZeRO series (DeepSpeed etc…) Tensor Parallel Pipeline Parallel torchrun (Elastic Launch) torch.distributed.launch

Slide 27

Slide 27 text

AI 27 ▪ マルチプロセスでスクリプトを起動できる機能 ▪ 起動するとRANKなどの環境変数がセットされ、スクリプトから参照できるようになる torchrun・torch.distributed.launch shell training script https://pytorch.org/docs/stable/elastic/run.html

Slide 28

Slide 28 text

AI 28 ▪ Trainerにstrategy引数を指定するだけで、Single GPUの学習コードに手を加えることなく実現できる ● DDP以外のstrategyも指定できる(FSDP, DeepSpeedなど) PyTorch Lightningを使う場合 https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html

Slide 29

Slide 29 text

AI 29 ▪ PyTorchにおける分散学習の概観を紹介した ▪ 1 GPUにモデルが載りきるならDDP、分散させないといけないならFSDPを使えばOK ▪ 紹介しきれなかったDeepSpeedなどのサードパーティライブラリについてはまた別の機会にまとめ

Slide 30

Slide 30 text

AI 30 ▪ TensorFlow ▪ https://www.tensorﬂow.org/guide/distributed_training?hl =ja ▪ PyTorchでいうところのDP・DDP・TPなどが実装されている ▪ Jax ▪ https://jax.readthedocs.io/en/latest/multi_process.html ▪ 低レベルのAPIが提供されており、適宜実装する必要がある？余談：他のDLフレームワークでは