Amazon SageMaker 分散学習ライブラリ - 新しいデータ並列・モデル並列の分散学習

1 © 2020 Amazon Web Services, Inc. or its affiliates.
All rights reserved | 1 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | アマゾンウェブサービスジャパン株式会社ソリューションアーキテクト鮫島正樹 Amazon SageMaker 分散学習ライブラリ新しいデータ並列・モデル並列の分散学習

All rights reserved | Background • データセットの巨⼤化・機械学習モデルの⼤規模化によって、より多くの学習時間が必要になってきている例) 8個の NVIDIA V100 GPU を搭載した ml.p3dn.24xlarge を利⽤した場合の学習時間 • COCO Dataset による Mask RCNN (物体検出モデル): 6時間以上 • BERT (⾃然⾔語処理モデル) の事前学習: 100時間以上 • 単⼀の GPU メモリに読み込むことが難しいような⼤規模なモデルも開発されている

All rights reserved | SageMaker Distributed Training Library データ並列 (SageMaker Data Parallelism; SDP) データセットを分割して複数のGPUで並列して勾配を計算し、最終的に勾配を⼀つにまとめてモデルを更新するモデル並列 (SageMaker Model Parallelism; SMP) モデルを分割して複数のGPUに展開し、それぞれのモデルの計算をなるべく並列化することで、必要なメモリ・計算を分散させる • 2020/12 に SageMaker 独⾃の⾼効率な分散学習ライブラリを発表 • PyTorch, TensorFlow を⽤いた GPU での学習に対応

All rights reserved | SageMaker データ並列による分散学習 I. Thangakrishnan, et al., Herring: Rethinking the Parameter Server at Scale for the Cloud, SC20 https://assets.amazon.science/ba/69/0a396bd3459294ad940a705ad7f5/herring-rethinking-the-parameter-server-at-scale-for-the-cloud.pdf • Amazon が SC20 で発表した分散学習アルゴリズム Herring の採⽤ • トレンドとなっていた All-Reduce 形式ではなく、Parameter Serverの形式 • Parameter Server • 複数の Worker にデータを分散して勾配を計算し、それを Parameter Server に集約する • All-Reduce • 複数の Worker にデータを分散して勾配を計算し、Worker 全体が互いに共有する

All rights reserved | なぜ Parameter Servers か https://d2l.ai/chapter_computational-performance/parameterserver.html • Worker がどれだけ増えても2ホップで勾配を集約できる • 勾配の集約を⼯夫することで、⾮同期的な学習を⾏うことができる • Worker に必要な帯域が⼩さくてすむ

All rights reserved | Parameter Servers の課題 • Parameter Server が複数ある場合は、それぞれに異なるパラメータ（重み）を保持させる • パラメータのサイズが異なるので、 Parameter Server が受け取るデータサイズに不均衡が⽣じてしまう • Parameter Server が管理するパラメータが計算されるときしか、 Parameter Server は動かない Parameter Server Parameter ある時間帯でのみ Parameter Server は動く Parameterのサイズはばらばら

All rights reserved | Balanced Fusion Buffers (BFB) • Parameter Server 全体にパラメータ保持の領域 (BFB) を⽤意し、保持されたパラメータを等分割して Parameter Server が受け取る • Worker は、BFB 内の決められた位置にパラメータを保存する（そのための効率的な⽅法も提案） • Parameter Server の前に Proxy Parameter Server を置くことで並列接続性を改善

All rights reserved | SageMaker モデル並列による分散学習 • モデル分割は、単⼀のGPU に乗りきらないモデルを学習するための⽅式として知られていたが、どこで分割するかが難しい問題 • 2つの観点で分割を決める • 分割したモデルの計算量が等しくなるように • モデル間のデータのやり取りを最⼩限に抑える • モデルを例えばAとBに分割し、A→Bと計算が⾏われる場合は実⾏時間がかかる。なるべく並列して計算できるようにしたい。 https://www.amazon.science/latest-news/the-science-of-amazon-sagemakers-distributed-training- engines

All rights reserved | モデルの⾃動分割 https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html • メモリ容量の⼤きい CPU 上にモデルを読み込み、モデルの解析を⾏って、モデルを分割する。このため、最初にオーバーヘッドの時間がある。 • Tensorflow は DAG を構築するので、DAG をベースに分割を⾏う。PyTorch はそのような機能がないので、ライブラリでTreeを構築して分割を⾏う。 • CPU上でモデルが分割されると、GPU にロードされて学習が⾏われる。 Tensorflow の DAG

All rights reserved | Pipelining https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html • 単純にモデルを分割 (X on GPU0 と Y on GPU1) してX→Yと計算すると、 • 分割前と同様にXとYの計算時間もかかる • XとYの間のデータのやり取りで時間がかかる → 結果的にトータルの実⾏時間は増えてしまう。 • ミニバッチをさらに細かいマイクロバッチにわけて、XとYをなるべく効率的に計算する → Pipelining Interleaved pipeline

All rights reserved | SageMaker モデル並列学習のパラメータ https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html partitions (int) いくつに分けるか microbatches (int) ミニバッチを何分割するか (1のとき分割しない) pipeline (interleaved or simple) Forward と Backward を混ぜて⾼速化する interleaved と、固めて⾏う simple optimize (memory or speed) memory: 変数の数にもとづいて分割 speed: 計算量にもとづいて分割 placement_strategy (cluster or spread) 分割したモデルを複数 GPU で学習する場合、そのモデルを近くの GPU (cluster) か、遠くの GPU (spread) に配置 auto_partition ⾃動分割を使うかどうか default_partition ⾃動分割しない場合で、演算をどの Partition に割り当てるか指定しない場合、default_partitionに割り当てられる

All rights reserved | モデル並列のための実装の⼀部 (PyTorch) https://github.com/aws/amazon-sagemaker- examples/tree/master/training/distributed_training/pytorch/model_parallel/bert import smdistributed.modelparallel.torch as smp model = smp.DistributedModel(model) optimizer = smp.DistributedOptimizer(optimizer) @smp.step def smp_step(args, device, input_ids, ... criterion, step): # calculation loss return loss for step, batch in enumerate(train_iter): loss_mbs = smp_step(args, device, input_ids,..., model, optimizer, criterion, step) loss = loss_mbs.reduce_mean() # ライブラリインポート # 分散並列⽤のモデルに変換 # 分散⽤の Optimizer を設定 # ロス計算の定義 # 学習ループで分散学習を⾏う

All rights reserved | データ並列 vs モデル並列 https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html • まずはデータ並列を選択する。データ並列が実⾏可能であれば、性能的にもデータ並列のほうが優れる。 • メモリに乗り切らない場合にモデル並列を検討する。ただし、本当にモデル並列を⾏う前に以下も検討する。 • ハイパーパラメータの変更 • バッチサイズを⼩さくする。バッチサイズ1でもメモリ不⾜が発⽣する場合はモデル並列を検討する。 • 他にも • Nvidia Tensor Core GPU であれば混合精度学習でメモリを節約可能 • ⾃然⾔語処理なら、利⽤する⽂の⻑さを短くする • 画像認識なら、画像の解像度を下げる

All rights reserved | まとめ • ⼤規模なデータセット・モデルの学習で時間がかかっていたり、GPU メモリで処理しきれない場合、分散学習での効率化が可能 • 分散学習にはデータ並列とモデル並列があり、基本的にはデータ並列を選択する。モデルがメモリに乗らない場合にモデル並列を試す。 • SageMaker のライブラリには、分散学習に関する複雑な処理を⾃動化する機能があるため、少ない実装での分散学習が可能

Amazon SageMaker 分散学習ライブラリ - 新しいデータ並列・モデル並列の分散学習

Amazon SageMaker 分散学習ライブラリ - 新しいデータ並列・モデル並列の分散学習

Masaki Samejima

More Decks by Masaki Samejima

Other Decks in Technology

Featured

Transcript

1 © 2020 Amazon Web Services, Inc. or its affiliates.

2 © 2020 Amazon Web Services, Inc. or its affiliates.

3 © 2020 Amazon Web Services, Inc. or its affiliates.

4 © 2020 Amazon Web Services, Inc. or its affiliates.

5 © 2020 Amazon Web Services, Inc. or its affiliates.

6 © 2020 Amazon Web Services, Inc. or its affiliates.

7 © 2020 Amazon Web Services, Inc. or its affiliates.

8 © 2020 Amazon Web Services, Inc. or its affiliates.

9 © 2020 Amazon Web Services, Inc. or its affiliates.

10 © 2020 Amazon Web Services, Inc. or its affiliates.

11 © 2020 Amazon Web Services, Inc. or its affiliates.

12 © 2020 Amazon Web Services, Inc. or its affiliates.

13 © 2020 Amazon Web Services, Inc. or its affiliates.

14 © 2020 Amazon Web Services, Inc. or its affiliates.

15 © 2020 Amazon Web Services, Inc. or its affiliates.

16 © 2020 Amazon Web Services, Inc. or its affiliates.