2022-03-26 TensorFlow Parameter Server Training紹介@機械学習の社会実装勉強会

Slide 1

Slide 1 text

TensorFlow Parameter Server Training 2022/03/26 Naka Masato

Slide 2

Slide 2 text

自己紹介名前那珂将人経歴 ● アルゴリズムエンジニアとしてレコメンドエンジン開発 ● インフラ基盤整備 GitHub: https://github.com/nakamasato Twitter: https://twitter.com/gymnstcs

Slide 3

Slide 3 text

Why Distributed Training? Deep Learning など計算コストの高い ML Dataset が大きくなる → ML モデルの学習時間が肥大化例 1. Uber: 2017 年 Horovod: a distributed deep learning framework a. but as datasets grew, so did the training times, which sometimes took a week—or longer!—to complete. We found ourselves in need of a way to train using a lot of data while maintaining short training times. To achieve this, our team turned to distributed training. 2. Google: 2016 年 Tensorﬂow が Distributed Training を Support a. Google uses machine learning across a wide range of its products. In order to continually improve our models, it's crucial that the training process be as fast as possible. 3. Yahoo!: 2017 年 TensorFlowOnSpark を Open Source 化 4. Baidu: 2017 年 ring-allreduce

Slide 4

Slide 4 text

Distributed Trainingもさまざま 1. TensorFlow Distributed Training 2. Mesh TensorFlow 3. TensorFlowOnSpark 4. DeepSpeed 5. PyTorch Distributed 6. Horovod 7. Ray 8. BytePS

Slide 5

Slide 5 text

今回はTensorFlow Distributed Training

Slide 6

Slide 6 text

Parallelism in Machine Learning Parallelism in Machine Learning Data Parallelism Model Parallelism Synchronous Asynchronous Parameter Server Strategy Mirrored Strategy

Slide 7

Slide 7 text

What is TensorFlow Parameter Server Training 1. distributed training strategies の一つ 2. Data-parallel で、複数マシン上でモデル学習をスケールしていくタイプ 3. Asynchronous Training 4. Components: a. Workers: 計算をして、 Parameter Server 上の変数を更新 b. Parameter Servers: 変数を格納

Slide 8

Slide 8 text

使い方 - 全体像 1. クラスタ作成 2. Strategy 作成 a. ParameterServerStrategy b. MirroredStrategy c. … 3. Dataset の準備 a. distributed dataset b. dataset creator c. … 4. モデルの定義と学習 (strategy.scope()) a. keras.models b. Custom Training Loop Worker Worker Worker PS PS Chief (coordinator) Evaluator strategy = tf.distribute.strategy.ParameterServerStrategy(...) dc = … with strategy.scope(): model = model.ﬁt(dc)

Slide 9

Slide 9 text

使い方 - ① クラスタの作成 1. ClusterSpec を作成 a. tf.train.ClusterSpec 2. Worker と PS を作成 a. tf.distribute.Server Worker Worker Worker PS PS

Slide 10

Slide 10 text

使い方 - ②Strategy作成 Strategy 作成 1. tf.distribute.experimental.ParameterServerStrat egy 2. tf.distribute.experimental.MirroredStrategy 3. … Strategy には ClusterResolver が必要 1. tf.distribute.cluster_resolver.SimpleClusterResol ver a. ClusterSpec からクラスタ情報を読み込む 2. tf.distribute.cluster_resolver.TFConﬁgClusterRe solver a. 環境変数 TF_CONFIG からクラスタ情報を読み込む

Slide 11

Slide 11 text

使い方 - ③データの準備 1. Keras のモデルを使う場合 : a. tf.keras.utils.experimental.DatasetCreator(dataset_fn) 2. Custom Training Loop を使う場合 : a. distributed_dataset = coordinator.create_per_worker_dataset(dataset_fn) b. distributed_iter = iter(distributed_dataset)

Slide 12

Slide 12 text

使い方 - ④モデルの定義と学習 1. @tf.function decorator を使って Worker 上で実行する処理を実装 worker_fn 2. coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator 3. coordinator.schedule(worker_fn, args=(per_worker_iter)) a. schedule で処理を Worker に実行させることができる keras.Model も Support されている裏側 : 1. keras.Model では内部で distribute_strategy を格納 2. strategy によって Coordinator を作成 3. train_function を tf.function によって tf.graph へコンパイル 4. coordinator.schedule で train_function を Wrap

Slide 13

Slide 13 text

Coordinator Coordinator とは、リモート関数実行をスケジュールしたり、コーディネートするオブジェクト 1. coordinator は tf.distribute.experimental.coordinator.ClusterCoord inator 2. strategy を指定して初期化←どのように Coordinate するのかに必要な情報 3. schedule(worker_fn, args) ←リモートで関数を args とともに実行する a. worker_fn は、 tf.function で Worker 上で実行される関数。 Worker Worker Worker PS PS Coordinator strategy schedule()

Slide 14

Slide 14 text

tf.function 関数を呼び出し可能な TensorFlow graph にコンパイルする Graph vs Eager Execution: 1. Graph Execution とは、計算が tf.Graph として実行 a. tf.Graph は以下の 2 つから成り立つ i. tf.Operation: 計算ユニット ii. tf.Tensor: 計算ユニット間を流れるデータユニット b. Graphs → 高速、並列化、複数 Device 上で効率よく実行可能 2. Eager Execution とは a. 直ちに実行される ( ↔ Session 内で後から実行される ) b. 実際の値を返す (↔ 計算Graph内のNodeへの参照を返す)

Slide 15

Slide 15 text

Distributed Dataset Distributed Training では、データも各 Worker 上に分散できる tf.data.Dataset ← 分散しない時に使うデータ tf.distribute.DistributedDataset: 1. tf.distribute.Strategy.experimental_distribute_dataset(dat aset) 2. tf.distribute.Strategy.distribute_datasets_from_function(d ataset_fn) a. dataset_fn は InputContex を引数にとり tf.Data.Dataset を返す ※ keras.models + ParameterServerStrategy の場合は tf.keras.utils.experimental.DatasetCreator を使う必要あり

Slide 16

Slide 16 text

strategy.scope() scope 内に入ると : 1. Strategy が global context にインストールされる a. tf.distribute.get_strategy() で現状の Strategy を取得できる b. keras.models を scope 内で作成する必要があるのは、この関数を使って Strategy を取得してるため 2. 変数作成は Strategy によって決まる a. Sync Strategy (MirroredStrategy, TPUStrategy, MultiWorkerMiroredStrategy) ではそれぞれの Replica 上に作成 b. ParameterServer の場合は PS 上に作成

Slide 17

Slide 17 text

Sample Non Distributed 1. tf.Variable を定義 2. tf.function worker_fn を定義 a. variable に iterator の element を足す 3. dataset_fn で 1~5 の iterator を作成 4. worker_fn に iterator を渡して 5 回呼び出す 5. tf.Variable -> 15 となる Distributed 1. Cluster 作成 2. ParameterServerStrategy 作成 3. Coordinator 作成 4. tf.Variable を scope 内に定義 5. tf.function worker_fn を定義 a. variable に iterator の element を足す 6. dataset_fn で 1~5 の iterator を作成 7. worker_fn を coordinator で 5 回呼び出す 8. tf.Variable -> 15 となる

Slide 18

Slide 18 text

まとめ 1. Distributed Training の背景 a. データセットが大きくなるにつれて学習時間が肥大化 2. Distributed Training の区分け a. Data Parallelism vs Model Parallelism b. Synchronous vs Asynchronous 3. TensorFlow Distributed Training の概要を紹介 4. 例の紹介 Todo: 1. Distributed Training の他の Framework や歴史など学びたい !