2022-03-26 TensorFlow Parameter Server Training紹介@機械学習の社会実装勉強会

TensorFlow Parameter Server Training 2022/03/26 Naka Masato

自己紹介名前那珂将人経歴 • アルゴリズムエンジニアとしてレコメンドエンジン開発 • インフラ基盤整備 GitHub: https://github.com/nakamasato
Twitter: https://twitter.com/gymnstcs

Why Distributed Training? Deep Learning など計算コストの高い ML Dataset が大きくなる →
ML モデルの学習時間が肥大化例 1. Uber: 2017 年 Horovod: a distributed deep learning framework a. but as datasets grew, so did the training times, which sometimes took a week—or longer!—to complete. We found ourselves in need of a way to train using a lot of data while maintaining short training times. To achieve this, our team turned to distributed training. 2. Google: 2016 年 Tensorﬂow が Distributed Training を Support a. Google uses machine learning across a wide range of its products. In order to continually improve our models, it's crucial that the training process be as fast as possible. 3. Yahoo!: 2017 年 TensorFlowOnSpark を Open Source 化 4. Baidu: 2017 年 ring-allreduce

Distributed Trainingもさまざま 1. TensorFlow Distributed Training 2. Mesh TensorFlow 3.
TensorFlowOnSpark 4. DeepSpeed 5. PyTorch Distributed 6. Horovod 7. Ray 8. BytePS

今回はTensorFlow Distributed Training

Parallelism in Machine Learning Parallelism in Machine Learning Data Parallelism
Model Parallelism Synchronous Asynchronous Parameter Server Strategy Mirrored Strategy

What is TensorFlow Parameter Server Training 1. distributed training strategies
の一つ 2. Data-parallel で、複数マシン上でモデル学習をスケールしていくタイプ 3. Asynchronous Training 4. Components: a. Workers: 計算をして、 Parameter Server 上の変数を更新 b. Parameter Servers: 変数を格納

使い方 - 全体像 1. クラスタ作成 2. Strategy 作成 a. ParameterServerStrategy
b. MirroredStrategy c. … 3. Dataset の準備 a. distributed dataset b. dataset creator c. … 4. モデルの定義と学習 (strategy.scope()) a. keras.models b. Custom Training Loop Worker Worker Worker PS PS Chief (coordinator) Evaluator strategy = tf.distribute.strategy.ParameterServerStrategy(...) dc = … with strategy.scope(): model = model.ﬁt(dc)

使い方 - ① クラスタの作成 1. ClusterSpec を作成 a. tf.train.ClusterSpec 2.
Worker と PS を作成 a. tf.distribute.Server Worker Worker Worker PS PS

使い方 - ②Strategy作成 Strategy 作成 1. tf.distribute.experimental.ParameterServerStrat egy 2. tf.distribute.experimental.MirroredStrategy
3. … Strategy には ClusterResolver が必要 1. tf.distribute.cluster_resolver.SimpleClusterResol ver a. ClusterSpec からクラスタ情報を読み込む 2. tf.distribute.cluster_resolver.TFConﬁgClusterRe solver a. 環境変数 TF_CONFIG からクラスタ情報を読み込む

使い方 - ③データの準備 1. Keras のモデルを使う場合 : a. tf.keras.utils.experimental.DatasetCreator(dataset_fn) 2.
Custom Training Loop を使う場合 : a. distributed_dataset = coordinator.create_per_worker_dataset(dataset_fn) b. distributed_iter = iter(distributed_dataset)

使い方 - ④モデルの定義と学習 1. @tf.function decorator を使って Worker 上で実行する処理を実装 worker_fn
2. coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator 3. coordinator.schedule(worker_fn, args=(per_worker_iter)) a. schedule で処理を Worker に実行させることができる keras.Model も Support されている裏側 : 1. keras.Model では内部で distribute_strategy を格納 2. strategy によって Coordinator を作成 3. train_function を tf.function によって tf.graph へコンパイル 4. coordinator.schedule で train_function を Wrap

Coordinator Coordinator とは、リモート関数実行をスケジュールしたり、コーディネートするオブジェクト 1. coordinator は tf.distribute.experimental.coordinator.ClusterCoord inator 2.
strategy を指定して初期化←どのように Coordinate するのかに必要な情報 3. schedule(worker_fn, args) ←リモートで関数を args とともに実行する a. worker_fn は、 tf.function で Worker 上で実行される関数。 Worker Worker Worker PS PS Coordinator strategy schedule()

tf.function 関数を呼び出し可能な TensorFlow graph にコンパイルする Graph vs Eager Execution: 1.
Graph Execution とは、計算が tf.Graph として実行 a. tf.Graph は以下の 2 つから成り立つ i. tf.Operation: 計算ユニット ii. tf.Tensor: 計算ユニット間を流れるデータユニット b. Graphs → 高速、並列化、複数 Device 上で効率よく実行可能 2. Eager Execution とは a. 直ちに実行される ( ↔ Session 内で後から実行される ) b. 実際の値を返す (↔ 計算Graph内のNodeへの参照を返す)

Distributed Dataset Distributed Training では、データも各 Worker 上に分散できる tf.data.Dataset ← 分散しない時に使うデータ
tf.distribute.DistributedDataset: 1. tf.distribute.Strategy.experimental_distribute_dataset(dat aset) 2. tf.distribute.Strategy.distribute_datasets_from_function(d ataset_fn) a. dataset_fn は InputContex を引数にとり tf.Data.Dataset を返す ※ keras.models + ParameterServerStrategy の場合は tf.keras.utils.experimental.DatasetCreator を使う必要あり

strategy.scope() scope 内に入ると : 1. Strategy が global context にインストールされる
a. tf.distribute.get_strategy() で現状の Strategy を取得できる b. keras.models を scope 内で作成する必要があるのは、この関数を使って Strategy を取得してるため 2. 変数作成は Strategy によって決まる a. Sync Strategy (MirroredStrategy, TPUStrategy, MultiWorkerMiroredStrategy) ではそれぞれの Replica 上に作成 b. ParameterServer の場合は PS 上に作成

Sample Non Distributed 1. tf.Variable を定義 2. tf.function worker_fn を定義
a. variable に iterator の element を足す 3. dataset_fn で 1~5 の iterator を作成 4. worker_fn に iterator を渡して 5 回呼び出す 5. tf.Variable -> 15 となる Distributed 1. Cluster 作成 2. ParameterServerStrategy 作成 3. Coordinator 作成 4. tf.Variable を scope 内に定義 5. tf.function worker_fn を定義 a. variable に iterator の element を足す 6. dataset_fn で 1~5 の iterator を作成 7. worker_fn を coordinator で 5 回呼び出す 8. tf.Variable -> 15 となる

まとめ 1. Distributed Training の背景 a. データセットが大きくなるにつれて学習時間が肥大化 2. Distributed Training
の区分け a. Data Parallelism vs Model Parallelism b. Synchronous vs Asynchronous 3. TensorFlow Distributed Training の概要を紹介 4. 例の紹介 Todo: 1. Distributed Training の他の Framework や歴史など学びたい !

2022-03-26 TensorFlow Parameter Server Training...

2022-03-26 TensorFlow Parameter Server Training紹介@機械学習の社会実装勉強会

Naka Masato

More Decks by Naka Masato

Other Decks in Technology

Featured

Transcript

TensorFlow Parameter Server Training 2022/03/26 Naka Masato

自己紹介名前那珂将人経歴 • アルゴリズムエンジニアとしてレコメンドエンジン開発 • インフラ基盤整備 GitHub: https://github.com/nakamasato

Why Distributed Training? Deep Learning など計算コストの高い ML Dataset が大きくなる →

Distributed Trainingもさまざま 1. TensorFlow Distributed Training 2. Mesh TensorFlow 3.

今回はTensorFlow Distributed Training

Parallelism in Machine Learning Parallelism in Machine Learning Data Parallelism

What is TensorFlow Parameter Server Training 1. distributed training strategies

使い方 - 全体像 1. クラスタ作成 2. Strategy 作成 a. ParameterServerStrategy

使い方 - ① クラスタの作成 1. ClusterSpec を作成 a. tf.train.ClusterSpec 2.

使い方 - ②Strategy作成 Strategy 作成 1. tf.distribute.experimental.ParameterServerStrat egy 2. tf.distribute.experimental.MirroredStrategy

使い方 - ③データの準備 1. Keras のモデルを使う場合 : a. tf.keras.utils.experimental.DatasetCreator(dataset_fn) 2.

使い方 - ④モデルの定義と学習 1. @tf.function decorator を使って Worker 上で実行する処理を実装 worker_fn

Coordinator Coordinator とは、リモート関数実行をスケジュールしたり、コーディネートするオブジェクト 1. coordinator は tf.distribute.experimental.coordinator.ClusterCoord inator 2.

tf.function 関数を呼び出し可能な TensorFlow graph にコンパイルする Graph vs Eager Execution: 1.

Distributed Dataset Distributed Training では、データも各 Worker 上に分散できる tf.data.Dataset ← 分散しない時に使うデータ

strategy.scope() scope 内に入ると : 1. Strategy が global context にインストールされる

Sample Non Distributed 1. tf.Variable を定義 2. tf.function worker_fn を定義

まとめ 1. Distributed Training の背景 a. データセットが大きくなるにつれて学習時間が肥大化 2. Distributed Training