2022-06-18 Ray Trainの紹介@機械学習の社会実装勉強会第12回

Slide 1

Slide 1 text

Ray Trainの紹介 2022-06-18 Naka Masato

Slide 2

Slide 2 text

自己紹介名前那珂将人経歴 ● アルゴリズムエンジニアとしてレコメンドエンジン開発 ● インフラ基盤整備 GitHub: https://github.com/nakamasato Twitter: https://twitter.com/gymnstcs

Slide 3

Slide 3 text

Rayを紹介 UC Berkeley RISE Lab で開発されたオープンソースのプロジェクト As a general-purpose and universal distributed compute framework, you can ﬂexibly run any compute-intensive Python workload — 1. from distributed training or 2. hyperparameter tuning to 3. deep reinforcement learning and 4. production model serving. Deep learning から Model Serving まで開発者が簡単にスケールできる https://www.ray.io/

Slide 4

Slide 4 text

前回 - Ray Components さまざまな Package がある 1. Core: コア ← 前前回 2. Tune: Scalable hyperparameter tuning 3. RLlib: Reinforcement learning 4. Train: Distributed deep learning (PyTorch, TensorFlow, Horovod) ← 今回 5. Datasets: Distributed data loading and compute 6. Serve: Scalable and programmable serving ← 前回 7. Workﬂows: Fast, durable application ﬂows

Slide 5

Slide 5 text

Ray Train Ray: Python を Framework 依存なしに簡単にスケーラブルにするライブラリ ● メイン : Distributed Training ● Tensorﬂow 、 Pytorch の Distributed Training を簡単に使えるもちろん Distributed Training 以外も使えるが、その場合は Ray を使う意味があまりない

Slide 6

Slide 6 text

Distributed Training ● Pytorch ○ DistributedDataParallel ● Tensorﬂow ○ MultiWorkerMirroredStrategy 同じモデルを複数のプロセスにもたせて、プロセスごとに異なるデータを与えて学習し、モデルレプリカを同期することで分散学習する方法 Ray Train では、 tensorﬂow or torch を Trainer で指定するだけで必要な設定を自動的にしてくれる

Slide 7

Slide 7 text

TensorﬂowのDistributed Training 以前も紹介済み 1. 2022-02-26 TensorFlow Training (TFJob) 紹介 2. 2022-03-26 Tensorﬂow Parameter Server Training

Slide 8

Slide 8 text

Trainの基本的な使い方 1. Trainer を初期化 a. from ray.train import Trainer b. trainer = Trainer(backend="tensorﬂow", num_workers=2) 2. メインロジックを train_func ( 関数 ) に記述 3. Trainer を実行 a. trainer.start() # set up resources b. trainer.run(train_func) c. trainer.shutdown() # clean up resources

Slide 9

Slide 9 text

Demo: simple example ● train_func ○ 与えられた num_epochs 分、配列 results に item i を追加 ● trainer.run で異なる num_epochs を入れて train_func を実行 ● num_workers が 2 なので同じものが 2 回呼ばれている

Slide 10

Slide 10 text

ML例: main main: 1. trainer の作成 (backend 、 worker 数の指定、 gpu 使用有無など ) 2. train.start() 3. trainer.run(train_func, conﬁg) 4. trainer.shutdown()

Slide 11

Slide 11 text

ML例: train_func 1. 学習条件 a. per_worker_batch_size: 64 (default) i. SGD の 1 ステップにいくつ Example を使うか b. epochs: 3 (default) c. steps_per_epoch: 70 (default) i. 各 epoch でのステップの数 (batches of samples) d. num_workers: 2 2. strategy = tf.distribute.MultiWorkerMirroredStr ategy() 3. strategy.scope(): モデルをこの scope の中で定義することで distribute training を使える

Slide 12

Slide 12 text

ﬁtの中身 keras.Model などの Distributed Training の実装に含まれている iterator (data) step tf.function (remote function) sync model step tf.function (remote function) step tf.function (remote function)

Slide 13

Slide 13 text

まとめ ● Ray を使うと、 framework に関係なく Distributed Training を活用できる ● 使い方もシンプル ○ trainer = Trainer(backend="tensorﬂow", num_workers=2) ○ trainer.start() ○ trainer.run(train_func) ○ trainer.shutdown() ● 実際の Distributed Training の実装自体は各 Framework 内にある ● Ray は Conﬁguration を Trainer でラップ