2022-06-18 Ray Trainの紹介@機械学習の社会実装勉強会第12回

Ray Trainの紹介 2022-06-18 Naka Masato

自己紹介名前那珂将人経歴 • アルゴリズムエンジニアとしてレコメンドエンジン開発 • インフラ基盤整備 GitHub: https://github.com/nakamasato
Twitter: https://twitter.com/gymnstcs

Rayを紹介 UC Berkeley RISE Lab で開発されたオープンソースのプロジェクト As a general-purpose
and universal distributed compute framework, you can ﬂexibly run any compute-intensive Python workload — 1. from distributed training or 2. hyperparameter tuning to 3. deep reinforcement learning and 4. production model serving. Deep learning から Model Serving まで開発者が簡単にスケールできる https://www.ray.io/

前回 - Ray Components さまざまな Package がある 1. Core: コア
← 前前回 2. Tune: Scalable hyperparameter tuning 3. RLlib: Reinforcement learning 4. Train: Distributed deep learning (PyTorch, TensorFlow, Horovod) ← 今回 5. Datasets: Distributed data loading and compute 6. Serve: Scalable and programmable serving ← 前回 7. Workﬂows: Fast, durable application ﬂows

Ray Train Ray: Python を Framework 依存なしに簡単にスケーラブルにするライブラリ • メイン :
Distributed Training • Tensorﬂow 、 Pytorch の Distributed Training を簡単に使えるもちろん Distributed Training 以外も使えるが、その場合は Ray を使う意味があまりない

Distributed Training • Pytorch ◦ DistributedDataParallel • Tensorﬂow ◦ MultiWorkerMirroredStrategy
同じモデルを複数のプロセスにもたせて、プロセスごとに異なるデータを与えて学習し、モデルレプリカを同期することで分散学習する方法 Ray Train では、 tensorﬂow or torch を Trainer で指定するだけで必要な設定を自動的にしてくれる

TensorﬂowのDistributed Training 以前も紹介済み 1. 2022-02-26 TensorFlow Training (TFJob) 紹介
2. 2022-03-26 Tensorﬂow Parameter Server Training

Trainの基本的な使い方 1. Trainer を初期化 a. from ray.train import Trainer b.
trainer = Trainer(backend="tensorﬂow", num_workers=2) 2. メインロジックを train_func ( 関数 ) に記述 3. Trainer を実行 a. trainer.start() # set up resources b. trainer.run(train_func) c. trainer.shutdown() # clean up resources

Demo: simple example • train_func ◦ 与えられた num_epochs 分、配列 results
に item i を追加 • trainer.run で異なる num_epochs を入れて train_func を実行 • num_workers が 2 なので同じものが 2 回呼ばれている

ML例: main main: 1. trainer の作成 (backend 、 worker 数の指定、
gpu 使用有無など ) 2. train.start() 3. trainer.run(train_func, conﬁg) 4. trainer.shutdown()

ML例: train_func 1. 学習条件 a. per_worker_batch_size: 64 (default) i. SGD
の 1 ステップにいくつ Example を使うか b. epochs: 3 (default) c. steps_per_epoch: 70 (default) i. 各 epoch でのステップの数 (batches of samples) d. num_workers: 2 2. strategy = tf.distribute.MultiWorkerMirroredStr ategy() 3. strategy.scope(): モデルをこの scope の中で定義することで distribute training を使える

ﬁtの中身 keras.Model などの Distributed Training の実装に含まれている iterator (data) step tf.function
(remote function) sync model step tf.function (remote function) step tf.function (remote function)

まとめ • Ray を使うと、 framework に関係なく Distributed Training を活用できる •
使い方もシンプル ◦ trainer = Trainer(backend="tensorﬂow", num_workers=2) ◦ trainer.start() ◦ trainer.run(train_func) ◦ trainer.shutdown() • 実際の Distributed Training の実装自体は各 Framework 内にある • Ray は Conﬁguration を Trainer でラップ

2022-06-18 Ray Trainの紹介@機械学習の社会実装勉強会第12回

2022-06-18 Ray Trainの紹介@機械学習の社会実装勉強会第12回

Naka Masato

More Decks by Naka Masato

Other Decks in Technology

Featured

Transcript

Ray Trainの紹介 2022-06-18 Naka Masato

自己紹介名前那珂将人経歴 • アルゴリズムエンジニアとしてレコメンドエンジン開発 • インフラ基盤整備 GitHub: https://github.com/nakamasato

Rayを紹介 UC Berkeley RISE Lab で開発されたオープンソースのプロジェクト As a general-purpose

前回 - Ray Components さまざまな Package がある 1. Core: コア

Ray Train Ray: Python を Framework 依存なしに簡単にスケーラブルにするライブラリ • メイン :

Distributed Training • Pytorch ◦ DistributedDataParallel • Tensorﬂow ◦ MultiWorkerMirroredStrategy

TensorﬂowのDistributed Training 以前も紹介済み 1. 2022-02-26 TensorFlow Training (TFJob) 紹介

Trainの基本的な使い方 1. Trainer を初期化 a. from ray.train import Trainer b.

Demo: simple example • train_func ◦ 与えられた num_epochs 分、配列 results

ML例: main main: 1. trainer の作成 (backend 、 worker 数の指定、

ML例: train_func 1. 学習条件 a. per_worker_batch_size: 64 (default) i. SGD

ﬁtの中身 keras.Model などの Distributed Training の実装に含まれている iterator (data) step tf.function

まとめ • Ray を使うと、 framework に関係なく Distributed Training を活用できる •