2022-04-29 Ray紹介@機械学習の社会実装勉強会

Slide 1

Slide 1 text

PythonをシンプルにスケーラブルにするRayの紹介 Naka Masato

Slide 2

Slide 2 text

自己紹介名前那珂将人経歴 ● アルゴリズムエンジニアとしてレコメンドエンジン開発 ● インフラ基盤整備 GitHub: https://github.com/nakamasato Twitter: https://twitter.com/gymnstcs

Slide 3

Slide 3 text

Rayとは UC Berkeley RISE Lab で開発されたオープンソースのプロジェクト As a general-purpose and universal distributed compute framework, you can ﬂexibly run any compute-intensive Python workload — 1. from distributed training or 2. hyperparameter tuning to 3. deep reinforcement learning and 4. production model serving. Deep learning から Model Serving まで開発者が簡単にスケールできる https://www.ray.io/

Slide 4

Slide 4 text

Components さまざまな Package がある 1. Core: コア 2. Tune: Scalable hyperparameter tuning 3. RLlib: Reinforcement learning 4. Train: Distributed deep learning (PyTorch, TensorFlow, Horovod) 5. Datasets: Distributed data loading and compute 6. Serve: Scalable and programmable serving 7. Workﬂows: Fast, durable application ﬂows

Slide 5

Slide 5 text

Concept 1. Tasks: 異なる Python ワーカ上で実行される非同期関数 2. Actors: Task を拡張した Class で本質的にはステートフルなワーカ 3. Objects: Task や Actor が作成されるもの ( クラスタの各ノードにある Object ストアに保存される ) 4. Placement Groups: 複数のノード上でのリソースのグループを保存する (e.g. Gang Scheduling) 5. Environment Dependencies: Task は複数のマシン上で実行されるので実行環境で依存パッケージや環境変数などが使えるように設定 ( ①クラスタ設定、② Runtime 環境設定 )

Slide 6

Slide 6 text

Ray Example 1. ray.init(): ray クラスタの初期化 2. @ray.remote: 関数を task (remote function) にするデコレータ 3. func.remote(): Task の呼び出し→ future が返る 4. ray.get(future): 結果を取得 ray.init でクラスタ上で task を複数のマシンで実行できる

Slide 7

Slide 7 text

Ray.init - クラスタへの接続 ray.init(): 既存クラスタへの接続 or クラスタ作成 + 接続 1. init() ローカルの場合 a. Redis, raylet, plasma store, plasma manager, some workers をスタートして接続 2. init(address=“auto”) or init(address=“ray://123.45.67.89:10001”) → 既存のクラスタに接続 Task の処理を分散して実行できる (remote function) Ray cluster task (@ray.remote) ray.get(futures)

Slide 8

Slide 8 text

Ray Cluster Cluster: 1. Head node 2. Worker node Launch a cluster: 1. The cluster launcher: ray up conﬁg.yml 2. The kubernetes operator: helm -n ray install example-cluster --create-namespace ./ray Supported Cloud: 1. AWS 2. Azure 3. GCP 4. Aliyun

Slide 9

Slide 9 text

Ray Cluster作成 ~ AWS Prerequisite: 1. aws configure (default profile のみ対応 ?) 2. IAM 権限 IAM と EC2 の作成用が必要 (Docs で明記されてない ?) 3. VPC と Subnet は事前に必要 Step: 1. config.yaml 作成 a. 右の yaml ap-northeast-1 では動かず 2. ray up -y config.yaml a. Minimal で 3 分弱

Slide 10

Slide 10 text

Ray Cluster作成 - AWS 意外とハマりどころがある 1. Prerequisite (AWS のプロファイル、 IAM 権限、 VPC) でコケる 2. example の conﬁg.yml が簡単に動かない (ap-northeast-1) a. Subnet なくてエラー b. AMI イメージ選択 c. Ray cluster 作成が落ちる `pip not found`, `docker not found` 3. Ray cluster を削除しても AWS のリソースが残る (key pair, IAM, security group…)

Slide 11

Slide 11 text

Ray Cluster - AWS - Jobの提出方法ローカルから Ray Cluster の Head に直接接続ではなく SDK や CLI から Job を提出する Ray cluster ray.init(“ray:// 10.0.0.1:100 01”) python example.py (local) ray submit conﬁg.yml example.py CLI 詳細 : https://docs.ray.io/en/master/cluster/job-submi ssion.html#job-submission-architecture

Slide 12

Slide 12 text

Ray Cluster on Kubernetes 1. Helm でインストール可能 a. helm -n ray install example-cluster --create-namespace ./ray 2. インストールされるもの a. ray-operator: raycluster を管理するコンポーネント b. raycluster (custom resource) -> 3 pods (1 head + 2 worker) c. service: head へアクセスするエンドポイント 3. Ray Job の Submit a. Dashboard のサービスをローカルに Port Forward & CLI で提出 i. kubectl -n ray port-forward service/example-cluster-ray-head 8265:8265 ii. ray job submit –runtime-env-json=... – python script.py b. Ray Head 10001 を Port Forward して ray.init(“local”) でローカル Run (Security 的に微妙 ) c. Kubernetes の Pod (Job などから ) で ray.init(“head-service”) を Kubernetes クラスタから実行 + 環境変数から Head の情報を渡す https://github.com/ray-project/ray/tree/master/doc/kubernetes/example_scripts

Slide 13

Slide 13 text

まとめ今日 1. Ray の基本的な使い方 a. Concept + クラスタ作成 ToDo: 1. Ray の多様な機能 (Data, Train, Tune, Serve, RLlib, Workflows) 2. Pytorch 、 Tensorflow の Distributed Training に Ray を使うメリット・デメリット 3. Kubeflow Training Operator との比較