2022-02-26 Kubeflow Training Operator - TFJob紹介@機械学習の社会実装勉強会

Kubeﬂow Training Operator - TFJob紹介 2022/02/26 Naka Masato

自己紹介名前那珂将人経歴 • アルゴリズムエンジニアとしてレコメンドエンジン開発 • インフラ基盤整備 GitHub: https://github.com/nakamasato
Twitter: https://twitter.com/gymnstcs

What is Kubeflow Training Operator? 1. Kubeflow のコンポーネントの一つ 2. Kubernetes
の Custom Resource で TensorFlow などのトレーニング Job を実行 3. kubeflow/training-operator 内に実装 training-operator TFJob Pod Service PyTorchJob MXJob XGBoostJob

What is TFJob? 1. TensorFlow の Distirbuted Training を Kubernetes
上で管理するための CRD (Custom Resource Deﬁnition) 2. Controller a. multi-worker training の設定の環境変数 TF_CONFIG を更新 b. Pod や Service を管理 Worker Worker Worker PS PS Chief (coordinator) Evaluator 環境変数 TF_CONFIG

TensorFlow’s Distributed Training Tensorflow の Distributed Training: 1. 複数 GPU,
machine, TPU で機械学習が可能 2. いくつかのタイプ tf.distribute.Strategy: a. MirroredStrategy b. TPUStrategy c. MultiWorkerMirroredStrategy d. ParameterServerStrategy e. CentralStorageStrategy https://www.tensorflow.org/guide/distributed_training Tensorflow の Distributed Training は次回もう一度紹介

Tensorﬂow’s Distributed Training - tf.distribute.Strategy tf.distribute.Strategy has been designed with
these key goals: 1. Easy to use 2. Provide good performance out of the box 3. Easy switching between strategies tf.distribute.Strategy は tf.keras に連携 : 1. tf.distribute.Strategy を選択 2. Keras モデル、オプティマイザ、メトリクスを strategy.scope の中に入れる

Tensorﬂow’s Distributed Training with Kubeﬂow Training Operator 1. TFJob: TensorFlow
の Job の定義 2. training-operator: Job の定義から Kubernetes Objects (Pod と Service) を作成 3. Pod & Service: Pod で定義のコンテナが実行され、 Service を通して Pod 間で疎通 4. Container Image: 実際の機械学習のロジックはコンテナイメージにしておく training-operator TFJob Pod Worker Service kind: TFJob Worker: replicas: 3 template: <pod template> PS replicas: 2 template: <pod template> Pod Worker Pod Worker Pod PS Pod PS Service Service Service Service

TFJobのYaml 1. kind は TFJob を指定 2. spec.tfReplicaSpecs は Map
型 a. キーは、 replicaType で PS, Worker, Chief, Evaluator のいずれか b. replica 数の指定 i. 右の例は、 Parameter Server Strategy の場合で PS と Worker それぞれ replica 1 と 3 c. Pod Template i. コンテナやコンテナ起動時のコマンドなどを指定 d. restartPolicy i. Pod が Exit したときに再起動するかを指定

TFJobのYamlをApply 1. operator が replicatype ごとに Pod と Service
を作成 2. クラスタ情報は環境変数 TF_CONFIG として各コンテナへ ※ Strategy によってはクラスタ情報が不要の場合も ※ GPU を使う場合には GPU driver のインストールなどが必要 training-operator Pod Worker Service Pod Worker Pod PS Service Service watch TF_CONFIG TF_CONFIG TF_CONFIG

TFJob用のScript＆Docker image 1. 複数 Worker の場合には、 ClusterSpec で全部の Worker で
Cluster の情報を必要 a. TFJob では環境変数 TF_CONFIG に json 形式で渡される 2. 一つのスクリプトで replica type によって分岐する a. if job_name == “ps” 3. 詳細の書き方は今回は割愛

TFJobの使い方 1. training-operator のインストール 2. TFJob の作成 a. Python Script
→ Docker Image の作成 b. Kubernetes の Yaml ファイルの作成 3. TFJob の Apply kubectl apply -k "github.com/kubeﬂow/training-operator/manifests/overlays/standalone?ref=v1.3.0" kubectl apply -f tfjob.yaml

Demo: 1. Run local kubernetes (https://kind.sigs.k8s.io/ ) 2. Apply sample
TFJob. a. kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/tensorflow/sim ple.yaml 3. Check Pods a. kubectl logs tfjob-simple-worker-0 -f -n kubeflow https://www.kubeflow.org/docs/components/training/tftraining/#running-the-mnist-example

今回のまとめ 1. Kubeﬂow Training Operator の TFJob を紹介 2. TFJob
は Tensorﬂow の Distributed Training をサポート 3. TFJob の Architecture と簡単なデモ 4. TFJob で走らせる Script のいい例を紹介できなかったので次回以降の宿題

参考 1. https://www.tensorflow.org/guide/distributed_training 2. https://www.tensorflow.org/tutorials/distribute/parameter_server_training 3. https://www.kubeflow.org/docs/components/training/tftraining/ 4. https://github.com/kubeflow/training-operator/tree/master/examples/tensorflow 5.
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf

2022-02-26 Kubeflow Training Operator - TFJob紹介...

2022-02-26 Kubeflow Training Operator - TFJob紹介@機械学習の社会実装勉強会

Naka Masato

More Decks by Naka Masato

Other Decks in Technology

Featured

Transcript

Kubeﬂow Training Operator - TFJob紹介 2022/02/26 Naka Masato

自己紹介名前那珂将人経歴 • アルゴリズムエンジニアとしてレコメンドエンジン開発 • インフラ基盤整備 GitHub: https://github.com/nakamasato

What is Kubeﬂow Training Operator? 1. Kubeﬂow のコンポーネントの一つ 2. Kubernetes

What is TFJob? 1. TensorFlow の Distirbuted Training を Kubernetes

TensorFlow’s Distributed Training Tensorﬂow の Distributed Training: 1. 複数 GPU,

Tensorﬂow’s Distributed Training - tf.distribute.Strategy tf.distribute.Strategy has been designed with

Tensorﬂow’s Distributed Training with Kubeﬂow Training Operator 1. TFJob: TensorFlow

TFJobのYaml 1. kind は TFJob を指定 2. spec.tfReplicaSpecs は Map

TFJobのYamlをApply 1. operator が replicatype ごとに Pod と Service

TFJob用のScript＆Docker image 1. 複数 Worker の場合には、 ClusterSpec で全部の Worker で

TFJobの使い方 1. training-operator のインストール 2. TFJob の作成 a. Python Script

Demo: 1. Run local kubernetes (https://kind.sigs.k8s.io/ ) 2. Apply sample

今回のまとめ 1. Kubeﬂow Training Operator の TFJob を紹介 2. TFJob

参考 1. https://www.tensorflow.org/guide/distributed_training 2. https://www.tensorflow.org/tutorials/distribute/parameter_server_training 3. https://www.kubeflow.org/docs/components/training/tftraining/ 4. https://github.com/kubeflow/training-operator/tree/master/examples/tensorflow 5.