2022-02-26 Kubeflow Training Operator - TFJob紹介@機械学習の社会実装勉強会

by Naka Masato

Slide 1

Slide 1 text

Kubeﬂow Training Operator - TFJob紹介 2022/02/26 Naka Masato

Slide 2

Slide 2 text

自己紹介名前那珂将人経歴 ● アルゴリズムエンジニアとしてレコメンドエンジン開発 ● インフラ基盤整備 GitHub: https://github.com/nakamasato Twitter: https://twitter.com/gymnstcs

Slide 3

Slide 3 text

What is Kubeflow Training Operator? 1. Kubeflow のコンポーネントの一つ 2. Kubernetes の Custom Resource で TensorFlow などのトレーニング Job を実行 3. kubeflow/training-operator 内に実装 training-operator TFJob Pod Service PyTorchJob MXJob XGBoostJob

Slide 4

Slide 4 text

What is TFJob? 1. TensorFlow の Distirbuted Training を Kubernetes 上で管理するための CRD (Custom Resource Deﬁnition) 2. Controller a. multi-worker training の設定の環境変数 TF_CONFIG を更新 b. Pod や Service を管理 Worker Worker Worker PS PS Chief (coordinator) Evaluator 環境変数 TF_CONFIG

Slide 5

Slide 5 text

TensorFlow’s Distributed Training Tensorflow の Distributed Training: 1. 複数 GPU, machine, TPU で機械学習が可能 2. いくつかのタイプ tf.distribute.Strategy: a. MirroredStrategy b. TPUStrategy c. MultiWorkerMirroredStrategy d. ParameterServerStrategy e. CentralStorageStrategy https://www.tensorflow.org/guide/distributed_training Tensorflow の Distributed Training は次回もう一度紹介

Slide 6

Slide 6 text

Tensorﬂow’s Distributed Training - tf.distribute.Strategy tf.distribute.Strategy has been designed with these key goals: 1. Easy to use 2. Provide good performance out of the box 3. Easy switching between strategies tf.distribute.Strategy は tf.keras に連携 : 1. tf.distribute.Strategy を選択 2. Keras モデル、オプティマイザ、メトリクスを strategy.scope の中に入れる

Slide 7

Slide 7 text

Tensorﬂow’s Distributed Training with Kubeﬂow Training Operator 1. TFJob: TensorFlow の Job の定義 2. training-operator: Job の定義から Kubernetes Objects (Pod と Service) を作成 3. Pod & Service: Pod で定義のコンテナが実行され、 Service を通して Pod 間で疎通 4. Container Image: 実際の機械学習のロジックはコンテナイメージにしておく training-operator TFJob Pod Worker Service kind: TFJob Worker: replicas: 3 template: PS replicas: 2 template: Pod Worker Pod Worker Pod PS Pod PS Service Service Service Service

Slide 8

Slide 8 text

TFJobのYaml 1. kind は TFJob を指定 2. spec.tfReplicaSpecs は Map 型 a. キーは、 replicaType で PS, Worker, Chief, Evaluator のいずれか b. replica 数の指定 i. 右の例は、 Parameter Server Strategy の場合で PS と Worker それぞれ replica 1 と 3 c. Pod Template i. コンテナやコンテナ起動時のコマンドなどを指定 d. restartPolicy i. Pod が Exit したときに再起動するかを指定

Slide 9

Slide 9 text

TFJobのYamlをApply 1. operator が replicatype ごとに Pod と Service を作成 2. クラスタ情報は環境変数 TF_CONFIG として各コンテナへ ※ Strategy によってはクラスタ情報が不要の場合も ※ GPU を使う場合には GPU driver のインストールなどが必要 training-operator Pod Worker Service Pod Worker Pod PS Service Service watch TF_CONFIG TF_CONFIG TF_CONFIG

Slide 10

Slide 10 text

TFJob用のScript＆Docker image 1. 複数 Worker の場合には、 ClusterSpec で全部の Worker で Cluster の情報を必要 a. TFJob では環境変数 TF_CONFIG に json 形式で渡される 2. 一つのスクリプトで replica type によって分岐する a. if job_name == “ps” 3. 詳細の書き方は今回は割愛

Slide 11

Slide 11 text

TFJobの使い方 1. training-operator のインストール 2. TFJob の作成 a. Python Script → Docker Image の作成 b. Kubernetes の Yaml ファイルの作成 3. TFJob の Apply kubectl apply -k "github.com/kubeﬂow/training-operator/manifests/overlays/standalone?ref=v1.3.0" kubectl apply -f tfjob.yaml

Slide 12

Slide 12 text

Demo: 1. Run local kubernetes (https://kind.sigs.k8s.io/ ) 2. Apply sample TFJob. a. kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/tensorflow/sim ple.yaml 3. Check Pods a. kubectl logs tfjob-simple-worker-0 -f -n kubeflow https://www.kubeflow.org/docs/components/training/tftraining/#running-the-mnist-example

Slide 13

Slide 13 text

今回のまとめ 1. Kubeﬂow Training Operator の TFJob を紹介 2. TFJob は Tensorﬂow の Distributed Training をサポート 3. TFJob の Architecture と簡単なデモ 4. TFJob で走らせる Script のいい例を紹介できなかったので次回以降の宿題

Slide 14

Slide 14 text

参考 1. https://www.tensorflow.org/guide/distributed_training 2. https://www.tensorflow.org/tutorials/distribute/parameter_server_training 3. https://www.kubeflow.org/docs/components/training/tftraining/ 4. https://github.com/kubeflow/training-operator/tree/master/examples/tensorflow 5. https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf