Vertex Pipelines ではじめるサーバーレス機械学習パイプライン

Slide 1

Slide 1 text

Vertex Pipelines ではじめるサーバーレス機械学習パイプライン Asei Sugiyama

Slide 2

Slide 2 text

tl;dr 機械学習パイプラインは機械学習で行われる一連の処理をパイプラインとして実行できるよう定義したもの Vertex Pipelines は機械学習パイプラインをサーバーレスに実行する GCP のサービスで、パイプラインは Kubeflow Pipelines SDK を用いて Python で記述するサンプルとして reproio/lab_sample_pipelines を用意したので触ってみてください

Slide 3

Slide 3 text

自己紹介杉山阿聖 (@K_Ryuichirou) Software Engineer @ Repro 機械学習基盤の設計構築 Advisor @ moneyforward TensorFlow User Group TFX & Kubeflow Pipeline 機械学習図鑑共著

Slide 4

Slide 4 text

TOC 機械学習パイプライン <- Kubeflow Pipeline SDK を用いたパイプライン構築 Lab Sample Pipeline の紹介

Slide 5

Slide 5 text

機械学習パイプラインとは機械学習で行われる一連の処理をパイプラインとして並べたもの典型的にはデータ取得・前処理・訓練・デプロイが含まれる O'Reilly Japan - 入門機械学習パイプライン

Slide 6

Slide 6 text

機械学習パイプラインに求められる要件オーケストレーションパイプラインを構成するタスクと、実行時のリソースの管理コードとデータの管理モデルの振る舞いを決めるものを同時に管理可視化入出力と合わせてそれを可視化したもの(典型的にはグラフ)を管理

Slide 7

Slide 7 text

機械学習はとにかく複雑 Machine Learning: The High Interest Credit Card of Technical Debt – Google Research

Slide 8

Slide 8 text

TFX (TensorFlow Extended) TFX: A TensorFlow-Based Production-Scale Machine Learning Platform – Google Research

Slide 9

Slide 9 text

Kubeflow Kubernetes 上で機械学習のワークフローを実行するプロジェクト Google 社内の機械学習基盤 TFX の OSS 実装を目指して開始 (今は独自路線)

Slide 10

Slide 10 text

Kubeflow Pipelines Kubeflow のコンポーネントのひとつ機械学習パイプラインを実行

Slide 11

Slide 11 text

Vertex Pipelines 機械学習パイプラインをサーバーレスに実行する GCP のサービス Kubeflow Pipelines と同じパイプラインを実行

Slide 12

Slide 12 text

TOC 機械学習パイプライン Kubeflow Pipeline SDK を用いたパイプライン構築 <- Lab Sample Pipeline の紹介

Slide 13

Slide 13 text

Kubeflow Pipeline SDK を用いたパイプライン構築 Hello, world コンポーネントの実行順序の定義 GPU の利用

Slide 14

Slide 14 text

Hello, world def hello_world(text: str): print(text) hello_world("Hello, world") # Hello, world この関数を実行する機械学習パイプラインをこれから作成

Slide 15

Slide 15 text

Python の関数からコンポーネントの作成 from kfp.v2 import components hw_op = components.component_factory.create_component_from_func( hello_world, output_component_file="hw.yaml" ) 関数 hello_world を実行するコンポーネントを準備 hw_op はコンポーネントを生成するファクトリ関数 hw.yaml については後ほど

Slide 16

Slide 16 text

@component デコレーター from kfp.v2.dsl import component @component(output_component_file='hw.yaml') def hw_op(text: str): print(text) 同様のことを @component デコレーターを用いると短く記述可能

Slide 17

Slide 17 text

パイプラインの定義 from kfp.v2 import dsl GCS_PIPELINE_ROOT = "gs://your-gcs-bucket/" @dsl.pipeline( name="hello-world", description="A simple intro pipeline", pipeline_root=f"{GCS_PIPELINE_ROOT}helloworld/" ) def hello_world_pipeline(text: str = "hi there"): hello_world_task = hw_op(text) pipeline_root はパイプラインの生成物を保管する先のバケット

Slide 18

Slide 18 text

コンパイル from kfp.v2 import compiler compiler.Compiler().compile( pipeline_func=hello_world_pipeline, package_path='hello-world-pipeline.json', ) パイプラインを Python の関数から JSON に変換生成される JSON のスキーマは PipelineSpec と呼ばれる

Slide 19

Slide 19 text

実行結果 hello_world コンポーネントだけのパイプライン実行したことがあるとキャッシュされた値が返る

Slide 20

Slide 20 text

hw.yaml 中身は右の通り Compone ntSpec と呼ばれるスキーマで記述

Slide 21

Slide 21 text

ComponentSpec (抜粋) inputs: - {name: text, type: String} # 入力 text の定義 # outputs: # 出力は outputs として定義 implementation: container: image: python:3.7 # 動かすコンテナの指定 command: # コンテナで動かすコマンドの指定 - sh - -ec ... args: # コンテナに与える引数の指定 - --text - {inputValue: text} # パイプラインで与えた値が入る

Slide 22

Slide 22 text

コンポーネントの実行順序の定義次の 2 つのどちらも可能入出力の依存関係に基づく実行順の定義 <-おすすめ明示的な実行順の定義

Slide 23

Slide 23 text

Producer / Consumer Pipeline (1/2) @dsl.component def echo(text: str) -> str: return text @dsl.pipeline( name="producer-consumer-pipeline", pipeline_root=f"{GCS_PIPELINE_ROOT}producer-consumer/" ) def producer_consumer_pipeline(text: str = "hi there"): producer = echo(text) consumer = echo(producer.output) # producer の出力を利用

Slide 24

Slide 24 text

Producer / Consumer Pipeline (2/2) 実行結果は右の通り入出力の依存関係に基づき、実行順が定義されるキャッシュが利用可能

Slide 25

Slide 25 text

明示的な実行順の定義 @dsl.pipeline( name="producer-consumer", pipeline_root=f"{GCS_PIPELINE_ROOT}producer-consumer/" ) def use_after_pipeline(text: str = "hi there"): producer = echo(text) consumer = echo(text).after(producer) # producer のあとに実行 after を使うことで明示的に実行順を定義可能キャッシュは必ずしも担保されないので注意

Slide 26

Slide 26 text

GPU の利用 @dsl.pipeline( name="hello-world-with-gpu", pipeline_root=f"{GCS_PIPELINE_ROOT}helloworld/" ) def hello_world_pipeline_with_gpu(text: str = "hi accelerator"): hello_world_task = (hw_op(text) .add_node_selector_constraint( # GPU の追加を宣言 'cloud.google.com/gke-accelerator', # 利用できる GPU の種類は 'nvidia-tesla-k80') # Custom Training に同じ .set_gpu_limit(1)) コンポーネントで CPU やメモリ、 GPU の指定が可能確保に失敗することもあるので注意

Slide 27

Slide 27 text

TOC 機械学習パイプライン Kubeflow Pipeline SDK を用いたパイプライン構築 Lab Sample Pipeline の紹介 <-

Slide 28

Slide 28 text

Lab Sample Pipeline の紹介パイプライン概要ディレクトリ構造コンポーネントの実装と単体テスト ComponentSpec の記述例パイプラインの実装 Metadata

Slide 29

Slide 29 text

パイプライン概要 reproio/lab_sample_pipelines penguin dataset のサブセットを用いた機械学習パイプライン機械学習ライブラリには scikit-learn を利用前処理・訓練・評価を含むシンプルなパイプライン Vertex Pipelines で動作

Slide 30

Slide 30 text

ディレクトリ構造 % tree . . ├── README.md ├── components │ ├── data_generator # コンポーネントごとのディレクトリ │ │ ├── Dockerfile # コンポーネントの Dockerfile │ │ ├── README.md # コンポーネントの仕様 │ │ ├── data_generator.yaml # ComponentSpec │ │ ├── src # ソースコード一式 │ │ └── tests # テストコード一式 │ ... └── pipeline.py # パイプラインの定義 Kubeflow Pipelines のドキュメントと同じ

Slide 31

Slide 31 text

コンポーネントの実装 if __name__ == "__main__": # 引数をパース artifacts = Artifacts.from_args() # main 関数で処理 train_data, eval_data = main(artifacts.component_arguments) # 与えられた出力先に CSV ファイルを保存 write_csv( artifacts.output_destinations.train_data_path, train_data) write_csv( artifacts.output_destinations.eval_data_path, eval_data) コンポーネントは普通の CLI アプリケーション

Slide 32

Slide 32 text

単体テスト import pytest class TestTrainModel: def test_train_model_from_small_dataset(self): source = np.array( [(0, 1, 2, 3, 4), (1, 2, 3, 4, 5), (0, 3, 4, 5, 6)], ... model = trainer.train(source, "species_xf") assert model is not None assert model.predict([[1, 2, 3, 4]]) == [0] pytest で単体テストを実装経験上、可能な限り単体テストを書くべき

Slide 33

Slide 33 text

ComponentSpec (data-generator) name: data_generator outputs: - { name: train_data_path, type: { GCPPath: { data_type: CSV } } } - { name: eval_data_path, type: { GCPPath: { data_type: CSV } } } implementation: container: image: us.gcr.io/your-project-id/kfp-sample-data-generator:v0.0.0 args: - { outputPath: train_data_path } - { outputPath: eval_data_path } kfp 1.8.4 では command を書かなくてもコンパイルできる将来的には command が必須に

Slide 34

Slide 34 text

パイプラインの実装 (一部) @kfp.dsl.pipeline( name=PIPELINE_NAME, pipeline_root=f"gs://{GCP_GCS_PIPELINE_ROOT}/", ) def kfp_sample_pipeline(suffix: str = "_xf"): data_generator = _data_generator_op() transform = _transform_op( train_data_path=data_generator.outputs[GeneratedData.TrainData.value], eval_data_path=data_generator.outputs[GeneratedData.EvalData.value], suffix=suffix, ) 出力が複数あるときには outputs に出力変数名を渡す

Slide 35

Slide 35 text

実行結果 Component が縦に並ぶそれぞれの入出力も、同時に表示

Slide 36

Slide 36 text

Metadata

Slide 37

Slide 37 text

まとめ機械学習パイプラインは機械学習で行われる一連の処理をパイプラインとして実行できるよう定義したもの Vertex Pipelines は機械学習パイプラインをサーバーレスに実行する GCP のサービスで、パイプラインは Kubeflow Pipelines SDK を用いて Python で記述するサンプルとして reproio/lab_sample_pipelines を用意したので触ってみてください