Vertex Pipelines ではじめるサーバーレス機械学習パイプライン

Vertex Pipelines ではじめるサーバーレス機械学習パイプライン Asei Sugiyama

tl;dr 機械学習パイプラインは機械学習で行われる一連の処理をパイプラインとして実行できるよう定義したもの Vertex Pipelines は機械学習パイプラインをサーバーレスに実行する GCP のサービスで、パイプラインは Kubeflow Pipelines
SDK を用いて Python で記述するサンプルとして reproio/lab_sample_pipelines を用意したので触ってみてください

自己紹介杉山阿聖 (@K_Ryuichirou) Software Engineer @ Repro 機械学習基盤の設計構築
Advisor @ moneyforward TensorFlow User Group TFX & Kubeflow Pipeline 機械学習図鑑共著

TOC 機械学習パイプライン <- Kubeflow Pipeline SDK を用いたパイプライン構築 Lab Sample Pipeline
の紹介

機械学習パイプラインとは機械学習で行われる一連の処理をパイプラインとして並べたもの典型的にはデータ取得・前処理・訓練・デプロイが含まれる O'Reilly Japan - 入門機械学習パイプライン

機械学習パイプラインに求められる要件オーケストレーションパイプラインを構成するタスクと、実行時のリソースの管理コードとデータの管理モデルの振る舞いを決めるものを同時に管理可視化入出力と合わせてそれを可視化したもの(典型的にはグラフ)を管理

機械学習はとにかく複雑 Machine Learning: The High Interest Credit Card of Technical
Debt – Google Research

TFX (TensorFlow Extended) TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
– Google Research

Kubeflow Kubernetes 上で機械学習のワークフローを実行するプロジェクト Google 社内の機械学習基盤 TFX の OSS 実装を目指して開始
(今は独自路線)

Kubeflow Pipelines Kubeflow のコンポーネントのひとつ機械学習パイプラインを実行

Vertex Pipelines 機械学習パイプラインをサーバーレスに実行する GCP のサービス Kubeflow Pipelines と同じパイプラインを実行

TOC 機械学習パイプライン Kubeflow Pipeline SDK を用いたパイプライン構築 <- Lab Sample Pipeline
の紹介

Kubeflow Pipeline SDK を用いたパイプライン構築 Hello, world コンポーネントの実行順序の定義 GPU の利用

Hello, world def hello_world(text: str): print(text) hello_world("Hello, world") # Hello,
world この関数を実行する機械学習パイプラインをこれから作成

Python の関数からコンポーネントの作成 from kfp.v2 import components hw_op = components.component_factory.create_component_from_func( hello_world,
output_component_file="hw.yaml" ) 関数 hello_world を実行するコンポーネントを準備 hw_op はコンポーネントを生成するファクトリ関数 hw.yaml については後ほど

@component デコレーター from kfp.v2.dsl import component @component(output_component_file='hw.yaml') def hw_op(text: str):
print(text) 同様のことを @component デコレーターを用いると短く記述可能

パイプラインの定義 from kfp.v2 import dsl GCS_PIPELINE_ROOT = "gs://your-gcs-bucket/" @dsl.pipeline( name="hello-world",
description="A simple intro pipeline", pipeline_root=f"{GCS_PIPELINE_ROOT}helloworld/" ) def hello_world_pipeline(text: str = "hi there"): hello_world_task = hw_op(text) pipeline_root はパイプラインの生成物を保管する先のバケット

コンパイル from kfp.v2 import compiler compiler.Compiler().compile( pipeline_func=hello_world_pipeline, package_path='hello-world-pipeline.json', ) パイプラインを
Python の関数から JSON に変換生成される JSON のスキーマは PipelineSpec と呼ばれる

実行結果 hello_world コンポーネントだけのパイプライン実行したことがあるとキャッシュされた値が返る

hw.yaml 中身は右の通り Compone ntSpec と呼ばれるスキーマで記述

ComponentSpec (抜粋) inputs: - {name: text, type: String} # 入力
text の定義 # outputs: # 出力は outputs として定義 implementation: container: image: python:3.7 # 動かすコンテナの指定 command: # コンテナで動かすコマンドの指定 - sh - -ec ... args: # コンテナに与える引数の指定 - --text - {inputValue: text} # パイプラインで与えた値が入る

コンポーネントの実行順序の定義次の 2 つのどちらも可能入出力の依存関係に基づく実行順の定義 <-おすすめ明示的な実行順の定義

Producer / Consumer Pipeline (1/2) @dsl.component def echo(text: str) ->
str: return text @dsl.pipeline( name="producer-consumer-pipeline", pipeline_root=f"{GCS_PIPELINE_ROOT}producer-consumer/" ) def producer_consumer_pipeline(text: str = "hi there"): producer = echo(text) consumer = echo(producer.output) # producer の出力を利用

Producer / Consumer Pipeline (2/2) 実行結果は右の通り入出力の依存関係に基づき、実行順が定義されるキャッシュが利用可能

明示的な実行順の定義 @dsl.pipeline( name="producer-consumer", pipeline_root=f"{GCS_PIPELINE_ROOT}producer-consumer/" ) def use_after_pipeline(text: str = "hi
there"): producer = echo(text) consumer = echo(text).after(producer) # producer のあとに実行 after を使うことで明示的に実行順を定義可能キャッシュは必ずしも担保されないので注意

GPU の利用 @dsl.pipeline( name="hello-world-with-gpu", pipeline_root=f"{GCS_PIPELINE_ROOT}helloworld/" ) def hello_world_pipeline_with_gpu(text: str =
"hi accelerator"): hello_world_task = (hw_op(text) .add_node_selector_constraint( # GPU の追加を宣言 'cloud.google.com/gke-accelerator', # 利用できる GPU の種類は 'nvidia-tesla-k80') # Custom Training に同じ .set_gpu_limit(1)) コンポーネントで CPU やメモリ、 GPU の指定が可能確保に失敗することもあるので注意

TOC 機械学習パイプライン Kubeflow Pipeline SDK を用いたパイプライン構築 Lab Sample Pipeline の紹介
<-

Lab Sample Pipeline の紹介パイプライン概要ディレクトリ構造コンポーネントの実装と単体テスト ComponentSpec の記述例パイプラインの実装
Metadata

パイプライン概要 reproio/lab_sample_pipelines penguin dataset のサブセットを用いた機械学習パイプライン機械学習ライブラリには scikit-learn を利用前処理・訓練・評価を含むシンプル
なパイプライン Vertex Pipelines で動作

ディレクトリ構造 % tree . . ├── README.md ├── components │
├── data_generator # コンポーネントごとのディレクトリ │ │ ├── Dockerfile # コンポーネントの Dockerfile │ │ ├── README.md # コンポーネントの仕様 │ │ ├── data_generator.yaml # ComponentSpec │ │ ├── src # ソースコード一式 │ │ └── tests # テストコード一式 │ ... └── pipeline.py # パイプラインの定義 Kubeflow Pipelines のドキュメントと同じ

コンポーネントの実装 if __name__ == "__main__": # 引数をパース artifacts = Artifacts.from_args()
# main 関数で処理 train_data, eval_data = main(artifacts.component_arguments) # 与えられた出力先に CSV ファイルを保存 write_csv( artifacts.output_destinations.train_data_path, train_data) write_csv( artifacts.output_destinations.eval_data_path, eval_data) コンポーネントは普通の CLI アプリケーション

単体テスト import pytest class TestTrainModel: def test_train_model_from_small_dataset(self): source = np.array(
[(0, 1, 2, 3, 4), (1, 2, 3, 4, 5), (0, 3, 4, 5, 6)], ... model = trainer.train(source, "species_xf") assert model is not None assert model.predict([[1, 2, 3, 4]]) == [0] pytest で単体テストを実装経験上、可能な限り単体テストを書くべき

ComponentSpec (data-generator) name: data_generator outputs: - { name: train_data_path, type:
{ GCPPath: { data_type: CSV } } } - { name: eval_data_path, type: { GCPPath: { data_type: CSV } } } implementation: container: image: us.gcr.io/your-project-id/kfp-sample-data-generator:v0.0.0 args: - { outputPath: train_data_path } - { outputPath: eval_data_path } kfp 1.8.4 では command を書かなくてもコンパイルできる将来的には command が必須に

パイプラインの実装 (一部) @kfp.dsl.pipeline( name=PIPELINE_NAME, pipeline_root=f"gs://{GCP_GCS_PIPELINE_ROOT}/", ) def kfp_sample_pipeline(suffix: str =
"_xf"): data_generator = _data_generator_op() transform = _transform_op( train_data_path=data_generator.outputs[GeneratedData.TrainData.value], eval_data_path=data_generator.outputs[GeneratedData.EvalData.value], suffix=suffix, ) 出力が複数あるときには outputs に出力変数名を渡す

実行結果 Component が縦に並ぶそれぞれの入出力も、同時に表示

Metadata

まとめ機械学習パイプラインは機械学習で行われる一連の処理をパイプラインとして実行できるよう定義したもの Vertex Pipelines は機械学習パイプラインをサーバーレスに実行する GCP のサービスで、パイプラインは Kubeflow Pipelines
SDK を用いて Python で記述するサンプルとして reproio/lab_sample_pipelines を用意したので触ってみてください

Vertex Pipelines ではじめるサーバーレス機械学習パイプライン

Vertex Pipelines ではじめるサーバーレス機械学習パイプライン

More Decks by Asei Sugiyama

Other Decks in Technology

Featured

Transcript