TFX and Kubeflow Pipeline Tutorial

When TFX meet K8s - Kubeflow Pipeline

2 Hello! I am Jack Lin • InfuseAI Software Engineer
• Kubeflow / tf-operator maintainer • Kubeflow 2020 GSoC mentor ChanYiLin Jack Lin

3 Why this talk March

https://cloud.google.com/blog/products/ai-machine- learning/introducing-cloud-ai-platform-pipelines Why this talk

Why this talk - Pipeliine - MLOps - Kubernetes -
… - XXX

Outline ◎Tensorflow Extended (TFX) ◎Tensorflow Extended Pipeline (Platform) ◎Kubeflow Pipeline
◎MLOps using TFX, Kubeflow Pipelines, and Cloud Build ◎GSoC - Kubeflow 6

1. Machine Learning Workflow with Tensorflow Extended Core Lib (low-level)
- Tensorflow Data Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 7

Tensorflow Data Validation (TFDV) • Data is as important as
Model itself • “garbage in, garbage out” • open-source library that helps developers automatically analyze, validate, and monitor their ML data at scale. • TFDV is part of the TFX Platform and this technology is used to analyze and validate PB of data at Google every day • Data Validation: 1. Computing and Visualizing Descriptive Statistics 2. Inferring a Schema 3. Validating New Data 8

Computing and Visualizing Descriptive Statistics • Connectors to different data
formats • underline it uses Apache Beam to define and process its data pipelines. can use Beam IO connectors, Ptransforms to process different formats • TFDV provides two helper functions for CSV and TFRecord of serialized tf.Examples. • Scale: creates an Apache Beam pipeline which can run in • DirectRunner in the notebook environment • DataflowRunner on Google Cloud Platform • Kubeflow pipeline / Airflow 9

Computing and Visualizing Descriptive Statistics 10 facets

Inferring a Schema 11 • Based on the statistics, TFDV
infers a schema (schema.proto) . • To reflect the stable characteristics of the data. • Schema • Do data validation • this schema format is also used as the interface to other components in the TFX ecosystem. E.g. data transform use TensorFlow Transform.

Inferring a Schema 12 Note: Infer schema helps developers first
author a schema that they can then refine and update manually.

13 Validation report Validating new data

Case 1. Validation of Continuously Arriving Data 14

Case 2. Training/Serving Skew Detection 16 • TFDV can compute
statistics of servings logs and perform validation with the schema • taking into account any expected differences between training and serving data (training data has label while serving log has not)

1. Machine Learning Workflow with Tensorflow Extended - Tensorflow Data
Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 17

Tensorflow Transform • Allows users to define preprocessing pipelines •
Run these using large scale data processing frameworks (ref) • Local (Airflow) • Apache Beam • Dataflow (Google Cloud Service) • Kubeflow Pipeline • Exporting the pipeline in a way that can be run as part of a TensorFlow graph. (so can use in training and serving) • TFT requires the specification of a schema to parse the data into tensors. • User can 1. manually specifying a schema (by specifying the type of each feature) 2. using the TFDV inferred schema radically simplifies the use of TFT. 18

• Feature Engineering • Turn raw-data into features • Data
Preprocess and Feature Construction • Feature Crosses • Tensorflow Transform • usually the most time-consuming part in ML workflow 19 Let’s talk about Feature Engineering first

• Numerical data to feature => use it as feature
directly • Remove inappropriate data => e.g ID • Categorical Feature 20 Feature Engineering - Turn raw-data into features ?

• Categorical Feature • one-hot encoding • hash encoding •
Vocabulary mapping • … • E.g. Street_Name: {‘Charleston Road’, ‘North Shoreline Boulevard’, ‘Shorebird Way’, ‘Rengstorff Avenue’} only several streets appear in the data • Method 1 (X) • map Charleston Road to 0 • map North Shoreline Boulevard to 1 • map Shorebird Way to 2 • map Rengstorff Avenue to 3 • map everything else (OOV) to 4 • has order meaning • might have multiple streets at the same time in a data 21 Feature Engineering - Turn raw-data into features • Method 2 (O) • one-hot encoding

22 Feature Engineering - Turn raw-data into features

• Linear • Nonlinear 23 Feature Engineering - Feature Crosses
(特徵組合) With new feature: x3 = x1x2

24 When to do Feature Engineering • Data Preprocess •
Dataflow • tf.transform • Generate Feature • Dataflow • Model Training • input_fn

25 tf.transform • is a library for preprocessing input data
for TensorFlow • creating features that require a full pass over the training dataset. • E.g. • Normalize an input value by using the mean and standard deviation (normalize) • Convert strings to integers by generating a vocabulary over all of the input values (word2vec) • Convert floats to integers by assigning them to buckets, based on the observed data distribution (Bucketing) • TensorFlow => on a single example or a batch of examples. • tf.Transform => extends these capabilities to support full passes over the entire training dataset. (Where we need Apache Beam to help)

26 tf.transform • Analyze and Transform • Output Serving or
- Transform Graph is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models). - Transform data represents the preprocessed training and evaluation data.

27 tf.transform Define our features and schema Define Preprocess function
Categorical Feature Normalize, Scaling

28 tf.transform Now we're ready to start transforming our data
in an Apache Beam pipeline. 1.Read in the data using the CSV reader 2.Clean it using our new MapAndFilterErrors transform 3.Transform it using a preprocessing pipeline that scales numeric data and converts categorical data from strings to int64 values indices, by creating a vocabulary for each category 4.Write out the result as a TFRecord for training a model

29 tf.transform Used in training input_fn() return transformed_feature: feature data
(train data) transformed_labels: target (prediction)

30 Transform to Training Ref: https://www.tensorflow.org/tfx/tutorials/transform/census

32 Tensorflow Model Analysis • a library for evaluating TensorFlow
models. • evaluate models on large amounts of data in a distributed manner, using the same metrics defined in their trainer. • combines the power of TensorFlow and Apache Beam to compute and visualize evaluation metrics. • Goal: Before deploying, ML developers need to evaluate it to ensure that it meets specific quality thresholds and behaves as expected for all relevant slices of data.

33 During training vs after training One model vs multiple
models over time Tensorflow Model Analysis

34 Tensorflow Model Analysis Aggregate vs sliced metrics Streaming vs
full-pass metrics - Analyze in more granular level. - A model may have an acceptable AUC over the entire eval dataset, but underperform on specific slices. - Identify slices where examples may be mislabeled, or where the model over- or under-predicts. - E.g. whether a model that predicts the generosity of a taxi tip works equally well for riders that take the taxi during day hours vs night hours - Streaming Metrics: TensorBoard are commonly computed on a mini-batch basis during training - TFMA uses Apache Beam to do a full pass over the specified evaluation dataset • more accurate • scales up to massive evaluation datasets (Beam pipelines can be run using distributed processing back-ends.)

Tensorflow Model Analysis 35

Tensorflow Model Analysis 36

38 tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} Tensorflow Model Analysis https://chanyilin.github.io/kubeflow-e2e-tutorial.html

2. Tensorflow Extended Pipeline 39

40 source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

43 Tensorflow Extended Pipeline Orchestration • Production: you will use
an orchestrator such as Apache Airflow, Kubeflow Pipelines, or Apache Beam to orchestrate a pre-defined pipeline graph of TFX components. • Interactive notebook(local): the notebook itself is the orchestrator, running each TFX component as you execute the notebook cells. Metadata • Production: MLMD stores metadata properties in a database such as MySQL or SQLite, and stores the metadata payloads in a persistent store such as on your filesystem. • Interactive notebook(local): both properties and payloads are stored in an ephemeral SQLite database in the /tmp directory on the Jupyter notebook or Colab server.

44 TFDV TF Transform TF Model Analysis TF Model Serving
source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

46 more details: https://www.tensorflow.org/tfx/tutorials/tfx/components#evaluator [Tensorflow Data Validation] stats = tfdv.generate_statistics_from_csv(input_data)
schema = tfdv.infer_schema(stats) [Tensorflow Transform] … [Tensorflow Model Analysis] … source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

47 Config Components 1. a driver which decides whether to
perform work or possibly reference a cached artifact 2. the executor which does the heavy lifting for the component 3. publisher to write the outcome to ml-metadata. Metadata store

48 Lineage tracking: Just like you wouldn’t code without version
control, you shouldn’t train models without lineage tracking. Lineage Tracking shows the history and versions of your models, data, and more. source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

50 ML Metadata (part of TFX) https://www.kubeflow.org/docs/components/metadata/

51 Metadata Data Versioning Model Versioning Training Metric versioning

52 Artifacts The real output or logs of every step
in the pipeline

53 TFX on Kubeflow Pipeline https://github.com/kubeflow/pipelines/blob/master/samples/core/parameterized_tfx_oss/taxi_pipeline_notebook.ipynb

54 TFX on Airflow

3. Kubeflow Pipelines 55

56 Def: A pipeline is a description of an ML
workflow, including all of the components in the workflow and how they combine in the form of a graph 1. An engine for scheduling multi-step ML workflows. 2. An SDK for defining and manipulating pipelines and components. 3. Notebooks for interacting with the system using the SDK. 4. Metadata and Artifacts to track input/output and the file created by each component Kubeflow Pipeline

57 Kubeflow Pipeline architecture 1. Pipeline Frontend/Backend 2. ScheduleWorkflow /
Argo: - build 、schedule and run the workflow 3. Artifact storage: - Metadata(MySQL): information about executions (runs), models, datasets, and other artifacts. - Artifacts(minio): the files and objects that form the inputs and outputs of the components in your ML workflow. E.g. metrics (time series). 4. Persistence agent (informer to watch resourses) - records the set of containers that executed as well as their inputs and outputs. - input/output consists of either container parameters or data artifact URIs

58 Components and Pipeline • Components: A pipeline component is
a self-contained set of user code, packaged as a Docker image, that performs one step in the pipeline. Artifacts of this components Container v1 spec (might add k8sresource/daemon… in the future) Input/output and the type

Define Components and run Pipeline Key Point: The execution logic
is in the container image. 1. define code 2. build image 3. turn into kfp components • Turn Python function into Component directly • Build Python module and Dockerfile to build image and then define Components • Use Reusable Components Demo 59 Ref: https://github.com/kubeflow/pipelines/tree/master/samples/tutorials/mnist https://github.com/ChanYiLin/kfp_notebook_example

Data passing • Small Data • Bigger data (files) Demo
60 https://github.com/kubeflow/pipelines/blob/master/samples/tutorials/Data passing in python components.ipynb https://github.com/ChanYiLin/kfp_notebook_example

Metadata and Artifacts • pipeline component must write a JSON
file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json • Type: Confusion matrix 61

file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json • Type: Markdown 62

file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json • Type: 63 https://www.kubeflow.org/docs/pipelines/sdk/output-viewer/

4. MLOps using TFX, Kubeflow Pipelines, and Cloud Build 64
https://cloud.google.com/solutions/machine-learning/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-b

65 CI/CD pipeline compared to CT pipeline - TFX Transform
code - Tensorflow model training code • If given new implementation, a successful CI/CD pipeline deploys a new ML CT pipeline. • If given new data, a successful CT pipeline serves a new model prediction service.

66 Model Validation Connect to Slack Do the training and
validation periodically like CI/CD And you can see the validation results on the slack to decide whether to update the model or not

69 Use case in Spotify

4. Kubeflow GSoC 70 https://docs.google.com/document/d/1AQDD9s8VpCf3y8OLKTBSMgDzHSjdsV_DOyL5dc-XCOQ/edit# https://www.kubeflow.org/docs/about/gsoc/

TFX and Kubeflow Pipeline Tutorial

TFX and Kubeflow Pipeline Tutorial

More Decks by Jack

Other Decks in Technology

Featured

Transcript