TFX and Kubeflow Pipeline Tutorial

Slide 1

Slide 1 text

When TFX meet K8s - Kubeflow Pipeline

Slide 2

Slide 2 text

2 Hello! I am Jack Lin • InfuseAI Software Engineer • Kubeflow / tf-operator maintainer • Kubeflow 2020 GSoC mentor ChanYiLin Jack Lin

Slide 3

Slide 3 text

3 Why this talk March

Slide 4

Slide 4 text

https://cloud.google.com/blog/products/ai-machine- learning/introducing-cloud-ai-platform-pipelines Why this talk

Slide 5

Slide 5 text

Why this talk - Pipeliine - MLOps - Kubernetes - … - XXX

Slide 6

Slide 6 text

Outline ◎Tensorflow Extended (TFX) ◎Tensorflow Extended Pipeline (Platform) ◎Kubeflow Pipeline ◎MLOps using TFX, Kubeflow Pipelines, and Cloud Build ◎GSoC - Kubeflow 6

Slide 7

Slide 7 text

1. Machine Learning Workflow with Tensorflow Extended Core Lib (low-level) - Tensorflow Data Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 7

Slide 8

Slide 8 text

Tensorflow Data Validation (TFDV) • Data is as important as Model itself • “garbage in, garbage out” • open-source library that helps developers automatically analyze, validate, and monitor their ML data at scale. • TFDV is part of the TFX Platform and this technology is used to analyze and validate PB of data at Google every day • Data Validation: 1. Computing and Visualizing Descriptive Statistics 2. Inferring a Schema 3. Validating New Data 8

Slide 9

Slide 9 text

Computing and Visualizing Descriptive Statistics • Connectors to different data formats • underline it uses Apache Beam to define and process its data pipelines. can use Beam IO connectors, Ptransforms to process different formats • TFDV provides two helper functions for CSV and TFRecord of serialized tf.Examples. • Scale: creates an Apache Beam pipeline which can run in • DirectRunner in the notebook environment • DataflowRunner on Google Cloud Platform • Kubeflow pipeline / Airflow 9

Slide 10

Slide 10 text

Computing and Visualizing Descriptive Statistics 10 facets

Slide 11

Slide 11 text

Inferring a Schema 11 • Based on the statistics, TFDV infers a schema (schema.proto) . • To reflect the stable characteristics of the data. • Schema • Do data validation • this schema format is also used as the interface to other components in the TFX ecosystem. E.g. data transform use TensorFlow Transform.

Slide 12

Slide 12 text

Inferring a Schema 12 Note: Infer schema helps developers first author a schema that they can then refine and update manually.

Slide 13

Slide 13 text

13 Validation report Validating new data

Slide 14

Slide 14 text

Case 1. Validation of Continuously Arriving Data 14

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Case 2. Training/Serving Skew Detection 16 • TFDV can compute statistics of servings logs and perform validation with the schema • taking into account any expected differences between training and serving data (training data has label while serving log has not)

Slide 17

Slide 17 text

1. Machine Learning Workflow with Tensorflow Extended - Tensorflow Data Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 17

Slide 18

Slide 18 text

Tensorflow Transform • Allows users to define preprocessing pipelines • Run these using large scale data processing frameworks (ref) • Local (Airflow) • Apache Beam • Dataflow (Google Cloud Service) • Kubeflow Pipeline • Exporting the pipeline in a way that can be run as part of a TensorFlow graph. (so can use in training and serving) • TFT requires the specification of a schema to parse the data into tensors. • User can 1. manually specifying a schema (by specifying the type of each feature) 2. using the TFDV inferred schema radically simplifies the use of TFT. 18

Slide 19

Slide 19 text

• Feature Engineering • Turn raw-data into features • Data Preprocess and Feature Construction • Feature Crosses • Tensorflow Transform • usually the most time-consuming part in ML workflow 19 Let’s talk about Feature Engineering first

Slide 20

Slide 20 text

• Numerical data to feature => use it as feature directly • Remove inappropriate data => e.g ID • Categorical Feature 20 Feature Engineering - Turn raw-data into features ?

Slide 21

Slide 21 text

• Categorical Feature • one-hot encoding • hash encoding • Vocabulary mapping • … • E.g. Street_Name: {‘Charleston Road’, ‘North Shoreline Boulevard’, ‘Shorebird Way’, ‘Rengstorff Avenue’} only several streets appear in the data • Method 1 (X) • map Charleston Road to 0 • map North Shoreline Boulevard to 1 • map Shorebird Way to 2 • map Rengstorff Avenue to 3 • map everything else (OOV) to 4 • has order meaning • might have multiple streets at the same time in a data 21 Feature Engineering - Turn raw-data into features • Method 2 (O) • one-hot encoding

Slide 22

Slide 22 text

22 Feature Engineering - Turn raw-data into features

Slide 23

Slide 23 text

• Linear • Nonlinear 23 Feature Engineering - Feature Crosses (特徵組合) With new feature: x3 = x1x2

Slide 24

Slide 24 text

24 When to do Feature Engineering • Data Preprocess • Dataflow • tf.transform • Generate Feature • Dataflow • Model Training • input_fn

Slide 25

Slide 25 text

25 tf.transform • is a library for preprocessing input data for TensorFlow • creating features that require a full pass over the training dataset. • E.g. • Normalize an input value by using the mean and standard deviation (normalize) • Convert strings to integers by generating a vocabulary over all of the input values (word2vec) • Convert floats to integers by assigning them to buckets, based on the observed data distribution (Bucketing) • TensorFlow => on a single example or a batch of examples. • tf.Transform => extends these capabilities to support full passes over the entire training dataset. (Where we need Apache Beam to help)

Slide 26

Slide 26 text

26 tf.transform • Analyze and Transform • Output Serving or - Transform Graph is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models). - Transform data represents the preprocessed training and evaluation data.

Slide 27

Slide 27 text

27 tf.transform Define our features and schema Define Preprocess function Categorical Feature Normalize, Scaling

Slide 28

Slide 28 text

28 tf.transform Now we're ready to start transforming our data in an Apache Beam pipeline. 1.Read in the data using the CSV reader 2.Clean it using our new MapAndFilterErrors transform 3.Transform it using a preprocessing pipeline that scales numeric data and converts categorical data from strings to int64 values indices, by creating a vocabulary for each category 4.Write out the result as a TFRecord for training a model

Slide 29

Slide 29 text

29 tf.transform Used in training input_fn() return transformed_feature: feature data (train data) transformed_labels: target (prediction)

Slide 30

Slide 30 text

30 Transform to Training Ref: https://www.tensorflow.org/tfx/tutorials/transform/census

Slide 31

Slide 31 text

1. Machine Learning Workflow with Tensorflow Extended - Tensorflow Data Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 31

Slide 32

Slide 32 text

32 Tensorflow Model Analysis • a library for evaluating TensorFlow models. • evaluate models on large amounts of data in a distributed manner, using the same metrics defined in their trainer. • combines the power of TensorFlow and Apache Beam to compute and visualize evaluation metrics. • Goal: Before deploying, ML developers need to evaluate it to ensure that it meets specific quality thresholds and behaves as expected for all relevant slices of data.

Slide 33

Slide 33 text

33 During training vs after training One model vs multiple models over time Tensorflow Model Analysis

Slide 34

Slide 34 text

34 Tensorflow Model Analysis Aggregate vs sliced metrics Streaming vs full-pass metrics - Analyze in more granular level. - A model may have an acceptable AUC over the entire eval dataset, but underperform on specific slices. - Identify slices where examples may be mislabeled, or where the model over- or under-predicts. - E.g. whether a model that predicts the generosity of a taxi tip works equally well for riders that take the taxi during day hours vs night hours - Streaming Metrics: TensorBoard are commonly computed on a mini-batch basis during training - TFMA uses Apache Beam to do a full pass over the specified evaluation dataset • more accurate • scales up to massive evaluation datasets (Beam pipelines can be run using distributed processing back-ends.)

Slide 35

Slide 35 text

Tensorflow Model Analysis 35

Slide 36

Slide 36 text

Tensorflow Model Analysis 36

Slide 37

Slide 37 text

1. Machine Learning Workflow with Tensorflow Extended - Tensorflow Data Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 37

Slide 38

Slide 38 text

38 tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} Tensorflow Model Analysis https://chanyilin.github.io/kubeflow-e2e-tutorial.html

Slide 39

Slide 39 text

2. Tensorflow Extended Pipeline 39

Slide 40

Slide 40 text

40 source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

Slide 41

Slide 41 text

41 source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

Slide 42

Slide 42 text

42 source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

Slide 43

Slide 43 text

43 Tensorflow Extended Pipeline Orchestration • Production: you will use an orchestrator such as Apache Airflow, Kubeflow Pipelines, or Apache Beam to orchestrate a pre-defined pipeline graph of TFX components. • Interactive notebook(local): the notebook itself is the orchestrator, running each TFX component as you execute the notebook cells. Metadata • Production: MLMD stores metadata properties in a database such as MySQL or SQLite, and stores the metadata payloads in a persistent store such as on your filesystem. • Interactive notebook(local): both properties and payloads are stored in an ephemeral SQLite database in the /tmp directory on the Jupyter notebook or Colab server.

Slide 44

Slide 44 text

44 TFDV TF Transform TF Model Analysis TF Model Serving source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

Slide 45

Slide 45 text

Slide 46

Slide 46 text

46 more details: https://www.tensorflow.org/tfx/tutorials/tfx/components#evaluator [Tensorflow Data Validation] stats = tfdv.generate_statistics_from_csv(input_data) schema = tfdv.infer_schema(stats) [Tensorflow Transform] … [Tensorflow Model Analysis] … source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

Slide 47

Slide 47 text

47 Config Components 1. a driver which decides whether to perform work or possibly reference a cached artifact 2. the executor which does the heavy lifting for the component 3. publisher to write the outcome to ml-metadata. Metadata store

Slide 48

Slide 48 text

48 Lineage tracking: Just like you wouldn’t code without version control, you shouldn’t train models without lineage tracking. Lineage Tracking shows the history and versions of your models, data, and more. source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing

Slide 49

Slide 49 text

Slide 50

Slide 50 text

50 ML Metadata (part of TFX) https://www.kubeflow.org/docs/components/metadata/

Slide 51

Slide 51 text

51 Metadata Data Versioning Model Versioning Training Metric versioning

Slide 52

Slide 52 text

52 Artifacts The real output or logs of every step in the pipeline

Slide 53

Slide 53 text

53 TFX on Kubeflow Pipeline https://github.com/kubeflow/pipelines/blob/master/samples/core/parameterized_tfx_oss/taxi_pipeline_notebook.ipynb

Slide 54

Slide 54 text

54 TFX on Airflow

Slide 55

Slide 55 text

3. Kubeflow Pipelines 55

Slide 56

Slide 56 text

56 Def: A pipeline is a description of an ML workflow, including all of the components in the workflow and how they combine in the form of a graph 1. An engine for scheduling multi-step ML workflows. 2. An SDK for defining and manipulating pipelines and components. 3. Notebooks for interacting with the system using the SDK. 4. Metadata and Artifacts to track input/output and the file created by each component Kubeflow Pipeline

Slide 57

Slide 57 text

57 Kubeflow Pipeline architecture 1. Pipeline Frontend/Backend 2. ScheduleWorkflow / Argo: - build 、schedule and run the workflow 3. Artifact storage: - Metadata(MySQL): information about executions (runs), models, datasets, and other artifacts. - Artifacts(minio): the files and objects that form the inputs and outputs of the components in your ML workflow. E.g. metrics (time series). 4. Persistence agent (informer to watch resourses) - records the set of containers that executed as well as their inputs and outputs. - input/output consists of either container parameters or data artifact URIs

Slide 58

Slide 58 text

58 Components and Pipeline • Components: A pipeline component is a self-contained set of user code, packaged as a Docker image, that performs one step in the pipeline. Artifacts of this components Container v1 spec (might add k8sresource/daemon… in the future) Input/output and the type

Slide 59

Slide 59 text

Define Components and run Pipeline Key Point: The execution logic is in the container image. 1. define code 2. build image 3. turn into kfp components • Turn Python function into Component directly • Build Python module and Dockerfile to build image and then define Components • Use Reusable Components Demo 59 Ref: https://github.com/kubeflow/pipelines/tree/master/samples/tutorials/mnist https://github.com/ChanYiLin/kfp_notebook_example

Slide 60

Slide 60 text

Data passing • Small Data • Bigger data (files) Demo 60 https://github.com/kubeflow/pipelines/blob/master/samples/tutorials/Data passing in python components.ipynb https://github.com/ChanYiLin/kfp_notebook_example

Slide 61

Slide 61 text

Metadata and Artifacts • pipeline component must write a JSON file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json • Type: Confusion matrix 61

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Slide 64

Slide 64 text

4. MLOps using TFX, Kubeflow Pipelines, and Cloud Build 64 https://cloud.google.com/solutions/machine-learning/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-b

Slide 65

Slide 65 text

65 CI/CD pipeline compared to CT pipeline - TFX Transform code - Tensorflow model training code • If given new implementation, a successful CI/CD pipeline deploys a new ML CT pipeline. • If given new data, a successful CT pipeline serves a new model prediction service.

Slide 66

Slide 66 text

66 Model Validation Connect to Slack Do the training and validation periodically like CI/CD And you can see the validation results on the slack to decide whether to update the model or not

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Slide 69

Slide 69 text

69 Use case in Spotify

Slide 70

Slide 70 text

4. Kubeflow GSoC 70 https://docs.google.com/document/d/1AQDD9s8VpCf3y8OLKTBSMgDzHSjdsV_DOyL5dc-XCOQ/edit# https://www.kubeflow.org/docs/about/gsoc/

Slide 71

Slide 71 text

No content