Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TFX and Kubeflow Pipeline Tutorial

Jack
March 21, 2020

TFX and Kubeflow Pipeline Tutorial

This is a talk at Cloud Native Taiwan User Group.

Jack

March 21, 2020
Tweet

More Decks by Jack

Other Decks in Technology

Transcript

  1. 2 Hello! I am Jack Lin • InfuseAI Software Engineer

    • Kubeflow / tf-operator maintainer • Kubeflow 2020 GSoC mentor ChanYiLin Jack Lin
  2. Outline ◎Tensorflow Extended (TFX) ◎Tensorflow Extended Pipeline (Platform) ◎Kubeflow Pipeline

    ◎MLOps using TFX, Kubeflow Pipelines, and Cloud Build ◎GSoC - Kubeflow 6
  3. 1. Machine Learning Workflow with Tensorflow Extended Core Lib (low-level)

    - Tensorflow Data Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 7
  4. Tensorflow Data Validation (TFDV) • Data is as important as

    Model itself • “garbage in, garbage out” • open-source library that helps developers automatically analyze, validate, and monitor their ML data at scale. • TFDV is part of the TFX Platform and this technology is used to analyze and validate PB of data at Google every day • Data Validation: 1. Computing and Visualizing Descriptive Statistics 2. Inferring a Schema 3. Validating New Data 8
  5. Computing and Visualizing Descriptive Statistics • Connectors to different data

    formats • underline it uses Apache Beam to define and process its data pipelines. can use Beam IO connectors, Ptransforms to process different formats • TFDV provides two helper functions for CSV and TFRecord of serialized tf.Examples. • Scale: creates an Apache Beam pipeline which can run in • DirectRunner in the notebook environment • DataflowRunner on Google Cloud Platform • Kubeflow pipeline / Airflow 9
  6. Inferring a Schema 11 • Based on the statistics, TFDV

    infers a schema (schema.proto) . • To reflect the stable characteristics of the data. • Schema • Do data validation • this schema format is also used as the interface to other components in the TFX ecosystem. E.g. data transform use TensorFlow Transform.
  7. Inferring a Schema 12 Note: Infer schema helps developers first

    author a schema that they can then refine and update manually.
  8. 15

  9. Case 2. Training/Serving Skew Detection 16 • TFDV can compute

    statistics of servings logs and perform validation with the schema • taking into account any expected differences between training and serving data (training data has label while serving log has not)
  10. 1. Machine Learning Workflow with Tensorflow Extended - Tensorflow Data

    Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 17
  11. Tensorflow Transform • Allows users to define preprocessing pipelines •

    Run these using large scale data processing frameworks (ref) • Local (Airflow) • Apache Beam • Dataflow (Google Cloud Service) • Kubeflow Pipeline • Exporting the pipeline in a way that can be run as part of a TensorFlow graph. (so can use in training and serving) • TFT requires the specification of a schema to parse the data into tensors. • User can 1. manually specifying a schema (by specifying the type of each feature) 2. using the TFDV inferred schema radically simplifies the use of TFT. 18
  12. • Feature Engineering • Turn raw-data into features • Data

    Preprocess and Feature Construction • Feature Crosses • Tensorflow Transform • usually the most time-consuming part in ML workflow 19 Let’s talk about Feature Engineering first
  13. • Numerical data to feature => use it as feature

    directly • Remove inappropriate data => e.g ID • Categorical Feature 20 Feature Engineering - Turn raw-data into features ?
  14. • Categorical Feature • one-hot encoding • hash encoding •

    Vocabulary mapping • … • E.g. Street_Name: {‘Charleston Road’, ‘North Shoreline Boulevard’, ‘Shorebird Way’, ‘Rengstorff Avenue’} only several streets appear in the data • Method 1 (X) • map Charleston Road to 0 • map North Shoreline Boulevard to 1 • map Shorebird Way to 2 • map Rengstorff Avenue to 3 • map everything else (OOV) to 4 • has order meaning • might have multiple streets at the same time in a data 21 Feature Engineering - Turn raw-data into features • Method 2 (O) • one-hot encoding
  15. • Linear • Nonlinear 23 Feature Engineering - Feature Crosses

    (特徵組合) With new feature: x3 = x1x2
  16. 24 When to do Feature Engineering • Data Preprocess •

    Dataflow • tf.transform • Generate Feature • Dataflow • Model Training • input_fn
  17. 25 tf.transform • is a library for preprocessing input data

    for TensorFlow • creating features that require a full pass over the training dataset. • E.g. • Normalize an input value by using the mean and standard deviation (normalize) • Convert strings to integers by generating a vocabulary over all of the input values (word2vec) • Convert floats to integers by assigning them to buckets, based on the observed data distribution (Bucketing) • TensorFlow => on a single example or a batch of examples. • tf.Transform => extends these capabilities to support full passes over the entire training dataset. (Where we need Apache Beam to help)
  18. 26 tf.transform • Analyze and Transform • Output Serving or

    - Transform Graph is the graph that can perform the preprocessing operations (this graph will be included in the serving and evaluation models). - Transform data represents the preprocessed training and evaluation data.
  19. 28 tf.transform Now we're ready to start transforming our data

    in an Apache Beam pipeline. 1.Read in the data using the CSV reader 2.Clean it using our new MapAndFilterErrors transform 3.Transform it using a preprocessing pipeline that scales numeric data and converts categorical data from strings to int64 values indices, by creating a vocabulary for each category 4.Write out the result as a TFRecord for training a model
  20. 1. Machine Learning Workflow with Tensorflow Extended - Tensorflow Data

    Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 31
  21. 32 Tensorflow Model Analysis • a library for evaluating TensorFlow

    models. • evaluate models on large amounts of data in a distributed manner, using the same metrics defined in their trainer. • combines the power of TensorFlow and Apache Beam to compute and visualize evaluation metrics. • Goal: Before deploying, ML developers need to evaluate it to ensure that it meets specific quality thresholds and behaves as expected for all relevant slices of data.
  22. 33 During training vs after training One model vs multiple

    models over time Tensorflow Model Analysis
  23. 34 Tensorflow Model Analysis Aggregate vs sliced metrics Streaming vs

    full-pass metrics - Analyze in more granular level. - A model may have an acceptable AUC over the entire eval dataset, but underperform on specific slices. - Identify slices where examples may be mislabeled, or where the model over- or under-predicts. - E.g. whether a model that predicts the generosity of a taxi tip works equally well for riders that take the taxi during day hours vs night hours - Streaming Metrics: TensorBoard are commonly computed on a mini-batch basis during training - TFMA uses Apache Beam to do a full pass over the specified evaluation dataset • more accurate • scales up to massive evaluation datasets (Beam pipelines can be run using distributed processing back-ends.)
  24. 1. Machine Learning Workflow with Tensorflow Extended - Tensorflow Data

    Validation - Tensorflow Transform - Tensorflow Model Analysis - Tensorflow Serving 37
  25. 43 Tensorflow Extended Pipeline Orchestration • Production: you will use

    an orchestrator such as Apache Airflow, Kubeflow Pipelines, or Apache Beam to orchestrate a pre-defined pipeline graph of TFX components. • Interactive notebook(local): the notebook itself is the orchestrator, running each TFX component as you execute the notebook cells. Metadata • Production: MLMD stores metadata properties in a database such as MySQL or SQLite, and stores the metadata payloads in a persistent store such as on your filesystem. • Interactive notebook(local): both properties and payloads are stored in an ephemeral SQLite database in the /tmp directory on the Jupyter notebook or Colab server.
  26. 44 TFDV TF Transform TF Model Analysis TF Model Serving

    source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing
  27. 45

  28. 46 more details: https://www.tensorflow.org/tfx/tutorials/tfx/components#evaluator [Tensorflow Data Validation] stats = tfdv.generate_statistics_from_csv(input_data)

    schema = tfdv.infer_schema(stats) [Tensorflow Transform] … [Tensorflow Model Analysis] … source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing
  29. 47 Config Components 1. a driver which decides whether to

    perform work or possibly reference a cached artifact 2. the executor which does the heavy lifting for the component 3. publisher to write the outcome to ml-metadata. Metadata store
  30. 48 Lineage tracking: Just like you wouldn’t code without version

    control, you shouldn’t train models without lineage tracking. Lineage Tracking shows the history and versions of your models, data, and more. source: https://docs.google.com/presentation/d/1aMmJUz7r7Toky4nGZ1bItWKfXW4kSMbs4cofFeyKE-M/edit?usp=sharing
  31. 49

  32. 56 Def: A pipeline is a description of an ML

    workflow, including all of the components in the workflow and how they combine in the form of a graph 1. An engine for scheduling multi-step ML workflows. 2. An SDK for defining and manipulating pipelines and components. 3. Notebooks for interacting with the system using the SDK. 4. Metadata and Artifacts to track input/output and the file created by each component Kubeflow Pipeline
  33. 57 Kubeflow Pipeline architecture 1. Pipeline Frontend/Backend 2. ScheduleWorkflow /

    Argo: - build 、schedule and run the workflow 3. Artifact storage: - Metadata(MySQL): information about executions (runs), models, datasets, and other artifacts. - Artifacts(minio): the files and objects that form the inputs and outputs of the components in your ML workflow. E.g. metrics (time series). 4. Persistence agent (informer to watch resourses) - records the set of containers that executed as well as their inputs and outputs. - input/output consists of either container parameters or data artifact URIs
  34. 58 Components and Pipeline • Components: A pipeline component is

    a self-contained set of user code, packaged as a Docker image, that performs one step in the pipeline. Artifacts of this components Container v1 spec (might add k8sresource/daemon… in the future) Input/output and the type
  35. Define Components and run Pipeline Key Point: The execution logic

    is in the container image. 1. define code 2. build image 3. turn into kfp components • Turn Python function into Component directly • Build Python module and Dockerfile to build image and then define Components • Use Reusable Components Demo 59 Ref: https://github.com/kubeflow/pipelines/tree/master/samples/tutorials/mnist https://github.com/ChanYiLin/kfp_notebook_example
  36. Data passing • Small Data • Bigger data (files) Demo

    60 https://github.com/kubeflow/pipelines/blob/master/samples/tutorials/Data passing in python components.ipynb https://github.com/ChanYiLin/kfp_notebook_example
  37. Metadata and Artifacts • pipeline component must write a JSON

    file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json • Type: Confusion matrix 61
  38. Metadata and Artifacts • pipeline component must write a JSON

    file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json • Type: Markdown 62
  39. Metadata and Artifacts • pipeline component must write a JSON

    file specifying metadata for the output viewer(s) that you want to use for visualizing the results. The file name must be /mlpipeline-ui-metadata.json • Type: 63 https://www.kubeflow.org/docs/pipelines/sdk/output-viewer/
  40. 4. MLOps using TFX, Kubeflow Pipelines, and Cloud Build 64

    https://cloud.google.com/solutions/machine-learning/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-b
  41. 65 CI/CD pipeline compared to CT pipeline - TFX Transform

    code - Tensorflow model training code • If given new implementation, a successful CI/CD pipeline deploys a new ML CT pipeline. • If given new data, a successful CT pipeline serves a new model prediction service.
  42. 66 Model Validation Connect to Slack Do the training and

    validation periodically like CI/CD And you can see the validation results on the slack to decide whether to update the model or not
  43. 67

  44. 68