TFX: A tensor flow-based production-scale machine learning platform

TFX: A TensorFlow-Based Production-Scale Machine Learning Platform KDD2017 Shunya Ueta
(@hurutoriya) 2018-03-28 @mercari

Abstract Issue : “This becomes particularly challenging when data changes
over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. ” 1. Reduce the time to production from the order of months to weeks 2. TFX in the Google Play app store, reduced custom code, faster experiment cycles, and a 2% increase in app installs KDD Link here, KDD Video here, Paper Link here, Author Demo Video here .

Machine Learning has High Technical Debt The conceptual workflow of
applying machine learning is simple but actual workflow becomes more complex. 1. Building one machine learning platform for many different learning tasks 2. Continuous training and serving 3. Human-in-the-loop 4. Production-level reliability and scalability

Hidden Technical Debt in Machine Learning Systems Ref : Hidden
Technical Debt in Machine Learning Systems, NIPS'15

What’s Contribution? • We present a case study of deploying
the platform in Google Play, a commercial mobile app store with over one billion active users and over one million apps. • We show best practices for machine learning platforms in a diverse set of contexts and are thus of general interest to researchers and practitioners in the field.

Machine Learning Platform Design 1. One machine learning platform for
many learning tasks. a. Linear, Deep, Linear and Deep combined, Tree-based, Sequential, Multi-tower, Multi-head, etc. 2. Continuous training. 3. Easy-to-use configuration and tools. 4. Production-level reliability and scalability.

High-level component overview of a machine learning platform.

High-level component overview of a machine learning platform. Focus

DATA ANALYSIS, TRANSFORMATION, AND VALIDATION • Small bugs in the
data can significantly degrade model quality over a period of time in a way that is hard to detect and diagnose. • Component needs to support a wide range of data-analysis and validation cases that correspond to machine learning applications. e.g. NaN Trap Several teams say this loo • Data Analysis →Data Transformation→Data Validation

Sample validation of an example | DATA ANALYSIS 1. Training
Data 2. Data Scheme 3. Data Validation 1. the appearance of a new value 2. needs to be fixed

Model Training One of the core design philosophies of TFX
is to streamline (and automate as much as possible) the process of training production quality models which can support all training use cases. Example Code [22] TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks

MODEL EVALUATION AND VALIDATION • Defining a “good” model ◦
Model is safe to serve, and that it has the desired prediction quality. • Evaluation: human-facing metrics of model quality ◦ A/B experiments on live traffic on relevant business metrics. • Validation: machine-facing judgment of model goodness ◦ We evaluate prediction quality by comparing the model quality against a fixed threshold as well as against a baseline model • Slicing: subset of the data containing certain features • User Attitudes towards Validation ◦ No product teams actively requested the validation function when the component was first built, ◦ However, encountering a real issue in production which could have been prevented by validation made the value of the validation apparent to the teams

MODEL SERVING • TensorFlow Serving (2016/02~) Link here ◦ model
is safe to serve, and that it has the desired prediction quality.

Multitenancy with Isolation • TensorFlow Serving (2017/02~) ◦ Latest Innovations
in TensorFlow Serving : here ◦ Multi-model serving ◦ “We recently launched a 1TB+ model in production with good results, and hope to open-source this capability soon.” • inference request latency ∼500 to ∼1500 msec. → f ∼75 to ∼150 msec.

CASE STUDY : GOOGLE PLAY • The recommender system for
Google Play ◦ The corpus contains over a million apps ▪ First step in this system is retrieval, which returns a short list of apps based on various signals. ▪ serve thousands of queries per second with a strict latency requirement of tens of milliseconds.

CASE STUDY : GOOGLE PLAY • The data validation and
analysis component helped in discovering a harmful training-serving feature skew. ◦ The results of an online A/B experiment showed that removing this skew improved the app install rate on the main landing page of the app store by 2%. • Warm-starting helped improve model quality and freshness while reducing the time and resources spent on training over hundreds of billions of examples.

TFX: A tensor flow-based production-scale machi...

TFX: A tensor flow-based production-scale machine learning platform

Shunya Ueta

More Decks by Shunya Ueta

Other Decks in Programming

Featured

Transcript

TFX: A TensorFlow-Based Production-Scale Machine Learning Platform KDD2017 Shunya Ueta

Abstract Issue : “This becomes particularly challenging when data changes

Machine Learning has High Technical Debt The conceptual workflow of

Hidden Technical Debt in Machine Learning Systems Ref : Hidden

What’s Contribution? • We present a case study of deploying

Machine Learning Platform Design 1. One machine learning platform for

High-level component overview of a machine learning platform.

High-level component overview of a machine learning platform. Focus

DATA ANALYSIS, TRANSFORMATION, AND VALIDATION • Small bugs in the

Sample validation of an example | DATA ANALYSIS 1. Training

Model Training One of the core design philosophies of TFX

MODEL EVALUATION AND VALIDATION • Defining a “good” model ◦

MODEL SERVING • TensorFlow Serving (2016/02~) Link here ◦ model

Multitenancy with Isolation • TensorFlow Serving (2017/02~) ◦ Latest Innovations

CASE STUDY : GOOGLE PLAY • The recommender system for

CASE STUDY : GOOGLE PLAY • The recommender system for

CASE STUDY : GOOGLE PLAY • The data validation and