Machine Learning Workflow

Machine Learning Workﬂow

Table of Contents

1. Background 2. Business Understanding 3. Objective metrics 4. Data
Selection and Cleansing 5. Data Understanding 6. Data Splitting and Leakage 7. Data Preprocessing 8. Feature Engineering Table of Contents

9. Fitting Machine Learning models 10. Cross Validation 11. Oﬄine
Metrics Analysis 12. Ablation and Error Analysis 13. Output Postprocessing 14. Model Explanation 15. A/B testing 16. Reproducible Research and Models 17. Model Deployment and Retraining Table of Contents

Background

1. Background New Data Scientist in a Company Characteristics -
Don’t know where to start - Has so much energy, wants to solve all business problems with SOTA ML. - Only has Math/Stats/CS/Business skills.

1. Background Company X Characteristics - Want to hire data
scientist to solve problems with ML in business.

New DS 1. Background Company X Problems - Company X
doesn’t have any data - New DS needs to build ML system

don’t have any data - New DS need to build ML system

New DS 1. Background Company X Problems - New DS
needs to build ML system - New DS doesn’t understand business process, doesn’t know what kind of ML services need to be built for the company.

need to build ML system - New DS don’t understand business process, don’t know what kind of ML services need to build for the company.

has messy data. - New DS can only work if the data is clean. - New DS doesn’t understand how to clean the data.

split train-test data - Accuracy 98% - There is a data leakage.

build ML model. - Fresh grad. - Background in Stats. - Only ﬁt, never predict.

build ML model. - Build a Decision Tree - Accuracy 100% in training.

build ML models. - Use a kaggle style modeling - Ensemble 100 models

build ML models. - 80% accuracy oﬄine

build ML models. - Company X ask for prediction explanation. - New DS build model with Deep Learning.

build ML models. - His code is not reproducible.

Business Understanding

2. Business Understanding - It is important to understand the
data generating process, user journey and business process (BP) in a company - You can create your ML services by automating any process in BP. - Veriﬁcation process - Identiﬁcation process - Scoring/estimation process - etc.

2. Business Understanding

2. Business Understanding KTP automatic information extraction (CV) Entity Matching
modeling Data spoofing modeling Credit Scoring

2. Business Understanding - Your goal is to make ML
services with high ROI - You show how “smart” you are by making usable ML services

2. Business Understanding NOT your goal: - Applying state of
the art model

Objective Metrics

3. Objective Metrics - You need to make an objective
metrics for every ML service that you made - Be SKEPTICAL with your objective metrics - Every objective metrics needs to be correlated with business metrics

3. Objective Metrics Netﬂix Recommendation System case: - Objective: -
We want to recommend movies to users - User journey: - Every users give rating for some movies

3. Objective Metrics

3. Objective Metrics Netﬂix Recommendation System case: - Objective: -
We want to predict rating for every movies for every users - Objective metrics: - RMSE, regression problem

3. Objective Metrics

3. Objective Metrics Netﬂix Recommendation System case: - We don’t
really care about RMSE. - Case 1: - Movie 1 : Rating 4 - Movie 1 : Rating Prediction 5 - Case 2: - Movie 2 : Rating 2 - Movie 2 : Rating Prediction 4

3. Objective Metrics Netﬂix Recommendation System case: - We care
about top N recommendation - Change objective metrics to precision-recall as an information retrieval problem - Change objective metrics to top@K precision-recall - Decision-support metrics: - ROC AUC, Breese score, later precision/recall - Error meets decision-support/user experience: - “Reversals” - User-centered metrics: - Coverage, user retention, recommendation uptake, satisfaction

Data Selection and Cleansing

4. Data Selection and Cleansing ML tutorial:

4. Data Selection and Cleansing Data in your daily job:

4. Data Selection and Cleansing Data in your daily job:
- Need to be cleaned - Need to be joined - Need to be selected for a given service

4. Data Selection and Cleansing

4. Data Selection and Cleansing Case Fintech: - People are
afraid to invest in fintech because they never had any experience in crises. - Bank Perkreditan Rakyat have data before and after the financial crises. - We need to build a Credit Scoring with 96-99 data to simulate a financial crises.

4. Data Selection and Cleansing

Data Understanding

5. Data Understanding Why we need to understand our data:
- We need to find patterns which might affect our model/objective metrics. - We need to find anomalies which will decrease our model accuracy. - In the next section, we can generate new variables from this understanding to ease our model to learn and fit.

5. Data Understanding - We can understand our data by
doing Exploratory Data Analysis

5. Data Understanding -- Type of Distribution Multimodal distribution has
2 or more mode.

5. Data Understanding -- Case Analysis

5. Data Understanding -- Case Analysis • Self-reported candidates are
more likely to post when they are Accepted. • They are likely to post results for multiple applications. • More successful candidates are likely to post their personal numbers. • Ostensibly, the quality of applicants who engage heavily with an online forum are far more serious about their application than the entirety of the test-taking pool, leading to better results/numbers. Sources: https://debarghyadas.com/writes/the-grad-school-statistics-we-never-had/

5. Data Understanding -- Type of Distribution

5. Data Understanding -- Type of Distribution After outlier detection

5. Data Understanding -- TS, Normal Scale Astra International Stock
Price 2002-2018

5. Data Understanding -- TS, Log10 Scale Astra International Stock
Price 2002-2018

Data Splitting and Leakage

Accuracy: 100% Training data Future data Accuracy: 5% 6. Data
Splitting and Leakage

6. Data Splitting and Leakage - We need to simulate
future data from our current dataset. - This simulation can help use to infer our model prediction power with future data. - Strictly speaking, it assumes a stable

Randomly split your training dataset into; • the set we
used to ﬁt our model, and • the set we use only to validate our model performance, that’s it. 6. Data Splitting and Leakage

Set of Hypothesis Fitted Model Model Performance Fitting Predict Results
Predicted Output Train Test 6. Data Splitting and Leakage

Accuracy: 100% Training data Test data Accuracy: 5% 6. Data
Splitting and Leakage

6. Data Splitting and Leakage Data Leakage is the introduction
of information about data target in input data.

6. Data Splitting and Leakage

6. Data Splitting and Leakage Credit Score Credit Group Derived
Prediction

Data Preprocessing

7. Data Preprocessing We need to convert our data into
consumable format by ML model.

7. Data Preprocessing - ML model only can accept numerical
input Jenis Kelamin Pria Pria Wanita Jenis Kelamin, Pria : 0, Wanita : 1 0 0 1

7. Data Preprocessing - ML model only can accept numerical
input Negara Indonesia India Malaysia Indonesia India Malaysia 1 0 0 0 1 0 0 0 1

7. Data Preprocessing - Some ML model only can accept
inputs with similar mean-variance.

Feature Engineering

8. Feature Engineering We need to convert our data into
more representative input to increase our model accuracy.

8. Feature Engineering Linear vs Quadratic Input for Regression

8. Feature Engineering Embedding with TSNE

8. Feature Engineering The effect of every feature engineering experiments
needs to be validated in future data or, we can do cross validation.

Fitting ML Model

• If the model is too simple, the solution is
biased and does not ﬁt the data. • If the model is too complex then it is very sensitive to small changes in the data. 9. Fitting ML Model

Bias Variance Complexity Underfitting: you have an overly simple model
High Low Low Overfitting: your model fit signal and noises. Low High High 9. Fitting ML Model

9. Fitting ML Model - Every model have their own
hyperparameter to tune. - We tune this hyperparam using cross validation.

9. Fitting ML Model

Cross Validation

10. Cross Validation - We need to verify our experiments
with “future data simulation”. - Cross validation will simulate unavailable future data.

Randomly split the data into k-folds, then 1. Pick one
fold as a validation set 2. Fit the model to the k-1 folds 3. Repeat those steps until you get k fitted models 4. Then, compute the CV Score as, 10. Cross Validation -- K-fold

CV Score is your model performance. 1. Select the best
model, with best combination of Hyperparam which has the best CV Score. 2. Then, build your ﬁnal model by ﬁtting the selected model into the whole training data. 10. Cross Validation

If we repeat the process of randomly splitting the sample
set into two parts, we will get a somewhat different estimate for the test MSE 10. Cross Validation

10. Cross Validation

10. Cross Validation - Every experiments should be validated with
cross validation. - Model hyperparam - Preprocessing - Feature Engineering - Every variables

Oﬄine Metrics Analysis

11. Oﬄine Metrics Analysis - Remember our Objective Metrics, now
we need to validate it. - There might be some tradeoff between metrics.

11. Offline Metrics Analysis Let’s revisit our Netflix Recommendation System
case: - We will focus on Diversity Metrics vs Accuracy Metrics. - More diverse recommendation will increase Netflix CTR - More accurate recommendation will increase Netflix CTR - Diversity and Accuracy are negatively correlated.

11. Oﬄine Metrics Analysis

Ablation and Error Analysis

12. Error Analysis - We want to know the source
of error in our prediction. - Thus we can improve our model accuracy by making a feature which represent the error.

12. Error Analysis

12. Ablation Analysis - We want to measure every additional
complexity to model accuracy. - If the additional complexity does not affect model accuracy, then we will drop that experiment.

12. Ablation Analysis Recommendation System PACMANN AI RMSE SVD-like 0.88
SVD-like, user-bias, item-bias 0.85 SVD-like, user-bias, item-bias, time-bias 0.83 Neural Collaborative Filtering (NBF) 0.87 NBF - WARP 0.87 NBF - BPR 0.87

12. Ablation Analysis

Output Postprocessing

13. Output Postprocessing - Probability from a classiﬁer is only
an approximation of real probability. For example, probability 90% "credit default" from a classiﬁer does not imply 9 from 10 cases will be default.

13. Output Postprocessing - This problem of bias of probability
from classiﬁer very often the result of ML classiﬁer which not approximate probability in their model. For example, Random Forest calculate average class, SVM calculate distance from separator line.

13. Output Postprocessing

13. Output Postprocessing We need to calibrate the probability with
Platt Scaling

Model Explanation

14. Model Explanation - Most of the time, if you
are working with business stakeholders, you will be asked by business people the reasoning of model output. - This problem is related to ML Interpretability

14. Model Explanation

14. Model Explanation - Most interpretable model for regression: -
Linear Regression - Most interpretable model for classiﬁcation: - Logistic Regression with Weight of Evidence

14. Model Explanation

A/B Testing

15. A/B Testing - Now we have a working model
and we want to deploy the system. - We want to measure the effect of the model to our objective metrics in online fashion.

121 Population Sample Sample A Sample B We separate these
samples into two set 15. A/B Testing

122 Sample A Sample B 15. A/B Testing

123 Sample A Sample B New ML Model Treatment 15.
A/B Testing

124 15. A/B Testing

Reproducible Research

16. Reproducible Research - We want to have a reproducible
model - Give the same output every time it train. - We need to ﬁnd source of variation in the model.

16. Reproducible Research - Source of variation: - Data: -
Data deﬁnition - Data preprocessing - Feature Engineering - Model: - Different hyperparameter - Source code - Environment

16. Reproducible Research - Partial solution: Build 1 yaml for
every source of variation.

16. Reproducible Research

Deployment and Retraining

17. Deployment and Retraining - We already validate the model,
we want to deploy the model into our system - There are two kinds of ML serving: Online and Oﬄine.

17. Deployment and Retraining - Online: - Train your machine
learning model - Build preprocess and feature engineering which can be hitted by new data. - Deploy to your server as an API - Oﬄine: - Train your model - Predict your data, write into DB

We believe everyone can be Data Scientist

Machine Learning Workflow

Machine Learning Workflow

More Decks by Pacmann AI

Other Decks in Science

Featured

Transcript