Slide 1

Slide 1 text

Machine Learning Workflow

Slide 2

Slide 2 text

Table of Contents

Slide 3

Slide 3 text

1. Background 2. Business Understanding 3. Objective metrics 4. Data Selection and Cleansing 5. Data Understanding 6. Data Splitting and Leakage 7. Data Preprocessing 8. Feature Engineering Table of Contents

Slide 4

Slide 4 text

9. Fitting Machine Learning models 10. Cross Validation 11. Offline Metrics Analysis 12. Ablation and Error Analysis 13. Output Postprocessing 14. Model Explanation 15. A/B testing 16. Reproducible Research and Models 17. Model Deployment and Retraining Table of Contents

Slide 5

Slide 5 text

Background

Slide 6

Slide 6 text

1. Background New Data Scientist in a Company Characteristics - Don’t know where to start - Has so much energy, wants to solve all business problems with SOTA ML. - Only has Math/Stats/CS/Business skills.

Slide 7

Slide 7 text

1. Background Company X Characteristics - Want to hire data scientist to solve problems with ML in business.

Slide 8

Slide 8 text

New DS 1. Background Company X Problems - Company X doesn’t have any data - New DS needs to build ML system

Slide 9

Slide 9 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 10

Slide 10 text

New DS 1. Background Company X Problems - New DS needs to build ML system - New DS doesn’t understand business process, doesn’t know what kind of ML services need to be built for the company.

Slide 11

Slide 11 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 12

Slide 12 text

New DS 1. Background Company X Problems - New DS need to build ML system - New DS don’t understand business process, don’t know what kind of ML services need to build for the company.

Slide 13

Slide 13 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 14

Slide 14 text

New DS 1. Background Company X Problems - Company X has messy data. - New DS can only work if the data is clean. - New DS doesn’t understand how to clean the data.

Slide 15

Slide 15 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 16

Slide 16 text

New DS 1. Background Company X Problems - New DS split train-test data - Accuracy 98% - There is a data leakage.

Slide 17

Slide 17 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 18

Slide 18 text

New DS 1. Background Company X Problems - New DS build ML model. - Fresh grad. - Background in Stats. - Only fit, never predict.

Slide 19

Slide 19 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 20

Slide 20 text

New DS 1. Background Company X Problems - New DS build ML model. - Build a Decision Tree - Accuracy 100% in training.

Slide 21

Slide 21 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 22

Slide 22 text

New DS 1. Background Company X Problems - New DS build ML models. - Use a kaggle style modeling - Ensemble 100 models

Slide 23

Slide 23 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 24

Slide 24 text

New DS 1. Background Company X Problems - New DS build ML models. - 80% accuracy offline

Slide 25

Slide 25 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 26

Slide 26 text

New DS 1. Background Company X Problems - New DS build ML models. - Company X ask for prediction explanation. - New DS build model with Deep Learning.

Slide 27

Slide 27 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 28

Slide 28 text

New DS 1. Background Company X Problems - New DS build ML models. - His code is not reproducible.

Slide 29

Slide 29 text

New DS 1. Background Company X Problems - Company X don’t have any data - New DS need to build ML system

Slide 30

Slide 30 text

Business Understanding

Slide 31

Slide 31 text

2. Business Understanding - It is important to understand the data generating process, user journey and business process (BP) in a company - You can create your ML services by automating any process in BP. - Verification process - Identification process - Scoring/estimation process - etc.

Slide 32

Slide 32 text

2. Business Understanding

Slide 33

Slide 33 text

2. Business Understanding KTP automatic information extraction (CV) Entity Matching modeling Data spoofing modeling Credit Scoring

Slide 34

Slide 34 text

2. Business Understanding - Your goal is to make ML services with high ROI - You show how “smart” you are by making usable ML services

Slide 35

Slide 35 text

2. Business Understanding NOT your goal: - Applying state of the art model

Slide 36

Slide 36 text

Objective Metrics

Slide 37

Slide 37 text

3. Objective Metrics - You need to make an objective metrics for every ML service that you made - Be SKEPTICAL with your objective metrics - Every objective metrics needs to be correlated with business metrics

Slide 38

Slide 38 text

3. Objective Metrics Netflix Recommendation System case: - Objective: - We want to recommend movies to users - User journey: - Every users give rating for some movies

Slide 39

Slide 39 text

3. Objective Metrics

Slide 40

Slide 40 text

3. Objective Metrics Netflix Recommendation System case: - Objective: - We want to predict rating for every movies for every users - Objective metrics: - RMSE, regression problem

Slide 41

Slide 41 text

3. Objective Metrics

Slide 42

Slide 42 text

3. Objective Metrics Netflix Recommendation System case: - We don’t really care about RMSE. - Case 1: - Movie 1 : Rating 4 - Movie 1 : Rating Prediction 5 - Case 2: - Movie 2 : Rating 2 - Movie 2 : Rating Prediction 4

Slide 43

Slide 43 text

3. Objective Metrics Netflix Recommendation System case: - We care about top N recommendation - Change objective metrics to precision-recall as an information retrieval problem - Change objective metrics to top@K precision-recall - Decision-support metrics: - ROC AUC, Breese score, later precision/recall - Error meets decision-support/user experience: - “Reversals” - User-centered metrics: - Coverage, user retention, recommendation uptake, satisfaction

Slide 44

Slide 44 text

Data Selection and Cleansing

Slide 45

Slide 45 text

4. Data Selection and Cleansing ML tutorial:

Slide 46

Slide 46 text

4. Data Selection and Cleansing Data in your daily job:

Slide 47

Slide 47 text

4. Data Selection and Cleansing Data in your daily job: - Need to be cleaned - Need to be joined - Need to be selected for a given service

Slide 48

Slide 48 text

4. Data Selection and Cleansing

Slide 49

Slide 49 text

4. Data Selection and Cleansing

Slide 50

Slide 50 text

4. Data Selection and Cleansing Case Fintech: - People are afraid to invest in fintech because they never had any experience in crises. - Bank Perkreditan Rakyat have data before and after the financial crises. - We need to build a Credit Scoring with 96-99 data to simulate a financial crises.

Slide 51

Slide 51 text

4. Data Selection and Cleansing

Slide 52

Slide 52 text

4. Data Selection and Cleansing

Slide 53

Slide 53 text

Data Understanding

Slide 54

Slide 54 text

5. Data Understanding Why we need to understand our data: - We need to find patterns which might affect our model/objective metrics. - We need to find anomalies which will decrease our model accuracy. - In the next section, we can generate new variables from this understanding to ease our model to learn and fit.

Slide 55

Slide 55 text

5. Data Understanding - We can understand our data by doing Exploratory Data Analysis

Slide 56

Slide 56 text

5. Data Understanding -- Type of Distribution Multimodal distribution has 2 or more mode.

Slide 57

Slide 57 text

5. Data Understanding -- Case Analysis

Slide 58

Slide 58 text

5. Data Understanding -- Case Analysis ● Self-reported candidates are more likely to post when they are Accepted. ● They are likely to post results for multiple applications. ● More successful candidates are likely to post their personal numbers. ● Ostensibly, the quality of applicants who engage heavily with an online forum are far more serious about their application than the entirety of the test-taking pool, leading to better results/numbers. Sources: https://debarghyadas.com/writes/the-grad-school-statistics-we-never-had/

Slide 59

Slide 59 text

5. Data Understanding -- Type of Distribution

Slide 60

Slide 60 text

5. Data Understanding -- Type of Distribution After outlier detection

Slide 61

Slide 61 text

5. Data Understanding -- TS, Normal Scale Astra International Stock Price 2002-2018

Slide 62

Slide 62 text

5. Data Understanding -- TS, Log10 Scale Astra International Stock Price 2002-2018

Slide 63

Slide 63 text

Data Splitting and Leakage

Slide 64

Slide 64 text

Accuracy: 100% Training data Future data Accuracy: 5% 6. Data Splitting and Leakage

Slide 65

Slide 65 text

6. Data Splitting and Leakage - We need to simulate future data from our current dataset. - This simulation can help use to infer our model prediction power with future data. - Strictly speaking, it assumes a stable

Slide 66

Slide 66 text

Randomly split your training dataset into; ● the set we used to fit our model, and ● the set we use only to validate our model performance, that’s it. 6. Data Splitting and Leakage

Slide 67

Slide 67 text

Set of Hypothesis Fitted Model Model Performance Fitting Predict Results Predicted Output Train Test 6. Data Splitting and Leakage

Slide 68

Slide 68 text

Accuracy: 100% Training data Test data Accuracy: 5% 6. Data Splitting and Leakage

Slide 69

Slide 69 text

6. Data Splitting and Leakage Data Leakage is the introduction of information about data target in input data.

Slide 70

Slide 70 text

6. Data Splitting and Leakage

Slide 71

Slide 71 text

6. Data Splitting and Leakage

Slide 72

Slide 72 text

6. Data Splitting and Leakage Credit Score Credit Group Derived Prediction

Slide 73

Slide 73 text

Data Preprocessing

Slide 74

Slide 74 text

7. Data Preprocessing We need to convert our data into consumable format by ML model.

Slide 75

Slide 75 text

7. Data Preprocessing - ML model only can accept numerical input Jenis Kelamin Pria Pria Wanita Jenis Kelamin, Pria : 0, Wanita : 1 0 0 1

Slide 76

Slide 76 text

7. Data Preprocessing - ML model only can accept numerical input Negara Indonesia India Malaysia Indonesia India Malaysia 1 0 0 0 1 0 0 0 1

Slide 77

Slide 77 text

7. Data Preprocessing - Some ML model only can accept inputs with similar mean-variance.

Slide 78

Slide 78 text

7. Data Preprocessing - Some ML model only can accept inputs with similar mean-variance.

Slide 79

Slide 79 text

Feature Engineering

Slide 80

Slide 80 text

8. Feature Engineering We need to convert our data into more representative input to increase our model accuracy.

Slide 81

Slide 81 text

8. Feature Engineering Linear vs Quadratic Input for Regression

Slide 82

Slide 82 text

8. Feature Engineering Embedding with TSNE

Slide 83

Slide 83 text

8. Feature Engineering The effect of every feature engineering experiments needs to be validated in future data or, we can do cross validation.

Slide 84

Slide 84 text

Fitting ML Model

Slide 85

Slide 85 text

● If the model is too simple, the solution is biased and does not fit the data. ● If the model is too complex then it is very sensitive to small changes in the data. 9. Fitting ML Model

Slide 86

Slide 86 text

Bias Variance Complexity Underfitting: you have an overly simple model High Low Low Overfitting: your model fit signal and noises. Low High High 9. Fitting ML Model

Slide 87

Slide 87 text

9. Fitting ML Model - Every model have their own hyperparameter to tune. - We tune this hyperparam using cross validation.

Slide 88

Slide 88 text

9. Fitting ML Model

Slide 89

Slide 89 text

Cross Validation

Slide 90

Slide 90 text

10. Cross Validation - We need to verify our experiments with “future data simulation”. - Cross validation will simulate unavailable future data.

Slide 91

Slide 91 text

Randomly split the data into k-folds, then 1. Pick one fold as a validation set 2. Fit the model to the k-1 folds 3. Repeat those steps until you get k fitted models 4. Then, compute the CV Score as, 10. Cross Validation -- K-fold

Slide 92

Slide 92 text

CV Score is your model performance. 1. Select the best model, with best combination of Hyperparam which has the best CV Score. 2. Then, build your final model by fitting the selected model into the whole training data. 10. Cross Validation

Slide 93

Slide 93 text

If we repeat the process of randomly splitting the sample set into two parts, we will get a somewhat different estimate for the test MSE 10. Cross Validation

Slide 94

Slide 94 text

10. Cross Validation

Slide 95

Slide 95 text

10. Cross Validation

Slide 96

Slide 96 text

10. Cross Validation - Every experiments should be validated with cross validation. - Model hyperparam - Preprocessing - Feature Engineering - Every variables

Slide 97

Slide 97 text

Offline Metrics Analysis

Slide 98

Slide 98 text

11. Offline Metrics Analysis - Remember our Objective Metrics, now we need to validate it. - There might be some tradeoff between metrics.

Slide 99

Slide 99 text

11. Offline Metrics Analysis Let’s revisit our Netflix Recommendation System case: - We will focus on Diversity Metrics vs Accuracy Metrics. - More diverse recommendation will increase Netflix CTR - More accurate recommendation will increase Netflix CTR - Diversity and Accuracy are negatively correlated.

Slide 100

Slide 100 text

11. Offline Metrics Analysis

Slide 101

Slide 101 text

11. Offline Metrics Analysis

Slide 102

Slide 102 text

Ablation and Error Analysis

Slide 103

Slide 103 text

12. Error Analysis - We want to know the source of error in our prediction. - Thus we can improve our model accuracy by making a feature which represent the error.

Slide 104

Slide 104 text

12. Error Analysis

Slide 105

Slide 105 text

12. Error Analysis

Slide 106

Slide 106 text

12. Ablation Analysis - We want to measure every additional complexity to model accuracy. - If the additional complexity does not affect model accuracy, then we will drop that experiment.

Slide 107

Slide 107 text

12. Ablation Analysis Recommendation System PACMANN AI RMSE SVD-like 0.88 SVD-like, user-bias, item-bias 0.85 SVD-like, user-bias, item-bias, time-bias 0.83 Neural Collaborative Filtering (NBF) 0.87 NBF - WARP 0.87 NBF - BPR 0.87

Slide 108

Slide 108 text

12. Ablation Analysis

Slide 109

Slide 109 text

Output Postprocessing

Slide 110

Slide 110 text

13. Output Postprocessing - Probability from a classifier is only an approximation of real probability. For example, probability 90% "credit default" from a classifier does not imply 9 from 10 cases will be default.

Slide 111

Slide 111 text

13. Output Postprocessing - This problem of bias of probability from classifier very often the result of ML classifier which not approximate probability in their model. For example, Random Forest calculate average class, SVM calculate distance from separator line.

Slide 112

Slide 112 text

13. Output Postprocessing

Slide 113

Slide 113 text

13. Output Postprocessing We need to calibrate the probability with Platt Scaling

Slide 114

Slide 114 text

Model Explanation

Slide 115

Slide 115 text

14. Model Explanation - Most of the time, if you are working with business stakeholders, you will be asked by business people the reasoning of model output. - This problem is related to ML Interpretability

Slide 116

Slide 116 text

14. Model Explanation

Slide 117

Slide 117 text

14. Model Explanation - Most interpretable model for regression: - Linear Regression - Most interpretable model for classification: - Logistic Regression with Weight of Evidence

Slide 118

Slide 118 text

14. Model Explanation

Slide 119

Slide 119 text

A/B Testing

Slide 120

Slide 120 text

15. A/B Testing - Now we have a working model and we want to deploy the system. - We want to measure the effect of the model to our objective metrics in online fashion.

Slide 121

Slide 121 text

121 Population Sample Sample A Sample B We separate these samples into two set 15. A/B Testing

Slide 122

Slide 122 text

122 Sample A Sample B 15. A/B Testing

Slide 123

Slide 123 text

123 Sample A Sample B New ML Model Treatment 15. A/B Testing

Slide 124

Slide 124 text

124 15. A/B Testing

Slide 125

Slide 125 text

Reproducible Research

Slide 126

Slide 126 text

16. Reproducible Research - We want to have a reproducible model - Give the same output every time it train. - We need to find source of variation in the model.

Slide 127

Slide 127 text

16. Reproducible Research - Source of variation: - Data: - Data definition - Data preprocessing - Feature Engineering - Model: - Different hyperparameter - Source code - Environment

Slide 128

Slide 128 text

16. Reproducible Research - Partial solution: Build 1 yaml for every source of variation.

Slide 129

Slide 129 text

16. Reproducible Research

Slide 130

Slide 130 text

Deployment and Retraining

Slide 131

Slide 131 text

17. Deployment and Retraining - We already validate the model, we want to deploy the model into our system - There are two kinds of ML serving: Online and Offline.

Slide 132

Slide 132 text

17. Deployment and Retraining - Online: - Train your machine learning model - Build preprocess and feature engineering which can be hitted by new data. - Deploy to your server as an API - Offline: - Train your model - Predict your data, write into DB

Slide 133

Slide 133 text

We believe everyone can be Data Scientist