Machine Learning Workflow

D73dc2189cf378ae9088283c720d0331?s=47 Pacmann AI
October 14, 2019

Machine Learning Workflow

D73dc2189cf378ae9088283c720d0331?s=128

Pacmann AI

October 14, 2019
Tweet

Transcript

  1. Machine Learning Workflow

  2. Table of Contents

  3. 1. Background 2. Business Understanding 3. Objective metrics 4. Data

    Selection and Cleansing 5. Data Understanding 6. Data Splitting and Leakage 7. Data Preprocessing 8. Feature Engineering Table of Contents
  4. 9. Fitting Machine Learning models 10. Cross Validation 11. Offline

    Metrics Analysis 12. Ablation and Error Analysis 13. Output Postprocessing 14. Model Explanation 15. A/B testing 16. Reproducible Research and Models 17. Model Deployment and Retraining Table of Contents
  5. Background

  6. 1. Background New Data Scientist in a Company Characteristics -

    Don’t know where to start - Has so much energy, wants to solve all business problems with SOTA ML. - Only has Math/Stats/CS/Business skills.
  7. 1. Background Company X Characteristics - Want to hire data

    scientist to solve problems with ML in business.
  8. New DS 1. Background Company X Problems - Company X

    doesn’t have any data - New DS needs to build ML system
  9. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  10. New DS 1. Background Company X Problems - New DS

    needs to build ML system - New DS doesn’t understand business process, doesn’t know what kind of ML services need to be built for the company.
  11. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  12. New DS 1. Background Company X Problems - New DS

    need to build ML system - New DS don’t understand business process, don’t know what kind of ML services need to build for the company.
  13. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  14. New DS 1. Background Company X Problems - Company X

    has messy data. - New DS can only work if the data is clean. - New DS doesn’t understand how to clean the data.
  15. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  16. New DS 1. Background Company X Problems - New DS

    split train-test data - Accuracy 98% - There is a data leakage.
  17. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  18. New DS 1. Background Company X Problems - New DS

    build ML model. - Fresh grad. - Background in Stats. - Only fit, never predict.
  19. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  20. New DS 1. Background Company X Problems - New DS

    build ML model. - Build a Decision Tree - Accuracy 100% in training.
  21. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  22. New DS 1. Background Company X Problems - New DS

    build ML models. - Use a kaggle style modeling - Ensemble 100 models
  23. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  24. New DS 1. Background Company X Problems - New DS

    build ML models. - 80% accuracy offline
  25. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  26. New DS 1. Background Company X Problems - New DS

    build ML models. - Company X ask for prediction explanation. - New DS build model with Deep Learning.
  27. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  28. New DS 1. Background Company X Problems - New DS

    build ML models. - His code is not reproducible.
  29. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  30. Business Understanding

  31. 2. Business Understanding - It is important to understand the

    data generating process, user journey and business process (BP) in a company - You can create your ML services by automating any process in BP. - Verification process - Identification process - Scoring/estimation process - etc.
  32. 2. Business Understanding

  33. 2. Business Understanding KTP automatic information extraction (CV) Entity Matching

    modeling Data spoofing modeling Credit Scoring
  34. 2. Business Understanding - Your goal is to make ML

    services with high ROI - You show how “smart” you are by making usable ML services
  35. 2. Business Understanding NOT your goal: - Applying state of

    the art model
  36. Objective Metrics

  37. 3. Objective Metrics - You need to make an objective

    metrics for every ML service that you made - Be SKEPTICAL with your objective metrics - Every objective metrics needs to be correlated with business metrics
  38. 3. Objective Metrics Netflix Recommendation System case: - Objective: -

    We want to recommend movies to users - User journey: - Every users give rating for some movies
  39. 3. Objective Metrics

  40. 3. Objective Metrics Netflix Recommendation System case: - Objective: -

    We want to predict rating for every movies for every users - Objective metrics: - RMSE, regression problem
  41. 3. Objective Metrics

  42. 3. Objective Metrics Netflix Recommendation System case: - We don’t

    really care about RMSE. - Case 1: - Movie 1 : Rating 4 - Movie 1 : Rating Prediction 5 - Case 2: - Movie 2 : Rating 2 - Movie 2 : Rating Prediction 4
  43. 3. Objective Metrics Netflix Recommendation System case: - We care

    about top N recommendation - Change objective metrics to precision-recall as an information retrieval problem - Change objective metrics to top@K precision-recall - Decision-support metrics: - ROC AUC, Breese score, later precision/recall - Error meets decision-support/user experience: - “Reversals” - User-centered metrics: - Coverage, user retention, recommendation uptake, satisfaction
  44. Data Selection and Cleansing

  45. 4. Data Selection and Cleansing ML tutorial:

  46. 4. Data Selection and Cleansing Data in your daily job:

  47. 4. Data Selection and Cleansing Data in your daily job:

    - Need to be cleaned - Need to be joined - Need to be selected for a given service
  48. 4. Data Selection and Cleansing

  49. 4. Data Selection and Cleansing

  50. 4. Data Selection and Cleansing Case Fintech: - People are

    afraid to invest in fintech because they never had any experience in crises. - Bank Perkreditan Rakyat have data before and after the financial crises. - We need to build a Credit Scoring with 96-99 data to simulate a financial crises.
  51. 4. Data Selection and Cleansing

  52. 4. Data Selection and Cleansing

  53. Data Understanding

  54. 5. Data Understanding Why we need to understand our data:

    - We need to find patterns which might affect our model/objective metrics. - We need to find anomalies which will decrease our model accuracy. - In the next section, we can generate new variables from this understanding to ease our model to learn and fit.
  55. 5. Data Understanding - We can understand our data by

    doing Exploratory Data Analysis
  56. 5. Data Understanding -- Type of Distribution Multimodal distribution has

    2 or more mode.
  57. 5. Data Understanding -- Case Analysis

  58. 5. Data Understanding -- Case Analysis • Self-reported candidates are

    more likely to post when they are Accepted. • They are likely to post results for multiple applications. • More successful candidates are likely to post their personal numbers. • Ostensibly, the quality of applicants who engage heavily with an online forum are far more serious about their application than the entirety of the test-taking pool, leading to better results/numbers. Sources: https://debarghyadas.com/writes/the-grad-school-statistics-we-never-had/
  59. 5. Data Understanding -- Type of Distribution

  60. 5. Data Understanding -- Type of Distribution After outlier detection

  61. 5. Data Understanding -- TS, Normal Scale Astra International Stock

    Price 2002-2018
  62. 5. Data Understanding -- TS, Log10 Scale Astra International Stock

    Price 2002-2018
  63. Data Splitting and Leakage

  64. Accuracy: 100% Training data Future data Accuracy: 5% 6. Data

    Splitting and Leakage
  65. 6. Data Splitting and Leakage - We need to simulate

    future data from our current dataset. - This simulation can help use to infer our model prediction power with future data. - Strictly speaking, it assumes a stable
  66. Randomly split your training dataset into; • the set we

    used to fit our model, and • the set we use only to validate our model performance, that’s it. 6. Data Splitting and Leakage
  67. Set of Hypothesis Fitted Model Model Performance Fitting Predict Results

    Predicted Output Train Test 6. Data Splitting and Leakage
  68. Accuracy: 100% Training data Test data Accuracy: 5% 6. Data

    Splitting and Leakage
  69. 6. Data Splitting and Leakage Data Leakage is the introduction

    of information about data target in input data.
  70. 6. Data Splitting and Leakage

  71. 6. Data Splitting and Leakage

  72. 6. Data Splitting and Leakage Credit Score Credit Group Derived

    Prediction
  73. Data Preprocessing

  74. 7. Data Preprocessing We need to convert our data into

    consumable format by ML model.
  75. 7. Data Preprocessing - ML model only can accept numerical

    input Jenis Kelamin Pria Pria Wanita Jenis Kelamin, Pria : 0, Wanita : 1 0 0 1
  76. 7. Data Preprocessing - ML model only can accept numerical

    input Negara Indonesia India Malaysia Indonesia India Malaysia 1 0 0 0 1 0 0 0 1
  77. 7. Data Preprocessing - Some ML model only can accept

    inputs with similar mean-variance.
  78. 7. Data Preprocessing - Some ML model only can accept

    inputs with similar mean-variance.
  79. Feature Engineering

  80. 8. Feature Engineering We need to convert our data into

    more representative input to increase our model accuracy.
  81. 8. Feature Engineering Linear vs Quadratic Input for Regression

  82. 8. Feature Engineering Embedding with TSNE

  83. 8. Feature Engineering The effect of every feature engineering experiments

    needs to be validated in future data or, we can do cross validation.
  84. Fitting ML Model

  85. • If the model is too simple, the solution is

    biased and does not fit the data. • If the model is too complex then it is very sensitive to small changes in the data. 9. Fitting ML Model
  86. Bias Variance Complexity Underfitting: you have an overly simple model

    High Low Low Overfitting: your model fit signal and noises. Low High High 9. Fitting ML Model
  87. 9. Fitting ML Model - Every model have their own

    hyperparameter to tune. - We tune this hyperparam using cross validation.
  88. 9. Fitting ML Model

  89. Cross Validation

  90. 10. Cross Validation - We need to verify our experiments

    with “future data simulation”. - Cross validation will simulate unavailable future data.
  91. Randomly split the data into k-folds, then 1. Pick one

    fold as a validation set 2. Fit the model to the k-1 folds 3. Repeat those steps until you get k fitted models 4. Then, compute the CV Score as, 10. Cross Validation -- K-fold
  92. CV Score is your model performance. 1. Select the best

    model, with best combination of Hyperparam which has the best CV Score. 2. Then, build your final model by fitting the selected model into the whole training data. 10. Cross Validation
  93. If we repeat the process of randomly splitting the sample

    set into two parts, we will get a somewhat different estimate for the test MSE 10. Cross Validation
  94. 10. Cross Validation

  95. 10. Cross Validation

  96. 10. Cross Validation - Every experiments should be validated with

    cross validation. - Model hyperparam - Preprocessing - Feature Engineering - Every variables
  97. Offline Metrics Analysis

  98. 11. Offline Metrics Analysis - Remember our Objective Metrics, now

    we need to validate it. - There might be some tradeoff between metrics.
  99. 11. Offline Metrics Analysis Let’s revisit our Netflix Recommendation System

    case: - We will focus on Diversity Metrics vs Accuracy Metrics. - More diverse recommendation will increase Netflix CTR - More accurate recommendation will increase Netflix CTR - Diversity and Accuracy are negatively correlated.
  100. 11. Offline Metrics Analysis

  101. 11. Offline Metrics Analysis

  102. Ablation and Error Analysis

  103. 12. Error Analysis - We want to know the source

    of error in our prediction. - Thus we can improve our model accuracy by making a feature which represent the error.
  104. 12. Error Analysis

  105. 12. Error Analysis

  106. 12. Ablation Analysis - We want to measure every additional

    complexity to model accuracy. - If the additional complexity does not affect model accuracy, then we will drop that experiment.
  107. 12. Ablation Analysis Recommendation System PACMANN AI RMSE SVD-like 0.88

    SVD-like, user-bias, item-bias 0.85 SVD-like, user-bias, item-bias, time-bias 0.83 Neural Collaborative Filtering (NBF) 0.87 NBF - WARP 0.87 NBF - BPR 0.87
  108. 12. Ablation Analysis

  109. Output Postprocessing

  110. 13. Output Postprocessing - Probability from a classifier is only

    an approximation of real probability. For example, probability 90% "credit default" from a classifier does not imply 9 from 10 cases will be default.
  111. 13. Output Postprocessing - This problem of bias of probability

    from classifier very often the result of ML classifier which not approximate probability in their model. For example, Random Forest calculate average class, SVM calculate distance from separator line.
  112. 13. Output Postprocessing

  113. 13. Output Postprocessing We need to calibrate the probability with

    Platt Scaling
  114. Model Explanation

  115. 14. Model Explanation - Most of the time, if you

    are working with business stakeholders, you will be asked by business people the reasoning of model output. - This problem is related to ML Interpretability
  116. 14. Model Explanation

  117. 14. Model Explanation - Most interpretable model for regression: -

    Linear Regression - Most interpretable model for classification: - Logistic Regression with Weight of Evidence
  118. 14. Model Explanation

  119. A/B Testing

  120. 15. A/B Testing - Now we have a working model

    and we want to deploy the system. - We want to measure the effect of the model to our objective metrics in online fashion.
  121. 121 Population Sample Sample A Sample B We separate these

    samples into two set 15. A/B Testing
  122. 122 Sample A Sample B 15. A/B Testing

  123. 123 Sample A Sample B New ML Model Treatment 15.

    A/B Testing
  124. 124 15. A/B Testing

  125. Reproducible Research

  126. 16. Reproducible Research - We want to have a reproducible

    model - Give the same output every time it train. - We need to find source of variation in the model.
  127. 16. Reproducible Research - Source of variation: - Data: -

    Data definition - Data preprocessing - Feature Engineering - Model: - Different hyperparameter - Source code - Environment
  128. 16. Reproducible Research - Partial solution: Build 1 yaml for

    every source of variation.
  129. 16. Reproducible Research

  130. Deployment and Retraining

  131. 17. Deployment and Retraining - We already validate the model,

    we want to deploy the model into our system - There are two kinds of ML serving: Online and Offline.
  132. 17. Deployment and Retraining - Online: - Train your machine

    learning model - Build preprocess and feature engineering which can be hitted by new data. - Deploy to your server as an API - Offline: - Train your model - Predict your data, write into DB
  133. We believe everyone can be Data Scientist