Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning Workflow

Pacmann AI
October 14, 2019

Machine Learning Workflow

Pacmann AI

October 14, 2019
Tweet

More Decks by Pacmann AI

Other Decks in Science

Transcript

  1. 1. Background 2. Business Understanding 3. Objective metrics 4. Data

    Selection and Cleansing 5. Data Understanding 6. Data Splitting and Leakage 7. Data Preprocessing 8. Feature Engineering Table of Contents
  2. 9. Fitting Machine Learning models 10. Cross Validation 11. Offline

    Metrics Analysis 12. Ablation and Error Analysis 13. Output Postprocessing 14. Model Explanation 15. A/B testing 16. Reproducible Research and Models 17. Model Deployment and Retraining Table of Contents
  3. 1. Background New Data Scientist in a Company Characteristics -

    Don’t know where to start - Has so much energy, wants to solve all business problems with SOTA ML. - Only has Math/Stats/CS/Business skills.
  4. 1. Background Company X Characteristics - Want to hire data

    scientist to solve problems with ML in business.
  5. New DS 1. Background Company X Problems - Company X

    doesn’t have any data - New DS needs to build ML system
  6. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  7. New DS 1. Background Company X Problems - New DS

    needs to build ML system - New DS doesn’t understand business process, doesn’t know what kind of ML services need to be built for the company.
  8. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  9. New DS 1. Background Company X Problems - New DS

    need to build ML system - New DS don’t understand business process, don’t know what kind of ML services need to build for the company.
  10. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  11. New DS 1. Background Company X Problems - Company X

    has messy data. - New DS can only work if the data is clean. - New DS doesn’t understand how to clean the data.
  12. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  13. New DS 1. Background Company X Problems - New DS

    split train-test data - Accuracy 98% - There is a data leakage.
  14. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  15. New DS 1. Background Company X Problems - New DS

    build ML model. - Fresh grad. - Background in Stats. - Only fit, never predict.
  16. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  17. New DS 1. Background Company X Problems - New DS

    build ML model. - Build a Decision Tree - Accuracy 100% in training.
  18. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  19. New DS 1. Background Company X Problems - New DS

    build ML models. - Use a kaggle style modeling - Ensemble 100 models
  20. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  21. New DS 1. Background Company X Problems - New DS

    build ML models. - 80% accuracy offline
  22. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  23. New DS 1. Background Company X Problems - New DS

    build ML models. - Company X ask for prediction explanation. - New DS build model with Deep Learning.
  24. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  25. New DS 1. Background Company X Problems - New DS

    build ML models. - His code is not reproducible.
  26. New DS 1. Background Company X Problems - Company X

    don’t have any data - New DS need to build ML system
  27. 2. Business Understanding - It is important to understand the

    data generating process, user journey and business process (BP) in a company - You can create your ML services by automating any process in BP. - Verification process - Identification process - Scoring/estimation process - etc.
  28. 2. Business Understanding - Your goal is to make ML

    services with high ROI - You show how “smart” you are by making usable ML services
  29. 3. Objective Metrics - You need to make an objective

    metrics for every ML service that you made - Be SKEPTICAL with your objective metrics - Every objective metrics needs to be correlated with business metrics
  30. 3. Objective Metrics Netflix Recommendation System case: - Objective: -

    We want to recommend movies to users - User journey: - Every users give rating for some movies
  31. 3. Objective Metrics Netflix Recommendation System case: - Objective: -

    We want to predict rating for every movies for every users - Objective metrics: - RMSE, regression problem
  32. 3. Objective Metrics Netflix Recommendation System case: - We don’t

    really care about RMSE. - Case 1: - Movie 1 : Rating 4 - Movie 1 : Rating Prediction 5 - Case 2: - Movie 2 : Rating 2 - Movie 2 : Rating Prediction 4
  33. 3. Objective Metrics Netflix Recommendation System case: - We care

    about top N recommendation - Change objective metrics to precision-recall as an information retrieval problem - Change objective metrics to top@K precision-recall - Decision-support metrics: - ROC AUC, Breese score, later precision/recall - Error meets decision-support/user experience: - “Reversals” - User-centered metrics: - Coverage, user retention, recommendation uptake, satisfaction
  34. 4. Data Selection and Cleansing Data in your daily job:

    - Need to be cleaned - Need to be joined - Need to be selected for a given service
  35. 4. Data Selection and Cleansing Case Fintech: - People are

    afraid to invest in fintech because they never had any experience in crises. - Bank Perkreditan Rakyat have data before and after the financial crises. - We need to build a Credit Scoring with 96-99 data to simulate a financial crises.
  36. 5. Data Understanding Why we need to understand our data:

    - We need to find patterns which might affect our model/objective metrics. - We need to find anomalies which will decrease our model accuracy. - In the next section, we can generate new variables from this understanding to ease our model to learn and fit.
  37. 5. Data Understanding -- Case Analysis • Self-reported candidates are

    more likely to post when they are Accepted. • They are likely to post results for multiple applications. • More successful candidates are likely to post their personal numbers. • Ostensibly, the quality of applicants who engage heavily with an online forum are far more serious about their application than the entirety of the test-taking pool, leading to better results/numbers. Sources: https://debarghyadas.com/writes/the-grad-school-statistics-we-never-had/
  38. 6. Data Splitting and Leakage - We need to simulate

    future data from our current dataset. - This simulation can help use to infer our model prediction power with future data. - Strictly speaking, it assumes a stable
  39. Randomly split your training dataset into; • the set we

    used to fit our model, and • the set we use only to validate our model performance, that’s it. 6. Data Splitting and Leakage
  40. Set of Hypothesis Fitted Model Model Performance Fitting Predict Results

    Predicted Output Train Test 6. Data Splitting and Leakage
  41. 6. Data Splitting and Leakage Data Leakage is the introduction

    of information about data target in input data.
  42. 7. Data Preprocessing - ML model only can accept numerical

    input Jenis Kelamin Pria Pria Wanita Jenis Kelamin, Pria : 0, Wanita : 1 0 0 1
  43. 7. Data Preprocessing - ML model only can accept numerical

    input Negara Indonesia India Malaysia Indonesia India Malaysia 1 0 0 0 1 0 0 0 1
  44. 7. Data Preprocessing - Some ML model only can accept

    inputs with similar mean-variance.
  45. 7. Data Preprocessing - Some ML model only can accept

    inputs with similar mean-variance.
  46. 8. Feature Engineering We need to convert our data into

    more representative input to increase our model accuracy.
  47. 8. Feature Engineering The effect of every feature engineering experiments

    needs to be validated in future data or, we can do cross validation.
  48. • If the model is too simple, the solution is

    biased and does not fit the data. • If the model is too complex then it is very sensitive to small changes in the data. 9. Fitting ML Model
  49. Bias Variance Complexity Underfitting: you have an overly simple model

    High Low Low Overfitting: your model fit signal and noises. Low High High 9. Fitting ML Model
  50. 9. Fitting ML Model - Every model have their own

    hyperparameter to tune. - We tune this hyperparam using cross validation.
  51. 10. Cross Validation - We need to verify our experiments

    with “future data simulation”. - Cross validation will simulate unavailable future data.
  52. Randomly split the data into k-folds, then 1. Pick one

    fold as a validation set 2. Fit the model to the k-1 folds 3. Repeat those steps until you get k fitted models 4. Then, compute the CV Score as, 10. Cross Validation -- K-fold
  53. CV Score is your model performance. 1. Select the best

    model, with best combination of Hyperparam which has the best CV Score. 2. Then, build your final model by fitting the selected model into the whole training data. 10. Cross Validation
  54. If we repeat the process of randomly splitting the sample

    set into two parts, we will get a somewhat different estimate for the test MSE 10. Cross Validation
  55. 10. Cross Validation - Every experiments should be validated with

    cross validation. - Model hyperparam - Preprocessing - Feature Engineering - Every variables
  56. 11. Offline Metrics Analysis - Remember our Objective Metrics, now

    we need to validate it. - There might be some tradeoff between metrics.
  57. 11. Offline Metrics Analysis Let’s revisit our Netflix Recommendation System

    case: - We will focus on Diversity Metrics vs Accuracy Metrics. - More diverse recommendation will increase Netflix CTR - More accurate recommendation will increase Netflix CTR - Diversity and Accuracy are negatively correlated.
  58. 12. Error Analysis - We want to know the source

    of error in our prediction. - Thus we can improve our model accuracy by making a feature which represent the error.
  59. 12. Ablation Analysis - We want to measure every additional

    complexity to model accuracy. - If the additional complexity does not affect model accuracy, then we will drop that experiment.
  60. 12. Ablation Analysis Recommendation System PACMANN AI RMSE SVD-like 0.88

    SVD-like, user-bias, item-bias 0.85 SVD-like, user-bias, item-bias, time-bias 0.83 Neural Collaborative Filtering (NBF) 0.87 NBF - WARP 0.87 NBF - BPR 0.87
  61. 13. Output Postprocessing - Probability from a classifier is only

    an approximation of real probability. For example, probability 90% "credit default" from a classifier does not imply 9 from 10 cases will be default.
  62. 13. Output Postprocessing - This problem of bias of probability

    from classifier very often the result of ML classifier which not approximate probability in their model. For example, Random Forest calculate average class, SVM calculate distance from separator line.
  63. 14. Model Explanation - Most of the time, if you

    are working with business stakeholders, you will be asked by business people the reasoning of model output. - This problem is related to ML Interpretability
  64. 14. Model Explanation - Most interpretable model for regression: -

    Linear Regression - Most interpretable model for classification: - Logistic Regression with Weight of Evidence
  65. 15. A/B Testing - Now we have a working model

    and we want to deploy the system. - We want to measure the effect of the model to our objective metrics in online fashion.
  66. 121 Population Sample Sample A Sample B We separate these

    samples into two set 15. A/B Testing
  67. 16. Reproducible Research - We want to have a reproducible

    model - Give the same output every time it train. - We need to find source of variation in the model.
  68. 16. Reproducible Research - Source of variation: - Data: -

    Data definition - Data preprocessing - Feature Engineering - Model: - Different hyperparameter - Source code - Environment
  69. 17. Deployment and Retraining - We already validate the model,

    we want to deploy the model into our system - There are two kinds of ML serving: Online and Offline.
  70. 17. Deployment and Retraining - Online: - Train your machine

    learning model - Build preprocess and feature engineering which can be hitted by new data. - Deploy to your server as an API - Offline: - Train your model - Predict your data, write into DB