Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Seminar #51 - Machine Learning - How Variable Importance Works

Seminar #51 - Machine Learning - How Variable Importance Works

In Exploratory, when you build machine learning or statistical learning models you will see a tab called 'Importance' that shows which variables are more important to predict a given target variable values.

In this seminar, Kan will explain how the variable importance is calculated as well as how to interpret the result. Also, he's going to introduce a method called 'Boruta', which is used address challenges brought by the randomness of the Random Forest models.

Kan Nishida

July 07, 2021
Tweet

More Decks by Kan Nishida

Other Decks in Technology

Transcript

  1. Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  2. 4 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) ExploratoryɹModern & Simple UI
  3. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 6
  4. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 7
  5. 8 • We have data that have answers (numerical values

    or labels) for what we are interested in. (e.g. Sales, Conversion, Attrition, etc.) • Use algorithms to detect relationships and patterns that can be used to identify the answers and model them as formulas or rules. • Use the models to predict the answers for the data with no answers. Prediction Model
  6. 9 • We know who converted as paid customers and

    who didn’t convert in the past. • We have customer attribute data for those who converted and those who didn’t convert. • Based on these data, we want to predict which of the current lead customers will convert or not. Use Case:
  7. 10 Algorithm Model Build a Prediction Model. Conversion Age Time

    Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media Answers Target Variable
  8. 11 A model is a definition of a pattern the

    algorithm has captured in the data. Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media
  9. 12 Predict Conversion Age Time Country Industry TRUE 25 120

    Japan Ad FALSE 23 55 US Media FALSE 40 150 US Ad Conversion Age Time Country Industry ? 25 120 Japan Ad ? 23 55 US Media ? 40 150 US Ad Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media No Answers
  10. All the models are approximation of the real world. George

    Box All models are wrong, but some are useful. British Statistician
  11. 14 Not just for the prediction, we can also use

    it to learn a lot about the patterns in data. Insight • Which variables have stronger relationship with the target variable. • How are they related? • Are they significant? • What is the quality if we used this model to predict? Algorithm Model
  12. Agenda • Prediction Model with Machine Learning / Statistical Learning

    • Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 15
  13. Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest /

    XGBoost Statistical Learning Machine Learning Logistic Regression Cox Regression Survival Forest 17 Regression Model Classification Model Survival Model Data Type Statistical Learning Machine Learning Statistical Learning Machine Learning Random Forest / XGBoost Targe Variable
  14. Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest /

    XGBoost Logistic Regression Cox Regression Survival Forest 18 Regression Model Classification Model Survival Model Random Forest / XGBoost Output Output Output Output Output Output Various ways of interpretations
  15. • The main difference among the various prediction models is

    what kinds of patterns they could capture in the data. • We want to find out the pattern or the relationship in the data the algorithms have found. • Regardless of which algorithms we use, can’t we have a standard framework to understand such patterns?
  16. Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest /

    XGBoost Logistic Regression Cox Regression Survival Forest 20 Random Forest / XGBoost Analytics Grammar A common framework for understanding the patterns and relationships in data Regression Model Classification Model Survival Model
  17. • Which variables are more important in order to predict

    the target variable? • Which variables have stronger relationship with a target variable? 22 Variable Importance
  18. How the target variable changes when a given predictor variable

    changes? Y-Axis shows the predicted numerical value of the target variable. Y-Axis shows the predicted probability of the target variable being TRUE. Target Variable is Numerical Target Variable is Logical 23 Prediction by Variable
  19. • How much does a target value change as one

    point value changes in a given predictor variable? • Is the change is statistically significant? (Hypothesis Test) 24 Only for Statistical Learning models (e.g. Linear Regression, Logistic Regression, etc.) Coefficient
  20. 25 Check if the relationship the prediction model captured is

    significant or not. Check how much a given prediction model fits the reality. Statistical Significance Prediction Quality Only for Statistical Learning Models Evaluation
  21. Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation

    Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Survival Curve Slope Significance Odds Ratio 26 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve
  22. Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation

    Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 27 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve
  23. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 28
  24. 30 • Build a model after removing one of the

    predictor variable and evaluate how much the quality of the model degrades compared to the model with all the predictor variables. • Repeat for every single predictor variable. • Compare the degree of the ‘degrade’ among all the predictor variables. How ‘variable importance’ is calculated?
  25. 31 If it’s a linear regression model, we can evaluate

    the quality of the model by calculating the difference between the actual values and the predicted values.
  26. 32 Algorithm Model Build a Prediction Model. Conversion Age Time

    Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  27. 33 Algorithm Model Predict for the training data. Conversion Age

    Time Industry FALSE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  28. Conversion Age Time Industry FALSE 60 120 Ad FALSE 45

    55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media 34 Algorithm Model Evaluate the quality by matching with the existing answers. Prediction Quality: 90
  29. 35 Algorithm Model Baseline Prediction Quality: 90 Conversion Age Time

    Industry FALSE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  30. 36 Algorithm Model Conversion Age Time Industry TRUE 60 120

    Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Build a Model without ‘Time’ variable.
  31. 37 Algorithm Model Predict for the training data. Conversion Age

    Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  32. 38 Algorithm Model Conversion Age Time Industry TRUE 60 120

    Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 80
  33. Variable Importance Time Without the ’Time’ variable, the quality of

    the model degrades for 10 points from the baseline. 10
  34. 40 Algorithm Model Build a Model without ‘Age’ variable. Conversion

    Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  35. 41 Algorithm Model Predict for the training data. Conversion Age

    Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  36. 42 Algorithm Model Conversion Age Time Industry TRUE 60 120

    Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 70
  37. 10 20 Time Age Variable Importance Without the ’Age’ variable,

    the quality of the model degrades for 20 points from the baseline.
  38. 44 Algorithm Model Build a Model without ‘Industry’ variable. Conversion

    Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  39. 45 Algorithm Model Predict for the training data. Conversion Age

    Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  40. 46 Algorithm Model Conversion Age Time Industry TRUE 60 120

    Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 20
  41. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 56
  42. Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation

    Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 58 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve
  43. 59 Coefficient shows how much the target value would change

    when a given predictor variable changes for one point.
  44. 60 Given that the units of the predictor variables are

    different the bigger coefficient values don’t mean their stronger relationship with the target variable.
  45. 61 Executive → Sales Rep. 1 Level 1 Year 1

    Point in Age 1 Point in Job Level 1 Point in Job Role
  46. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 62
  47. “Shallow men believe in luck or in circumstance. Strong men

    believe in cause and effect.” Ralph Waldo Emerson 65
  48. 70 Attrition Marital Status Stock Option Both Stock Option and

    Marital Status are associated with Attrition.
  49. 72 Association Attrition Marital Status Stock Option Stock Option and

    Marital Status are associated to each other.
  50. 74 1 → 2 ‘Married’ Keep Constant By holding the

    ‘Marital Status’ constant, we can see if the Attrition changes when the Stock Option changes. Effect? Attrition Marital Status Stock Option
  51. 75 Single -> Married ̍ Keep Constant Effect? Attrition Marital

    Status Stock Option Or, by holding the Stock Option constant, we can see if the Attrition changes when the Marital Status changes.
  52. It turned out that the relationship between the Stock Option

    and the Attrition is not a direct relationship, rather it is a relationship via the one between the Marital Status and the Attrition, according to this model. 76
  53. 77 If we removed the Martial Status, now the Stock

    Option comes as a significant and more important variable to predict the Attrition.
  54. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 78
  55. 83 TRUE FALSE TRUE FALSE 100% 50% 0% Income >

    5500 Overtime > 15 Ratio of TRUE It creates a series of questions to separate data into multiple groups so that each group will have similar values. Decision Tree
  56. 84

  57. 87 Data Age Working Years Monthly Income Job Level Job

    Role Attrition 40 22 10424 1 Sales Executive TRUE 33 13 4726 2 Laboratory Technician FALSE 32 12 6694 3 Research Scientist FALSE 28 8 8342 2 Manager TRUE 24 6 7123 3 Sales Executive FALSE 28 10 8722 3 Laboratory Technician FALSE 43 20 4233 2 Research Scientist FALSE 38 17 5512 2 Manager TRUE 34 14 2500 1 Sales Executive FALSE 29 9 3394 2 Research Scientist TRUE
  58. Age Working Years Monthly Income Job Level Job Role Attrition

    40 22 10424 1 Sales Executive TRUE 33 13 4726 2 Laboratory Technician FALSE 32 12 6694 3 Research Scientist FALSE 28 8 8342 2 Manager TRUE 24 6 7123 3 Sales Executive FALSE 28 10 8722 3 Laboratory Technician FALSE 43 20 4233 2 Research Scientist FALSE 38 17 5512 2 Manager TRUE 34 14 2500 1 Sales Executive FALSE 29 9 3394 2 Research Scientist TRUE 88 Target Variable
  59. 89 Age Monthly Income Attrition 40 8157 TRUE 33 4456

    FALSE 32 3600 FALSE 28 2500 TRUE Data Sample rows and columns randomly. Attrition variable is always in. Gender Job Level Working Years Attrition 28 1 8 FALSE 43 3 22 FALSE 38 2 13 TRUE Age Job Role Attrition 33 Sales Executive TRUE 32 Manager FALSE 33 Sales Executive FALSE
  60. 90 Data Age Monthly Income Attrition 40 8157 TRUE 33

    4456 FALSE 32 3600 FALSE 28 2500 TRUE Gender Job Level Working Years Attrition 28 1 8 FALSE 43 3 22 FALSE 38 2 13 TRUE Age Job Role Attrition 33 Sales Executive TRUE 32 Manager FALSE 33 Sales Executive FALSE Build a tree model for each sampled data. Each tree is slightly different.
  61. 1. Do all the predictor variables have meaningful effects on

    target variable? Maybe some of the variables have nothing to do with it. 2. Random Forest’s randomness doesn’t guarantee the result being same every time we build the same model. 94
  62. Decision Tree Data Sampling Sampling Sampling Vote Vote Vote Result

    … Random Sampling Random sampling doesn’t guarantee the result being the same all the times.
  63. 98 • If there is randomness in the result, the

    suggested important variables might be just by a chance. • We can test if they are consistently important or not with a statistical test method. Boruta
  64. 99

  65. 100

  66. 101 Building Random Forest models 20 times would generate 20

    values of the importance metric. Then, it visualizes the distribution of 20 different values with Boxplot
  67. 102 Whether each variable is useful for predicting the target

    variable based on the statistical test.
  68. • Count a number of times a given variable’s variable

    importance score is better than the shadow variables for each variable. • Perform a hypothesis test for the counts and evaluate the statistical significance for each variable.
  69. 108 Data Job Level Working Years Age 1 25 27

    3 35 40 2 40 41 1 22 22 1 33 35
  70. 109 Job Level Working Years Age 1 25 27 3

    35 40 2 40 41 1 22 22 1 33 35 Create Shadow Job Level Shadow Working Years Shadow Age Shadow 1 22 41 1 44 22 1 35 35 3 25 40 2 40 27 • Copy the original variables and shuffle the values randomly. • These randomly shuffled variables shouldn’t have any association with the target variables.
  71. 112 Hit Hit Not Hit Shadow Shadow Shadow Count how

    many times each variable scores better than the best shadow variable.
  72. 114 Hypothesis Test Null HypothesisɿThere is no difference between a

    given variable and the best shadow variable in terms of the variable importance. Alternative Hypothesis 1 : A given variable is better than the best shadow. (Useful) Alternative Hypothesis 2 : A given variable is worse then the best shadow variable. (Not Useful)
  73. 115 Under the assumption of Null Hypothesis, a distribution of

    a number of Hits for 20 experiments. 10 5 20 15 0
  74. 116 If we take 5% (0.05) as a threshold of

    significance… 10 5 20 15 0 P Value : 5%
  75. 117 10 5 20 15 0 If a given variable

    was better at 15 times out of 20…
  76. 118 10 5 20 15 0 this can happen only

    with less than 5% of chance.
  77. 119 10 5 20 15 0 We can reject the

    null hypothesis and conclude that this variable is useful for prediction.
  78. 120 If a given variable was better at only 5

    times out of 20… 10 5 20 15 0
  79. 121 10 5 20 15 0 this can happen only

    with less than 5% of chance.
  80. 122 10 5 20 15 0 We can reject the

    null hypothesis and conclude that this variable is Not Useful for prediction.
  81. 123 10 5 20 15 0 If a given variable

    was better at 12 times out of 20…
  82. 124 10 5 20 15 0 this can happen at

    greater than 5% of chance.
  83. 125 10 5 20 15 0 We can NOT reject

    the null hypothesis, can NOT conclude that this variable is whether Useful or Not Useful for prediction.
  84. 126 10 5 20 15 0 A number of hits

    resides here can conclude ‘Useful’ A number of hits resides here can’t conclude either way. A number of hits resides here can conclude ‘Not Useful’
  85. 127

  86. 129 A model is a definition of a pattern the

    algorithm has captured in the data. Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media
  87. 130 Not just for the prediction, we can also use

    it to learn a lot about the patterns in data. Insight • Which variables have stronger relationship with the target variable. • How are they related? • Are they significant? • What is the quality if we used this model to predict? Algorithm Model
  88. Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest /

    XGBoost Logistic Regression Cox Regression Survival Forest 131 Random Forest / XGBoost Analytics Grammar A common framework for understanding the patterns and relationships in data Regression Model Classification Model Survival Model
  89. Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation

    Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 133 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve
  90. • Which variables are more important in order to predict

    the target variable? • Which variables have stronger relationship with a target variable? 135 Variable Importance