Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Seminar #51 - Machine Learning - How Variable Importance Works

Seminar #51 - Machine Learning - How Variable Importance Works

In Exploratory, when you build machine learning or statistical learning models you will see a tab called 'Importance' that shows which variables are more important to predict a given target variable values.

In this seminar, Kan will explain how the variable importance is calculated as well as how to interpret the result. Also, he's going to introduce a method called 'Boruta', which is used address challenges brought by the randomness of the Random Forest models.

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

July 07, 2021
Tweet

Transcript

  1. EXPLORATORY Online Seminar #51 Machine Learning - Variable Importance

  2. Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  3. 3 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics

    / Machine Learning) Data Science Workflow
  4. 4 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling

    Visualization Analytics (Statistics / Machine Learning) ExploratoryɹModern & Simple UI
  5. EXPLORATORY Online Seminar #51 Machine Learning - Variable Importance

  6. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 6
  7. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 7
  8. 8 • We have data that have answers (numerical values

    or labels) for what we are interested in. (e.g. Sales, Conversion, Attrition, etc.) • Use algorithms to detect relationships and patterns that can be used to identify the answers and model them as formulas or rules. • Use the models to predict the answers for the data with no answers. Prediction Model
  9. 9 • We know who converted as paid customers and

    who didn’t convert in the past. • We have customer attribute data for those who converted and those who didn’t convert. • Based on these data, we want to predict which of the current lead customers will convert or not. Use Case:
  10. 10 Algorithm Model Build a Prediction Model. Conversion Age Time

    Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media Answers Target Variable
  11. 11 A model is a definition of a pattern the

    algorithm has captured in the data. Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media
  12. 12 Predict Conversion Age Time Country Industry TRUE 25 120

    Japan Ad FALSE 23 55 US Media FALSE 40 150 US Ad Conversion Age Time Country Industry ? 25 120 Japan Ad ? 23 55 US Media ? 40 150 US Ad Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media No Answers
  13. All the models are approximation of the real world. George

    Box All models are wrong, but some are useful. British Statistician
  14. 14 Not just for the prediction, we can also use

    it to learn a lot about the patterns in data. Insight • Which variables have stronger relationship with the target variable. • How are they related? • Are they significant? • What is the quality if we used this model to predict? Algorithm Model
  15. Agenda • Prediction Model with Machine Learning / Statistical Learning

    • Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 15
  16. We want to build a prediction model, but which algorithm

    we should use?
  17. Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest /

    XGBoost Statistical Learning Machine Learning Logistic Regression Cox Regression Survival Forest 17 Regression Model Classification Model Survival Model Data Type Statistical Learning Machine Learning Statistical Learning Machine Learning Random Forest / XGBoost Targe Variable
  18. Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest /

    XGBoost Logistic Regression Cox Regression Survival Forest 18 Regression Model Classification Model Survival Model Random Forest / XGBoost Output Output Output Output Output Output Various ways of interpretations
  19. • The main difference among the various prediction models is

    what kinds of patterns they could capture in the data. • We want to find out the pattern or the relationship in the data the algorithms have found. • Regardless of which algorithms we use, can’t we have a standard framework to understand such patterns?
  20. Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest /

    XGBoost Logistic Regression Cox Regression Survival Forest 20 Random Forest / XGBoost Analytics Grammar A common framework for understanding the patterns and relationships in data Regression Model Classification Model Survival Model
  21. Variable Importance Prediction by Variable Coefficient 21 Analytics Grammar Evaluation

  22. • Which variables are more important in order to predict

    the target variable? • Which variables have stronger relationship with a target variable? 22 Variable Importance
  23. How the target variable changes when a given predictor variable

    changes? Y-Axis shows the predicted numerical value of the target variable. Y-Axis shows the predicted probability of the target variable being TRUE. Target Variable is Numerical Target Variable is Logical 23 Prediction by Variable
  24. • How much does a target value change as one

    point value changes in a given predictor variable? • Is the change is statistically significant? (Hypothesis Test) 24 Only for Statistical Learning models (e.g. Linear Regression, Logistic Regression, etc.) Coefficient
  25. 25 Check if the relationship the prediction model captured is

    significant or not. Check how much a given prediction model fits the reality. Statistical Significance Prediction Quality Only for Statistical Learning Models Evaluation
  26. Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation

    Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Survival Curve Slope Significance Odds Ratio 26 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve
  27. Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation

    Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 27 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve
  28. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 28
  29. Variable Importance Which variables have stronger importance with a target

    variable? 29
  30. 30 • Build a model after removing one of the

    predictor variable and evaluate how much the quality of the model degrades compared to the model with all the predictor variables. • Repeat for every single predictor variable. • Compare the degree of the ‘degrade’ among all the predictor variables. How ‘variable importance’ is calculated?
  31. 31 If it’s a linear regression model, we can evaluate

    the quality of the model by calculating the difference between the actual values and the predicted values.
  32. 32 Algorithm Model Build a Prediction Model. Conversion Age Time

    Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  33. 33 Algorithm Model Predict for the training data. Conversion Age

    Time Industry FALSE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  34. Conversion Age Time Industry FALSE 60 120 Ad FALSE 45

    55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media 34 Algorithm Model Evaluate the quality by matching with the existing answers. Prediction Quality: 90
  35. 35 Algorithm Model Baseline Prediction Quality: 90 Conversion Age Time

    Industry FALSE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  36. 36 Algorithm Model Conversion Age Time Industry TRUE 60 120

    Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Build a Model without ‘Time’ variable.
  37. 37 Algorithm Model Predict for the training data. Conversion Age

    Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  38. 38 Algorithm Model Conversion Age Time Industry TRUE 60 120

    Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 80
  39. Variable Importance Time Without the ’Time’ variable, the quality of

    the model degrades for 10 points from the baseline. 10
  40. 40 Algorithm Model Build a Model without ‘Age’ variable. Conversion

    Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  41. 41 Algorithm Model Predict for the training data. Conversion Age

    Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  42. 42 Algorithm Model Conversion Age Time Industry TRUE 60 120

    Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 70
  43. 10 20 Time Age Variable Importance Without the ’Age’ variable,

    the quality of the model degrades for 20 points from the baseline.
  44. 44 Algorithm Model Build a Model without ‘Industry’ variable. Conversion

    Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  45. 45 Algorithm Model Predict for the training data. Conversion Age

    Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media
  46. 46 Algorithm Model Conversion Age Time Industry TRUE 60 120

    Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 20
  47. 10 20 Time 70 points decrease Age Variable Importance 70

    Industry
  48. We realize their importance after we’ve lost them…

  49. Who is more important?

  50. How the performance degrades when John Lennon is not here?

  51. How about without Ringo Star?

  52. How about without George?

  53. How about without Paul?

  54. In Exploratory 54

  55. None
  56. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 56
  57. Can we use the coefficient view to compare the importance

    of the variables? 57
  58. Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation

    Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 58 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve
  59. 59 Coefficient shows how much the target value would change

    when a given predictor variable changes for one point.
  60. 60 Given that the units of the predictor variables are

    different the bigger coefficient values don’t mean their stronger relationship with the target variable.
  61. 61 Executive → Sales Rep. 1 Level 1 Year 1

    Point in Age 1 Point in Job Level 1 Point in Job Role
  62. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 62
  63. Logistic Regression

  64. Some of the variables are considered ‘not significant’, why?

  65. “Shallow men believe in luck or in circumstance. Strong men

    believe in cause and effect.” Ralph Waldo Emerson 65
  66. None
  67. 67 Shark Attack Ice Cream Sales

  68. 68 Shark Attack Ice Cream Sales

  69. 69 Hot Confounding Shark Attack Ice Cream Sales

  70. 70 Attrition Marital Status Stock Option Both Stock Option and

    Marital Status are associated with Attrition.
  71. There is a relationship between the Marital Status and the

    Stock Option. 71
  72. 72 Association Attrition Marital Status Stock Option Stock Option and

    Marital Status are associated to each other.
  73. 73 Attrition Marital Status Stock Option Which one is causing

    the Attrition?
  74. 74 1 → 2 ‘Married’ Keep Constant By holding the

    ‘Marital Status’ constant, we can see if the Attrition changes when the Stock Option changes. Effect? Attrition Marital Status Stock Option
  75. 75 Single -> Married ̍ Keep Constant Effect? Attrition Marital

    Status Stock Option Or, by holding the Stock Option constant, we can see if the Attrition changes when the Marital Status changes.
  76. It turned out that the relationship between the Stock Option

    and the Attrition is not a direct relationship, rather it is a relationship via the one between the Marital Status and the Attrition, according to this model. 76
  77. 77 If we removed the Martial Status, now the Stock

    Option comes as a significant and more important variable to predict the Attrition.
  78. Agenda • Prediction Model • Analytics Grammar • Variable Importance

    • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 78
  79. 79 Random Forest

  80. 80 Forest

  81. 81 Tree

  82. 82 Data Result Decision Tree

  83. 83 TRUE FALSE TRUE FALSE 100% 50% 0% Income >

    5500 Overtime > 15 Ratio of TRUE It creates a series of questions to separate data into multiple groups so that each group will have similar values. Decision Tree
  84. 84

  85. 85 Forest

  86. Decision Tree Random Forest Data Sampling Sampling Sampling Vote Vote

    Vote Result … Random Sampling
  87. 87 Data Age Working Years Monthly Income Job Level Job

    Role Attrition 40 22 10424 1 Sales Executive TRUE 33 13 4726 2 Laboratory Technician FALSE 32 12 6694 3 Research Scientist FALSE 28 8 8342 2 Manager TRUE 24 6 7123 3 Sales Executive FALSE 28 10 8722 3 Laboratory Technician FALSE 43 20 4233 2 Research Scientist FALSE 38 17 5512 2 Manager TRUE 34 14 2500 1 Sales Executive FALSE 29 9 3394 2 Research Scientist TRUE
  88. Age Working Years Monthly Income Job Level Job Role Attrition

    40 22 10424 1 Sales Executive TRUE 33 13 4726 2 Laboratory Technician FALSE 32 12 6694 3 Research Scientist FALSE 28 8 8342 2 Manager TRUE 24 6 7123 3 Sales Executive FALSE 28 10 8722 3 Laboratory Technician FALSE 43 20 4233 2 Research Scientist FALSE 38 17 5512 2 Manager TRUE 34 14 2500 1 Sales Executive FALSE 29 9 3394 2 Research Scientist TRUE 88 Target Variable
  89. 89 Age Monthly Income Attrition 40 8157 TRUE 33 4456

    FALSE 32 3600 FALSE 28 2500 TRUE Data Sample rows and columns randomly. Attrition variable is always in. Gender Job Level Working Years Attrition 28 1 8 FALSE 43 3 22 FALSE 38 2 13 TRUE Age Job Role Attrition 33 Sales Executive TRUE 32 Manager FALSE 33 Sales Executive FALSE
  90. 90 Data Age Monthly Income Attrition 40 8157 TRUE 33

    4456 FALSE 32 3600 FALSE 28 2500 TRUE Gender Job Level Working Years Attrition 28 1 8 FALSE 43 3 22 FALSE 38 2 13 TRUE Age Job Role Attrition 33 Sales Executive TRUE 32 Manager FALSE 33 Sales Executive FALSE Build a tree model for each sampled data. Each tree is slightly different.
  91. 91 Data Sampling Sampling Sampling Vote Vote Vote Result …

    Take the mean or the majority value.
  92. None
  93. 93 Challenges with Variable Importance

  94. 1. Do all the predictor variables have meaningful effects on

    target variable? Maybe some of the variables have nothing to do with it. 2. Random Forest’s randomness doesn’t guarantee the result being same every time we build the same model. 94
  95. Decision Tree Data Sampling Sampling Sampling Vote Vote Vote Result

    … Random Sampling Random sampling doesn’t guarantee the result being the same all the times.
  96. 96 A Result with a Random Seed 1

  97. 97 Different seed causes different ranks. A Result with a

    Random Seed 2
  98. 98 • If there is randomness in the result, the

    suggested important variables might be just by a chance. • We can test if they are consistently important or not with a statistical test method. Boruta
  99. 99

  100. 100

  101. 101 Building Random Forest models 20 times would generate 20

    values of the importance metric. Then, it visualizes the distribution of 20 different values with Boxplot
  102. 102 Whether each variable is useful for predicting the target

    variable based on the statistical test.
  103. 103 Variables that confirmed ‘Useful’.

  104. 104 Variables that are not confirmed either ‘Useful’ or ‘Not

    Useful’.
  105. 105 Variables that are confirmed ‘Not Useful’.

  106. 106 How does Boruta do the test?

  107. • Count a number of times a given variable’s variable

    importance score is better than the shadow variables for each variable. • Perform a hypothesis test for the counts and evaluate the statistical significance for each variable.
  108. 108 Data Job Level Working Years Age 1 25 27

    3 35 40 2 40 41 1 22 22 1 33 35
  109. 109 Job Level Working Years Age 1 25 27 3

    35 40 2 40 41 1 22 22 1 33 35 Create Shadow Job Level Shadow Working Years Shadow Age Shadow 1 22 41 1 44 22 1 35 35 3 25 40 2 40 27 • Copy the original variables and shuffle the values randomly. • These randomly shuffled variables shouldn’t have any association with the target variables.
  110. 110 Run Variable Importance Shadow Shadow Shadow

  111. 111 Shadow Shadow Shadow The best scoring shadow variable.

  112. 112 Hit Hit Not Hit Shadow Shadow Shadow Count how

    many times each variable scores better than the best shadow variable.
  113. 113 Repeat and Count Shadow Shadow Shadow

  114. 114 Hypothesis Test Null HypothesisɿThere is no difference between a

    given variable and the best shadow variable in terms of the variable importance. Alternative Hypothesis 1 : A given variable is better than the best shadow. (Useful) Alternative Hypothesis 2 : A given variable is worse then the best shadow variable. (Not Useful)
  115. 115 Under the assumption of Null Hypothesis, a distribution of

    a number of Hits for 20 experiments. 10 5 20 15 0
  116. 116 If we take 5% (0.05) as a threshold of

    significance… 10 5 20 15 0 P Value : 5%
  117. 117 10 5 20 15 0 If a given variable

    was better at 15 times out of 20…
  118. 118 10 5 20 15 0 this can happen only

    with less than 5% of chance.
  119. 119 10 5 20 15 0 We can reject the

    null hypothesis and conclude that this variable is useful for prediction.
  120. 120 If a given variable was better at only 5

    times out of 20… 10 5 20 15 0
  121. 121 10 5 20 15 0 this can happen only

    with less than 5% of chance.
  122. 122 10 5 20 15 0 We can reject the

    null hypothesis and conclude that this variable is Not Useful for prediction.
  123. 123 10 5 20 15 0 If a given variable

    was better at 12 times out of 20…
  124. 124 10 5 20 15 0 this can happen at

    greater than 5% of chance.
  125. 125 10 5 20 15 0 We can NOT reject

    the null hypothesis, can NOT conclude that this variable is whether Useful or Not Useful for prediction.
  126. 126 10 5 20 15 0 A number of hits

    resides here can conclude ‘Useful’ A number of hits resides here can’t conclude either way. A number of hits resides here can conclude ‘Not Useful’
  127. 127

  128. Summary

  129. 129 A model is a definition of a pattern the

    algorithm has captured in the data. Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media
  130. 130 Not just for the prediction, we can also use

    it to learn a lot about the patterns in data. Insight • Which variables have stronger relationship with the target variable. • How are they related? • Are they significant? • What is the quality if we used this model to predict? Algorithm Model
  131. Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest /

    XGBoost Logistic Regression Cox Regression Survival Forest 131 Random Forest / XGBoost Analytics Grammar A common framework for understanding the patterns and relationships in data Regression Model Classification Model Survival Model
  132. Variable Importance Prediction by Variable Coefficient 132 Analytics Grammar Evaluation

  133. Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation

    Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 133 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve
  134. Variable Importance Which variables have stronger importance with a target

    variable? 134
  135. • Which variables are more important in order to predict

    the target variable? • Which variables have stronger relationship with a target variable? 135 Variable Importance
  136. That’s it for today!

  137. Next Seminar

  138. EXPLORATORY Online Seminar #52 7/21/2021 (Wed) 11AM PT Exploratory Server

  139. None
  140. Information Email kan@exploratory.io Website https://exploratory.io Twitter @ExploratoryData Seminar https://exploratory.io/online-seminar

  141. Q & A 141

  142. EXPLORATORY 142