Seminar #51 - Machine Learning - How Variable Importance Works

EXPLORATORY Online Seminar #51 Machine Learning - Variable Importance

Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory,
Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

3 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics
/ Machine Learning) Data Science Workﬂow

4 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling
Visualization Analytics (Statistics / Machine Learning) ExploratoryɹModern & Simple UI

EXPLORATORY Online Seminar #51 Machine Learning - Variable Importance

Agenda • Prediction Model • Analytics Grammar • Variable Importance
• Statistical Learning - Variable Importance vs. Coefﬁcient • Statistical Learning - Signiﬁcance of Relationship • Random Forest - Boruta for Variable Importance 6

8 • We have data that have answers (numerical values
or labels) for what we are interested in. (e.g. Sales, Conversion, Attrition, etc.) • Use algorithms to detect relationships and patterns that can be used to identify the answers and model them as formulas or rules. • Use the models to predict the answers for the data with no answers. Prediction Model

9 • We know who converted as paid customers and
who didn’t convert in the past. • We have customer attribute data for those who converted and those who didn’t convert. • Based on these data, we want to predict which of the current lead customers will convert or not. Use Case:

10 Algorithm Model Build a Prediction Model. Conversion Age Time
Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media Answers Target Variable

11 A model is a deﬁnition of a pattern the
algorithm has captured in the data. Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media

12 Predict Conversion Age Time Country Industry TRUE 25 120
Japan Ad FALSE 23 55 US Media FALSE 40 150 US Ad Conversion Age Time Country Industry ? 25 120 Japan Ad ? 23 55 US Media ? 40 150 US Ad Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media No Answers

All the models are approximation of the real world. George
Box All models are wrong, but some are useful. British Statistician

14 Not just for the prediction, we can also use
it to learn a lot about the patterns in data. Insight • Which variables have stronger relationship with the target variable. • How are they related? • Are they signiﬁcant? • What is the quality if we used this model to predict? Algorithm Model

Agenda • Prediction Model with Machine Learning / Statistical Learning
• Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefﬁcient • Statistical Learning - Signiﬁcance of Relationship • Random Forest - Boruta for Variable Importance 15

We want to build a prediction model, but which algorithm
we should use?

Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest /
XGBoost Statistical Learning Machine Learning Logistic Regression Cox Regression Survival Forest 17 Regression Model Classiﬁcation Model Survival Model Data Type Statistical Learning Machine Learning Statistical Learning Machine Learning Random Forest / XGBoost Targe Variable

XGBoost Logistic Regression Cox Regression Survival Forest 18 Regression Model Classiﬁcation Model Survival Model Random Forest / XGBoost Output Output Output Output Output Output Various ways of interpretations

• The main difference among the various prediction models is
what kinds of patterns they could capture in the data. • We want to ﬁnd out the pattern or the relationship in the data the algorithms have found. • Regardless of which algorithms we use, can’t we have a standard framework to understand such patterns?

XGBoost Logistic Regression Cox Regression Survival Forest 20 Random Forest / XGBoost Analytics Grammar A common framework for understanding the patterns and relationships in data Regression Model Classiﬁcation Model Survival Model

Variable Importance Prediction by Variable Coefﬁcient 21 Analytics Grammar Evaluation

• Which variables are more important in order to predict
the target variable? • Which variables have stronger relationship with a target variable? 22 Variable Importance

How the target variable changes when a given predictor variable
changes? Y-Axis shows the predicted numerical value of the target variable. Y-Axis shows the predicted probability of the target variable being TRUE. Target Variable is Numerical Target Variable is Logical 23 Prediction by Variable

• How much does a target value change as one
point value changes in a given predictor variable? • Is the change is statistically signiﬁcant? (Hypothesis Test) 24 Only for Statistical Learning models (e.g. Linear Regression, Logistic Regression, etc.) Coefﬁcient

25 Check if the relationship the prediction model captured is
significant or not. Check how much a given prediction model fits the reality. Statistical Significance Prediction Quality Only for Statistical Learning Models Evaluation

Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation
Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Survival Curve Slope Significance Odds Ratio 26 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve

Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 27 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve

Variable Importance Which variables have stronger importance with a target
variable? 29

30 • Build a model after removing one of the
predictor variable and evaluate how much the quality of the model degrades compared to the model with all the predictor variables. • Repeat for every single predictor variable. • Compare the degree of the ‘degrade’ among all the predictor variables. How ‘variable importance’ is calculated?

31 If it’s a linear regression model, we can evaluate
the quality of the model by calculating the difference between the actual values and the predicted values.

32 Algorithm Model Build a Prediction Model. Conversion Age Time
Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

33 Algorithm Model Predict for the training data. Conversion Age
Time Industry FALSE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Conversion Age Time Industry FALSE 60 120 Ad FALSE 45
55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media 34 Algorithm Model Evaluate the quality by matching with the existing answers. Prediction Quality: 90

35 Algorithm Model Baseline Prediction Quality: 90 Conversion Age Time
Industry FALSE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

36 Algorithm Model Conversion Age Time Industry TRUE 60 120
Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Build a Model without ‘Time’ variable.

Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 80

Variable Importance Time Without the ’Time’ variable, the quality of
the model degrades for 10 points from the baseline. 10

40 Algorithm Model Build a Model without ‘Age’ variable. Conversion
Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

10 20 Time Age Variable Importance Without the ’Age’ variable,
the quality of the model degrades for 20 points from the baseline.

44 Algorithm Model Build a Model without ‘Industry’ variable. Conversion
Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

10 20 Time 70 points decrease Age Variable Importance 70
Industry

We realize their importance after we’ve lost them…

Who is more important?

How the performance degrades when John Lennon is not here?

How about without Ringo Star?

How about without George?

How about without Paul?

In Exploratory 54

• Statistical Learning - Variable Importance vs. Coeﬃcient • Statistical Learning - Signiﬁcance of Relationship • Random Forest - Boruta for Variable Importance 56

Can we use the coeﬃcient view to compare the importance
of the variables? 57

59 Coefﬁcient shows how much the target value would change
when a given predictor variable changes for one point.

60 Given that the units of the predictor variables are
different the bigger coefﬁcient values don’t mean their stronger relationship with the target variable.

61 Executive → Sales Rep. 1 Level 1 Year 1
Point in Age 1 Point in Job Level 1 Point in Job Role

Logistic Regression

Some of the variables are considered ‘not signiﬁcant’, why?

“Shallow men believe in luck or in circumstance. Strong men
believe in cause and effect.” Ralph Waldo Emerson 65

67 Shark Attack Ice Cream Sales

68 Shark Attack Ice Cream Sales

69 Hot Confounding Shark Attack Ice Cream Sales

70 Attrition Marital Status Stock Option Both Stock Option and
Marital Status are associated with Attrition.

There is a relationship between the Marital Status and the
Stock Option. 71

72 Association Attrition Marital Status Stock Option Stock Option and
Marital Status are associated to each other.

73 Attrition Marital Status Stock Option Which one is causing
the Attrition?

74 1 → 2 ‘Married’ Keep Constant By holding the
‘Marital Status’ constant, we can see if the Attrition changes when the Stock Option changes. Eﬀect? Attrition Marital Status Stock Option

75 Single -> Married ̍ Keep Constant Eﬀect? Attrition Marital
Status Stock Option Or, by holding the Stock Option constant, we can see if the Attrition changes when the Marital Status changes.

It turned out that the relationship between the Stock Option
and the Attrition is not a direct relationship, rather it is a relationship via the one between the Marital Status and the Attrition, according to this model. 76

77 If we removed the Martial Status, now the Stock
Option comes as a signiﬁcant and more important variable to predict the Attrition.

79 Random Forest

80 Forest

81 Tree

82 Data Result Decision Tree

83 TRUE FALSE TRUE FALSE 100% 50% 0% Income >
5500 Overtime > 15 Ratio of TRUE It creates a series of questions to separate data into multiple groups so that each group will have similar values. Decision Tree

85 Forest

Decision Tree Random Forest Data Sampling Sampling Sampling Vote Vote
Vote Result … Random Sampling

87 Data Age Working Years Monthly Income Job Level Job
Role Attrition 40 22 10424 1 Sales Executive TRUE 33 13 4726 2 Laboratory Technician FALSE 32 12 6694 3 Research Scientist FALSE 28 8 8342 2 Manager TRUE 24 6 7123 3 Sales Executive FALSE 28 10 8722 3 Laboratory Technician FALSE 43 20 4233 2 Research Scientist FALSE 38 17 5512 2 Manager TRUE 34 14 2500 1 Sales Executive FALSE 29 9 3394 2 Research Scientist TRUE

Age Working Years Monthly Income Job Level Job Role Attrition
40 22 10424 1 Sales Executive TRUE 33 13 4726 2 Laboratory Technician FALSE 32 12 6694 3 Research Scientist FALSE 28 8 8342 2 Manager TRUE 24 6 7123 3 Sales Executive FALSE 28 10 8722 3 Laboratory Technician FALSE 43 20 4233 2 Research Scientist FALSE 38 17 5512 2 Manager TRUE 34 14 2500 1 Sales Executive FALSE 29 9 3394 2 Research Scientist TRUE 88 Target Variable

89 Age Monthly Income Attrition 40 8157 TRUE 33 4456
FALSE 32 3600 FALSE 28 2500 TRUE Data Sample rows and columns randomly. Attrition variable is always in. Gender Job Level Working Years Attrition 28 1 8 FALSE 43 3 22 FALSE 38 2 13 TRUE Age Job Role Attrition 33 Sales Executive TRUE 32 Manager FALSE 33 Sales Executive FALSE

90 Data Age Monthly Income Attrition 40 8157 TRUE 33
4456 FALSE 32 3600 FALSE 28 2500 TRUE Gender Job Level Working Years Attrition 28 1 8 FALSE 43 3 22 FALSE 38 2 13 TRUE Age Job Role Attrition 33 Sales Executive TRUE 32 Manager FALSE 33 Sales Executive FALSE Build a tree model for each sampled data. Each tree is slightly different.

91 Data Sampling Sampling Sampling Vote Vote Vote Result …
Take the mean or the majority value.

93 Challenges with Variable Importance

1. Do all the predictor variables have meaningful effects on
target variable? Maybe some of the variables have nothing to do with it. 2. Random Forest’s randomness doesn’t guarantee the result being same every time we build the same model. 94

Decision Tree Data Sampling Sampling Sampling Vote Vote Vote Result
… Random Sampling Random sampling doesn’t guarantee the result being the same all the times.

96 A Result with a Random Seed 1

97 Different seed causes different ranks. A Result with a
Random Seed 2

98 • If there is randomness in the result, the
suggested important variables might be just by a chance. • We can test if they are consistently important or not with a statistical test method. Boruta

101 Building Random Forest models 20 times would generate 20
values of the importance metric. Then, it visualizes the distribution of 20 different values with Boxplot

102 Whether each variable is useful for predicting the target
variable based on the statistical test.

103 Variables that conﬁrmed ‘Useful’.

104 Variables that are not conﬁrmed either ‘Useful’ or ‘Not
Useful’.

105 Variables that are conﬁrmed ‘Not Useful’.

106 How does Boruta do the test?

• Count a number of times a given variable’s variable
importance score is better than the shadow variables for each variable. • Perform a hypothesis test for the counts and evaluate the statistical signiﬁcance for each variable.

108 Data Job Level Working Years Age 1 25 27
3 35 40 2 40 41 1 22 22 1 33 35

109 Job Level Working Years Age 1 25 27 3
35 40 2 40 41 1 22 22 1 33 35 Create Shadow Job Level Shadow Working Years Shadow Age Shadow 1 22 41 1 44 22 1 35 35 3 25 40 2 40 27 • Copy the original variables and shufﬂe the values randomly. • These randomly shufﬂed variables shouldn’t have any association with the target variables.

110 Run Variable Importance Shadow Shadow Shadow

111 Shadow Shadow Shadow The best scoring shadow variable.

112 Hit Hit Not Hit Shadow Shadow Shadow Count how
many times each variable scores better than the best shadow variable.

113 Repeat and Count Shadow Shadow Shadow

114 Hypothesis Test Null HypothesisɿThere is no difference between a
given variable and the best shadow variable in terms of the variable importance. Alternative Hypothesis 1 : A given variable is better than the best shadow. (Useful) Alternative Hypothesis 2 : A given variable is worse then the best shadow variable. (Not Useful)

115 Under the assumption of Null Hypothesis, a distribution of
a number of Hits for 20 experiments. 10 5 20 15 0

116 If we take 5% (0.05) as a threshold of
signiﬁcance… 10 5 20 15 0 P Value : 5%

117 10 5 20 15 0 If a given variable
was better at 15 times out of 20…

118 10 5 20 15 0 this can happen only
with less than 5% of chance.

119 10 5 20 15 0 We can reject the
null hypothesis and conclude that this variable is useful for prediction.

120 If a given variable was better at only 5
times out of 20… 10 5 20 15 0

121 10 5 20 15 0 this can happen only
with less than 5% of chance.

122 10 5 20 15 0 We can reject the
null hypothesis and conclude that this variable is Not Useful for prediction.

123 10 5 20 15 0 If a given variable
was better at 12 times out of 20…

124 10 5 20 15 0 this can happen at
greater than 5% of chance.

125 10 5 20 15 0 We can NOT reject
the null hypothesis, can NOT conclude that this variable is whether Useful or Not Useful for prediction.

126 10 5 20 15 0 A number of hits
resides here can conclude ‘Useful’ A number of hits resides here can’t conclude either way. A number of hits resides here can conclude ‘Not Useful’

Summary

129 A model is a deﬁnition of a pattern the
algorithm has captured in the data. Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media

130 Not just for the prediction, we can also use
it to learn a lot about the patterns in data. Insight • Which variables have stronger relationship with the target variable. • How are they related? • Are they signiﬁcant? • What is the quality if we used this model to predict? Algorithm Model

XGBoost Logistic Regression Cox Regression Survival Forest 131 Random Forest / XGBoost Analytics Grammar A common framework for understanding the patterns and relationships in data Regression Model Classiﬁcation Model Survival Model

Variable Importance Prediction by Variable Coefﬁcient 132 Analytics Grammar Evaluation

Variable Importance Which variables have stronger importance with a target
variable? 134

• Which variables are more important in order to predict
the target variable? • Which variables have stronger relationship with a target variable? 135 Variable Importance

That’s it for today!

Next Seminar

EXPLORATORY Online Seminar #52 7/21/2021 (Wed) 11AM PT Exploratory Server

Information Email [email protected] Website https://exploratory.io Twitter @ExploratoryData Seminar https://exploratory.io/online-seminar

Q & A 141

EXPLORATORY 142

Seminar #51 - Machine Learning - How Variable I...

Seminar #51 - Machine Learning - How Variable Importance Works

More Decks by Kan Nishida

Other Decks in Technology

Featured

Transcript