Slide 1

Slide 1 text

EXPLORATORY Online Seminar #51 Machine Learning - Variable Importance

Slide 2

Slide 2 text

Kan Nishida CEO/co-founder Exploratory Summary In Spring 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams to build various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 3

Slide 3 text

3 Questions Communication Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Data Science Workflow

Slide 4

Slide 4 text

4 Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) ExploratoryɹModern & Simple UI

Slide 5

Slide 5 text

EXPLORATORY Online Seminar #51 Machine Learning - Variable Importance

Slide 6

Slide 6 text

Agenda • Prediction Model • Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 6

Slide 7

Slide 7 text

Agenda • Prediction Model • Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 7

Slide 8

Slide 8 text

8 • We have data that have answers (numerical values or labels) for what we are interested in. (e.g. Sales, Conversion, Attrition, etc.) • Use algorithms to detect relationships and patterns that can be used to identify the answers and model them as formulas or rules. • Use the models to predict the answers for the data with no answers. Prediction Model

Slide 9

Slide 9 text

9 • We know who converted as paid customers and who didn’t convert in the past. • We have customer attribute data for those who converted and those who didn’t convert. • Based on these data, we want to predict which of the current lead customers will convert or not. Use Case:

Slide 10

Slide 10 text

10 Algorithm Model Build a Prediction Model. Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media Answers Target Variable

Slide 11

Slide 11 text

11 A model is a definition of a pattern the algorithm has captured in the data. Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media

Slide 12

Slide 12 text

12 Predict Conversion Age Time Country Industry TRUE 25 120 Japan Ad FALSE 23 55 US Media FALSE 40 150 US Ad Conversion Age Time Country Industry ? 25 120 Japan Ad ? 23 55 US Media ? 40 150 US Ad Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media No Answers

Slide 13

Slide 13 text

All the models are approximation of the real world. George Box All models are wrong, but some are useful. British Statistician

Slide 14

Slide 14 text

14 Not just for the prediction, we can also use it to learn a lot about the patterns in data. Insight • Which variables have stronger relationship with the target variable. • How are they related? • Are they significant? • What is the quality if we used this model to predict? Algorithm Model

Slide 15

Slide 15 text

Agenda • Prediction Model with Machine Learning / Statistical Learning • Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 15

Slide 16

Slide 16 text

We want to build a prediction model, but which algorithm we should use?

Slide 17

Slide 17 text

Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest / XGBoost Statistical Learning Machine Learning Logistic Regression Cox Regression Survival Forest 17 Regression Model Classification Model Survival Model Data Type Statistical Learning Machine Learning Statistical Learning Machine Learning Random Forest / XGBoost Targe Variable

Slide 18

Slide 18 text

Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest 18 Regression Model Classification Model Survival Model Random Forest / XGBoost Output Output Output Output Output Output Various ways of interpretations

Slide 19

Slide 19 text

• The main difference among the various prediction models is what kinds of patterns they could capture in the data. • We want to find out the pattern or the relationship in the data the algorithms have found. • Regardless of which algorithms we use, can’t we have a standard framework to understand such patterns?

Slide 20

Slide 20 text

Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest 20 Random Forest / XGBoost Analytics Grammar A common framework for understanding the patterns and relationships in data Regression Model Classification Model Survival Model

Slide 21

Slide 21 text

Variable Importance Prediction by Variable Coefficient 21 Analytics Grammar Evaluation

Slide 22

Slide 22 text

• Which variables are more important in order to predict the target variable? • Which variables have stronger relationship with a target variable? 22 Variable Importance

Slide 23

Slide 23 text

How the target variable changes when a given predictor variable changes? Y-Axis shows the predicted numerical value of the target variable. Y-Axis shows the predicted probability of the target variable being TRUE. Target Variable is Numerical Target Variable is Logical 23 Prediction by Variable

Slide 24

Slide 24 text

• How much does a target value change as one point value changes in a given predictor variable? • Is the change is statistically significant? (Hypothesis Test) 24 Only for Statistical Learning models (e.g. Linear Regression, Logistic Regression, etc.) Coefficient

Slide 25

Slide 25 text

25 Check if the relationship the prediction model captured is significant or not. Check how much a given prediction model fits the reality. Statistical Significance Prediction Quality Only for Statistical Learning Models Evaluation

Slide 26

Slide 26 text

Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Survival Curve Slope Significance Odds Ratio 26 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve

Slide 27

Slide 27 text

Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 27 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve

Slide 28

Slide 28 text

Agenda • Prediction Model • Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 28

Slide 29

Slide 29 text

Variable Importance Which variables have stronger importance with a target variable? 29

Slide 30

Slide 30 text

30 • Build a model after removing one of the predictor variable and evaluate how much the quality of the model degrades compared to the model with all the predictor variables. • Repeat for every single predictor variable. • Compare the degree of the ‘degrade’ among all the predictor variables. How ‘variable importance’ is calculated?

Slide 31

Slide 31 text

31 If it’s a linear regression model, we can evaluate the quality of the model by calculating the difference between the actual values and the predicted values.

Slide 32

Slide 32 text

32 Algorithm Model Build a Prediction Model. Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Slide 33

Slide 33 text

33 Algorithm Model Predict for the training data. Conversion Age Time Industry FALSE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Slide 34

Slide 34 text

Conversion Age Time Industry FALSE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media 34 Algorithm Model Evaluate the quality by matching with the existing answers. Prediction Quality: 90

Slide 35

Slide 35 text

35 Algorithm Model Baseline Prediction Quality: 90 Conversion Age Time Industry FALSE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Slide 36

Slide 36 text

36 Algorithm Model Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Build a Model without ‘Time’ variable.

Slide 37

Slide 37 text

37 Algorithm Model Predict for the training data. Conversion Age Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Slide 38

Slide 38 text

38 Algorithm Model Conversion Age Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 80

Slide 39

Slide 39 text

Variable Importance Time Without the ’Time’ variable, the quality of the model degrades for 10 points from the baseline. 10

Slide 40

Slide 40 text

40 Algorithm Model Build a Model without ‘Age’ variable. Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Slide 41

Slide 41 text

41 Algorithm Model Predict for the training data. Conversion Age Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Slide 42

Slide 42 text

42 Algorithm Model Conversion Age Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 70

Slide 43

Slide 43 text

10 20 Time Age Variable Importance Without the ’Age’ variable, the quality of the model degrades for 20 points from the baseline.

Slide 44

Slide 44 text

44 Algorithm Model Build a Model without ‘Industry’ variable. Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Slide 45

Slide 45 text

45 Algorithm Model Predict for the training data. Conversion Age Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media

Slide 46

Slide 46 text

46 Algorithm Model Conversion Age Time Industry TRUE 60 120 Ad TRUE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank TRUE 35 30 Media Conversion Age Time Industry TRUE 60 120 Ad FALSE 45 55 Medical FALSE 52 20 Media TRUE 48 140 Ad TRUE 53 80 Bank FALSE 35 30 Media Evaluate the quality by matching with the existing answers. Prediction Quality: 20

Slide 47

Slide 47 text

10 20 Time 70 points decrease Age Variable Importance 70 Industry

Slide 48

Slide 48 text

We realize their importance after we’ve lost them…

Slide 49

Slide 49 text

Who is more important?

Slide 50

Slide 50 text

How the performance degrades when John Lennon is not here?

Slide 51

Slide 51 text

How about without Ringo Star?

Slide 52

Slide 52 text

How about without George?

Slide 53

Slide 53 text

How about without Paul?

Slide 54

Slide 54 text

In Exploratory 54

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Agenda • Prediction Model • Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 56

Slide 57

Slide 57 text

Can we use the coefficient view to compare the importance of the variables? 57

Slide 58

Slide 58 text

Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 58 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve

Slide 59

Slide 59 text

59 Coefficient shows how much the target value would change when a given predictor variable changes for one point.

Slide 60

Slide 60 text

60 Given that the units of the predictor variables are different the bigger coefficient values don’t mean their stronger relationship with the target variable.

Slide 61

Slide 61 text

61 Executive → Sales Rep. 1 Level 1 Year 1 Point in Age 1 Point in Job Level 1 Point in Job Role

Slide 62

Slide 62 text

Agenda • Prediction Model • Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 62

Slide 63

Slide 63 text

Logistic Regression

Slide 64

Slide 64 text

Some of the variables are considered ‘not significant’, why?

Slide 65

Slide 65 text

“Shallow men believe in luck or in circumstance. Strong men believe in cause and effect.” Ralph Waldo Emerson 65

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

67 Shark Attack Ice Cream Sales

Slide 68

Slide 68 text

68 Shark Attack Ice Cream Sales

Slide 69

Slide 69 text

69 Hot Confounding Shark Attack Ice Cream Sales

Slide 70

Slide 70 text

70 Attrition Marital Status Stock Option Both Stock Option and Marital Status are associated with Attrition.

Slide 71

Slide 71 text

There is a relationship between the Marital Status and the Stock Option. 71

Slide 72

Slide 72 text

72 Association Attrition Marital Status Stock Option Stock Option and Marital Status are associated to each other.

Slide 73

Slide 73 text

73 Attrition Marital Status Stock Option Which one is causing the Attrition?

Slide 74

Slide 74 text

74 1 → 2 ‘Married’ Keep Constant By holding the ‘Marital Status’ constant, we can see if the Attrition changes when the Stock Option changes. Effect? Attrition Marital Status Stock Option

Slide 75

Slide 75 text

75 Single -> Married ̍ Keep Constant Effect? Attrition Marital Status Stock Option Or, by holding the Stock Option constant, we can see if the Attrition changes when the Marital Status changes.

Slide 76

Slide 76 text

It turned out that the relationship between the Stock Option and the Attrition is not a direct relationship, rather it is a relationship via the one between the Marital Status and the Attrition, according to this model. 76

Slide 77

Slide 77 text

77 If we removed the Martial Status, now the Stock Option comes as a significant and more important variable to predict the Attrition.

Slide 78

Slide 78 text

Agenda • Prediction Model • Analytics Grammar • Variable Importance • Statistical Learning - Variable Importance vs. Coefficient • Statistical Learning - Significance of Relationship • Random Forest - Boruta for Variable Importance 78

Slide 79

Slide 79 text

79 Random Forest

Slide 80

Slide 80 text

80 Forest

Slide 81

Slide 81 text

81 Tree

Slide 82

Slide 82 text

82 Data Result Decision Tree

Slide 83

Slide 83 text

83 TRUE FALSE TRUE FALSE 100% 50% 0% Income > 5500 Overtime > 15 Ratio of TRUE It creates a series of questions to separate data into multiple groups so that each group will have similar values. Decision Tree

Slide 84

Slide 84 text

84

Slide 85

Slide 85 text

85 Forest

Slide 86

Slide 86 text

Decision Tree Random Forest Data Sampling Sampling Sampling Vote Vote Vote Result … Random Sampling

Slide 87

Slide 87 text

87 Data Age Working Years Monthly Income Job Level Job Role Attrition 40 22 10424 1 Sales Executive TRUE 33 13 4726 2 Laboratory Technician FALSE 32 12 6694 3 Research Scientist FALSE 28 8 8342 2 Manager TRUE 24 6 7123 3 Sales Executive FALSE 28 10 8722 3 Laboratory Technician FALSE 43 20 4233 2 Research Scientist FALSE 38 17 5512 2 Manager TRUE 34 14 2500 1 Sales Executive FALSE 29 9 3394 2 Research Scientist TRUE

Slide 88

Slide 88 text

Age Working Years Monthly Income Job Level Job Role Attrition 40 22 10424 1 Sales Executive TRUE 33 13 4726 2 Laboratory Technician FALSE 32 12 6694 3 Research Scientist FALSE 28 8 8342 2 Manager TRUE 24 6 7123 3 Sales Executive FALSE 28 10 8722 3 Laboratory Technician FALSE 43 20 4233 2 Research Scientist FALSE 38 17 5512 2 Manager TRUE 34 14 2500 1 Sales Executive FALSE 29 9 3394 2 Research Scientist TRUE 88 Target Variable

Slide 89

Slide 89 text

89 Age Monthly Income Attrition 40 8157 TRUE 33 4456 FALSE 32 3600 FALSE 28 2500 TRUE Data Sample rows and columns randomly. Attrition variable is always in. Gender Job Level Working Years Attrition 28 1 8 FALSE 43 3 22 FALSE 38 2 13 TRUE Age Job Role Attrition 33 Sales Executive TRUE 32 Manager FALSE 33 Sales Executive FALSE

Slide 90

Slide 90 text

90 Data Age Monthly Income Attrition 40 8157 TRUE 33 4456 FALSE 32 3600 FALSE 28 2500 TRUE Gender Job Level Working Years Attrition 28 1 8 FALSE 43 3 22 FALSE 38 2 13 TRUE Age Job Role Attrition 33 Sales Executive TRUE 32 Manager FALSE 33 Sales Executive FALSE Build a tree model for each sampled data. Each tree is slightly different.

Slide 91

Slide 91 text

91 Data Sampling Sampling Sampling Vote Vote Vote Result … Take the mean or the majority value.

Slide 92

Slide 92 text

No content

Slide 93

Slide 93 text

93 Challenges with Variable Importance

Slide 94

Slide 94 text

1. Do all the predictor variables have meaningful effects on target variable? Maybe some of the variables have nothing to do with it. 2. Random Forest’s randomness doesn’t guarantee the result being same every time we build the same model. 94

Slide 95

Slide 95 text

Decision Tree Data Sampling Sampling Sampling Vote Vote Vote Result … Random Sampling Random sampling doesn’t guarantee the result being the same all the times.

Slide 96

Slide 96 text

96 A Result with a Random Seed 1

Slide 97

Slide 97 text

97 Different seed causes different ranks. A Result with a Random Seed 2

Slide 98

Slide 98 text

98 • If there is randomness in the result, the suggested important variables might be just by a chance. • We can test if they are consistently important or not with a statistical test method. Boruta

Slide 99

Slide 99 text

99

Slide 100

Slide 100 text

100

Slide 101

Slide 101 text

101 Building Random Forest models 20 times would generate 20 values of the importance metric. Then, it visualizes the distribution of 20 different values with Boxplot

Slide 102

Slide 102 text

102 Whether each variable is useful for predicting the target variable based on the statistical test.

Slide 103

Slide 103 text

103 Variables that confirmed ‘Useful’.

Slide 104

Slide 104 text

104 Variables that are not confirmed either ‘Useful’ or ‘Not Useful’.

Slide 105

Slide 105 text

105 Variables that are confirmed ‘Not Useful’.

Slide 106

Slide 106 text

106 How does Boruta do the test?

Slide 107

Slide 107 text

• Count a number of times a given variable’s variable importance score is better than the shadow variables for each variable. • Perform a hypothesis test for the counts and evaluate the statistical significance for each variable.

Slide 108

Slide 108 text

108 Data Job Level Working Years Age 1 25 27 3 35 40 2 40 41 1 22 22 1 33 35

Slide 109

Slide 109 text

109 Job Level Working Years Age 1 25 27 3 35 40 2 40 41 1 22 22 1 33 35 Create Shadow Job Level Shadow Working Years Shadow Age Shadow 1 22 41 1 44 22 1 35 35 3 25 40 2 40 27 • Copy the original variables and shuffle the values randomly. • These randomly shuffled variables shouldn’t have any association with the target variables.

Slide 110

Slide 110 text

110 Run Variable Importance Shadow Shadow Shadow

Slide 111

Slide 111 text

111 Shadow Shadow Shadow The best scoring shadow variable.

Slide 112

Slide 112 text

112 Hit Hit Not Hit Shadow Shadow Shadow Count how many times each variable scores better than the best shadow variable.

Slide 113

Slide 113 text

113 Repeat and Count Shadow Shadow Shadow

Slide 114

Slide 114 text

114 Hypothesis Test Null HypothesisɿThere is no difference between a given variable and the best shadow variable in terms of the variable importance. Alternative Hypothesis 1 : A given variable is better than the best shadow. (Useful) Alternative Hypothesis 2 : A given variable is worse then the best shadow variable. (Not Useful)

Slide 115

Slide 115 text

115 Under the assumption of Null Hypothesis, a distribution of a number of Hits for 20 experiments. 10 5 20 15 0

Slide 116

Slide 116 text

116 If we take 5% (0.05) as a threshold of significance… 10 5 20 15 0 P Value : 5%

Slide 117

Slide 117 text

117 10 5 20 15 0 If a given variable was better at 15 times out of 20…

Slide 118

Slide 118 text

118 10 5 20 15 0 this can happen only with less than 5% of chance.

Slide 119

Slide 119 text

119 10 5 20 15 0 We can reject the null hypothesis and conclude that this variable is useful for prediction.

Slide 120

Slide 120 text

120 If a given variable was better at only 5 times out of 20… 10 5 20 15 0

Slide 121

Slide 121 text

121 10 5 20 15 0 this can happen only with less than 5% of chance.

Slide 122

Slide 122 text

122 10 5 20 15 0 We can reject the null hypothesis and conclude that this variable is Not Useful for prediction.

Slide 123

Slide 123 text

123 10 5 20 15 0 If a given variable was better at 12 times out of 20…

Slide 124

Slide 124 text

124 10 5 20 15 0 this can happen at greater than 5% of chance.

Slide 125

Slide 125 text

125 10 5 20 15 0 We can NOT reject the null hypothesis, can NOT conclude that this variable is whether Useful or Not Useful for prediction.

Slide 126

Slide 126 text

126 10 5 20 15 0 A number of hits resides here can conclude ‘Useful’ A number of hits resides here can’t conclude either way. A number of hits resides here can conclude ‘Not Useful’

Slide 127

Slide 127 text

127

Slide 128

Slide 128 text

Summary

Slide 129

Slide 129 text

129 A model is a definition of a pattern the algorithm has captured in the data. Algorithm Model Conversion Age Time Country Industry TRUE 60 120 Japan Ad FALSE 45 55 US Medical FALSE 52 20 US Media TRUE 48 140 Japan Ad TRUE 53 80 UK Bank FALSE 35 30 Japan Media

Slide 130

Slide 130 text

130 Not just for the prediction, we can also use it to learn a lot about the patterns in data. Insight • Which variables have stronger relationship with the target variable. • How are they related? • Are they significant? • What is the quality if we used this model to predict? Algorithm Model

Slide 131

Slide 131 text

Numeric TRUE/FALSE TRUE/FALSE + Time Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest 131 Random Forest / XGBoost Analytics Grammar A common framework for understanding the patterns and relationships in data Regression Model Classification Model Survival Model

Slide 132

Slide 132 text

Variable Importance Prediction by Variable Coefficient 132 Analytics Grammar Evaluation

Slide 133

Slide 133 text

Statistical Learning Machine Learning Data Type Model Type Algorithm Evaluation Relationship R Squared RMSE AUC F Score Hazard Ratio R Squared Variable Importance Prediction by Variable Slope Significance Odds Ratio 133 Coefficient Linear Regression Random Forest / XGBoost Logistic Regression Cox Regression Survival Forest Random Forest / XGBoost Numeric TRUE/FALSE TRUE/FALSE + Time Statistical Learning Machine Learning Statistical Learning Machine Learning Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Variable Importance Prediction by Variable Significance Significance R Squared RMSE R Squared AUC F Score Significance Significance Significance Survival Curve Survival Curve

Slide 134

Slide 134 text

Variable Importance Which variables have stronger importance with a target variable? 134

Slide 135

Slide 135 text

• Which variables are more important in order to predict the target variable? • Which variables have stronger relationship with a target variable? 135 Variable Importance

Slide 136

Slide 136 text

That’s it for today!

Slide 137

Slide 137 text

Next Seminar

Slide 138

Slide 138 text

EXPLORATORY Online Seminar #52 7/21/2021 (Wed) 11AM PT Exploratory Server

Slide 139

Slide 139 text

No content

Slide 140

Slide 140 text

Information Email kan@exploratory.io Website https://exploratory.io Twitter @ExploratoryData Seminar https://exploratory.io/online-seminar

Slide 141

Slide 141 text

Q & A 141

Slide 142

Slide 142 text

EXPLORATORY 142