Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Seminar #51 - Machine Learning - How Variable Importance Works

Seminar #51 - Machine Learning - How Variable Importance Works

In Exploratory, when you build machine learning or statistical learning models you will see a tab called 'Importance' that shows which variables are more important to predict a given target variable values.

In this seminar, Kan will explain how the variable importance is calculated as well as how to interpret the result. Also, he's going to introduce a method called 'Boruta', which is used address challenges brought by the randomness of the Random Forest models.

Kan Nishida
PRO

July 07, 2021
Tweet

More Decks by Kan Nishida

Other Decks in Technology

Transcript

  1. EXPLORATORY
    Online Seminar #51
    Machine Learning -
    Variable Importance

    View Slide

  2. Kan Nishida
    CEO/co-founder
    Exploratory
    Summary
    In Spring 2016, launched Exploratory, Inc. to democratize
    Data Science.
    Prior to Exploratory, Kan was a director of product
    development at Oracle leading teams to build various Data
    Science products in areas including Machine Learning, BI,
    Data Visualization, Mobile Analytics, Big Data, etc.
    While at Oracle, Kan also provided training and consulting
    services to help organizations transform with data.
    @KanAugust
    Speaker

    View Slide

  3. 3
    Questions Communication
    Data Access
    Data Wrangling
    Visualization
    Analytics
    (Statistics / Machine
    Learning)
    Data Science Workflow

    View Slide

  4. 4
    Questions Communication
    (Dashboard, Note, Slides)
    Data Access
    Data Wrangling
    Visualization
    Analytics
    (Statistics / Machine
    Learning)
    ExploratoryɹModern & Simple UI

    View Slide

  5. EXPLORATORY
    Online Seminar #51
    Machine Learning -
    Variable Importance

    View Slide

  6. Agenda
    • Prediction Model
    • Analytics Grammar
    • Variable Importance
    • Statistical Learning - Variable Importance vs. Coefficient
    • Statistical Learning - Significance of Relationship
    • Random Forest - Boruta for Variable Importance
    6

    View Slide

  7. Agenda
    • Prediction Model
    • Analytics Grammar
    • Variable Importance
    • Statistical Learning - Variable Importance vs. Coefficient
    • Statistical Learning - Significance of Relationship
    • Random Forest - Boruta for Variable Importance
    7

    View Slide

  8. 8
    • We have data that have answers (numerical values or labels)
    for what we are interested in. (e.g. Sales, Conversion,
    Attrition, etc.)
    • Use algorithms to detect relationships and patterns that can
    be used to identify the answers and model them as formulas
    or rules.
    • Use the models to predict the answers for the data with no
    answers.
    Prediction Model

    View Slide

  9. 9
    • We know who converted as paid customers and who didn’t
    convert in the past.
    • We have customer attribute data for those who converted and
    those who didn’t convert.
    • Based on these data, we want to predict which of the current
    lead customers will convert or not.
    Use Case:

    View Slide

  10. 10
    Algorithm Model
    Build a Prediction Model.
    Conversion Age Time Country Industry
    TRUE 60 120 Japan Ad
    FALSE 45 55 US Medical
    FALSE 52 20 US Media
    TRUE 48 140 Japan Ad
    TRUE 53 80 UK Bank
    FALSE 35 30 Japan Media
    Answers
    Target Variable

    View Slide

  11. 11
    A model is a definition of a pattern the algorithm
    has captured in the data.
    Algorithm Model
    Conversion Age Time Country Industry
    TRUE 60 120 Japan Ad
    FALSE 45 55 US Medical
    FALSE 52 20 US Media
    TRUE 48 140 Japan Ad
    TRUE 53 80 UK Bank
    FALSE 35 30 Japan Media

    View Slide

  12. 12
    Predict
    Conversion Age Time Country Industry
    TRUE 25 120 Japan Ad
    FALSE 23 55 US Media
    FALSE 40 150 US Ad
    Conversion Age Time Country Industry
    ? 25 120 Japan Ad
    ? 23 55 US Media
    ? 40 150 US Ad
    Algorithm Model
    Conversion Age Time Country Industry
    TRUE 60 120 Japan Ad
    FALSE 45 55 US Medical
    FALSE 52 20 US Media
    TRUE 48 140 Japan Ad
    TRUE 53 80 UK Bank
    FALSE 35 30 Japan Media
    No Answers

    View Slide

  13. All the models are approximation of the real world.
    George Box
    All models are wrong, but some
    are useful.
    British Statistician

    View Slide

  14. 14
    Not just for the prediction, we can also use it to learn a lot
    about the patterns in data.
    Insight
    • Which variables have stronger relationship with the target variable.
    • How are they related?
    • Are they significant?
    • What is the quality if we used this model to predict?
    Algorithm Model

    View Slide

  15. Agenda
    • Prediction Model with Machine Learning / Statistical Learning
    • Analytics Grammar
    • Variable Importance
    • Statistical Learning - Variable Importance vs. Coefficient
    • Statistical Learning - Significance of Relationship
    • Random Forest - Boruta for Variable Importance
    15

    View Slide

  16. We want to build a prediction model,
    but which algorithm we should use?

    View Slide

  17. Numeric TRUE/FALSE TRUE/FALSE + Time
    Linear
    Regression
    Random Forest
    / XGBoost
    Statistical
    Learning
    Machine
    Learning
    Logistic
    Regression
    Cox
    Regression
    Survival Forest
    17
    Regression
    Model
    Classification
    Model
    Survival Model
    Data Type
    Statistical
    Learning
    Machine
    Learning
    Statistical
    Learning
    Machine
    Learning
    Random Forest
    / XGBoost
    Targe Variable

    View Slide

  18. Numeric TRUE/FALSE TRUE/FALSE + Time
    Linear
    Regression
    Random Forest
    / XGBoost
    Logistic
    Regression
    Cox
    Regression
    Survival Forest
    18
    Regression
    Model
    Classification
    Model
    Survival Model
    Random Forest
    / XGBoost
    Output Output Output Output Output Output
    Various ways of interpretations

    View Slide

  19. • The main difference among the various prediction models is
    what kinds of patterns they could capture in the data.
    • We want to find out the pattern or the relationship in the data
    the algorithms have found.
    • Regardless of which algorithms we use, can’t we have a
    standard framework to understand such patterns?

    View Slide

  20. Numeric TRUE/FALSE TRUE/FALSE + Time
    Linear
    Regression
    Random Forest
    / XGBoost
    Logistic
    Regression
    Cox
    Regression
    Survival Forest
    20
    Random Forest
    / XGBoost
    Analytics Grammar
    A common framework for understanding
    the patterns and relationships in data
    Regression
    Model
    Classification
    Model
    Survival Model

    View Slide

  21. Variable
    Importance
    Prediction
    by Variable
    Coefficient
    21
    Analytics Grammar
    Evaluation

    View Slide

  22. • Which variables are more important in order to predict the target variable?
    • Which variables have stronger relationship with a target variable?
    22
    Variable
    Importance

    View Slide

  23. How the target variable changes when a given predictor variable changes?
    Y-Axis shows the predicted numerical value of the
    target variable.
    Y-Axis shows the predicted probability of the
    target variable being TRUE.
    Target Variable is Numerical Target Variable is Logical
    23
    Prediction
    by Variable

    View Slide

  24. • How much does a target value change as one point value changes
    in a given predictor variable?
    • Is the change is statistically significant? (Hypothesis Test)
    24
    Only for Statistical Learning models
    (e.g. Linear Regression, Logistic
    Regression, etc.)
    Coefficient

    View Slide

  25. 25
    Check if the relationship the
    prediction model captured is
    significant or not.
    Check how much a given
    prediction model fits the
    reality.
    Statistical Significance
    Prediction Quality
    Only for Statistical Learning Models
    Evaluation

    View Slide

  26. Statistical Learning Machine Learning
    Data Type
    Model Type
    Algorithm
    Evaluation
    Relationship
    R Squared
    RMSE
    AUC
    F Score
    Hazard Ratio
    R Squared
    Variable
    Importance
    Prediction by
    Variable
    Survival Curve
    Slope
    Significance
    Odds Ratio
    26
    Coefficient
    Linear
    Regression
    Random Forest
    / XGBoost
    Logistic
    Regression
    Cox
    Regression
    Survival Forest
    Random Forest
    / XGBoost
    Numeric TRUE/FALSE TRUE/FALSE + Time
    Statistical Learning Machine Learning Statistical Learning Machine Learning
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Significance Significance
    R Squared
    RMSE
    R Squared
    AUC
    F Score
    Significance Significance Significance
    Survival Curve

    View Slide

  27. Statistical Learning Machine Learning
    Data Type
    Model Type
    Algorithm
    Evaluation
    Relationship
    R Squared
    RMSE
    AUC
    F Score
    Hazard Ratio
    R Squared
    Variable
    Importance
    Prediction by
    Variable
    Slope
    Significance
    Odds Ratio
    27
    Coefficient
    Linear
    Regression
    Random Forest
    / XGBoost
    Logistic
    Regression
    Cox
    Regression
    Survival Forest
    Random Forest
    / XGBoost
    Numeric TRUE/FALSE TRUE/FALSE + Time
    Statistical Learning Machine Learning Statistical Learning Machine Learning
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Significance Significance
    R Squared
    RMSE
    R Squared
    AUC
    F Score
    Significance Significance Significance
    Survival Curve Survival Curve

    View Slide

  28. Agenda
    • Prediction Model
    • Analytics Grammar
    • Variable Importance
    • Statistical Learning - Variable Importance vs. Coefficient
    • Statistical Learning - Significance of Relationship
    • Random Forest - Boruta for Variable Importance
    28

    View Slide

  29. Variable Importance
    Which variables have stronger importance with a target variable?
    29

    View Slide

  30. 30
    • Build a model after removing one of the predictor variable and evaluate
    how much the quality of the model degrades compared to the model
    with all the predictor variables.
    • Repeat for every single predictor variable.
    • Compare the degree of the ‘degrade’ among all the predictor variables.
    How ‘variable importance’ is calculated?

    View Slide

  31. 31
    If it’s a linear regression model, we can evaluate the quality of the
    model by calculating the difference between the actual values and
    the predicted values.

    View Slide

  32. 32
    Algorithm Model
    Build a Prediction Model.
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media

    View Slide

  33. 33
    Algorithm Model
    Predict for the
    training data.
    Conversion Age Time Industry
    FALSE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media

    View Slide

  34. Conversion Age Time Industry
    FALSE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media
    34
    Algorithm Model
    Evaluate the quality by
    matching with the existing
    answers.
    Prediction Quality: 90

    View Slide

  35. 35
    Algorithm Model
    Baseline
    Prediction Quality: 90 Conversion Age Time Industry
    FALSE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media

    View Slide

  36. 36
    Algorithm Model
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media
    Build a Model without ‘Time’ variable.

    View Slide

  37. 37
    Algorithm Model
    Predict for the
    training data.
    Conversion Age Time Industry
    TRUE 60 120 Ad
    TRUE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    TRUE 35 30 Media
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media

    View Slide

  38. 38
    Algorithm Model
    Conversion Age Time Industry
    TRUE 60 120 Ad
    TRUE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    TRUE 35 30 Media
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media
    Evaluate the quality by
    matching with the existing
    answers.
    Prediction Quality: 80

    View Slide

  39. Variable Importance
    Time
    Without the ’Time’ variable, the quality of the model degrades for
    10 points from the baseline.
    10

    View Slide

  40. 40
    Algorithm Model
    Build a Model without ‘Age’ variable.
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media

    View Slide

  41. 41
    Algorithm Model
    Predict for the
    training data.
    Conversion Age Time Industry
    TRUE 60 120 Ad
    TRUE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    TRUE 35 30 Media
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media

    View Slide

  42. 42
    Algorithm Model
    Conversion Age Time Industry
    TRUE 60 120 Ad
    TRUE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    TRUE 35 30 Media
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media
    Evaluate the quality by
    matching with the existing
    answers.
    Prediction Quality: 70

    View Slide

  43. 10
    20
    Time Age
    Variable Importance
    Without the ’Age’ variable, the quality of the model degrades for 20
    points from the baseline.

    View Slide

  44. 44
    Algorithm Model
    Build a Model without ‘Industry’ variable.
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media

    View Slide

  45. 45
    Algorithm Model
    Predict for the
    training data.
    Conversion Age Time Industry
    TRUE 60 120 Ad
    TRUE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    TRUE 35 30 Media
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media

    View Slide

  46. 46
    Algorithm Model
    Conversion Age Time Industry
    TRUE 60 120 Ad
    TRUE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    TRUE 35 30 Media
    Conversion Age Time Industry
    TRUE 60 120 Ad
    FALSE 45 55 Medical
    FALSE 52 20 Media
    TRUE 48 140 Ad
    TRUE 53 80 Bank
    FALSE 35 30 Media
    Evaluate the quality by
    matching with the existing
    answers.
    Prediction Quality: 20

    View Slide

  47. 10
    20
    Time
    70 points decrease
    Age
    Variable Importance
    70
    Industry

    View Slide

  48. We realize their importance
    after we’ve lost them…

    View Slide

  49. Who is more important?

    View Slide

  50. How the performance degrades
    when John Lennon is not here?

    View Slide

  51. How about without Ringo Star?

    View Slide

  52. How about without George?

    View Slide

  53. How about without Paul?

    View Slide

  54. In Exploratory
    54

    View Slide

  55. View Slide

  56. Agenda
    • Prediction Model
    • Analytics Grammar
    • Variable Importance
    • Statistical Learning - Variable Importance vs. Coefficient
    • Statistical Learning - Significance of Relationship
    • Random Forest - Boruta for Variable Importance
    56

    View Slide

  57. Can we use the coefficient view to
    compare the importance of the variables?
    57

    View Slide

  58. Statistical Learning Machine Learning
    Data Type
    Model Type
    Algorithm
    Evaluation
    Relationship
    R Squared
    RMSE
    AUC
    F Score
    Hazard Ratio
    R Squared
    Variable
    Importance
    Prediction by
    Variable
    Slope
    Significance
    Odds Ratio
    58
    Coefficient
    Linear
    Regression
    Random Forest
    / XGBoost
    Logistic
    Regression
    Cox
    Regression
    Survival Forest
    Random Forest
    / XGBoost
    Numeric TRUE/FALSE TRUE/FALSE + Time
    Statistical Learning Machine Learning Statistical Learning Machine Learning
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Significance Significance
    R Squared
    RMSE
    R Squared
    AUC
    F Score
    Significance Significance Significance
    Survival Curve Survival Curve

    View Slide

  59. 59
    Coefficient shows how much the target value would change when a given
    predictor variable changes for one point.

    View Slide

  60. 60
    Given that the units of the predictor variables are different the bigger coefficient
    values don’t mean their stronger relationship with the target variable.

    View Slide

  61. 61
    Executive → Sales Rep.
    1 Level
    1 Year
    1 Point in Age
    1 Point in Job Level
    1 Point in Job Role

    View Slide

  62. Agenda
    • Prediction Model
    • Analytics Grammar
    • Variable Importance
    • Statistical Learning - Variable Importance vs. Coefficient
    • Statistical Learning - Significance of Relationship
    • Random Forest - Boruta for Variable Importance
    62

    View Slide

  63. Logistic Regression

    View Slide

  64. Some of the variables are considered ‘not significant’, why?

    View Slide

  65. “Shallow men believe in luck or in circumstance.
    Strong men believe in cause and effect.”
    Ralph Waldo Emerson
    65

    View Slide

  66. View Slide

  67. 67
    Shark
    Attack
    Ice Cream
    Sales

    View Slide

  68. 68
    Shark
    Attack
    Ice Cream
    Sales

    View Slide

  69. 69
    Hot
    Confounding
    Shark
    Attack
    Ice Cream
    Sales

    View Slide

  70. 70
    Attrition
    Marital
    Status
    Stock
    Option
    Both Stock Option and Marital Status are associated with Attrition.

    View Slide

  71. There is a relationship between the Marital Status and the Stock Option.
    71

    View Slide

  72. 72
    Association
    Attrition
    Marital
    Status
    Stock
    Option
    Stock Option and Marital Status are associated to each other.

    View Slide

  73. 73
    Attrition
    Marital
    Status
    Stock
    Option
    Which one is causing the Attrition?

    View Slide

  74. 74
    1 → 2
    ‘Married’
    Keep Constant
    By holding the ‘Marital Status’ constant, we can see if the Attrition changes when
    the Stock Option changes.
    Effect?
    Attrition
    Marital
    Status
    Stock
    Option

    View Slide

  75. 75
    Single ->
    Married
    ̍
    Keep Constant
    Effect?
    Attrition
    Marital
    Status
    Stock
    Option
    Or, by holding the Stock Option constant, we can see if the Attrition changes
    when the Marital Status changes.

    View Slide

  76. It turned out that the relationship between the Stock Option and the Attrition is
    not a direct relationship, rather it is a relationship via the one between the Marital
    Status and the Attrition, according to this model.
    76

    View Slide

  77. 77
    If we removed the Martial Status, now the Stock Option comes as a significant
    and more important variable to predict the Attrition.

    View Slide

  78. Agenda
    • Prediction Model
    • Analytics Grammar
    • Variable Importance
    • Statistical Learning - Variable Importance vs. Coefficient
    • Statistical Learning - Significance of Relationship
    • Random Forest - Boruta for Variable Importance
    78

    View Slide

  79. 79
    Random Forest

    View Slide

  80. 80
    Forest

    View Slide

  81. 81
    Tree

    View Slide

  82. 82
    Data
    Result
    Decision Tree

    View Slide

  83. 83
    TRUE FALSE
    TRUE FALSE
    100%
    50% 0%
    Income > 5500
    Overtime > 15
    Ratio of TRUE
    It creates a series of
    questions to separate data
    into multiple groups so that
    each group will have similar
    values.
    Decision Tree

    View Slide

  84. 84

    View Slide

  85. 85
    Forest

    View Slide

  86. Decision Tree
    Random Forest
    Data
    Sampling Sampling Sampling
    Vote Vote Vote
    Result

    Random Sampling

    View Slide

  87. 87
    Data
    Age Working Years Monthly Income Job Level Job Role Attrition
    40 22 10424 1 Sales Executive TRUE
    33 13 4726 2 Laboratory Technician FALSE
    32 12 6694 3 Research Scientist FALSE
    28 8 8342 2 Manager TRUE
    24 6 7123 3 Sales Executive FALSE
    28 10 8722 3 Laboratory Technician FALSE
    43 20 4233 2 Research Scientist FALSE
    38 17 5512 2 Manager TRUE
    34 14 2500 1 Sales Executive FALSE
    29 9 3394 2 Research Scientist TRUE

    View Slide

  88. Age Working Years Monthly Income Job Level Job Role Attrition
    40 22 10424 1 Sales Executive TRUE
    33 13 4726 2 Laboratory Technician FALSE
    32 12 6694 3 Research Scientist FALSE
    28 8 8342 2 Manager TRUE
    24 6 7123 3 Sales Executive FALSE
    28 10 8722 3 Laboratory Technician FALSE
    43 20 4233 2 Research Scientist FALSE
    38 17 5512 2 Manager TRUE
    34 14 2500 1 Sales Executive FALSE
    29 9 3394 2 Research Scientist TRUE
    88
    Target Variable

    View Slide

  89. 89
    Age Monthly Income Attrition
    40 8157 TRUE
    33 4456 FALSE
    32 3600 FALSE
    28 2500 TRUE
    Data
    Sample rows and columns randomly.
    Attrition variable is always in.
    Gender Job Level Working Years Attrition
    28 1 8 FALSE
    43 3 22 FALSE
    38 2 13 TRUE
    Age Job Role Attrition
    33 Sales Executive TRUE
    32 Manager FALSE
    33 Sales Executive FALSE

    View Slide

  90. 90
    Data
    Age Monthly Income Attrition
    40 8157 TRUE
    33 4456 FALSE
    32 3600 FALSE
    28 2500 TRUE
    Gender Job Level Working Years Attrition
    28 1 8 FALSE
    43 3 22 FALSE
    38 2 13 TRUE
    Age Job Role Attrition
    33 Sales Executive TRUE
    32 Manager FALSE
    33 Sales Executive FALSE
    Build a tree model for each sampled data.
    Each tree is slightly different.

    View Slide

  91. 91
    Data
    Sampling Sampling Sampling
    Vote Vote Vote
    Result

    Take the mean or the majority value.

    View Slide

  92. View Slide

  93. 93
    Challenges with
    Variable Importance

    View Slide

  94. 1. Do all the predictor variables have meaningful effects on target
    variable? Maybe some of the variables have nothing to do with it.
    2. Random Forest’s randomness doesn’t guarantee the result being
    same every time we build the same model.
    94

    View Slide

  95. Decision Tree
    Data
    Sampling Sampling Sampling
    Vote Vote Vote
    Result

    Random Sampling
    Random sampling doesn’t guarantee the result being
    the same all the times.

    View Slide

  96. 96
    A Result with a Random Seed 1

    View Slide

  97. 97
    Different seed causes different ranks.
    A Result with a Random Seed 2

    View Slide

  98. 98
    • If there is randomness in the result, the suggested important
    variables might be just by a chance.
    • We can test if they are consistently important or not with a
    statistical test method.
    Boruta

    View Slide

  99. 99

    View Slide

  100. 100

    View Slide

  101. 101
    Building Random Forest models 20 times would generate 20
    values of the importance metric. Then, it visualizes the
    distribution of 20 different values with Boxplot

    View Slide

  102. 102
    Whether each variable is useful for
    predicting the target variable based on the
    statistical test.

    View Slide

  103. 103
    Variables that confirmed ‘Useful’.

    View Slide

  104. 104
    Variables that are not confirmed
    either ‘Useful’ or ‘Not Useful’.

    View Slide

  105. 105
    Variables that are confirmed
    ‘Not Useful’.

    View Slide

  106. 106
    How does Boruta do the test?

    View Slide

  107. • Count a number of times a given variable’s variable
    importance score is better than the shadow variables for each
    variable.
    • Perform a hypothesis test for the counts and evaluate the
    statistical significance for each variable.

    View Slide

  108. 108
    Data
    Job Level Working Years Age
    1 25 27
    3 35 40
    2 40 41
    1 22 22
    1 33 35

    View Slide

  109. 109
    Job Level Working Years Age
    1 25 27
    3 35 40
    2 40 41
    1 22 22
    1 33 35
    Create Shadow
    Job Level Shadow Working Years Shadow Age Shadow
    1 22 41
    1 44 22
    1 35 35
    3 25 40
    2 40 27
    • Copy the original variables and shuffle the values randomly.
    • These randomly shuffled variables shouldn’t have any association
    with the target variables.

    View Slide

  110. 110
    Run Variable Importance
    Shadow Shadow Shadow

    View Slide

  111. 111
    Shadow Shadow Shadow
    The best scoring shadow variable.

    View Slide

  112. 112
    Hit
    Hit
    Not Hit
    Shadow Shadow Shadow
    Count how many times each variable scores
    better than the best shadow variable.

    View Slide

  113. 113
    Repeat and Count
    Shadow Shadow Shadow

    View Slide

  114. 114
    Hypothesis Test
    Null HypothesisɿThere is no difference between a given variable and the best
    shadow variable in terms of the variable importance.
    Alternative Hypothesis 1 : A given variable is better than the best shadow.
    (Useful)
    Alternative Hypothesis 2 : A given variable is worse then the best shadow
    variable. (Not Useful)

    View Slide

  115. 115
    Under the assumption of Null Hypothesis,
    a distribution of a number of Hits for 20 experiments.
    10
    5 20
    15
    0

    View Slide

  116. 116
    If we take 5% (0.05) as a threshold of significance…
    10
    5 20
    15
    0
    P Value : 5%

    View Slide

  117. 117
    10
    5 20
    15
    0
    If a given variable was better at 15 times out of 20…

    View Slide

  118. 118
    10
    5 20
    15
    0
    this can happen only with less than 5% of chance.

    View Slide

  119. 119
    10
    5 20
    15
    0
    We can reject the null hypothesis and conclude that this variable is
    useful for prediction.

    View Slide

  120. 120
    If a given variable was better at only 5 times out of 20…
    10
    5 20
    15
    0

    View Slide

  121. 121
    10
    5 20
    15
    0
    this can happen only with less than 5% of chance.

    View Slide

  122. 122
    10
    5 20
    15
    0
    We can reject the null hypothesis and conclude that this variable is
    Not Useful for prediction.

    View Slide

  123. 123
    10
    5 20
    15
    0
    If a given variable was better at 12 times out of 20…

    View Slide

  124. 124
    10
    5 20
    15
    0
    this can happen at greater than 5% of chance.

    View Slide

  125. 125
    10
    5 20
    15
    0
    We can NOT reject the null hypothesis, can NOT conclude that this
    variable is whether Useful or Not Useful for prediction.

    View Slide

  126. 126
    10
    5 20
    15
    0
    A number of hits resides
    here can conclude
    ‘Useful’
    A number of hits resides here can’t
    conclude either way.
    A number of hits resides
    here can conclude ‘Not
    Useful’

    View Slide

  127. 127

    View Slide

  128. Summary

    View Slide

  129. 129
    A model is a definition of a pattern the algorithm
    has captured in the data.
    Algorithm Model
    Conversion Age Time Country Industry
    TRUE 60 120 Japan Ad
    FALSE 45 55 US Medical
    FALSE 52 20 US Media
    TRUE 48 140 Japan Ad
    TRUE 53 80 UK Bank
    FALSE 35 30 Japan Media

    View Slide

  130. 130
    Not just for the prediction, we can also use it to learn a lot
    about the patterns in data.
    Insight
    • Which variables have stronger relationship with the target variable.
    • How are they related?
    • Are they significant?
    • What is the quality if we used this model to predict?
    Algorithm Model

    View Slide

  131. Numeric TRUE/FALSE TRUE/FALSE + Time
    Linear
    Regression
    Random Forest
    / XGBoost
    Logistic
    Regression
    Cox
    Regression
    Survival Forest
    131
    Random Forest
    / XGBoost
    Analytics Grammar
    A common framework for understanding
    the patterns and relationships in data
    Regression
    Model
    Classification
    Model
    Survival Model

    View Slide

  132. Variable
    Importance
    Prediction
    by Variable
    Coefficient
    132
    Analytics Grammar
    Evaluation

    View Slide

  133. Statistical Learning Machine Learning
    Data Type
    Model Type
    Algorithm
    Evaluation
    Relationship
    R Squared
    RMSE
    AUC
    F Score
    Hazard Ratio
    R Squared
    Variable
    Importance
    Prediction by
    Variable
    Slope
    Significance
    Odds Ratio
    133
    Coefficient
    Linear
    Regression
    Random Forest
    / XGBoost
    Logistic
    Regression
    Cox
    Regression
    Survival Forest
    Random Forest
    / XGBoost
    Numeric TRUE/FALSE TRUE/FALSE + Time
    Statistical Learning Machine Learning Statistical Learning Machine Learning
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Variable
    Importance
    Prediction by
    Variable
    Significance Significance
    R Squared
    RMSE
    R Squared
    AUC
    F Score
    Significance Significance Significance
    Survival Curve Survival Curve

    View Slide

  134. Variable Importance
    Which variables have stronger importance with a target variable?
    134

    View Slide

  135. • Which variables are more important in order to predict the target variable?
    • Which variables have stronger relationship with a target variable?
    135
    Variable
    Importance

    View Slide

  136. That’s it for today!

    View Slide

  137. Next Seminar

    View Slide

  138. EXPLORATORY
    Online Seminar #52
    7/21/2021 (Wed) 11AM PT
    Exploratory Server

    View Slide

  139. View Slide

  140. Information
    Email
    [email protected]
    Website
    https://exploratory.io
    Twitter
    @ExploratoryData
    Seminar
    https://exploratory.io/online-seminar

    View Slide

  141. Q & A
    141

    View Slide

  142. EXPLORATORY
    142

    View Slide