Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: An Introduction to Random Forest & Boruta

Exploratory: An Introduction to Random Forest & Boruta

Random Forest is known as one of the ensemble machine learning algorithms that build ‘decision tree’ based models to predict either categorical or numerical outputs based on the patterns inside the data.

It can be often used as ‘Variable Importance’ to find which variables are more important to predict the target output.

Kan will be showing how to use it with Exploratory’s Analytics view along with various methods like Boruta, EDARF, and SMOTE (adjusting imbalanced data).

Kan Nishida

July 03, 2019
Tweet

More Decks by Kan Nishida

Other Decks in Technology

Transcript

  1. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of development at Oracle leading development teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Instructor
  2. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  3. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  4. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  5. Agenda • Random Forest 101 • Metrics of Prediction Model

    • Type 1 Error vs. Type 2 Error • Adjust Imbalanced Data with SMOTE • Variable Importance with Boruta
  6. Ensemble Learning • Independently train multiple models • Combine the

    prediction from the multiple models to
 come up with the unified prediction. • Example: Random Forest, XGBoost
  7. 14 Mother Age Father Age Weight Plurality State Is Premature

    40 42 5.5 1 CA TRUE 33 33 6.7 1 NY FALSE 32 36 7.0 1 WA FALSE 28 28 4.5 2 NC TRUE 24 26 6.0 1 MI FALSE 28 26 6.7 1 AZ FALSE 43 40 7.6 1 TX FALSE 38 33 4.2 2 FL TRUE 34 32 5.7 1 CA FALSE 29 33 5.2 1 NY TRUE Data
  8. 15 Mother Age Father Age Weight Plurality State is_premature 40

    42 5.5 1 CA TRUE 33 33 6.7 1 NY FALSE 32 36 7.0 1 WA FALSE 28 28 4.5 2 NC TRUE 24 26 6.0 1 MI FALSE 28 26 6.7 1 AZ FALSE 43 40 7.6 1 TX FALSE 38 33 4.2 2 FL TRUE 34 32 5.7 1 CA FALSE 29 33 5.2 1 NY TRUE Target Variable
  9. 16 Mother Age Weight is_premature 40 5.5 TRUE 33 6.7

    FALSE 32 7.0 FALSE 28 4.5 TRUE Data Sample rows and columns randomly. is_premature variable is always in. Mother Age Plurality State is_premature 28 1 AZ FALSE 43 1 TX FALSE 38 2 FL TRUE Father Age State is_premature 33 FL TRUE 32 CA FALSE 33 NY TRUE
  10. 17 Mother Age Weight is_premature 40 5.5 TRUE 33 6.7

    FALSE 32 7.0 FALSE 28 4.5 TRUE Mother Age Plurality State is_premature 28 1 AZ FALSE 43 1 TX FALSE 38 2 FL TRUE Father Age State is_premature 33 FL TRUE 32 CA FALSE 33 NY TRUE Data Build a tree model for each sampled data.
  11. • A model built by Random Forest can give which

    variables have more influence on the target variable based on Gini Impurity information. • It can be very useful at Exploratory Data Analysis phase to understand the relationships among the variables. • Random Forest is often used to extract this information, not just for prediction. Variable Importance
  12. 26

  13. • Prediction Matrix a.k.a. Confusion Matrix • What are the

    percentage of the predicted outputs being right or wrong? • Model Summary with Metrics • Accuracy Rate, AUC, F Score, Precision, Recall, etc. Evaluate Model Quality
  14. TRUE FALSE TRUE True Positive False Negative FALSE False Positive

    True Negative Prediction Actual Accuracy Rate Accuracy Rate = (True Positive + True Negative) / Total
  15. TRUE FALSE TRUE 5 15 FALSE 15 195 Prediction Actual

    Accuracy Rate Accuracy Rate = (5 + 195) / 240 = 0.875
  16. TRUE FALSE TRUE True Positive False Negative FALSE False Positive

    True Negative Prediction Actual Type 1 Error vs. Type 2 Error Type 1 Error Type 2 Error
  17. TRUE FALSE TRUE 5 15 FALSE 15 200 Prediction Actual

    Precision You said TRUE, but the chance of being right is only 25% (5 / (5+15)). Precision = 5 / (5+15) = 25%
  18. When you predicted TRUE, it’s better be TRUE Hi Grandma

    <[email protected]> Hello Kan Where are you? Are you doing ok? I’ve been worried about you since I haven’t heard back from you for a long time. Hope nothing happened to you. Please call me back. Love, Your grandma Spam Type 1 Error
  19. TRUE FALSE TRUE 5 15 FALSE 15 200 Prediction Actual

    Recall The answer is TRUE, but the chance you can predict it is only 25%. Recall = 5 / (5+15) = 25%
  20. You’d better to predict TRUE than not. When it’s TRUE

    but you didn’t predict it as TRUE, it can cause serious damages. Type 2 Error
  21. F Score: It should score to satisfy both Recall and

    Precision in a balanced way. It’s between 0 and 1, The closer to 1 is better. F Score 0.25 F Score is the harmonic mean of Recall and Precision.
  22. You’d better to predict TRUE than not. When it’s TRUE

    but you didn’t predict it as TRUE, it can cause serious damages. Type 2 Error
  23. TRUE FALSE TRUE 5 15 FALSE 15 200 Prediction Actual

    Recall The answer is TRUE, the ratio of you can predict correctly.
  24. Problem of Imbalanced Data • On Prediction Matrix, we can

    see that model is simply almost always predicting FALSE. • Predicting majority case (not premature) would make overall Log Likelihood higher.
  25. • When the actual values are TRUE, more cases are

    predicted as FALSE than TRUE. • Ratio of Type 2 Error is high.
  26. • When the actual values are TRUE, more cases are

    predicted as FALSE than TRUE. • Ratio of Type 2 Error is high. • When there is a risk of premature birth, failing to predict it can be a serious problem.
  27. Balancing Data by SMOTE • An algorithm called SMOTE(Synthetic Minority

    Oversampling Technique) balances imbalanced data by increasing the minority data by synthesizing them. • SMOTE can be added as a Step, or can be done inside Logistic Regression Analytics View.
  28. 63

  29. Adjust Imbalanced Data We already know the data is skewed.

    Let’s use “Adjust Imbalanced Data”.
  30. 1. Do all the predictor variables have effects on ‘Is_Premature’

    variable? Maybe some variables have nothing to do with it? 2. Random Forest’s Randomness doesn’t guarantee the result being same all the times. 71
  31. Decision Tree Data Sampling Sampling Sampling Vote Vote Vote Result

    … Random Sampling Random sampling doesn’t guarantee the result being the same all the times.
  32. 76 • If there is randomness in the result, the

    suggested important variables might be just by a chance. • We can test if they are consistently important or not by using a statistical test method. Boruta
  33. 78

  34. 79 Building Random Forest models 20 times would generate 20

    values of the importance metric. Visualizing the distribution of 20 different values with Boxplot
  35. • Count a number of times a given variable’s variable

    importance score is better than the shadow variables for each variable. • Perform a hypothesis test for the counts and evaluate the statistical significance for each variable.
  36. 86 Plurality Mother Age Father Age 1 25 27 3

    35 40 2 40 41 1 22 22 1 33 35 Data
  37. 87 Plurality Mother Age Father Age 1 25 27 3

    35 40 2 40 41 1 22 22 1 33 35 Create Shadow Plurality Shadow Mother Age Shadow Father Age Shadow 1 22 41 1 44 22 1 35 35 3 25 40 2 40 27 • Copy the original variables and shuffle the values randomly. • These randomly shuffled variables shouldn’t have any association with the target variables.
  38. 90 Hit Hit Not Hit Shadow Shadow Shadow Count how

    many times each variable scores better than the best shadow variable.
  39. 92 Hypothesis Test Null HypothesisɿThere is no difference between a

    given variable and the best shadow variable in terms of the variable importance.
 Alternative Hypothesis 1 : A given variable is better than the best shadow. (Useful) Alternative Hypothesis 2 : A given variable is worse then the best shadow variable. (Not Useful)
  40. 93 Under the assumption of Null Hypothesis,
 a distribution of

    a number of Hits for 20 experiments. 10 5 20 15 0
  41. 94 If we take 5% (0.05) as a threshold of

    significance… 10 5 20 15 0 P Value : 5%
  42. 95 10 5 20 15 0 If a given variable

    was better at 15 times out of 20…
  43. 96 10 5 20 15 0 this can happen only

    with less than 5% of chance.
  44. 97 10 5 20 15 0 We can reject the

    null hypothesis and conclude that this variable is useful for prediction.
  45. 98 If a given variable was better at only 5

    times out of 20… 10 5 20 15 0
  46. 99 10 5 20 15 0 this can happen only

    with less than 5% of chance.
  47. 100 10 5 20 15 0 We can reject the

    null hypothesis and conclude that this variable is Not Useful for prediction.
  48. 101 10 5 20 15 0 If a given variable

    was better at 12 times out of 20…
  49. 102 10 5 20 15 0 this can happen at

    greater than 5% of chance.
  50. 103 10 5 20 15 0 We can NOT reject

    the null hypothesis, can NOT conclude that this variable is whether Useful or Not Useful for prediction.
  51. 104 10 5 20 15 0 A number of hits

    resides here can conclude ‘Useful’ A number of hits resides here can’t conclude either way. A number of hits resides here can conclude ‘Not Useful’
  52. 105