Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: An Introduction to Decision Tree

Exploratory: An Introduction to Decision Tree

Kan will be introducing Decision Tree, which is one of the machine learning algorithms that build prediction models based on the patterns inside the data, by demonstrating it with Exploratory’s Analytics view.

Kan Nishida

June 26, 2019
Tweet

More Decks by Kan Nishida

Other Decks in Technology

Transcript

  1. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of development at Oracle leading development teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Instructor
  2. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  3. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  4. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  5. 9 Baby Weight Plurality is_Premature A 5.2 1 TRUE B

    4.7 2 TRUE C 6.8 1 FALSE D 7.2 1 FALSE E 5.1 2 TRUE Z 5.8 1 ? Will this baby be prematurely born?
  6. 10 Build a decision tree model to predict if a

    given baby is prematurely born or not.
  7. 12 12 Red is Premature, Blue is Not Premature 1

    5 2 3 4 5 6 4 7 Plurality Weight Pound
  8. 13 13 Divide into multiple groups with same colors by

    drawing as less straight lines as possible. 1 5 2 3 4 5 6 4 7 Plurality Weight Pound
  9. 14 14 1 5 2 3 4 5 6 4

    7 Divide into 2 groups by whether Weight Pound is greater than 5.5. Plurality Weight Pound Weight Pound >= 5.5
  10. 15 15 1 5 2 3 4 5 6 4

    7 Divide by whether Plurality is greater than 1.5 Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound
  11. 16 16 1 5 2 3 4 5 6 4

    7 Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound
  12. 17 17 1 5 2 3 4 5 6 4

    7 Premature: No Ratio of Premature: 0% Ratio of All Babies: 40% Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound
  13. 18 18 1 5 2 3 4 5 6 4

    7 Premature: No Ratio of Premature: 40% Ratio of All Babies: 20% Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound
  14. 19 19 1 5 2 3 4 5 6 4

    7 Premature: Yes Ratio of Premature: 100%
 Ratio of All Babies: 40% Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound
  15. 20 Weight Pound >= 5.5 TRUE FALSE Plurality > 1.5

    TRUE FALSE 0% 40% 100% Probability of Premature
  16. 23 Gini Impurity • Ranges between 0 and 1. •

    A metric to measure how much different values are mixed per group. pɿPercentage of each classification result in each branch of the decision tree
  17. • Which variable to use to branch the tree is

    determined by comparing weighted average of Gini impurity. • The lower the Gini impurity is, the more cleanly the data points are classified without branches with mixed classification results. • Variable with the smallest weighted average of Gini impurity is picked for making next branch. Gini Impurity
  18. Impurity = 0 25 Not Premature Not Premature Not Premature

    1 - (0/6)2 - (6/6)2 = 0 Not Premature Not Premature Not Premature
  19. 26 Premature Premature Premature 1 - (6/6)2 - (0/6)2 =

    0 Premature Premature Premature Impurity = 0
  20. Impurity = 0.44 27 Not Premature Not Premature Not Premature

    Premature Premature 1 - (2/6)2 - (4/6)2 = 0.44 Not Premature
  21. Impurity = 0.44 28 Not Premature Premature Premature 1 -

    (4/6)2 - (2/6)2 = 0.44 Not Premature Premature Premature
  22. Impurity = 0.5 29 Not Premature Premature Not Premature Not

    Premature Premature Premature 1 - (3/6)2 - (3/6)2 = 0.5
  23. 30 ૣ࢈ ૣ࢈ ૣ࢈ Impurity: 0.5 ૣ࢈ ૣ࢈ Starting Point

    Premature Premature Premature Premature Premature Not Premature Not Premature Not Premature Not Premature Not Premature
  24. 34 TRUE FALSE Impurity: 0 Impurity: 1- (2/7)2 - (5/7)2

    = 0.41 Impurity: 3/10*0 + 7/10*0.41 = 0.29 Weight Pound >= 5.5
  25. 36 TRUE FALSE Impurity: 3/10*0 + 7/10*0.41 = 0.29 Impurity:

    0.5 0.21 Decrease Weight Pound >= 5.5
  26. 39 TRUE FALSE Impurity: 1- (2/5)2 - (3/5)2 = 0.48

    Impurity: 1- (3/5)2 - (2/5)2 = 0.48 Plurality > 1.5
  27. 40 TRUE FALSE Impurity: 5/10*0.48 + 5/10*0.48 = 0.48 Impurity:

    1- (2/5)2 - (3/5)2 = 0.48 Impurity: 1- (3/5)2 - (2/5)2 = 0.48 Plurality > 1.5
  28. 43 Weight Pound helps decreasing Impurity Score better Plurality 0.02

    ʻ 0.48 Weight Pound Compare Impurity Scores
  29. 47 Over_35 TRUE FALSE Is_Plural TRUE FALSE 50% 100% Is_Plural

    TRUE FALSE 100% 100% Weight Pound >= 5.5 Plurality > 1.5 Weight Pound >= 5.5
  30. 51

  31. 56

  32. Starting point. The majority of data is FALSE (Not Premature).

    Ratio of TRUE (Premature) is 12%. Ratio of babies is 100%
  33. The majority of ‘Yes’ group is FALSE. Ratio of TRUE

    (Premature) is 8%. Ratio of babies is 94%
  34. The majority of ‘No’ group is TRUE. Ratio of TRUE

    (Premature) is 72%. Ratio of babies is 6%