Exploratory: An Introduction to Decision Tree

EXPLORATORY

Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,
Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of development at Oracle leading development teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Instructor

Mission Make Data Science Available for Everyone

Data Science is not just for Engineers and Statisticians. Exploratory
makes it possible for Everyone to do Data Science. The Third Wave

First Wave Second Wave Third Wave Proprietary Open Source UI
& Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users

Analytics Decision Tree

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization
Analytics (Statistics / Machine Learning) Exploratory Data Analysis

Decision Tree Decision tree is a series of forks with
conditional questions.

9 Baby Weight Plurality is_Premature A 5.2 1 TRUE B
4.7 2 TRUE C 6.8 1 FALSE D 7.2 1 FALSE E 5.1 2 TRUE Z 5.8 1 ? Will this baby be prematurely born?

10 Build a decision tree model to predict if a
given baby is prematurely born or not.

11 Visualize Weight Pound and Plurality 11 Plurality Weight Pound
1 5 2 3 4 5 6 4 7

12 12 Red is Premature, Blue is Not Premature 1
5 2 3 4 5 6 4 7 Plurality Weight Pound

13 13 Divide into multiple groups with same colors by
drawing as less straight lines as possible. 1 5 2 3 4 5 6 4 7 Plurality Weight Pound

14 14 1 5 2 3 4 5 6 4
7 Divide into 2 groups by whether Weight Pound is greater than 5.5. Plurality Weight Pound Weight Pound >= 5.5

15 15 1 5 2 3 4 5 6 4
7 Divide by whether Plurality is greater than 1.5 Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound

16 16 1 5 2 3 4 5 6 4
7 Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound

17 17 1 5 2 3 4 5 6 4
7 Premature: No Ratio of Premature: 0% Ratio of All Babies: 40% Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound

18 18 1 5 2 3 4 5 6 4
7 Premature: No Ratio of Premature: 40% Ratio of All Babies: 20% Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound

19 19 1 5 2 3 4 5 6 4
7 Premature: Yes Ratio of Premature: 100%  Ratio of All Babies: 40% Weight Pound >= 5.5 Pluralityʼ1.5 Plurality Weight Pound

20 Weight Pound >= 5.5 TRUE FALSE Plurality > 1.5
TRUE FALSE 0% 40% 100% Probability of Premature

21 How does it create Tree?

22 Which conditions should come ﬁrst?

23 Gini Impurity • Ranges between 0 and 1. •
A metric to measure how much different values are mixed per group. pɿPercentage of each classiﬁcation result in each branch of the decision tree

• Which variable to use to branch the tree is
determined by comparing weighted average of Gini impurity. • The lower the Gini impurity is, the more cleanly the data points are classiﬁed without branches with mixed classiﬁcation results. • Variable with the smallest weighted average of Gini impurity is picked for making next branch. Gini Impurity

Impurity = 0 25 Not Premature Not Premature Not Premature
1 - (0/6)2 - (6/6)2 = 0 Not Premature Not Premature Not Premature

26 Premature Premature Premature 1 - (6/6)2 - (0/6)2 =
0 Premature Premature Premature Impurity = 0

Impurity = 0.44 27 Not Premature Not Premature Not Premature
Premature Premature 1 - (2/6)2 - (4/6)2 = 0.44 Not Premature

Impurity = 0.44 28 Not Premature Premature Premature 1 -
(4/6)2 - (2/6)2 = 0.44 Not Premature Premature Premature

Impurity = 0.5 29 Not Premature Premature Not Premature Not
Premature Premature Premature 1 - (3/6)2 - (3/6)2 = 0.5

30 ૣ࢈ ૣ࢈ ૣ࢈ Impurity: 0.5 ૣ࢈ ૣ࢈ Starting Point
Premature Premature Premature Premature Premature Not Premature Not Premature Not Premature Not Premature Not Premature

31 Use Weight Pound to separate ﬁrst

32 Weight Pound >= 5.5 TRUE FALSE

33 TRUE FALSE Impurity: 0 Impurity: 1- (2/7)2 - (5/7)2
= 0.41 Weight Pound >= 5.5

34 TRUE FALSE Impurity: 0 Impurity: 1- (2/7)2 - (5/7)2
= 0.41 Impurity: 3/10*0 + 7/10*0.41 = 0.29 Weight Pound >= 5.5

35 TRUE FALSE Impurity: 3/10*0 + 7/10*0.41 = 0.29 Impurity:
0.5 Weight Pound >= 5.5

36 TRUE FALSE Impurity: 3/10*0 + 7/10*0.41 = 0.29 Impurity:
0.5 0.21 Decrease Weight Pound >= 5.5

37 Use Plurality First

38 Plurality > 1.5 TRUE FALSE

39 TRUE FALSE Impurity: 1- (2/5)2 - (3/5)2 = 0.48
Impurity: 1- (3/5)2 - (2/5)2 = 0.48 Plurality > 1.5

40 TRUE FALSE Impurity: 5/10*0.48 + 5/10*0.48 = 0.48 Impurity:
1- (2/5)2 - (3/5)2 = 0.48 Impurity: 1- (3/5)2 - (2/5)2 = 0.48 Plurality > 1.5

0.5 Plurality > 1.5

0.5 0.02 Decrease Plurality > 1.5

43 Weight Pound helps decreasing Impurity Score better Plurality 0.02
ʻ 0.48 Weight Pound Compare Impurity Scores

44 Use Weight Pound ﬁrst, Then use Plurality

45 TRUE FALSE TRUE FALSE 100% 50% 0% Weight Pound
>= 5.5 Plurality > 1.5

46 Use Plurality ﬁrst, Then use Weight Pound

47 Over_35 TRUE FALSE Is_Plural TRUE FALSE 50% 100% Is_Plural
TRUE FALSE 100% 100% Weight Pound >= 5.5 Plurality > 1.5 Weight Pound >= 5.5

48 Analytics Let’s Decision Tree!

Create a ‘is_premature’ Column 49 gestation_weeks < 37

Select ‘Mutate (Create Calculation) from the column header menu of
‘gestation_week’. 50

52 Select ‘Decision Tree’

53 Select ‘is_premature’ column

54 Select Predictor Variables

55 Select all columns except ‘gestation_weeks’

Starting point. The majority of data is FALSE (Not Premature).
Ratio of TRUE (Premature) is 12%. Ratio of babies is 100%

Condition: Is weight_pounds greater than 5.3?

The majority of ‘Yes’ group is FALSE. Ratio of TRUE
(Premature) is 8%. Ratio of babies is 94%

The majority of ‘No’ group is TRUE. Ratio of TRUE
(Premature) is 72%. Ratio of babies is 6%

Exploratory: An Introduction to Decision Tree

Exploratory: An Introduction to Decision Tree

More Decks by Kan Nishida

Other Decks in Technology

Featured

Transcript