Slide 1

Slide 1 text

DECISION TREES, DATA SCIENCE & MACHINE LEARNING: USING ENTROPY TO DISCOVER PATH TO PURCHASE ADDAM HARDY NWA Tech Summit, 10 Nov 2015

Slide 2

Slide 2 text

WHY IS DATA IMPORTANT?

Slide 3

Slide 3 text

BIG DATA IS POINTLESS.

Slide 4

Slide 4 text

BIG DATA IS POINTLESS.BY ITSELF ANY

Slide 5

Slide 5 text

TEXT

Slide 6

Slide 6 text

TEXT ‣

Slide 7

Slide 7 text

WHAT IS IMPORTANT THEN?

Slide 8

Slide 8 text

INSIGHTS FROM DATA IS WHAT MATTERS.

Slide 9

Slide 9 text

INSIGHTS FROM DATA IS WHAT MATTERS. ACTIONABLE EVEN BETTER:

Slide 10

Slide 10 text

WHY IS RAW DATA USELESS?

Slide 11

Slide 11 text

RGB(84,109,172) RGB(61,148,4) RGB(204,52,126) RGB(48,25,245) RGB(78,114,93) RGB(104,178,75) RGB(110,92,8) RGB(49,65,114) RGB(23,55,24) RGB(211,65,23) RGB(145,214,222) RGB(73,210,62) RGB(47,123,206) RGB(196,51,120) RGB(96,66,92) RGB(60,134,127) RGB(199,112,182) RGB(110,29,202) RGB(28,215,129) RGB(123,108,150) RGB(121,66,112) RGB(217,159,104) RGB(22,111,250) RGB(33,205,104) RGB(4,62,227) RGB(177,246,42) RGB(160,157,124) RGB(147,180,20) RGB(141,46,211) RGB(189,218,73) RGB(177,154,61) RGB(187,66,117) RGB(200,188,39) RGB(221,41,196) RGB(246,109,30) RGB(13,24,116) RGB(23,24,201) RGB(114,43,52) RGB(6,177,253) RGB(221,98,240) RGB(226,21,242) RGB(238,236,86) RGB(224,9,29) RGB(193,82,149) RGB(8,225,89) RGB(37,102,174) RGB(94,192,111) RGB(106,241,207) RGB(145,221,34) RGB(150,139,147) RGB(234,137,16) RGB(143,208,237) RGB(244,195,105) RGB(74,137,229) RGB(34,194,57) RGB(213,79,231) RGB(15,165,133) RGB(126,110,159) RGB(31,241,243) RGB(231,164,167) RGB(129,166,143) RGB(23,29,145) RGB(72,254,92) RGB(25,106,28) RGB(94,49,177) RGB(93,104,159) RGB(144,97,4) RGB(252,180,13) RGB(115,56,55) RGB(237,18,254) RGB(41,61,11) RGB(15,88,141) RGB(78,17,171) RGB(217,14,177) RGB(35,238,166) RGB(125,214,251) RGB(71,130,184) RGB(158,215,157) RGB(187,26,186) RGB(139,33,250) RGB(133,20,79) RGB(210,141,50) RGB(14,216,90) RGB(168,127,104) RGB(48,239,168) RGB(187,145,139) RGB(243,56,32) RGB(79,77,114) RGB(48,110,46) RGB(46,75,8) RGB(197,132,39) RGB(216,27,62) RGB(138,254,137) RGB(121,76,229) RGB(137,227,190) RGB(190,53,99) RGB(151,13,150) RGB(154,230,60) RGB(171,13,32) RGB(175,126,241) RGB(207,1,47) RGB(161,86,61) RGB(217,222,183) RGB(146,96,23) RGB(155,203,206) RGB(168,189,23) RGB(128,51,186) RGB(230,54,198) RGB(237,237,107) RGB(108,191,228) RGB(49,91,61) RGB(19,43,177) RGB(77,140,115) RGB(87,107,228) RGB(222,1,231) RGB(39,7,4) RGB(236,22,163) RGB(126,186,228) RGB(150,160,5) RGB(45,123,70) RGB(28,206,71) RGB(244,248,65) RGB(130,90,155) RGB(42,254,37) RGB(139,241,164) RGB(125,36,35) RGB(224,187,84) RGB(34,36,156) RGB(172,106,219) RGB(22,7,249) RGB(217,182,237) RGB(251,124,12) RGB(162,189,168) RGB(72,149,79) RGB(38,97,211) RGB(163,100,137) RGB(226,56,28) RGB(9,200,52) RGB(130,12,237) RGB(109,132,69) RGB(39,152,215) RGB(136,216,221) RGB(90,154,59) RGB(24,99,204) RGB(80,121,143) RGB(132,110,250) RGB(12,238,13) RGB(236,134,86) RGB(158,47,208) RGB(100,138,207) RGB(203,240,204) RGB(153,209,18) RGB(181,75,22) RGB(3,156,254) RGB(233,208,39) RGB(122,117,211) RGB(16,8,158) RGB(244,69,201) RGB(101,197,36) RGB(112,235,205) RGB(28,53,11) RGB(178,126,148) RGB(5,101,191) RGB(60,195,71) RGB(40,222,6) RGB(1,97,232) RGB(1,34,34) RGB(57,59,250) RGB(93,219,123)

Slide 12

Slide 12 text

WHAT IS THIS?

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

ALRIGHT… I STILL DON’T GET IT

Slide 15

Slide 15 text

WHAT IF WE SORT THE DATA DIFFERENTLY?

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

HOW DO YOU GET VALUE FROM DATA?

Slide 18

Slide 18 text

BIG DATA

Slide 19

Slide 19 text

BIG DATA

Slide 20

Slide 20 text

BIG DATA SCIENCE

Slide 21

Slide 21 text

THERE IS NO LACK OF TOOLS: ID3 DECISION TREES LOGISTIC REGRESSION RANDOM FORESTS SUPPORT VECTOR MACHINES NEURAL NETWORKS NAIVE BAYES K-MEANS DEEP BOLTZMANN MACHINE PRINCIPAL COMPONENT ANALYSIS AND ON.. AND ON..

Slide 22

Slide 22 text

THERE IS NO LACK OF TOOLS: ID3 DECISION TREES LOGISTIC REGRESSION RANDOM FORESTS SUPPORT VECTOR MACHINES NEURAL NETWORKS NAIVE BAYES K-MEANS DEEP BOLTZMANN MACHINE PRINCIPAL COMPONENT ANALYSIS AND ON.. AND ON..

Slide 23

Slide 23 text

Iterative Dichotomiser 3 (ID3)

Slide 24

Slide 24 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Outlook Temp Humidity Windy Run? Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No

Slide 25

Slide 25 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Outlook Temp Humidity Windy Run? Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No PREDICTORS TARGET

Slide 26

Slide 26 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DECISION TREE

Slide 27

Slide 27 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) ITERATIVE DICHOTOMISER 3 (ID3) IS A TOP DOWN, GREEDY SEARCH THROUGH THE SPACE OF POSSIBLE BRANCHES WITH NO BACK TRACKING. USING THIS METHOD, WE CAN PARTITION A DATA SET AND MEASURE ENTROPY AND INFORMATION GAIN AS THE DATA IS SPLIT TO DETERMINE THE OPTIMAL STRUCTURE TO CONSTRUCT A DECISION TREE.

Slide 28

Slide 28 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) ITERATIVE DICHOTOMISER 3 (ID3) IS A TOP DOWN, GREEDY SEARCH THROUGH THE SPACE OF POSSIBLE BRANCHES WITH NO BACK TRACKING. USING THIS METHOD, WE CAN PARTITION A DATA SET AND MEASURE ENTROPY AND INFORMATION GAIN AS THE DATA IS SPLIT TO DETERMINE THE OPTIMAL STRUCTURE TO CONSTRUCT A DECISION TREE. ALRIGHT, ENOUGH DEFINITION

Slide 29

Slide 29 text

HOW DO YOU MEASURE ENTROPY?

Slide 30

Slide 30 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) WITH MATH

Slide 31

Slide 31 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) SINGLE ATTRIBUTE CALCULATION

Slide 32

Slide 32 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No 9 5 3 GE 7 2 3 GE . 1 Run Yes No 9 5 SINGLE ATTRIBUTE CALCULATION

Slide 33

Slide 33 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No 9 5 3 GE 7 2 3 GE . 1 2 3 GE .(*- 1(*- Run Yes No 9 5 SINGLE ATTRIBUTE CALCULATION

Slide 34

Slide 34 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No 9 5 3 GE 7 2 3 GE . 1 2 3 GE .(*- 1(*- 2 3 GE ) ,/ ) /- Run Yes No 9 5 SINGLE ATTRIBUTE CALCULATION

Slide 35

Slide 35 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No 9 5 3 GE 7 2 3 GE . 1 2 3 GE .(*- 1(*- 2 3 GE ) ,/ ) /- 2 ) ,/ + ) ,/ ) /- + ) /- Run Yes No 9 5 SINGLE ATTRIBUTE CALCULATION

Slide 36

Slide 36 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No 9 5 3 GE 7 2 3 GE . 1 2 3 GE .(*- 1(*- 2 3 GE ) ,/ ) /- 2 ) ,/ + ) ,/ ) /- + ) /- 2 ) 1- SINGLE ATTRIBUTE CALCULATION

Slide 37

Slide 37 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No Sunny 3 2 5 Outlook Overcast 4 0 4 Rainy 2 3 3 14 MULTIPLE ATTRIBUTE CALCULATION

Slide 38

Slide 38 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No Sunny 3 2 5 Outlook Overcast 4 0 4 Rainy 2 3 3 14 3 GE 7 5 G 2 6 3 , + 6 5 =E G 3 - ) 6 7 3 + , MULTIPLE ATTRIBUTE CALCULATION

Slide 39

Slide 39 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No Sunny 3 2 5 Outlook Overcast 4 0 4 Rainy 2 3 3 14 3 GE 7 5 G 2 6 3 , + 6 5 =E G 3 - ) 6 7 3 + , 2 .(*- ) 10* -(*- ) ) .(*- ) 10* MULTIPLE ATTRIBUTE CALCULATION

Slide 40

Slide 40 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No Sunny 3 2 5 Outlook Overcast 4 0 4 Rainy 2 3 3 14 3 GE 7 5 G 2 6 3 , + 6 5 =E G 3 - ) 6 7 3 + , 2 .(*- ) 10* -(*- ) ) .(*- ) 10* 2 ) /1, MULTIPLE ATTRIBUTE CALCULATION

Slide 41

Slide 41 text

OK, SO I KNOW HOW TO MEASURE ENTROPY. WHAT IS INFORMATION GAIN?

Slide 42

Slide 42 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) INFORMATION GAIN IS THE DIFFERENCE IN ENTROPY BEFORE AND AFTER THE PARTITION IN DATA.

Slide 43

Slide 43 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No Sunny 3 2 Outlook Overcast 4 0 Rainy 2 3 4 9 2 3 GE 9 3 GE 9 INFORMATION GAIN CALCULATION Run Yes No Hot 2 2 Temp Mild 4 2 Cool 3 1 Gain = 0.029

Slide 44

Slide 44 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No Sunny 3 2 Outlook Overcast 4 0 Rainy 2 3 4 9 2 3 GE 9 3 GE 9 INFORMATION GAIN CALCULATION Run Yes No Hot 2 2 Temp Mild 4 2 Cool 3 1 Gain = 0.029 4 7 5 G 2 3 7 3 7 5 G

Slide 45

Slide 45 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No Sunny 3 2 Outlook Overcast 4 0 Rainy 2 3 4 9 2 3 GE 9 3 GE 9 INFORMATION GAIN CALCULATION Run Yes No Hot 2 2 Temp Mild 4 2 Cool 3 1 Gain = 0.029 4 7 5 G 2 3 7 3 7 5 G 2 ) 1-) ) /1, 2 ) +-0

Slide 46

Slide 46 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) Run Yes No Sunny 3 2 Outlook Overcast 4 0 Rainy 2 3 Gain = 0.247 4 9 2 3 GE 9 3 GE 9 INFORMATION GAIN CALCULATION Run Yes No Hot 2 2 Temp Mild 4 2 Cool 3 1 Gain = 0.029 4 7 5 G 2 3 7 3 7 5 G 2 ) 1-) ) /1, 2 ) +-0 HIGHEST INFORMATION GAIN

Slide 47

Slide 47 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) A BRANCH WITH 0 ENTROPY IS CONSIDERED A LEAF NODE.

Slide 48

Slide 48 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) A BRANCH WITH ENTROPY MORE THAN 0 REQUIRES FURTHER SPLITTING.

Slide 49

Slide 49 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) THE ALGORITHM IS RUN ON EACH BRANCH RECURSIVELY UNTIL IT TERMINATES ON A LEAF NODE.

Slide 50

Slide 50 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) FINALLY:

Slide 51

Slide 51 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) FROM THIS: Outlook Temp Humidity Windy Run? Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No

Slide 52

Slide 52 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) WE AUTOMATICALLY GENERATED THIS:

Slide 53

Slide 53 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) WE AUTOMATICALLY GENERATED THIS: WITH MATH

Slide 54

Slide 54 text

SO WHAT?

Slide 55

Slide 55 text

USING DECISION TREES TO DETERMINE PATH TO PURCHASE

Slide 56

Slide 56 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE Gender Ethnicity Income Age Purchase Female African American 50k-100k 45-54 No Male African American 50k-100k 18-24 Yes Female Hispanic <50k 25-34 Yes Male African American <50k 45-54 Yes Female Asian >100k 35-44 Yes Female Hispanic <50k 18-24 No Female Asian 50k-100k 25-34 No Female African American <50k 25-34 No Male Hispanic >100k 25-34 No Male Caucasian 50k-100k 25-34 No Female Hispanic >100k 18-24 No Female Asian >100k 18-24 Yes Male African American <50k 55+ Yes

Slide 57

Slide 57 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE FEATURES: AGE, INCOME, ETHNICITY

Slide 58

Slide 58 text

TRAIN THE MODEL WITH KNOWN DATA. THEN PREDICT THE OUTCOME OF A SAMPLE SET.

Slide 59

Slide 59 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE FEATURES: AGE, INCOME, ETHNICITY

Slide 60

Slide 60 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE 66% ACCURACY FEATURES: AGE, INCOME, ETHNICITY

Slide 61

Slide 61 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE 66% ACCURACY FEATURES: AGE, INCOME, ETHNICITY

Slide 62

Slide 62 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE FEATURES: INCOME, ETHNICITY

Slide 63

Slide 63 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE FEATURES: INCOME, ETHNICITY 50% ACCURACY

Slide 64

Slide 64 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE 50% ACCURACY FEATURES: AGE, INCOME, ETHNICITY

Slide 65

Slide 65 text

IT’S NOT REALLY THAT ACCURATE. WHAT SHOULD I DO?

Slide 66

Slide 66 text

CHANGE THE FEATURES ANALYZED. GET MORE TRAINING DATA. KEEP DIGGING.

Slide 67

Slide 67 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE FEATURES: GENDER, ETHNICITY, INCOME, AGE

Slide 68

Slide 68 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE FEATURES: GENDER, ETHNICITY, INCOME, AGE 83.3% ACCURACY

Slide 69

Slide 69 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE 83.3% ACCURACY FEATURES: AGE, INCOME, ETHNICITY

Slide 70

Slide 70 text

DECISION TREES: ITERATIVE DICHOTOMISER 3 (ID3) DEMOGRAPHIC DATA + PURCHASE FEATURES: GENDER, ETHNICITY, INCOME, AGE WE SHOULD TRY FOCUSING OUR MARKETING ON HISPANICS AGED 25-34 WHO MAKE LESS THAN $50K A YEAR

Slide 71

Slide 71 text

KEEP IN MIND: THIS IS FUZZY LOGIC. AN ESTIMATE. ALWAYS TEST AND VERIFY YOUR HYPOTHESES.

Slide 72

Slide 72 text

DATA SCIENCE.

Slide 73

Slide 73 text

DATA SCIENCE. IT WORKS.

Slide 74

Slide 74 text

THANKS