WB Day 3 Consolidated

Introduction Data Science and Data Engineering

Instructor – Raja Iqbal • Founder, CEO & Chief Data
Scientist. • Worked in Bing data mining, Bing Ads (2006-2013) • ETL, bot detection, online experimentation and A/B testing, relevance of online ads, click prediction, etc. • Ph.D. in CS with a focus on computer vision, machine learning and data mining. Copyright (c) 2018. Data Science Dojo 2

Instructor – Rebecca Merrett • Technical Writer and Content Developer.
• Worked in game engine technology, writing technical content on new features. • Graduate diploma in mathematics and statistics, with a bachelor degree in information and media. Copyright (c) 2018. Data Science Dojo 3

Instructor – Victoria Louise Clayton • Instructor and Mentor. •
Worked for a small research consultancy in London and worked on projects for governments, international organizations and companies such as the UN and Siemens. • BA in Human Sciences from Oxford University and an MSc in Decision Science. Copyright (c) 2018. Data Science Dojo 4

Instructor – Margaux Penwarden • Instructor and Mentor. • Data
scientist at McKinsey & Company, Sydney • Bachelor’s in Computer Science and Mathematics from Télécom Paristech (“Grande Ecole”), and a Master’s in Statistics from Imperial College, London. Copyright (c) 2018. Data Science Dojo 5

About Data Science Dojo •Started in August 2014 •100+ bootcamps,
workshops and corporate trainings •~3500 attendees •600+ companies •10 countries Copyright (c) 2018. Data Science Dojo 6

Learning Objectives • Learn the theory and practice of data
science for improved health systems and healthcare. • Explore and visualize a health-related dataset. • Build and evaluate predictive models for classification and regression (for instance, predicting whether a tumour is malignant or not), as an example of the use of machine learning in health. Copyright (c) 2018. Data Science Dojo 7

Learning Objectives • Understand the fundamentals of unsupervised learning and
clustering, and its potential applications in health systems • Learn fundamentals of text analytics and perform text analytics on a health-related dataset • Get an introduction to big data and data engineering Copyright (c) 2018. Data Science Dojo 8

Maximizing ROI From This Week •Map the techniques to real
problems at all times: •Problem and business impact •Data you have (and do not have). •Measurement metrics •Business metrics 10 Copyright (c) 2018. Data Science Dojo

Logistics • 8:30 am – 5:30 pm daily* • Course
material and resources: • Handbooks • Learning portal • Request: • Make sure your computers are ready • Keep the session interactive • Social media, email, etc. Copyright (c) 2018. Data Science Dojo 11 *We will end at 4:00 pm on Friday

Agenda for Today Session I: Understanding the AI and data
science landscape Session II: Data exploration and visualization Session III: Introduction to predictive modeling Session IV: Decision tree learning and building your first predictive model Session V: Evaluating classification models Copyright (c) 2018. Data Science Dojo 12

Understanding the AI and Data Science Landscape

Objectives • Review the current data science landscape • Discuss
what other organizations are (or may be) doing • Common data mining tasks • Identify some data science problems in health Copyright (c) 2018. Data Science Dojo 14

Drug Discoveries • Insilico Medicine • Finding new drugs and
treatments including immunotherapies. • MIT Clinical Machine Learning Group • Focussed on disease processes and design for effective treatment of diseases such as Type 2 diabetes. • Knight Cancer Institute • With a current focus on developing an approach to personalize drug combinations for Acute Myeloid Leukemia (AML). Copyright (c) 2018. Data Science Dojo 15

Medical Imaging & Diagnostics ▪ VunoMed • Identifies different types
of lung tissue damage by color to help physicians make more accurate diagnosis. • IBM Watson Genomics • Provides precision medicine to cancer patients. Copyright (c) 2018. Data Science Dojo 16

Virtual Assistants • Scanadu’s doc.ai • NLP program that allows
patients to get their lab results explained to them by an app, saving both patient and doctor time and money. • Somatix • Recognizes of hand-to-mouth gestures in order to help people better understand their behavior and make life-affirming changes. Copyright (c) 2018. Data Science Dojo 17

Research • Google Deep Mind • Develops technology to address
macular degeneration in aging eyes. • Desktop Genetics • AI-designed tech for more effective and affordable guides. Recognized as leader in genome editing technology. • iCarbonX • Monitors and models human biological data to enable people to find the proper lifestyle and treatments that can improve their health, life quality and joy. Copyright (c) 2018. Data Science Dojo 18

Brainstorming What are some other applications? Copyright (c) 2018. Data
Science Dojo 19

Connecting the Dots •The underlying magic behind what we saw
is ‘big data’ and ‘predictive analytics’ Copyright (c) 2018. Data Science Dojo 20

Big Data Pipeline Stage: Data influx • Output: Data stream
Stage: Collection • Output: Target data Stage: Preprocessing • Output: Preprocessed data Stage: Transformation • Output: Transformed data Stage: Data Mining • Output: Patterns Stage: Interpretation and Evaluation • Output: Knowledge discovery and actionable insights Copyright (c) 2018. Data Science Dojo 21

Data Management Data Science Collect Store Transform Reason Model Visualize
Recommend Predict Explore ETL/Log SQL NoSQL MapReduce Real Time Analytics Big Data – Technology, Platforms & Products Copyright (c) 2018. Data Science Dojo 22

From Analytics Translator to Data Science Engineer Analytics translators are
just as important as data scientists and data engineers! • Identify applications and use cases • Convey the needs of the business to data scientists and engineers and vice versa • Generate user buy-in • Educate the business on high level concepts of analytics 23 Copyright (c) 2018. Data Science Dojo

Data Mining Tasks • Descriptive Methods: • Find human-interpretable patterns
that describe the data • Techniques: Clustering, Association Analysis, X-point summaries • Predictive Methods: • Use available data to build models that can predict the outcome of future data • Techniques: Classification, Regression, Anomaly, and Deviation Detection • Prescriptive Methods: • Predict future outcomes and suggest actions that may prevent or mitigate the impact of the predicted outcomes • Techniques: Various optimization techniques Copyright (c) 2018. Data Science Dojo 24

Traffic Management Descriptive [Informing Role]: • Traffic jam has happened
already • [Implicit: Do something about it] Copyright (c) 2018. Data Science Dojo 25

Traffic Management Predictive [Informing and Warning Role]: • Traffic jam
is about to happen in the next 30 minutes • [Implicit: Do something before it happens] Copyright (c) 2018. Data Science Dojo 26

Traffic Management Prescriptive [Informing, Warning, and Advisory Role]: Take action
so traffic jam does not happen OR Traffic jam is about to happen in the next 30 minutes and you could possibly take the following courses of action: • Route traffic to service road near I-5 • Block more traffic from entering the WA-520 bridge Copyright (c) 2018. Data Science Dojo 27

Data Mining and Predictive Analytics In the next few slides,
we will take a look at some of the most common data mining tasks. Copyright (c) 2018. Data Science Dojo 29

Classification: A Simple Example Tid Refund Marital Status Taxable Income
Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund Marital Status Taxable Income Cheat No Single 75K ? Yes Married 50K ? No Married 150K ? Yes Divorced 90K ? No Single 40K ? No Married 80K ? 10 Test Set Training Set Model Learn Classifier Copyright (c) 2018. Data Science Dojo 30

Classification: More Examples • What is the likelihood that a
patient will develop diabetes? • What is the likelihood that a COPD patient will be readmitted within 90 days of discharge? • What is the likelihood that a person will not show up to their appointment? Copyright (c) 2018. Data Science Dojo 31

Intra-cluster distances are minimized Inter-cluster distances are maximized Clustering in
3-D space using Euclidean distance Clustering: An Illustration Copyright (c) 2018. Data Science Dojo 32

Clustering • Given a set of data points, each having
a set of attributes, and a similarity measure among them, find clusters such that: • Data points within a cluster have more similarities with one another • Data points in different clusters have less similarities with one another 33 Copyright (c) 2018. Data Science Dojo

Clustering: Similarity Measures • Similarity Measures: • Euclidean Distance if
attributes are continuous • Other problem-specific measures • Example: If a particular word occurs in two documents or not Copyright (c) 2018. Data Science Dojo 34

Clustering: Examples To find groups of documents that are similar
to each other based on the most important terms that appear in them (e.g. medical records) Copyright (c) 2018. Data Science Dojo 35

Association Analysis Your behavior is being predicted, not by studying
you, but by studying others. Copyright (c) 2018. Data Science Dojo 36

Association Rule Discovery • Given a set of records each
of which contain some number of items from a given collection: • Produce dependency rules which will predict the occurrence of an item based on the occurrences of other items TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Copyright (c) 2018. Data Science Dojo 37

Association Analysis: Pharmacy Shelf Management Copyright (c) 2018. Data Science
Dojo 38

Predicts a value of a given continuous-valued variable based on
the values of other variables, assuming a linear or nonlinear model of dependency Regression Copyright (c) 2018. Data Science Dojo 39

Regression Example Copyright (c) 2018. Data Science Dojo 40 Predicting
vaccine demand to better plan supply

Anomaly Detection • Detect significant deviations from normal behavior •
Applications: • Unusual patient behavior • Insurance fraud detection • Treatment outlier detection Copyright (c) 2018. Data Science Dojo 41

Challenges in Data Mining Scalability Dimensionality Complex and heterogeneous data
Data quality Data ownership and distribution Privacy Reaction time Many other domain specific issues Copyright (c) 2018. Data Science Dojo 42

AI in Healthcare Landscape 43 Copyright (c) 2018. Data Science
Dojo

Overview of Datasets 4

Wisconsin Breast Cancer Data Copyright (c) 2018. Data Science Dojo
45

Wisconsin Breast Cancer Data 46 Copyright (c) 2018. Data Science
Dojo • Features obtained from a digital image of a fine needle aspirate (FNA) of a breast mass. • Describes characteristics of the cell nuclei present in the image. • Attribute information: • ID number • Diagnosis (M = malignant, B = benign) • 10 real-valued features • Total of 569 records

Wisconsin Breast Cancer Data 47 Copyright (c) 2018. Data Science
Dojo 47 Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Features: Wisconsin Breast Cancer Data 48 Copyright (c) 2018. Data
Science Dojo Id: ID number diagnosis: The diagnosis of breast tissues (M = malignant, B = benign) radius_mean: mean of distances from center to points on the perimeter texture_mean: standard deviation of gray-scale values perimeter_mean: mean size of the core tumor compactness_se: standard error for perimeter^2 / area - 1.0 smoothness_mean: mean of local variation in radius lengths compactness_mean: mean of perimeter^2 / area - 1.0 concavity_mean: mean of severity of concave portions of the contour concave points_mean: mean for number of concave portions of the contour fractal_dimension_mean: mean for "coastline approximation" - 1 radius_se: standard error for the mean of distances from center to points on the perimeter texture_se: standard error for standard deviation of gray-scale values Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Data Exploration and Visualization 49 Copyright © 2018. Data Science
Dojo

Agenda •Why data exploration and visualization? •Exploration and visualization of
data: •Core R functionality •lattice package •ggplot2 package Copyright © 2018. Data Science Dojo 50

WHY DATA EXPLORATION AND VISUALIZATION? Copyright © 2018. Data Science
Dojo 51

Data Beats Algorithm But… • More data usually yields good
generalization performance, even with a simple algorithm • But there are caveats: • Amount of data may have diminishing returns • Data quality and variety matters • A decent performing learning algorithm is still needed • Most importantly, extracting useful features out of data is important Copyright © 2018. Data Science Dojo 52

Copyright © 2018. Data Science Dojo 23:05:33 –5 UTC, April
3, 2014 Is Date-Time Stamp a Good Feature? Hour of date Day of week AM/PM 53

Dispelling a Common Myth •There is NO single ML algorithm
that will take raw data and give you the best model •You do NOT need to know a lot of machine learning algorithms to build robust predictive models Copyright © 2018. Data Science Dojo 54

Janitorial Work is Important •Not spending time on understanding your
data is a source of many problems! •Remember the 80/20 rule: • 80% : Data cleaning, data exploration, feature engineering, pre-processing, etc • 20% : Model building Copyright © 2018. Data Science Dojo 55

EXPLORATION AND VISUALIZATION USING R Copyright © 2018. Data Science
Dojo 56

Objectives •Develop an understanding of the high-level thinking process of
data exploration •Make sense of data using visualization techniques •Learn to perform feature engineering •Become a good storyteller Copyright © 2018. Data Science Dojo 57

Anscombe’s Quartet Plot I II III IV x y x
y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Copyright © 2018. Data Science Dojo 58

I II III IV x y x y x y
x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Consider the 4 following different datasets Anscombe’s Quartet Mean of X 9 Variance of X 11 Mean of Y 7.5 Variance of Y 4.125 Correlation between X & Y 0.816 Copyright © 2018. Data Science Dojo 59

Common Graphical Parameters • Title of graph using the main
function, main = “title” • Label x axis by using the xlab function, xlab = “label x axis” • Label x axis by using the ylab function, ylab = “label y axis” • Colors controlled by col • Get legends of layered plots with auto.key=TRUE Copyright © 2018. Data Science Dojo 61

Exploring Data Commands Copyright © 2018. Data Science Dojo Commands
Description read.csv() , read.table() Load data/file into a dataframe data() Loads or resets a dataset names() List names of variables in a dataframe head() First 6 rows of data tail() Last 6 rows of data str() Display internal structure if R object View() View dataset in spreadsheet format in RStudio dim() Dimensions( rows and columns) of dataframe summary() Display 5-number summary and mean colnames() Provide column names 62

Copyright © 2018. Data Science Dojo 64 Breast Cancer Dataset
breast_cancer <- read.csv("data.csv") data(breast_cancer) head(breast_cancer)

• Summarizes quantitative/numeric data Boxplots # Core Graphics boxplot( radius_mean~diagnosis,
data=breast_cancer, main="Radius Mean for various diagnoses ", xlab="Diagnosis", ylab="Radius Mean" ) B: Benign M: Malignant Copyright © 2018. Data Science Dojo 65

Pie Chart ▪ Summarizes qualitative/categorical variables # Core Graphics pie(table(breast_cancer$diagnosis))
Copyright © 2018. Data Science Dojo 66 B: Benign M: Malignant

Scatter Plot ▪ Visual depiction of correlation between numeric variables
# Core Graphics plot(breast_cancer$concave. points_worst ,breast_cancer$perimeter_worst ,xlab="Concave Points Worst", ylab="Perimeter Worst") Copyright © 2018. Data Science Dojo 67

Scatter Plot # Core Graphics plot(perimeter_worst~area_worst, data=breast_cancer) ▪ Plot of
perimeter_worst against area_worst Copyright © 2018. Data Science Dojo 68

Scatter Plot plot(concave.points_worst~perimeter_worst, data=breast_cancer, main="Concave Points Worst vs Perimeter Worst",
xlab="Concave Points Worst", ylab="Perimeter Worst") abline(lm(concave.points_worst~perimeter_worst, data=breast_cancer),col="red",lwd=2) cor(breast_cancer$concave.points_worst,breast_cancer$perimeter_worst) >0.816322101687544 • Plots counts of Concave Points Worst versus Perimeter Worst, then adds a regression line • Find correlation between variables (values close to 1 or -1 depict strong linear relationship) Copyright © 2018. Data Science Dojo 69

ggplot Fundamentals •ggplot() provides a blank canvas for plotting •geom_*()
creates actual graphical layers • geom_point() • geom_boxplot() •aes() defines an "aesthetic" either globally or by layer Copyright © 2018. Data Science Dojo 71

Copyright © 2018. Data Science Dojo ggplot(breast_cancer, aes()) + geom_point()
Layering Layer 1 Layer 2 72

Histogram A histogram of counts of Concave Points Worst ggplot(breast_cancer,aes(x=con
cave.points_worst)) + geom_histogram() Copyright © 2018. Data Science Dojo 73

Density Smooths over the counts of concave points worst ▪
Note the location of aes() ggplot(breast_cancer) + geom_density(aes(x=concave. points_worst),fill="gray50") + labs(x="Concave Points Worst") Copyright © 2018. Data Science Dojo 74

Scatter Plot ggplot(breast_cancer, aes(x=concave.points_worst, y=perimeter_worst)) + geom_point() + labs(x="Concave Points
Worst", y="Perimeter Worst") Copyright © 2018. Data Science Dojo 75

Saving a ggplot Object # ggplot object # Store the
plot for future modifications g <- ggplot(breast_cancer, aes(x=concave.points_worst, y=perimeter_worst)) # Second aesthetic adds settings specific to geom_point layer g + geom_point(aes(color=diagnosis)) + labs(x="Concave Points Worst", y="Perimeter Worst") Copyright © 2018. Data Science Dojo 76

Segmenting a Plot # Segment by factor g + geom_point(aes(color=diagnosis))
+ facet_wrap(~diagnosis) + labs(x="Concave Points Worst“ ,y="Perimeter Worst") Copyright © 2018. Data Science Dojo 77

Summary ✓Basics of R ✓Graphing in R – core and
ggplot2 ✓Look at multiple types of graphs ✓Visualize and segment data to gain more insights ✓Identify key features ✓Summarize findings Copyright © 2018. Data Science Dojo 78

80 Copyright (c) 2018. Data Science Dojo Building Classification Models
Using Decision Trees

81 Copyright (c) 2018. Data Science Dojo 81 Agenda •Introduction
to predictive analytics •Introduction to classification •Decision Tree Classifier •Hands-on Lab: Building a decision tree classifier using R

83 Copyright (c) 2018. Data Science Dojo 83 Emergency &
Surgery Rooms • Gauss Surgical • Develops real-time blood monitoring solutions to provide an accurate and objective estimate of blood loss. • MedaSense • Assesses patients’ physiological response to pain.

84 Copyright (c) 2018. Data Science Dojo 84 Patient Data
& Risk Assessment ▪ Watson for oncology • Analyzes patients medical records and identify treatment options for doctors and patients. ▪ SkinVision • Assesses skin cancer risk using image recognition and user provided information. ▪ Berg • Includes dosage trials for intravenous tumor treatment, detection and management of prostate cancer.

85 Copyright (c) 2018. Data Science Dojo 85 Mental Health
▪ MedyMatch • Helps treat stroke and head trauma more effectively by detecting intracranial brain bleeds. ▪ P1vital • Predicting Response to Depression Treatment (PReDicT test) uses Machine Learning to provide anti- depressant treatment.

87 Copyright (c) 2018. Data Science Dojo 87 Supervised Learning
Training Set Train Model Learning Learning Algorithm Model Apply Model Prediction Test Set ID Diagnosis 2 Benign 4 Benign 6 Benign 8 Malignant 10 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 4 123.8 30.28 0.1466 ? 6 151.3 32.92 0.1395 ? 8 86.3 26.03 0.1258 ? 10 ID Perimeter Texture Concavity Diagnosis 1 84.8 25.41 0.1266 Malignant 3 96.2 26.13 0.1302 Malignant 5 123.5 29.54 0.1469 Benign 7 120.9 26.92 0.1355 Benign 10 121.2 27.02 0.1478 Benign 11 153.4 33.83 0.1202 Benign

88 Copyright (c) 2018. Data Science Dojo 88 Decision Tree
Learning Splitting Attributes Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥ 0.1358 <0.1358 <26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 1 84.8 25.41 0.1266 Malignant 3 96.2 26.13 0.1302 Malignant 5 123.5 29.54 0.1469 Benign 7 120.9 26.92 0.1355 Benign 10 121.2 27.02 0.1478 Benign 11 153.4 33.83 0.1202 Benign

89 Copyright (c) 2018. Data Science Dojo 89 A Different
Decision Tree Texture Perimeter Concavity Malignant Benign Benign Malignant <114.6 ≥114.6 <26.29 ≥26.29 <0.1358 ≥0.1358 There could be more than one tree that fits the same data! ID Perimeter Texture Concavity Diagnosis 1 84.8 25.41 0.1266 Malignant 3 96.2 26.13 0.1302 Malignant 5 123.5 29.54 0.1469 Benign 7 120.9 26.92 0.1355 Benign 10 121.2 27.02 0.1478 Benign 11 153.4 33.83 0.1202 Benign

90 Copyright (c) 2018. Data Science Dojo 90 Decision Tree
Application Training Set Train Model Induction Learning Algorithm Model Apply Model Deduction Test Set ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 4 123.8 30.28 0.1466 ? 6 151.3 32.92 0.1395 ? 8 86.3 26.03 0.1258 ? 10 ID Perimeter Texture Concavity Diagnosis 1 84.8 25.41 0.1266 Malignant 3 96.2 26.13 0.1302 Malignant 5 123.5 29.54 0.1469 Benign 7 120.9 26.92 0.1355 Benign 10 121.2 27.02 0.1478 Benign 11 153.4 33.83 0.1202 Benign ID Diagnosis 2 Benign 4 Benign 6 Benign 8 Malignant 10

91 Copyright (c) 2018. Data Science Dojo 91 Apply Model
to Test Data Test Data Start from the root of tree. Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥ 0.1358 <0.1358 < 26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10

to Test Data Test Data Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥0.1358 <0.1358 <26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10

to Test Data Test Data Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥0.1358 <0.1358 < 26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10

to Test Data Test Data Diagnosis = “Benign” Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥0.1358 <0.1358 <26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10

97 Copyright (c) 2018. Data Science Dojo 97 How Do
We Get A Tree? • Exponentially many decision trees are possible • Finding the optimal tree is infeasible • Greedy methods that find near-optimal solutions do exist

98 Copyright (c) 2018. Data Science Dojo 98 Tree Induction
• Greedy strategy • Split based attribute test that optimizes a criterion • Issues • How to split the records • What attribute test condition? • How to determine the best split? • When do we stop?

• Greedy strategy • Split based attribute test that optimizes a criterion • Issues • How to split the records • What attribute test criterion? • How to determine the best split? • When do we stop?

100 Copyright (c) 2018. Data Science Dojo 100 Splitting a
Node Texture > 26.29? No Yes Binary Split Texture [16.5, 22.2) <16.5 [22.2, 32.5) [35.8, 39.7) ≥ 30.2 Multi-way Split

102 Copyright (c) 2018. Data Science Dojo 102 What is
The Best Split? Before Splitting: 10 records of class 1, 10 records of class 2 Which test condition is the best? Texture? C1 : 6 C2 : 4 C1 : 4 C2 : 6 Concavity? C1 : 1 C2 : 3 C1 : 8 C2 : 0 C1 : 1 C2 : 7 ID? C1 : 0 C2 : 1 C1 : 1 C2 : 0 C1 : 0 C2 : 1 C1 : 1 C2 : 0 < 0.14 s1 s2 s3 s20 … C1 : Benign C2 : Malignant ≥ 0.14 & ≤ 0.15 > 0.15 < 26.29 ≥ 26.29

103 Copyright (c) 2018. Data Science Dojo 103 C1 :
9 C2 : 1 C1 : 5 C2 : 5 What is The Best Split? • Greedy approach • Homogeneous class distribution preferred • Need a measure of node impurity Non-homogeneous High degree of impurity Homogeneous Low degree of impurity C1 : Benign C2 : Malignant

104 Copyright (c) 2018. Data Science Dojo 104 Measures of
Node Impurity •Gini Index •Entropy •Misclassification error

105 Copyright (c) 2018. Data Science Dojo 105 Impurity Measure:
GINI • p( j | t) is the relative frequency of class j at node t • Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information • n c =number of classes • Minimum (0.0) when all records belong to one class, implying most interesting information  − = j t j p t GINI 2 )] | ( [ 1 ) ( C1 0 C2 6 Gini=0.000 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 C1 1 C2 5 Gini=0.278 C1 : Benign C2 : Malignant

GINI C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0  − = j t j p t GINI 2 )] | ( [ 1 ) ( P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 C1 : Benign C2 : Malignant

GINI • When a node p is split into k partitions (children), the quality of split is computed as: where n i = number of records at child i n = number of records at node p  = = k i i i GINI n n p split GINI 1 ) ( ) , (

GINI • Split data into two partitions • Partition measurements are weighted • Larger and purer partitions are sought after B? Malignant Benign Node N1 Node N2 Parent C1 6 C2 6 Gini = 0.500 N1 N2 C1 5 1 C2 2 4 Gini=0.371 Gini(N1 ) = 1 – (5/7)2 – (2/7)2 = 0.408 Gini(N2 ) = 1 – (1/5)2– (4/5)2 = 0.320 Gini(B?, Parent) = 7/12 * 0.408 + 5/12 * 0.320 = 0.371 N1 N2 C1 5 1 C2 2 4 C1 : Benign C2 : Malignant

109 Copyright (c) 2018. Data Science Dojo 109 • is
the relative frequency of class j at node t • Maximum: records equally distributed • Minimum: all records belong to one class  − = j t j p t j p t Entropy )) | ( ( log ) | ( ) ( 2 Impurity Measure: Entropy

Entropy C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6)– (5/6) log2 (5/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6)– (4/6) log2 (4/6) = 0.92  − = j t j p t j p t Entropy ) | ( log ) | ( ) ( 2 C1 : Benign C2 : Malignant

Information • Node p is split into k partitions • n i is number of records in partition i • Measures reduction in entropy • Choose split that maximizes GAIN • Tends to prefer splits with large number of partitions       − =  = k i i split i Entropy n n p Entropy GAIN 1 ) ( ) (

Classification Error • Maximum: records are equally distributed • Minimum: all records belong to one class • Similar to information gain • Less sensitive for > 2 or 3 splits • Less prone to overfitting ) | ( max 1 ) ( t i P t Error i − =

Classification Error C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 ) | ( max 1 ) ( t i P t Error i − = C1 : Benign C2 : Malignant

115 Copyright (c) 2018. Data Science Dojo 115 Sample Stopping
Criteria • All the records belong to the same class • All the records have similar attribute values • Fixed termination or pruning • Number of Levels • Number in Leaf Node • Minimum samples per leaf node

116 Copyright (c) 2018. Data Science Dojo 116 Decision Trees
- PROS • Intuitive • Easy interpretation for small trees • Non parametric • Incorporate both numeric and categorical attributes • Fast • Once rules are developed, prediction is rapid • Robust to outliers Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 <0.1358 <0.1358 < 26.29 ≥26.29

117 Copyright (c) 2018. Data Science Dojo 117 Decision Trees
- CONS • Overfitting • Must be trained with great care • Rectangular Classification • Recursive partitioning of data may not capture complex relationships

118 Copyright (c) 2018. Data Science Dojo QUESTIONS Copyright (c)
2018. Data Science Dojo

120 Copyright (c) 2018. Data Science Dojo 120 Agenda •
Evaluation of classification models: • Confusion Matrix • Accuracy, Precision, Recall, F1 measure • Building robust machine learning models: • Bias/variance tradeoff • Methods of evaluation: • Cross validation • ROC curve

121 Copyright (c) 2018. Data Science Dojo 121 The Limitations
of Accuracy • Consider a 2-class problem: • Number of Class 0 examples = 9990 • Number of Class 1 examples = 10 • If the model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % • Accuracy is misleading!

123 Copyright (c) 2018. Data Science Dojo 123 Confusion Matrix
PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a b Class=No c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

124 Copyright (c) 2018. Data Science Dojo 124 Confusion Matrix
PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) d c b a d a FN FP TN TP TN TP + + + + = + + + + = Accuracy

125 Copyright (c) 2018. Data Science Dojo 125 Precision =
+ = + PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

126 Copyright (c) 2018. Data Science Dojo 126 Recall/Sensitivity =
+ = + PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

127 Copyright (c) 2018. Data Science Dojo 127 F1-Score 1
= 2 + = 2 2 + + PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Harmonic mean of precision and recall

128 Copyright (c) 2018. Data Science Dojo WILL MY MODEL
BETRAY ME?

129 Copyright (c) 2018. Data Science Dojo 129 Is My
Model Really Good? • My model shows an accuracy of 90% in the training environment • Would the model be 90% accurate in production environment?

130 Copyright (c) 2018. Data Science Dojo 130 Generalization •
A machine learning model should be able to handle any data set coming from the same distribution as the training set. • Generalization refers to a model's ability to handle any random variations of training data

131 Copyright (c) 2018. Data Science Dojo 131 Overfitting (lack
of generalization) • The gravest and most common sin of machine learning • Overfitting: learning so much from your data that you memorize it. • You do well on training data • But don’t do well (or even fail miserably) on test data

132 Copyright (c) 2018. Data Science Dojo 132 Train/Test Partition
is Not Enough Labelled Data Training Data Blind Holdout Data 70% 30%

133 Copyright (c) 2018. Data Science Dojo 133 Blind Holdout
Dataset • The person building the model has no access to the blind holdout dataset • Why do we need to lock it away? • Even in presence of a 70/30 split, you may end up with a model that is not generalized

134 Copyright (c) 2018. Data Science Dojo 134 Perils of
Overfitting

135 Copyright (c) 2018. Data Science Dojo 135 Bias/Variance Tradeoff
You can beat your data to confession.

136 Copyright (c) 2018. Data Science Dojo 136 Copyright (c)
2018. Data Science Dojo The generation of random numbers is too important to be left to chance.

137 Copyright (c) 2018. Data Science Dojo 137 Bias/Variance Trade-off
Bullseye is the theoretical best performance (accuracy, precision, recall or something else) Each dartboard represents a model

• Test your model on several variations of the dataset • Each dot represents a random variation of the test dataset

141 Copyright (c) 2018. Data Science Dojo 141 Cross Validation
•Split data into k disjoint partitions •Train on k-1 partitions and test on 1 •Repeat k times

142 Copyright (c) 2018. Data Science Dojo 142 Cross Validation
(k=10) 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Training Set Test Set

143 Copyright (c) 2018. Data Science Dojo 143 Adjusting Learning
Parameters max depth = 10 max depth = 7 max depth = 2 A1 100% 80% 55% A2 60% 78% 55% A3 90% 79% 55% A4 70% 77% 55% A5 80% 81% 55% average 80% 79% 55%

144 Copyright (c) 2018. Data Science Dojo 144 Holdout Set
•70% for training, 30% for testing •60/40 or 50/50 also possible •Repeated holdout: Apply 70/30, 60/40 or 50/50 many times.

145 Copyright (c) 2018. Data Science Dojo 145 Stratified Sampling
•Use when class distribution is skewed •Ensures that all partitions have fixed ratio of classes •Same ratio as training set • If training set is 5% class 1 and 95% class 2, so is each partition

146 Copyright (c) 2018. Data Science Dojo 146 Using ROC
for Model Comparison • No model consistently outperforms the other • Purple is better at low thresholds • Red is better at high thresholds • Area Under ROC Curve (AUC) • Compares models directly AUC=0.865 AUC=0.859

147 Copyright (c) 2018. Data Science Dojo QUESTIONS Copyright (c)
2018. Data Science Dojo

WB Day 3 Consolidated

WB Day 3 Consolidated

Featured

Transcript