Upgrade to Pro — share decks privately, control downloads, hide ads and more …

WB Day 3 Consolidated

DataScienceDojo
October 03, 2018
18

WB Day 3 Consolidated

Consolidation of all slides from Day 3.

DataScienceDojo

October 03, 2018
Tweet

Transcript

  1. Instructor – Raja Iqbal • Founder, CEO & Chief Data

    Scientist. • Worked in Bing data mining, Bing Ads (2006-2013) • ETL, bot detection, online experimentation and A/B testing, relevance of online ads, click prediction, etc. • Ph.D. in CS with a focus on computer vision, machine learning and data mining. Copyright (c) 2018. Data Science Dojo 2
  2. Instructor – Rebecca Merrett • Technical Writer and Content Developer.

    • Worked in game engine technology, writing technical content on new features. • Graduate diploma in mathematics and statistics, with a bachelor degree in information and media. Copyright (c) 2018. Data Science Dojo 3
  3. Instructor – Victoria Louise Clayton • Instructor and Mentor. •

    Worked for a small research consultancy in London and worked on projects for governments, international organizations and companies such as the UN and Siemens. • BA in Human Sciences from Oxford University and an MSc in Decision Science. Copyright (c) 2018. Data Science Dojo 4
  4. Instructor – Margaux Penwarden • Instructor and Mentor. • Data

    scientist at McKinsey & Company, Sydney • Bachelor’s in Computer Science and Mathematics from Télécom Paristech (“Grande Ecole”), and a Master’s in Statistics from Imperial College, London. Copyright (c) 2018. Data Science Dojo 5
  5. About Data Science Dojo •Started in August 2014 •100+ bootcamps,

    workshops and corporate trainings •~3500 attendees •600+ companies •10 countries Copyright (c) 2018. Data Science Dojo 6
  6. Learning Objectives • Learn the theory and practice of data

    science for improved health systems and healthcare. • Explore and visualize a health-related dataset. • Build and evaluate predictive models for classification and regression (for instance, predicting whether a tumour is malignant or not), as an example of the use of machine learning in health. Copyright (c) 2018. Data Science Dojo 7
  7. Learning Objectives • Understand the fundamentals of unsupervised learning and

    clustering, and its potential applications in health systems • Learn fundamentals of text analytics and perform text analytics on a health-related dataset • Get an introduction to big data and data engineering Copyright (c) 2018. Data Science Dojo 8
  8. Maximizing ROI From This Week •Map the techniques to real

    problems at all times: •Problem and business impact •Data you have (and do not have). •Measurement metrics •Business metrics 10 Copyright (c) 2018. Data Science Dojo
  9. Logistics • 8:30 am – 5:30 pm daily* • Course

    material and resources: • Handbooks • Learning portal • Request: • Make sure your computers are ready • Keep the session interactive • Social media, email, etc. Copyright (c) 2018. Data Science Dojo 11 *We will end at 4:00 pm on Friday
  10. Agenda for Today Session I: Understanding the AI and data

    science landscape Session II: Data exploration and visualization Session III: Introduction to predictive modeling Session IV: Decision tree learning and building your first predictive model Session V: Evaluating classification models Copyright (c) 2018. Data Science Dojo 12
  11. Objectives • Review the current data science landscape • Discuss

    what other organizations are (or may be) doing • Common data mining tasks • Identify some data science problems in health Copyright (c) 2018. Data Science Dojo 14
  12. Drug Discoveries • Insilico Medicine • Finding new drugs and

    treatments including immunotherapies. • MIT Clinical Machine Learning Group • Focussed on disease processes and design for effective treatment of diseases such as Type 2 diabetes. • Knight Cancer Institute • With a current focus on developing an approach to personalize drug combinations for Acute Myeloid Leukemia (AML). Copyright (c) 2018. Data Science Dojo 15
  13. Medical Imaging & Diagnostics ▪ VunoMed • Identifies different types

    of lung tissue damage by color to help physicians make more accurate diagnosis. • IBM Watson Genomics • Provides precision medicine to cancer patients. Copyright (c) 2018. Data Science Dojo 16
  14. Virtual Assistants • Scanadu’s doc.ai • NLP program that allows

    patients to get their lab results explained to them by an app, saving both patient and doctor time and money. • Somatix • Recognizes of hand-to-mouth gestures in order to help people better understand their behavior and make life-affirming changes. Copyright (c) 2018. Data Science Dojo 17
  15. Research • Google Deep Mind • Develops technology to address

    macular degeneration in aging eyes. • Desktop Genetics • AI-designed tech for more effective and affordable guides. Recognized as leader in genome editing technology. • iCarbonX • Monitors and models human biological data to enable people to find the proper lifestyle and treatments that can improve their health, life quality and joy. Copyright (c) 2018. Data Science Dojo 18
  16. Connecting the Dots •The underlying magic behind what we saw

    is ‘big data’ and ‘predictive analytics’ Copyright (c) 2018. Data Science Dojo 20
  17. Big Data Pipeline Stage: Data influx • Output: Data stream

    Stage: Collection • Output: Target data Stage: Preprocessing • Output: Preprocessed data Stage: Transformation • Output: Transformed data Stage: Data Mining • Output: Patterns Stage: Interpretation and Evaluation • Output: Knowledge discovery and actionable insights Copyright (c) 2018. Data Science Dojo 21
  18. Data Management Data Science Collect Store Transform Reason Model Visualize

    Recommend Predict Explore ETL/Log SQL NoSQL MapReduce Real Time Analytics Big Data – Technology, Platforms & Products Copyright (c) 2018. Data Science Dojo 22
  19. From Analytics Translator to Data Science Engineer Analytics translators are

    just as important as data scientists and data engineers! • Identify applications and use cases • Convey the needs of the business to data scientists and engineers and vice versa • Generate user buy-in • Educate the business on high level concepts of analytics 23 Copyright (c) 2018. Data Science Dojo
  20. Data Mining Tasks • Descriptive Methods: • Find human-interpretable patterns

    that describe the data • Techniques: Clustering, Association Analysis, X-point summaries • Predictive Methods: • Use available data to build models that can predict the outcome of future data • Techniques: Classification, Regression, Anomaly, and Deviation Detection • Prescriptive Methods: • Predict future outcomes and suggest actions that may prevent or mitigate the impact of the predicted outcomes • Techniques: Various optimization techniques Copyright (c) 2018. Data Science Dojo 24
  21. Traffic Management Descriptive [Informing Role]: • Traffic jam has happened

    already • [Implicit: Do something about it] Copyright (c) 2018. Data Science Dojo 25
  22. Traffic Management Predictive [Informing and Warning Role]: • Traffic jam

    is about to happen in the next 30 minutes • [Implicit: Do something before it happens] Copyright (c) 2018. Data Science Dojo 26
  23. Traffic Management Prescriptive [Informing, Warning, and Advisory Role]: Take action

    so traffic jam does not happen OR Traffic jam is about to happen in the next 30 minutes and you could possibly take the following courses of action: • Route traffic to service road near I-5 • Block more traffic from entering the WA-520 bridge Copyright (c) 2018. Data Science Dojo 27
  24. Data Mining and Predictive Analytics In the next few slides,

    we will take a look at some of the most common data mining tasks. Copyright (c) 2018. Data Science Dojo 29
  25. Classification: A Simple Example Tid Refund Marital Status Taxable Income

    Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund Marital Status Taxable Income Cheat No Single 75K ? Yes Married 50K ? No Married 150K ? Yes Divorced 90K ? No Single 40K ? No Married 80K ? 10 Test Set Training Set Model Learn Classifier Copyright (c) 2018. Data Science Dojo 30
  26. Classification: More Examples • What is the likelihood that a

    patient will develop diabetes? • What is the likelihood that a COPD patient will be readmitted within 90 days of discharge? • What is the likelihood that a person will not show up to their appointment? Copyright (c) 2018. Data Science Dojo 31
  27. Intra-cluster distances are minimized Inter-cluster distances are maximized Clustering in

    3-D space using Euclidean distance Clustering: An Illustration Copyright (c) 2018. Data Science Dojo 32
  28. Clustering • Given a set of data points, each having

    a set of attributes, and a similarity measure among them, find clusters such that: • Data points within a cluster have more similarities with one another • Data points in different clusters have less similarities with one another 33 Copyright (c) 2018. Data Science Dojo
  29. Clustering: Similarity Measures • Similarity Measures: • Euclidean Distance if

    attributes are continuous • Other problem-specific measures • Example: If a particular word occurs in two documents or not Copyright (c) 2018. Data Science Dojo 34
  30. Clustering: Examples To find groups of documents that are similar

    to each other based on the most important terms that appear in them (e.g. medical records) Copyright (c) 2018. Data Science Dojo 35
  31. Association Analysis Your behavior is being predicted, not by studying

    you, but by studying others. Copyright (c) 2018. Data Science Dojo 36
  32. Association Rule Discovery • Given a set of records each

    of which contain some number of items from a given collection: • Produce dependency rules which will predict the occurrence of an item based on the occurrences of other items TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Copyright (c) 2018. Data Science Dojo 37
  33. Predicts a value of a given continuous-valued variable based on

    the values of other variables, assuming a linear or nonlinear model of dependency Regression Copyright (c) 2018. Data Science Dojo 39
  34. Anomaly Detection • Detect significant deviations from normal behavior •

    Applications: • Unusual patient behavior • Insurance fraud detection • Treatment outlier detection Copyright (c) 2018. Data Science Dojo 41
  35. Challenges in Data Mining Scalability Dimensionality Complex and heterogeneous data

    Data quality Data ownership and distribution Privacy Reaction time Many other domain specific issues Copyright (c) 2018. Data Science Dojo 42
  36. Wisconsin Breast Cancer Data 46 Copyright (c) 2018. Data Science

    Dojo • Features obtained from a digital image of a fine needle aspirate (FNA) of a breast mass. • Describes characteristics of the cell nuclei present in the image. • Attribute information: • ID number • Diagnosis (M = malignant, B = benign) • 10 real-valued features • Total of 569 records
  37. Wisconsin Breast Cancer Data 47 Copyright (c) 2018. Data Science

    Dojo 47 Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
  38. Features: Wisconsin Breast Cancer Data 48 Copyright (c) 2018. Data

    Science Dojo Id: ID number diagnosis: The diagnosis of breast tissues (M = malignant, B = benign) radius_mean: mean of distances from center to points on the perimeter texture_mean: standard deviation of gray-scale values perimeter_mean: mean size of the core tumor compactness_se: standard error for perimeter^2 / area - 1.0 smoothness_mean: mean of local variation in radius lengths compactness_mean: mean of perimeter^2 / area - 1.0 concavity_mean: mean of severity of concave portions of the contour concave points_mean: mean for number of concave portions of the contour fractal_dimension_mean: mean for "coastline approximation" - 1 radius_se: standard error for the mean of distances from center to points on the perimeter texture_se: standard error for standard deviation of gray-scale values Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
  39. Agenda •Why data exploration and visualization? •Exploration and visualization of

    data: •Core R functionality •lattice package •ggplot2 package Copyright © 2018. Data Science Dojo 50
  40. Data Beats Algorithm But… • More data usually yields good

    generalization performance, even with a simple algorithm • But there are caveats: • Amount of data may have diminishing returns • Data quality and variety matters • A decent performing learning algorithm is still needed • Most importantly, extracting useful features out of data is important Copyright © 2018. Data Science Dojo 52
  41. Copyright © 2018. Data Science Dojo 23:05:33 –5 UTC, April

    3, 2014 Is Date-Time Stamp a Good Feature? Hour of date Day of week AM/PM 53
  42. Dispelling a Common Myth •There is NO single ML algorithm

    that will take raw data and give you the best model •You do NOT need to know a lot of machine learning algorithms to build robust predictive models Copyright © 2018. Data Science Dojo 54
  43. Janitorial Work is Important •Not spending time on understanding your

    data is a source of many problems! •Remember the 80/20 rule: • 80% : Data cleaning, data exploration, feature engineering, pre-processing, etc • 20% : Model building Copyright © 2018. Data Science Dojo 55
  44. Objectives •Develop an understanding of the high-level thinking process of

    data exploration •Make sense of data using visualization techniques •Learn to perform feature engineering •Become a good storyteller Copyright © 2018. Data Science Dojo 57
  45. Anscombe’s Quartet Plot I II III IV x y x

    y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Copyright © 2018. Data Science Dojo 58
  46. I II III IV x y x y x y

    x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Consider the 4 following different datasets Anscombe’s Quartet Mean of X 9 Variance of X 11 Mean of Y 7.5 Variance of Y 4.125 Correlation between X & Y 0.816 Copyright © 2018. Data Science Dojo 59
  47. Common Graphical Parameters • Title of graph using the main

    function, main = “title” • Label x axis by using the xlab function, xlab = “label x axis” • Label x axis by using the ylab function, ylab = “label y axis” • Colors controlled by col • Get legends of layered plots with auto.key=TRUE Copyright © 2018. Data Science Dojo 61
  48. Exploring Data Commands Copyright © 2018. Data Science Dojo Commands

    Description read.csv() , read.table() Load data/file into a dataframe data() Loads or resets a dataset names() List names of variables in a dataframe head() First 6 rows of data tail() Last 6 rows of data str() Display internal structure if R object View() View dataset in spreadsheet format in RStudio dim() Dimensions( rows and columns) of dataframe summary() Display 5-number summary and mean colnames() Provide column names 62
  49. Copyright © 2018. Data Science Dojo 64 Breast Cancer Dataset

    breast_cancer <- read.csv("data.csv") data(breast_cancer) head(breast_cancer)
  50. • Summarizes quantitative/numeric data Boxplots # Core Graphics boxplot( radius_mean~diagnosis,

    data=breast_cancer, main="Radius Mean for various diagnoses ", xlab="Diagnosis", ylab="Radius Mean" ) B: Benign M: Malignant Copyright © 2018. Data Science Dojo 65
  51. Scatter Plot ▪ Visual depiction of correlation between numeric variables

    # Core Graphics plot(breast_cancer$concave. points_worst ,breast_cancer$perimeter_worst ,xlab="Concave Points Worst", ylab="Perimeter Worst") Copyright © 2018. Data Science Dojo 67
  52. Scatter Plot # Core Graphics plot(perimeter_worst~area_worst, data=breast_cancer) ▪ Plot of

    perimeter_worst against area_worst Copyright © 2018. Data Science Dojo 68
  53. Scatter Plot plot(concave.points_worst~perimeter_worst, data=breast_cancer, main="Concave Points Worst vs Perimeter Worst",

    xlab="Concave Points Worst", ylab="Perimeter Worst") abline(lm(concave.points_worst~perimeter_worst, data=breast_cancer),col="red",lwd=2) cor(breast_cancer$concave.points_worst,breast_cancer$perimeter_worst) >0.816322101687544 • Plots counts of Concave Points Worst versus Perimeter Worst, then adds a regression line • Find correlation between variables (values close to 1 or -1 depict strong linear relationship) Copyright © 2018. Data Science Dojo 69
  54. ggplot Fundamentals •ggplot() provides a blank canvas for plotting •geom_*()

    creates actual graphical layers • geom_point() • geom_boxplot() •aes() defines an "aesthetic" either globally or by layer Copyright © 2018. Data Science Dojo 71
  55. Histogram A histogram of counts of Concave Points Worst ggplot(breast_cancer,aes(x=con

    cave.points_worst)) + geom_histogram() Copyright © 2018. Data Science Dojo 73
  56. Density Smooths over the counts of concave points worst ▪

    Note the location of aes() ggplot(breast_cancer) + geom_density(aes(x=concave. points_worst),fill="gray50") + labs(x="Concave Points Worst") Copyright © 2018. Data Science Dojo 74
  57. Saving a ggplot Object # ggplot object # Store the

    plot for future modifications g <- ggplot(breast_cancer, aes(x=concave.points_worst, y=perimeter_worst)) # Second aesthetic adds settings specific to geom_point layer g + geom_point(aes(color=diagnosis)) + labs(x="Concave Points Worst", y="Perimeter Worst") Copyright © 2018. Data Science Dojo 76
  58. Segmenting a Plot # Segment by factor g + geom_point(aes(color=diagnosis))

    + facet_wrap(~diagnosis) + labs(x="Concave Points Worst“ ,y="Perimeter Worst") Copyright © 2018. Data Science Dojo 77
  59. Summary ✓Basics of R ✓Graphing in R – core and

    ggplot2 ✓Look at multiple types of graphs ✓Visualize and segment data to gain more insights ✓Identify key features ✓Summarize findings Copyright © 2018. Data Science Dojo 78
  60. 81 Copyright (c) 2018. Data Science Dojo 81 Agenda •Introduction

    to predictive analytics •Introduction to classification •Decision Tree Classifier •Hands-on Lab: Building a decision tree classifier using R
  61. 82 Copyright (c) 2018. Data Science Dojo INTRODUCTION TO PREDICTIVE

    ANALYTICS Copyright © 2018. Data Science Dojo 82
  62. 83 Copyright (c) 2018. Data Science Dojo 83 Emergency &

    Surgery Rooms • Gauss Surgical • Develops real-time blood monitoring solutions to provide an accurate and objective estimate of blood loss. • MedaSense • Assesses patients’ physiological response to pain.
  63. 84 Copyright (c) 2018. Data Science Dojo 84 Patient Data

    & Risk Assessment ▪ Watson for oncology • Analyzes patients medical records and identify treatment options for doctors and patients. ▪ SkinVision • Assesses skin cancer risk using image recognition and user provided information. ▪ Berg • Includes dosage trials for intravenous tumor treatment, detection and management of prostate cancer.
  64. 85 Copyright (c) 2018. Data Science Dojo 85 Mental Health

    ▪ MedyMatch • Helps treat stroke and head trauma more effectively by detecting intracranial brain bleeds. ▪ P1vital • Predicting Response to Depression Treatment (PReDicT test) uses Machine Learning to provide anti- depressant treatment.
  65. 87 Copyright (c) 2018. Data Science Dojo 87 Supervised Learning

    Training Set Train Model Learning Learning Algorithm Model Apply Model Prediction Test Set ID Diagnosis 2 Benign 4 Benign 6 Benign 8 Malignant 10 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 4 123.8 30.28 0.1466 ? 6 151.3 32.92 0.1395 ? 8 86.3 26.03 0.1258 ? 10 ID Perimeter Texture Concavity Diagnosis 1 84.8 25.41 0.1266 Malignant 3 96.2 26.13 0.1302 Malignant 5 123.5 29.54 0.1469 Benign 7 120.9 26.92 0.1355 Benign 10 121.2 27.02 0.1478 Benign 11 153.4 33.83 0.1202 Benign
  66. 88 Copyright (c) 2018. Data Science Dojo 88 Decision Tree

    Learning Splitting Attributes Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥ 0.1358 <0.1358 <26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 1 84.8 25.41 0.1266 Malignant 3 96.2 26.13 0.1302 Malignant 5 123.5 29.54 0.1469 Benign 7 120.9 26.92 0.1355 Benign 10 121.2 27.02 0.1478 Benign 11 153.4 33.83 0.1202 Benign
  67. 89 Copyright (c) 2018. Data Science Dojo 89 A Different

    Decision Tree Texture Perimeter Concavity Malignant Benign Benign Malignant <114.6 ≥114.6 <26.29 ≥26.29 <0.1358 ≥0.1358 There could be more than one tree that fits the same data! ID Perimeter Texture Concavity Diagnosis 1 84.8 25.41 0.1266 Malignant 3 96.2 26.13 0.1302 Malignant 5 123.5 29.54 0.1469 Benign 7 120.9 26.92 0.1355 Benign 10 121.2 27.02 0.1478 Benign 11 153.4 33.83 0.1202 Benign
  68. 90 Copyright (c) 2018. Data Science Dojo 90 Decision Tree

    Application Training Set Train Model Induction Learning Algorithm Model Apply Model Deduction Test Set ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 4 123.8 30.28 0.1466 ? 6 151.3 32.92 0.1395 ? 8 86.3 26.03 0.1258 ? 10 ID Perimeter Texture Concavity Diagnosis 1 84.8 25.41 0.1266 Malignant 3 96.2 26.13 0.1302 Malignant 5 123.5 29.54 0.1469 Benign 7 120.9 26.92 0.1355 Benign 10 121.2 27.02 0.1478 Benign 11 153.4 33.83 0.1202 Benign ID Diagnosis 2 Benign 4 Benign 6 Benign 8 Malignant 10
  69. 91 Copyright (c) 2018. Data Science Dojo 91 Apply Model

    to Test Data Test Data Start from the root of tree. Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥ 0.1358 <0.1358 < 26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10
  70. 92 Copyright (c) 2018. Data Science Dojo 92 Apply Model

    to Test Data Test Data Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥0.1358 <0.1358 <26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10
  71. 93 Copyright (c) 2018. Data Science Dojo 93 Apply Model

    to Test Data Test Data Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥0.1358 <0.1358 <26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10
  72. 94 Copyright (c) 2018. Data Science Dojo 94 Apply Model

    to Test Data Test Data Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥0.1358 <0.1358 <26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10
  73. 95 Copyright (c) 2018. Data Science Dojo 95 Apply Model

    to Test Data Test Data Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥0.1358 <0.1358 < 26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10
  74. 96 Copyright (c) 2018. Data Science Dojo 96 Apply Model

    to Test Data Test Data Diagnosis = “Benign” Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 ≥0.1358 <0.1358 <26.29 ≥26.29 ID Perimeter Texture Concavity Diagnosis 2 125.6 31.51 0.1578 ? 10
  75. 97 Copyright (c) 2018. Data Science Dojo 97 How Do

    We Get A Tree? • Exponentially many decision trees are possible • Finding the optimal tree is infeasible • Greedy methods that find near-optimal solutions do exist
  76. 98 Copyright (c) 2018. Data Science Dojo 98 Tree Induction

    • Greedy strategy • Split based attribute test that optimizes a criterion • Issues • How to split the records • What attribute test condition? • How to determine the best split? • When do we stop?
  77. 99 Copyright (c) 2018. Data Science Dojo 99 Tree Induction

    • Greedy strategy • Split based attribute test that optimizes a criterion • Issues • How to split the records • What attribute test criterion? • How to determine the best split? • When do we stop?
  78. 100 Copyright (c) 2018. Data Science Dojo 100 Splitting a

    Node Texture > 26.29? No Yes Binary Split Texture [16.5, 22.2) <16.5 [22.2, 32.5) [35.8, 39.7) ≥ 30.2 Multi-way Split
  79. 101 Copyright (c) 2018. Data Science Dojo 101 Tree Induction

    • Greedy strategy • Split based attribute test that optimizes a criterion • Issues • How to split the records • What attribute test criterion? • How to determine the best split? • When do we stop?
  80. 102 Copyright (c) 2018. Data Science Dojo 102 What is

    The Best Split? Before Splitting: 10 records of class 1, 10 records of class 2 Which test condition is the best? Texture? C1 : 6 C2 : 4 C1 : 4 C2 : 6 Concavity? C1 : 1 C2 : 3 C1 : 8 C2 : 0 C1 : 1 C2 : 7 ID? C1 : 0 C2 : 1 C1 : 1 C2 : 0 C1 : 0 C2 : 1 C1 : 1 C2 : 0 < 0.14 s1 s2 s3 s20 … C1 : Benign C2 : Malignant ≥ 0.14 & ≤ 0.15 > 0.15 < 26.29 ≥ 26.29
  81. 103 Copyright (c) 2018. Data Science Dojo 103 C1 :

    9 C2 : 1 C1 : 5 C2 : 5 What is The Best Split? • Greedy approach • Homogeneous class distribution preferred • Need a measure of node impurity Non-homogeneous High degree of impurity Homogeneous Low degree of impurity C1 : Benign C2 : Malignant
  82. 104 Copyright (c) 2018. Data Science Dojo 104 Measures of

    Node Impurity •Gini Index •Entropy •Misclassification error
  83. 105 Copyright (c) 2018. Data Science Dojo 105 Impurity Measure:

    GINI • p( j | t) is the relative frequency of class j at node t • Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information • n c =number of classes • Minimum (0.0) when all records belong to one class, implying most interesting information  − = j t j p t GINI 2 )] | ( [ 1 ) ( C1 0 C2 6 Gini=0.000 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 C1 1 C2 5 Gini=0.278 C1 : Benign C2 : Malignant
  84. 106 Copyright (c) 2018. Data Science Dojo 106 Impurity Measure:

    GINI C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0  − = j t j p t GINI 2 )] | ( [ 1 ) ( P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 C1 : Benign C2 : Malignant
  85. 107 Copyright (c) 2018. Data Science Dojo 107 Impurity Measure:

    GINI • When a node p is split into k partitions (children), the quality of split is computed as: where n i = number of records at child i n = number of records at node p  = = k i i i GINI n n p split GINI 1 ) ( ) , (
  86. 108 Copyright (c) 2018. Data Science Dojo 108 Impurity Measure:

    GINI • Split data into two partitions • Partition measurements are weighted • Larger and purer partitions are sought after B? Malignant Benign Node N1 Node N2 Parent C1 6 C2 6 Gini = 0.500 N1 N2 C1 5 1 C2 2 4 Gini=0.371 Gini(N1 ) = 1 – (5/7)2 – (2/7)2 = 0.408 Gini(N2 ) = 1 – (1/5)2– (4/5)2 = 0.320 Gini(B?, Parent) = 7/12 * 0.408 + 5/12 * 0.320 = 0.371 N1 N2 C1 5 1 C2 2 4 C1 : Benign C2 : Malignant
  87. 109 Copyright (c) 2018. Data Science Dojo 109 • is

    the relative frequency of class j at node t • Maximum: records equally distributed • Minimum: all records belong to one class  − = j t j p t j p t Entropy )) | ( ( log ) | ( ) ( 2 Impurity Measure: Entropy
  88. 110 Copyright (c) 2018. Data Science Dojo 110 Impurity Measure:

    Entropy C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6)– (5/6) log2 (5/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6)– (4/6) log2 (4/6) = 0.92  − = j t j p t j p t Entropy ) | ( log ) | ( ) ( 2 C1 : Benign C2 : Malignant
  89. 111 Copyright (c) 2018. Data Science Dojo 111 Impurity Measure:

    Information • Node p is split into k partitions • n i is number of records in partition i • Measures reduction in entropy • Choose split that maximizes GAIN • Tends to prefer splits with large number of partitions       − =  = k i i split i Entropy n n p Entropy GAIN 1 ) ( ) (
  90. 112 Copyright (c) 2018. Data Science Dojo 112 Impurity Measure:

    Classification Error • Maximum: records are equally distributed • Minimum: all records belong to one class • Similar to information gain • Less sensitive for > 2 or 3 splits • Less prone to overfitting ) | ( max 1 ) ( t i P t Error i − =
  91. 113 Copyright (c) 2018. Data Science Dojo 113 Impurity Measure:

    Classification Error C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 ) | ( max 1 ) ( t i P t Error i − = C1 : Benign C2 : Malignant
  92. 114 Copyright (c) 2018. Data Science Dojo 114 Tree Induction

    • Greedy strategy • Split based attribute test that optimizes a criterion • Issues • How to split the records • What attribute test criterion? • How to determine the best split? • When do we stop?
  93. 115 Copyright (c) 2018. Data Science Dojo 115 Sample Stopping

    Criteria • All the records belong to the same class • All the records have similar attribute values • Fixed termination or pruning • Number of Levels • Number in Leaf Node • Minimum samples per leaf node
  94. 116 Copyright (c) 2018. Data Science Dojo 116 Decision Trees

    - PROS • Intuitive • Easy interpretation for small trees • Non parametric • Incorporate both numeric and categorical attributes • Fast • Once rules are developed, prediction is rapid • Robust to outliers Perimeter Concavity Texture Benign Malignant Malignant Benign <114.6 ≥114.6 <0.1358 <0.1358 < 26.29 ≥26.29
  95. 117 Copyright (c) 2018. Data Science Dojo 117 Decision Trees

    - CONS • Overfitting • Must be trained with great care • Rectangular Classification • Recursive partitioning of data may not capture complex relationships
  96. 120 Copyright (c) 2018. Data Science Dojo 120 Agenda •

    Evaluation of classification models: • Confusion Matrix • Accuracy, Precision, Recall, F1 measure • Building robust machine learning models: • Bias/variance tradeoff • Methods of evaluation: • Cross validation • ROC curve
  97. 121 Copyright (c) 2018. Data Science Dojo 121 The Limitations

    of Accuracy • Consider a 2-class problem: • Number of Class 0 examples = 9990 • Number of Class 1 examples = 10 • If the model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % • Accuracy is misleading!
  98. 123 Copyright (c) 2018. Data Science Dojo 123 Confusion Matrix

    PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a b Class=No c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
  99. 124 Copyright (c) 2018. Data Science Dojo 124 Confusion Matrix

    PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) d c b a d a FN FP TN TP TN TP + + + + = + + + + = Accuracy
  100. 125 Copyright (c) 2018. Data Science Dojo 125 Precision =

    + = + PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)
  101. 126 Copyright (c) 2018. Data Science Dojo 126 Recall/Sensitivity =

    + = + PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)
  102. 127 Copyright (c) 2018. Data Science Dojo 127 F1-Score 1

    = 2 + = 2 2 + + PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Harmonic mean of precision and recall
  103. 129 Copyright (c) 2018. Data Science Dojo 129 Is My

    Model Really Good? • My model shows an accuracy of 90% in the training environment • Would the model be 90% accurate in production environment?
  104. 130 Copyright (c) 2018. Data Science Dojo 130 Generalization •

    A machine learning model should be able to handle any data set coming from the same distribution as the training set. • Generalization refers to a model's ability to handle any random variations of training data
  105. 131 Copyright (c) 2018. Data Science Dojo 131 Overfitting (lack

    of generalization) • The gravest and most common sin of machine learning • Overfitting: learning so much from your data that you memorize it. • You do well on training data • But don’t do well (or even fail miserably) on test data
  106. 132 Copyright (c) 2018. Data Science Dojo 132 Train/Test Partition

    is Not Enough Labelled Data Training Data Blind Holdout Data 70% 30%
  107. 133 Copyright (c) 2018. Data Science Dojo 133 Blind Holdout

    Dataset • The person building the model has no access to the blind holdout dataset • Why do we need to lock it away? • Even in presence of a 70/30 split, you may end up with a model that is not generalized
  108. 136 Copyright (c) 2018. Data Science Dojo 136 Copyright (c)

    2018. Data Science Dojo The generation of random numbers is too important to be left to chance.
  109. 137 Copyright (c) 2018. Data Science Dojo 137 Bias/Variance Trade-off

    Bullseye is the theoretical best performance (accuracy, precision, recall or something else) Each dartboard represents a model
  110. 138 Copyright (c) 2018. Data Science Dojo 138 Bias/Variance Trade-off

    • Test your model on several variations of the dataset • Each dot represents a random variation of the test dataset
  111. 141 Copyright (c) 2018. Data Science Dojo 141 Cross Validation

    •Split data into k disjoint partitions •Train on k-1 partitions and test on 1 •Repeat k times
  112. 142 Copyright (c) 2018. Data Science Dojo 142 Cross Validation

    (k=10) 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Training Set Test Set
  113. 143 Copyright (c) 2018. Data Science Dojo 143 Adjusting Learning

    Parameters max depth = 10 max depth = 7 max depth = 2 A1 100% 80% 55% A2 60% 78% 55% A3 90% 79% 55% A4 70% 77% 55% A5 80% 81% 55% average 80% 79% 55%
  114. 144 Copyright (c) 2018. Data Science Dojo 144 Holdout Set

    •70% for training, 30% for testing •60/40 or 50/50 also possible •Repeated holdout: Apply 70/30, 60/40 or 50/50 many times.
  115. 145 Copyright (c) 2018. Data Science Dojo 145 Stratified Sampling

    •Use when class distribution is skewed •Ensures that all partitions have fixed ratio of classes •Same ratio as training set • If training set is 5% class 1 and 95% class 2, so is each partition
  116. 146 Copyright (c) 2018. Data Science Dojo 146 Using ROC

    for Model Comparison • No model consistently outperforms the other • Purple is better at low thresholds • Red is better at high thresholds • Area Under ROC Curve (AUC) • Compares models directly AUC=0.865 AUC=0.859