Machine Learning 102: Classification

The picture can't be displayed. The picture can't be displayed.
The picture can't be displayed. With Marcos Arancibia, Product Manager, Data Science and Big Data @MarcosArancibia Mark Hornick, Senior Director, Product Management, Data Science and Machine Learning @MarkHornick oracle.com/machine-learning Oracle Machine Learning Office Hours Machine Learning 102 - Classification Copyright © 2020, Oracle and/or its affiliates. All rights reserved

Today’s Agenda Questions Upcoming session Speaker Marcos Arancibia – Machine
Learning 102 Q&A Copyright © 2020 Oracle and/or its affiliates.

Web Questions • Previous session was awesome and i really
loved it. I hope that similar sessions will be organized based on different ML functions. • In this current session , i would like you to provide details about Cost Matrix and how to calculate cost and assign. I have gone through all the documentation. But, it does not give you clear explanation about how cost should be assigned. Also, request you to provide details on model evaluation with respect to cost matrix. Copyright © 2020 Oracle and/or its affiliates.

Next Session June 25, 2020: Oracle Machine Office Hours, 9AM
US Pacific Machine Learning 101 – Regression Have you always been curious about what machine learning can do for your business problem, but could never find the time to learn the practical necessary skills? Do you wish to learn what Classification, Regression, Clustering and Feature Extraction techniques do, and how to apply them using the Oracle Machine Learning family of products? Join us for this special series “Oracle Machine Learning Office Hours – Machine Learning 101”, where we will go through the main steps of solving a Business Problem from beginning to end, using the different components available in Oracle Machine Learning: programming languages and interfaces, including Notebooks with SQL, UI, and languages like R and Python. This second session in the series will cover Regression, where we will learn how to set up a data set for regression modeling, build machine learning models that predict numeric values such as home prices, and evaluate model quality. Marcos Arancibia, OML Product Management Marcos Arancibia, OML Product Management Copyright © 2020, Oracle and/or its affiliates. All rights reserved

For product info… https://www.oracle.com/machine-learning Copyright © 2020 Oracle and/or its
affiliates.

Today’s Session: Machine Learning 102 - Classification In this "ML
Classification 102" session we will pickup where we left off from our 101 Session, and go deeper in our discussions on ML algorithms, the importance of Feature Selection, and will explore even more the correct way to evaluate models using the Confusion Matrix and the many statistics that can be computed from it. Copyright © 2020, Oracle and/or its affiliates. All rights reserved

• What is machine learning? • What is classification? •
Business problems addressed by classification • Types of data needed for Classification • Terminology • Data preparation • Model evaluation • AutoML • Q&A • Further details in Model evaluation Agenda Copyright © 2020 Oracle and/or its affiliates 8

Copyright © 2020, Oracle and/or its affiliates 9 How can
we determine if a Model is any good? After Scoring new (Test or Validation) data, we compare what the Model predicted was going to happen vs. the Actual Target. Review: Model Evaluation: Confusion Matrix 1 0 1 20 12 0 10 50 Model Predicted This These are Actual Responses found on the test data Precision only takes into account the True Positives on the Actual Predicted Positives Precision = 20 / (20+12) = 62.5% Accuracy takes into account the Positives but also the Negatives, which is key in many use cases Accuracy = (20 + 50) / (20 + 12 + 10 + 50) = 76.1%

Copyright © 2020, Oracle and/or its affiliates 10 There are
many more measures of quality of a Model available, several can be easily computed and several are available in Oracle Machine Learning. From Wikipedia on Confusion Matrix: Model Evaluation: Confusion Matrix

Copyright © 2020, Oracle and/or its affiliates 11 In Oracle
Data Miner (SQL Developer desktop App): • Interactive Comparison of multiple Cutoff Points • Comparison of all Models • In practice, interpretation is: "How much of the total actual positive cases would I have captured if I have chosen to only contact the top X % of the customers sorted in descending order by Probability?" Percentile Cumulative positive cases by Model Lift/Gains Chart

Data Miner (SQL Developer desktop App): • Interactive Comparison of multiple Cutoff Points • Comparison of all Models • In practice, interpretation is: "How much better than a Random Choice would my model be if I have chosen to only contact the top X % of the customers sorted in descending order by Probability?" • Notice that at the right most, all models go to '1' because if you contact everyone you actually would have spoken to all positive responders Percentile Cumulative lift by Model Lift/Gains Chart

Data Miner (SQL Developer desktop App): • Interactive Comparison of multiple Cutoff Points • Comparison of all Models • In practice, interpretation is: "How much money would I actually win or lose depending on the cutoff point, based on the fact that I have a base Cost for Contacting people (cost of a lead), I might have incremental revenues when a customer actually accepts an offer, I have an incremental cost when the customer accepts an offer (cost of processing, welcome kits, etc).? • We can also add limits to budget and number of people we can contact (Call Center limitations for example) Percentile Cumulative lift by Model Lift/Gains Chart

Copyright © 2020, Oracle and/or its affiliates 14 1. Sort
data descending on PROBABILITY of being the TARGET as decided by the model 2. Divide the data in percentiles or deciles (slices of data with the same number of observations). For deciles this means 10 slices, while for percentiles it means 100 3. Evaluate on each slice how many correct TARGETS the Model is able to identify 4. To simulate the IDEAL model, we separately sort data descending on TARGET, and compute the proportion of TARGET on each slice 5. The IDEAL model would have done 100% on all initial slices until it ran out of TARGET records. 6. The overall Average target proportion can be used as the Random Guess comparison Decile statistics of proportion of Detected Target by Model Lift/Gains Chart

Copyright © 2020, Oracle and/or its affiliates 15 1. Distribution
of "PROBABILITY_OF_1 " for each TARGET level is compared for the Model 2. Models usually have a cutoff at 0.5 (or 50%) PROBABILITY of assinging a PREDICTION to "0" or "1". If PROBABILITY_OF_1 > 0.5 then the decision by the model is PREDICTION = '1' , else PREDICTION='0' 3. When looking at the Density Chart, one should expect that the better the model, the larger the separation the model will be able to detect between the '1' and '0'. Histogram of TARGET='1' vs TARGET='0' Density Chart of Predictions

Copyright © 2020, Oracle and/or its affiliates 16 1. The
ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings 2. The ROC curve is thus the sensitivity or recall as a function of fall-out. 3. ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution 4. The Area Under the Curve (AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative') 5. If prediction models are well calibrated and unbiased, the "Gini inequality coefficient" can be derived from AUC: Gini = AUC * 2 - 1. True Positive Rate (Sensitivity) vs. False Positive Rate (1 – Specificity) ROC (Receiver Operating Characteristic) curve

Copyright © 2020, Oracle and/or its affiliates 17 1. Added
a few Statistics inside the ROC Curve Chart to facilitate viewing of Models True Positive Rate (Sensitivity) vs. False Positive Rate (1 – Specificity) ROC (Receiver Operating Characteristic) curve

Thank You Marcos Arancibia | [email protected] Mark Hornick | [email protected]
Oracle Machine Learning Product Management

Machine Learning 102: Classification

Machine Learning 102: Classification

Marcos Arancibia

More Decks by Marcos Arancibia

Other Decks in Technology

Featured

Transcript

The picture can't be displayed. The picture can't be displayed.

Today’s Agenda Questions Upcoming session Speaker Marcos Arancibia – Machine

Web Questions • Previous session was awesome and i really

Next Session June 25, 2020: Oracle Machine Office Hours, 9AM

For product info… https://www.oracle.com/machine-learning Copyright © 2020 Oracle and/or its

Copyright © 2020 Oracle and/or its affiliates. https://www.oracle.com/cloud/free/

Today’s Session: Machine Learning 102 - Classification In this "ML

• What is machine learning? • What is classification? •

Copyright © 2020, Oracle and/or its affiliates 9 How can

Copyright © 2020, Oracle and/or its affiliates 10 There are

Copyright © 2020, Oracle and/or its affiliates 11 In Oracle

Copyright © 2020, Oracle and/or its affiliates 12 In Oracle

Copyright © 2020, Oracle and/or its affiliates 13 In Oracle

Copyright © 2020, Oracle and/or its affiliates 14 1. Sort

Copyright © 2020, Oracle and/or its affiliates 15 1. Distribution

Copyright © 2020, Oracle and/or its affiliates 16 1. The

Copyright © 2020, Oracle and/or its affiliates 17 1. Added

Copyright © 2020, Oracle and/or its affiliates 18 Relationship between

Copyright © 2020, Oracle and/or its affiliates 19 Demo on

For more information… oracle.com/machine-learning Copyright © 2020 Oracle and/or its

Copyright © 2020, Oracle and/or its affiliates 21 Q &

Thank You Marcos Arancibia | [email protected] Mark Hornick | [email protected]