Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning 102: Classification

Machine Learning 102: Classification

Have you always been curious about what machine learning can do for your business problem, but could never find the time to learn the practical necessary skills? Do you wish to learn what Classification, Regression, Clustering and Feature Extraction techniques do, and how to apply them using the Oracle Machine Learning family of products? Join us for this second chapter of the series “Oracle Machine Learning Office Hours – Machine Learning 101”.

In this "ML Classification 102" session we picked up where we left off from our 101 Session, and went deeper in our discussions on ML algorithms, the importance of Feature Selection, and explored even more the correct way to evaluate models using the Confusion Matrix and the many statistics that can be computed from it.

We continued to make use of Oracle Machine Learning Notebooks, with Python and SQL as the underlying languages and OML4Py with AutoML for our demo environment.

Marcos Arancibia

June 09, 2020
Tweet

More Decks by Marcos Arancibia

Other Decks in Technology

Transcript

  1. The picture can't be displayed. The picture can't be displayed.

    The picture can't be displayed. With Marcos Arancibia, Product Manager, Data Science and Big Data @MarcosArancibia Mark Hornick, Senior Director, Product Management, Data Science and Machine Learning @MarkHornick oracle.com/machine-learning Oracle Machine Learning Office Hours Machine Learning 102 - Classification Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  2. Today’s Agenda Questions Upcoming session Speaker Marcos Arancibia – Machine

    Learning 102 Q&A Copyright © 2020 Oracle and/or its affiliates.
  3. Web Questions • Previous session was awesome and i really

    loved it. I hope that similar sessions will be organized based on different ML functions. • In this current session , i would like you to provide details about Cost Matrix and how to calculate cost and assign. I have gone through all the documentation. But, it does not give you clear explanation about how cost should be assigned. Also, request you to provide details on model evaluation with respect to cost matrix. Copyright © 2020 Oracle and/or its affiliates.
  4. Next Session June 25, 2020: Oracle Machine Office Hours, 9AM

    US Pacific Machine Learning 101 – Regression Have you always been curious about what machine learning can do for your business problem, but could never find the time to learn the practical necessary skills? Do you wish to learn what Classification, Regression, Clustering and Feature Extraction techniques do, and how to apply them using the Oracle Machine Learning family of products? Join us for this special series “Oracle Machine Learning Office Hours – Machine Learning 101”, where we will go through the main steps of solving a Business Problem from beginning to end, using the different components available in Oracle Machine Learning: programming languages and interfaces, including Notebooks with SQL, UI, and languages like R and Python. This second session in the series will cover Regression, where we will learn how to set up a data set for regression modeling, build machine learning models that predict numeric values such as home prices, and evaluate model quality. Marcos Arancibia, OML Product Management Marcos Arancibia, OML Product Management Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  5. Today’s Session: Machine Learning 102 - Classification In this "ML

    Classification 102" session we will pickup where we left off from our 101 Session, and go deeper in our discussions on ML algorithms, the importance of Feature Selection, and will explore even more the correct way to evaluate models using the Confusion Matrix and the many statistics that can be computed from it. Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  6. • What is machine learning? • What is classification? •

    Business problems addressed by classification • Types of data needed for Classification • Terminology • Data preparation • Model evaluation • AutoML • Q&A • Further details in Model evaluation Agenda Copyright © 2020 Oracle and/or its affiliates 8
  7. Copyright © 2020, Oracle and/or its affiliates 9 How can

    we determine if a Model is any good? After Scoring new (Test or Validation) data, we compare what the Model predicted was going to happen vs. the Actual Target. Review: Model Evaluation: Confusion Matrix 1 0 1 20 12 0 10 50 Model Predicted This These are Actual Responses found on the test data Precision only takes into account the True Positives on the Actual Predicted Positives Precision = 20 / (20+12) = 62.5% Accuracy takes into account the Positives but also the Negatives, which is key in many use cases Accuracy = (20 + 50) / (20 + 12 + 10 + 50) = 76.1%
  8. Copyright © 2020, Oracle and/or its affiliates 10 There are

    many more measures of quality of a Model available, several can be easily computed and several are available in Oracle Machine Learning. From Wikipedia on Confusion Matrix: Model Evaluation: Confusion Matrix
  9. Copyright © 2020, Oracle and/or its affiliates 11 In Oracle

    Data Miner (SQL Developer desktop App): • Interactive Comparison of multiple Cutoff Points • Comparison of all Models • In practice, interpretation is: "How much of the total actual positive cases would I have captured if I have chosen to only contact the top X % of the customers sorted in descending order by Probability?" Percentile Cumulative positive cases by Model Lift/Gains Chart
  10. Copyright © 2020, Oracle and/or its affiliates 12 In Oracle

    Data Miner (SQL Developer desktop App): • Interactive Comparison of multiple Cutoff Points • Comparison of all Models • In practice, interpretation is: "How much better than a Random Choice would my model be if I have chosen to only contact the top X % of the customers sorted in descending order by Probability?" • Notice that at the right most, all models go to '1' because if you contact everyone you actually would have spoken to all positive responders Percentile Cumulative lift by Model Lift/Gains Chart
  11. Copyright © 2020, Oracle and/or its affiliates 13 In Oracle

    Data Miner (SQL Developer desktop App): • Interactive Comparison of multiple Cutoff Points • Comparison of all Models • In practice, interpretation is: "How much money would I actually win or lose depending on the cutoff point, based on the fact that I have a base Cost for Contacting people (cost of a lead), I might have incremental revenues when a customer actually accepts an offer, I have an incremental cost when the customer accepts an offer (cost of processing, welcome kits, etc).? • We can also add limits to budget and number of people we can contact (Call Center limitations for example) Percentile Cumulative lift by Model Lift/Gains Chart
  12. Copyright © 2020, Oracle and/or its affiliates 14 1. Sort

    data descending on PROBABILITY of being the TARGET as decided by the model 2. Divide the data in percentiles or deciles (slices of data with the same number of observations). For deciles this means 10 slices, while for percentiles it means 100 3. Evaluate on each slice how many correct TARGETS the Model is able to identify 4. To simulate the IDEAL model, we separately sort data descending on TARGET, and compute the proportion of TARGET on each slice 5. The IDEAL model would have done 100% on all initial slices until it ran out of TARGET records. 6. The overall Average target proportion can be used as the Random Guess comparison Decile statistics of proportion of Detected Target by Model Lift/Gains Chart
  13. Copyright © 2020, Oracle and/or its affiliates 15 1. Distribution

    of "PROBABILITY_OF_1 " for each TARGET level is compared for the Model 2. Models usually have a cutoff at 0.5 (or 50%) PROBABILITY of assinging a PREDICTION to "0" or "1". If PROBABILITY_OF_1 > 0.5 then the decision by the model is PREDICTION = '1' , else PREDICTION='0' 3. When looking at the Density Chart, one should expect that the better the model, the larger the separation the model will be able to detect between the '1' and '0'. Histogram of TARGET='1' vs TARGET='0' Density Chart of Predictions
  14. Copyright © 2020, Oracle and/or its affiliates 16 1. The

    ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings 2. The ROC curve is thus the sensitivity or recall as a function of fall-out. 3. ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution 4. The Area Under the Curve (AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative') 5. If prediction models are well calibrated and unbiased, the "Gini inequality coefficient" can be derived from AUC: Gini = AUC * 2 - 1. True Positive Rate (Sensitivity) vs. False Positive Rate (1 – Specificity) ROC (Receiver Operating Characteristic) curve
  15. Copyright © 2020, Oracle and/or its affiliates 17 1. Added

    a few Statistics inside the ROC Curve Chart to facilitate viewing of Models True Positive Rate (Sensitivity) vs. False Positive Rate (1 – Specificity) ROC (Receiver Operating Characteristic) curve
  16. Copyright © 2020, Oracle and/or its affiliates 18 Relationship between

    Density, ROC, Confusion Matrix ROC on Wikipedia: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
  17. Copyright © 2020, Oracle and/or its affiliates 19 Demo on

    OML4Py, Advanced Model Evaluation and Comparison, Predictions and Statistics