Oracle Machine Learning 101: Classification

Slide 1

Slide 1 text

The picture can't be displayed. The picture can't be displayed. The picture can't be displayed. With Mark Hornick, Senior Director, Product Management, Data Science and Machine Learning @MarkHornick Marcos Arancibia, Product Manager, Data Science and Big Data @MarcosArancibia oracle.com/machine-learning Oracle Machine Learning Office Hours Machine Learning 101 - Classification Copyright © 2020, Oracle and/or its affiliates. All rights reserved

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Next Session June 9, 2020: Oracle Machine Office Hours, 9AM US Pacific Machine Learning 101 – Regression Have you always been curious about what machine learning can do for your business problem, but could never find the time to learn the practical necessary skills? Do you wish to learn what Classification, Regression, Clustering and Feature Extraction techniques do, and how to apply them using the Oracle Machine Learning family of products? Join us for this special series “Oracle Machine Learning Office Hours – Machine Learning 101”, where we will go through the main steps of solving a Business Problem from beginning to end, using the different components available in Oracle Machine Learning: programming languages and interfaces, including Notebooks with SQL, UI, and languages like R and Python. This second session in the series will cover Regression, where we will learn how to set up a data set for regression modeling, build machine learning models that predict numeric values such as home prices, and evaluate model quality.Marcos Arancibia, OML Product Management Marcos Arancibia, OML Product Management Copyright © 2020, Oracle and/or its affiliates. All rights reserved

Slide 4

Slide 4 text

Web Questions • How to identify and reduce biased data ? is there any standard technique available in Oracle ML notebook using OML4SQL ? Lots of people are discussing about algorithms. But, no one is really focusing on preparation of test data and train data. How to split data without any bias ? I think this topic deserves dedicated office hours session. Like proven sampling and bias techniques. • I have a customer interested in auto-tagging for images, and I was wondering how can we help them. Their idea is to manually tag many images to classify them between different types, for example to classify faces based on their types (rectangular, circle, square, heart-shaped, diamond, triangle, oval) and then, based on this input, let ML work to do this auto-tagging on new images. They understand this is a process, therefore they'll have to check results of auto-tagging and educate the algorithm to get more and more accurate results. I've seen at this presentation: https://www.oracle.com/a/tech/docs/oracle-machine-learning-overview-and-roadmap.pdf that there are capabilities associated to automatic tagging and classification of images to work with Content And Experience, and I'd like to know more about those OML cognitive Services for images. Can anyone guide me to achieve customer's target? Thanks in advance for your help on this matter. Copyright © 2020 Oracle and/or its affiliates.

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Today’s Session: Machine Learning 101 - Classification This first session in the series will cover Classification, where we will learn how to set up a data set for classification modeling, build machine learning models that can, e.g., discern between good or bad customers for a marketing offer, and evaluate the quality of that model. Copyright © 2020, Oracle and/or its affiliates. All rights reserved

Slide 8

Slide 8 text

• What is machine learning? • What is classification? • Business problems addressed by classification • Types of data needed for Classification • Terminology • Data preparation part I – Data Quality • Data preparation part II – Derived attributes • Data preparation part III - Algorithm-specific • Algorithms supporting classification • Model building • Model evaluation • Scoring Agenda Copyright © 2020 Oracle and/or its affiliates 8

Slide 9

Slide 9 text

“Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data” “Statistics is the discipline that concerns the collection, organization, displaying, analysis, interpretation and presentation of data” “…Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans” “Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead” “Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on artificial neural networks” – Wikipedia What is machine learning? Copyright © 2020 Oracle and/or its affiliates 9 STATISTICS Collection, organization, displaying, analysis, interpretation, and presentation of data ARTIFICIAL INTELLIGENCE A program that can sense, reason, act, and adapt MACHINE LEARNING Algorithms whose performance improves as they are exposed to more data over time DEEP LEARNING Subset of ML in which multi-layered neural networks learn from vast amounts of data

Slide 10

Slide 10 text

Automation Opportunities Data Preparation* Modeling Evaluation Deployment Automation Challenges Data Preparation* Business Understanding Data Understanding CRISP-DM – often-cited methodology Machine Learning Process Cross-industry standard process for data mining *Depending on the type of data. Time-based trends are a challenge, feature-creation is hard without interpretation Data Understanding Data Preparation Business Understanding Modeling Evaluation Deployment Copyright © 2020 Oracle and/or its affiliates.

Slide 11

Slide 11 text

Classification is a subcategory of Supervised Learning (Machine Learning with a known past outcome) where the goal is to predict two or more categorical class labels of new records based on past observations. Examples of Classification: • Classify cells as “cancerous or benign tumor”, using the data from any medical device and patient information possible, including demographics and X-rays, MRI and other measurements. • Identify a transaction as "fraud or not fraud", based on customer past behavior, location of the transaction, time of day, day of week, distance to home, distance to last transaction and time to last transaction (to check for feasibility of physically being there), credit limits, etc. • Predict whether a customer is going to “buy or not buy a product”, based on past customer purchase behavior, previous marketing campaign information, website visits, likes and dislikes, etc. • Understand whether a machine is going to "fail soon or not", based on IoT sensor data, weather predictions, locations, past failure data, etc. Machine learning can be applied to a wide range of business problems What is Classification? Copyright © 2020 Oracle and/or its affiliates.

Slide 12

Slide 12 text

Slide 13

Slide 13 text

To predict if customer is going to “buy or not buy a product”: “Garbage in, garbage out” is especially true for ML, but also having the right data What type of data is needed for a Classification problem? Copyright © 2020 Oracle and/or its affiliates. Historical data with known outcomes New data unknown outcomes – predict if customer will buy product ID AGE INCOME ZIP STATUS … INSURED PURCHASED 1 25 25000 33131 M … Yes Yes 2 46 76000 60923 D … No No 3 33 96500 01439 S … Yes Yes … … … …. … … … … ID AGE INCOME ZIP STATUS … INSURED PURCH PROB_PURCH 1001 25 27500 01463 M … Yes Yes/No Prob(Yes)=? 1002 46 63000 33923 D … No Yes/No Prob(Yes)=? 1003 33 56500 92439 S … Yes Yes/No Prob(Yes)=? … … … …. … … … …

Slide 14

Slide 14 text

A Machine Learning classification algorithm will usually generate 2 new columns: “Garbage in, garbage out” is especially true for ML, but also having the right data What type of data is needed for a Classification problem? Copyright © 2020 Oracle and/or its affiliates. New data unknown outcomes – predict if customer will buy product The decision of the likely outcome: Will the customer purchase this offer? The probability or likelihood that the customer will purchase the offer ID AGE INCOME ZIP STATUS … INSURED PURCH PROB_PURCH 1001 25 27500 01463 M … Yes Yes/No Prob(Yes)=? 1002 46 63000 33923 D … No Yes/No Prob(Yes)=? 1003 33 56500 92439 S … Yes Yes/No Prob(Yes)=? … … … …. … … … …

Slide 15

Slide 15 text

Several names are used for the same components, depending on the field of study Machine Learning terminology Copyright © 2020 Oracle and/or its affiliates. Historical data with known outcomes Table Row • Record • Case • Instance • Example Table Columns • Variable • Attribute • Field • Predictor Table Column • Target – what to predict • Response Table Column • Case ID • Unique ID ID AGE INCOME ZIP STATUS … INSURED PURCHASED 1 25 25000 33131 M … Yes Yes 2 46 76000 60923 D … No No 3 33 96500 01439 S … Yes Yes … … … …. … … … … Data • Database Table or View • Data set (or dataset) • Training data – to build a model • Test data – to evaluate a model

Slide 16

Slide 16 text

Several names are used for the same components, depending on the field of study Machine Learning terminology Copyright © 2020 Oracle and/or its affiliates. Table Column • Prediction • Prediction Probability New data unknown outcomes – predict if customer will buy product ID AGE INCOME ZIP STATUS … INSURED PURCH PROB_PURCH 1001 25 27500 01463 M … Yes Yes/No Prob(Yes)=? 1002 46 63000 33923 D … No Yes/No Prob(Yes)=? 1003 33 56500 92439 S … Yes Yes/No Prob(Yes)=? … … … …. … … … … Data • Database Table or View • Scoring data – for predictions

Slide 17

Slide 17 text

Copyright © 2020, Oracle and/or its affiliates | Confidential: Internal 17 Split the Data into Train and Test/Validation sets • You need to be able to build (train) the model on one set of data, and the model needs to be capable of generalizing its qualities to new data coming in the future. We use a separate sample called Testing or Validation set to test the expected model behavior. Intuition: Data preparation Build Model Keep Test Data aside Score the Test Data Pass the data for Scoring without the Actual Response Compare the Model Predictions with the Actual known Responses Prediction Target Create Confusion Matrix

Slide 18

Slide 18 text

Copyright © 2020, Oracle and/or its affiliates | Confidential: Internal 18 Depending on the Algorithm and Data • Data Transformation • Standardization/Normalization of values • Missing value Imputation For example, what can be derived from a single date? Data preparation 05/19/2020 Basic Information • 138 days since 1st Jan 2020 • Tuesday • Third day of the week • Second day of the workweek • Sunrise was at 6:32PM in Miami • Sun will set at 8:02PM in Miami • It's an overcast day in Miami • There were Flood Warnings in Miami Domain Knowledge • Has been a customer for 3.5 years • Machine has been operating for 564 days • Customer increased spending in the last 3 months • Revenue last month declined vs. Avg previous 3 months • Customer has declined usage 30% since last offer • 6 months since last Contact

Slide 19

Slide 19 text

Copyright © 2020, Oracle and/or its affiliates | Confidential: Internal 19 OML Includes an Automatic Data Preparation Most algorithms require some form of data transformation. During the model build process, Oracle Machine Learning can automatically perform the transformations required by the algorithm. You can choose to supplement the automatic transformations with additional transformations of your own, or you can choose to manage all the transformations yourself. In calculating automatic transformations, Oracle Machine Learning uses heuristics that address the common requirements of a given algorithm. This process results in reasonable model quality in most cases. Binning, normalization, and outlier treatment are transformations that are commonly needed by data mining algorithms. Data preparation

Slide 20

Slide 20 text

Copyright © 2020, Oracle and/or its affiliates | Confidential: Internal 20 Binning • Binning, also called discretization, is a technique for reducing the cardinality of continuous and discrete data. Binning groups related values together in bins to reduce the number of distinct values. • Binning can improve resource utilization and model build response time dramatically without significant loss in model quality. Binning can improve model quality by strengthening the relationship between attributes. • Supervised binning is a form of intelligent binning in which important characteristics of the data are used to determine the bin boundaries. In supervised binning, the bin boundaries are identified by a single-predictor decision tree that takes into account the joint distribution with the target. Supervised binning can be used for both numerical and categorical attributes. Normalization • Normalization is the most common technique for reducing the range of numerical data. Most normalization methods map the range of a single variable to another range (often 0,1). Outlier Treatment • A value is considered an outlier if it deviates significantly from most other values in the column. The presence of outliers can have a skewing effect on the data and can interfere with the effectiveness of transformations such as normalization or binning. • Outlier treatment methods such as trimming or clipping can be implemented to minimize the effect of outliers. • Outliers represent problematic data, for example, a bad reading due to the abnormal condition of an instrument. However, in some cases, especially in the business arena, outliers are perfectly valid. For example, in census data, the earnings for some of the richest individuals can vary significantly from the general population. Do not treat this information as an outlier, since it is an important part of the data. You need domain knowledge to determine outlier handling. Data preparation

Slide 21

Slide 21 text

Copyright © 2020, Oracle and/or its affiliates 21 How can we determine if a Model is any good? After Scoring new (Test or Validation) data, we compare what the Model predicted was going to happen vs. the Actual Target. Model Evaluation: Confusion Matrix 1 0 1 20 12 0 10 50 Model Predicted This These are Actual Responses found on the test data Precision only takes into account the True Positives on the Actual Predicted Positives Precision = 20 / (20+12) = 62.5% Accuracy takes into account the Positives but also the Negatives, which is key in many use cases Accuracy = (20 + 50) / (20 + 12 + 10 + 50) = 76.1%

Slide 22

Slide 22 text

Copyright © 2020, Oracle and/or its affiliates 22 There are many more measures of quality of a Model available, several can be easily computed and several are available in Oracle Machine Learning. From Wikipedia on Confusion Matrix: Model Evaluation: Confusion Matrix

Slide 23

Slide 23 text

AutoML – new with OML4Py Auto Feature Selection – Reduce # of features by identifying most predictive – Improve performance and accuracy Increase data scientist productivity – reduce overall compute time Auto Model Selection Much faster than exhaustive search Auto Feature Selection >50% reduction in features AutoTune Significant score improvement ML Model Auto Model Selection – Identify in-database algorithm that achieves highest model quality – Find best model faster than with exhaustive search Auto Tune Hyperparameters – Significantly improve model accuracy – Avoid manual or exhaustive search techniques Copyright © 2019 Oracle and/or its affiliates. Enables non-expert users to leverage Machine Learning Data Table

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Thank You Mark Hornick | [email protected] Marcos Arancibia | [email protected] Oracle Machine Learning Product Management