Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oracle Machine Learning 101: Classification

Oracle Machine Learning 101: Classification

Have you always been curious about what machine learning can do for your business problem, but could never find the time to learn the practical necessary skills? Do you wish to learn what Classification, Regression, Clustering and Feature Extraction techniques do, and how to apply them using the Oracle Machine Learning family of products?

In this special series “Oracle Machine Learning Office Hours – Machine Learning 101” we went through the main steps of solving a Business Problem from beginning to end, using the different components available in Oracle Machine Learning: programming languages and interfaces, including Notebooks with SQL, UI, and languages like R and Python.

This first session in the series covered Classification, where we learned how to set up a data set for classification modeling, build machine learning models that can, e.g., discern between good or bad customers for a marketing offer, and evaluate the quality of that model.

Marcos Arancibia

May 19, 2020
Tweet

More Decks by Marcos Arancibia

Other Decks in Technology

Transcript

  1. The picture can't be displayed. The picture can't be displayed.

    The picture can't be displayed. With Mark Hornick, Senior Director, Product Management, Data Science and Machine Learning @MarkHornick Marcos Arancibia, Product Manager, Data Science and Big Data @MarcosArancibia oracle.com/machine-learning Oracle Machine Learning Office Hours Machine Learning 101 - Classification Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  2. Today’s Agenda Questions Upcoming session Speaker Marcos Arancibia – Machine

    Learning 101 Q&A Copyright © 2020 Oracle and/or its affiliates.
  3. Next Session June 9, 2020: Oracle Machine Office Hours, 9AM

    US Pacific Machine Learning 101 – Regression Have you always been curious about what machine learning can do for your business problem, but could never find the time to learn the practical necessary skills? Do you wish to learn what Classification, Regression, Clustering and Feature Extraction techniques do, and how to apply them using the Oracle Machine Learning family of products? Join us for this special series “Oracle Machine Learning Office Hours – Machine Learning 101”, where we will go through the main steps of solving a Business Problem from beginning to end, using the different components available in Oracle Machine Learning: programming languages and interfaces, including Notebooks with SQL, UI, and languages like R and Python. This second session in the series will cover Regression, where we will learn how to set up a data set for regression modeling, build machine learning models that predict numeric values such as home prices, and evaluate model quality.Marcos Arancibia, OML Product Management Marcos Arancibia, OML Product Management Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  4. Web Questions • How to identify and reduce biased data

    ? is there any standard technique available in Oracle ML notebook using OML4SQL ? Lots of people are discussing about algorithms. But, no one is really focusing on preparation of test data and train data. How to split data without any bias ? I think this topic deserves dedicated office hours session. Like proven sampling and bias techniques. • I have a customer interested in auto-tagging for images, and I was wondering how can we help them. Their idea is to manually tag many images to classify them between different types, for example to classify faces based on their types (rectangular, circle, square, heart-shaped, diamond, triangle, oval) and then, based on this input, let ML work to do this auto-tagging on new images. They understand this is a process, therefore they'll have to check results of auto-tagging and educate the algorithm to get more and more accurate results. I've seen at this presentation: https://www.oracle.com/a/tech/docs/oracle-machine-learning-overview-and-roadmap.pdf that there are capabilities associated to automatic tagging and classification of images to work with Content And Experience, and I'd like to know more about those OML cognitive Services for images. Can anyone guide me to achieve customer's target? Thanks in advance for your help on this matter. Copyright © 2020 Oracle and/or its affiliates.
  5. Today’s Session: Machine Learning 101 - Classification This first session

    in the series will cover Classification, where we will learn how to set up a data set for classification modeling, build machine learning models that can, e.g., discern between good or bad customers for a marketing offer, and evaluate the quality of that model. Copyright © 2020, Oracle and/or its affiliates. All rights reserved
  6. • What is machine learning? • What is classification? •

    Business problems addressed by classification • Types of data needed for Classification • Terminology • Data preparation part I – Data Quality • Data preparation part II – Derived attributes • Data preparation part III - Algorithm-specific • Algorithms supporting classification • Model building • Model evaluation • Scoring Agenda Copyright © 2020 Oracle and/or its affiliates 8
  7. “Data science is a multi-disciplinary field that uses scientific methods,

    processes, algorithms and systems to extract knowledge and insights from structured and unstructured data” “Statistics is the discipline that concerns the collection, organization, displaying, analysis, interpretation and presentation of data” “…Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans” “Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead” “Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on artificial neural networks” – Wikipedia What is machine learning? Copyright © 2020 Oracle and/or its affiliates 9 STATISTICS Collection, organization, displaying, analysis, interpretation, and presentation of data ARTIFICIAL INTELLIGENCE A program that can sense, reason, act, and adapt MACHINE LEARNING Algorithms whose performance improves as they are exposed to more data over time DEEP LEARNING Subset of ML in which multi-layered neural networks learn from vast amounts of data
  8. Automation Opportunities Data Preparation* Modeling Evaluation Deployment Automation Challenges Data

    Preparation* Business Understanding Data Understanding CRISP-DM – often-cited methodology Machine Learning Process Cross-industry standard process for data mining *Depending on the type of data. Time-based trends are a challenge, feature-creation is hard without interpretation Data Understanding Data Preparation Business Understanding Modeling Evaluation Deployment Copyright © 2020 Oracle and/or its affiliates.
  9. Classification is a subcategory of Supervised Learning (Machine Learning with

    a known past outcome) where the goal is to predict two or more categorical class labels of new records based on past observations. Examples of Classification: • Classify cells as “cancerous or benign tumor”, using the data from any medical device and patient information possible, including demographics and X-rays, MRI and other measurements. • Identify a transaction as "fraud or not fraud", based on customer past behavior, location of the transaction, time of day, day of week, distance to home, distance to last transaction and time to last transaction (to check for feasibility of physically being there), credit limits, etc. • Predict whether a customer is going to “buy or not buy a product”, based on past customer purchase behavior, previous marketing campaign information, website visits, likes and dislikes, etc. • Understand whether a machine is going to "fail soon or not", based on IoT sensor data, weather predictions, locations, past failure data, etc. Machine learning can be applied to a wide range of business problems What is Classification? Copyright © 2020 Oracle and/or its affiliates.
  10. To predict if customer is going to “buy or not

    buy a product”: “Garbage in, garbage out” is especially true for ML, but also having the right data What type of data is needed for a Classification problem? Copyright © 2020 Oracle and/or its affiliates. Historical data with known outcomes ID AGE INCOME ZIP STATUS … INSURED PURCHASED 1 25 25000 33131 M … Yes Yes 2 46 76000 60923 D … No No 3 33 96500 01439 S … Yes Yes … … … …. … … … …
  11. To predict if customer is going to “buy or not

    buy a product”: “Garbage in, garbage out” is especially true for ML, but also having the right data What type of data is needed for a Classification problem? Copyright © 2020 Oracle and/or its affiliates. Historical data with known outcomes New data unknown outcomes – predict if customer will buy product ID AGE INCOME ZIP STATUS … INSURED PURCHASED 1 25 25000 33131 M … Yes Yes 2 46 76000 60923 D … No No 3 33 96500 01439 S … Yes Yes … … … …. … … … … ID AGE INCOME ZIP STATUS … INSURED PURCH PROB_PURCH 1001 25 27500 01463 M … Yes Yes/No Prob(Yes)=? 1002 46 63000 33923 D … No Yes/No Prob(Yes)=? 1003 33 56500 92439 S … Yes Yes/No Prob(Yes)=? … … … …. … … … …
  12. A Machine Learning classification algorithm will usually generate 2 new

    columns: “Garbage in, garbage out” is especially true for ML, but also having the right data What type of data is needed for a Classification problem? Copyright © 2020 Oracle and/or its affiliates. New data unknown outcomes – predict if customer will buy product The decision of the likely outcome: Will the customer purchase this offer? The probability or likelihood that the customer will purchase the offer ID AGE INCOME ZIP STATUS … INSURED PURCH PROB_PURCH 1001 25 27500 01463 M … Yes Yes/No Prob(Yes)=? 1002 46 63000 33923 D … No Yes/No Prob(Yes)=? 1003 33 56500 92439 S … Yes Yes/No Prob(Yes)=? … … … …. … … … …
  13. Several names are used for the same components, depending on

    the field of study Machine Learning terminology Copyright © 2020 Oracle and/or its affiliates. Historical data with known outcomes Table Row • Record • Case • Instance • Example Table Columns • Variable • Attribute • Field • Predictor Table Column • Target – what to predict • Response Table Column • Case ID • Unique ID ID AGE INCOME ZIP STATUS … INSURED PURCHASED 1 25 25000 33131 M … Yes Yes 2 46 76000 60923 D … No No 3 33 96500 01439 S … Yes Yes … … … …. … … … … Data • Database Table or View • Data set (or dataset) • Training data – to build a model • Test data – to evaluate a model
  14. Several names are used for the same components, depending on

    the field of study Machine Learning terminology Copyright © 2020 Oracle and/or its affiliates. Table Column • Prediction • Prediction Probability New data unknown outcomes – predict if customer will buy product ID AGE INCOME ZIP STATUS … INSURED PURCH PROB_PURCH 1001 25 27500 01463 M … Yes Yes/No Prob(Yes)=? 1002 46 63000 33923 D … No Yes/No Prob(Yes)=? 1003 33 56500 92439 S … Yes Yes/No Prob(Yes)=? … … … …. … … … … Data • Database Table or View • Scoring data – for predictions
  15. Copyright © 2020, Oracle and/or its affiliates | Confidential: Internal

    17 Split the Data into Train and Test/Validation sets • You need to be able to build (train) the model on one set of data, and the model needs to be capable of generalizing its qualities to new data coming in the future. We use a separate sample called Testing or Validation set to test the expected model behavior. Intuition: Data preparation Build Model Keep Test Data aside Score the Test Data Pass the data for Scoring without the Actual Response Compare the Model Predictions with the Actual known Responses Prediction Target Create Confusion Matrix
  16. Copyright © 2020, Oracle and/or its affiliates | Confidential: Internal

    18 Depending on the Algorithm and Data • Data Transformation • Standardization/Normalization of values • Missing value Imputation For example, what can be derived from a single date? Data preparation 05/19/2020 Basic Information • 138 days since 1st Jan 2020 • Tuesday • Third day of the week • Second day of the workweek • Sunrise was at 6:32PM in Miami • Sun will set at 8:02PM in Miami • It's an overcast day in Miami • There were Flood Warnings in Miami Domain Knowledge • Has been a customer for 3.5 years • Machine has been operating for 564 days • Customer increased spending in the last 3 months • Revenue last month declined vs. Avg previous 3 months • Customer has declined usage 30% since last offer • 6 months since last Contact
  17. Copyright © 2020, Oracle and/or its affiliates | Confidential: Internal

    19 OML Includes an Automatic Data Preparation Most algorithms require some form of data transformation. During the model build process, Oracle Machine Learning can automatically perform the transformations required by the algorithm. You can choose to supplement the automatic transformations with additional transformations of your own, or you can choose to manage all the transformations yourself. In calculating automatic transformations, Oracle Machine Learning uses heuristics that address the common requirements of a given algorithm. This process results in reasonable model quality in most cases. Binning, normalization, and outlier treatment are transformations that are commonly needed by data mining algorithms. Data preparation
  18. Copyright © 2020, Oracle and/or its affiliates | Confidential: Internal

    20 Binning • Binning, also called discretization, is a technique for reducing the cardinality of continuous and discrete data. Binning groups related values together in bins to reduce the number of distinct values. • Binning can improve resource utilization and model build response time dramatically without significant loss in model quality. Binning can improve model quality by strengthening the relationship between attributes. • Supervised binning is a form of intelligent binning in which important characteristics of the data are used to determine the bin boundaries. In supervised binning, the bin boundaries are identified by a single-predictor decision tree that takes into account the joint distribution with the target. Supervised binning can be used for both numerical and categorical attributes. Normalization • Normalization is the most common technique for reducing the range of numerical data. Most normalization methods map the range of a single variable to another range (often 0,1). Outlier Treatment • A value is considered an outlier if it deviates significantly from most other values in the column. The presence of outliers can have a skewing effect on the data and can interfere with the effectiveness of transformations such as normalization or binning. • Outlier treatment methods such as trimming or clipping can be implemented to minimize the effect of outliers. • Outliers represent problematic data, for example, a bad reading due to the abnormal condition of an instrument. However, in some cases, especially in the business arena, outliers are perfectly valid. For example, in census data, the earnings for some of the richest individuals can vary significantly from the general population. Do not treat this information as an outlier, since it is an important part of the data. You need domain knowledge to determine outlier handling. Data preparation
  19. Copyright © 2020, Oracle and/or its affiliates 21 How can

    we determine if a Model is any good? After Scoring new (Test or Validation) data, we compare what the Model predicted was going to happen vs. the Actual Target. Model Evaluation: Confusion Matrix 1 0 1 20 12 0 10 50 Model Predicted This These are Actual Responses found on the test data Precision only takes into account the True Positives on the Actual Predicted Positives Precision = 20 / (20+12) = 62.5% Accuracy takes into account the Positives but also the Negatives, which is key in many use cases Accuracy = (20 + 50) / (20 + 12 + 10 + 50) = 76.1%
  20. Copyright © 2020, Oracle and/or its affiliates 22 There are

    many more measures of quality of a Model available, several can be easily computed and several are available in Oracle Machine Learning. From Wikipedia on Confusion Matrix: Model Evaluation: Confusion Matrix
  21. AutoML – new with OML4Py Auto Feature Selection – Reduce

    # of features by identifying most predictive – Improve performance and accuracy Increase data scientist productivity – reduce overall compute time Auto Model Selection Much faster than exhaustive search Auto Feature Selection >50% reduction in features AutoTune Significant score improvement ML Model Auto Model Selection – Identify in-database algorithm that achieves highest model quality – Find best model faster than with exhaustive search Auto Tune Hyperparameters – Significantly improve model accuracy – Avoid manual or exhaustive search techniques Copyright © 2019 Oracle and/or its affiliates. Enables non-expert users to leverage Machine Learning Data Table
  22. Complementary features for Oracle Autonomous Database OML Notebooks with AutoML

    UI Copyright © 2020 Oracle and/or its affiliates.
  23. Copyright © 2020 Oracle and/or its affiliates. Model building and

    monitoring AutoML code-free User Interface
  24. Copyright © 2020 Oracle and/or its affiliates. Model building and

    monitoring AutoML code-free User Interface
  25. Copyright © 2020 Oracle and/or its affiliates. Model building and

    monitoring AutoML code-free User Interface