Machine Learning 101: Feature Extraction

With Marcos Arancibia, Product Manager, Data Science and Big Data
@MarcosArancibia Mark Hornick, Senior Director, Product Management, Data Science and Machine Learning @MarkHornick oracle.com/machine-learning Oracle Machine Learning Office Hours Machine Learning 101 – Feature Extraction Copyright © 2020, Oracle and/or its affiliates. All rights reserved

Today’s Agenda Upcoming session Speaker Marcos Arancibia – Machine Learning
101: Feature Extraction Q&A Copyright © 2020 Oracle and/or its affiliates.

Next Session January 2021: Oracle Machine Office Hours Machine Learning
102 – Feature Extraction This eighth session in the series will cover Feature Extraction 102, where we continue to learn about methods to extract meaningful attributes from a large number of columns in datasets, explore dimensionality reduction in wider datasets and documents using Explicit Semantic Analysis, and compare the benefits of Feature Extraction as data pre-processing for Machine Learning models. Copyright © 2020, Oracle and/or its affiliates. All rights reserved

For product info… https://www.oracle.com/machine-learning Copyright © 2020 Oracle and/or its
affiliates.

Today’s Session: Machine Learning 101 – Feature Extraction This seventh
session in the series will cover Feature Extraction 101, where we learn about the methods to extract meaningful attributes from a large number of columns in datasets, explore dimensionality reduction and how it can be beneficial as a pre- processing for machine learning models. Copyright © 2020, Oracle and/or its affiliates. All rights reserved

What is Feature Extraction? "Feature extraction involves reducing the number
of resources required to describe a large set of data" Wikipedia: https://en.wikipedia.org/wiki/Feature_extraction The term "Feature Extraction" is used to denote several different methods that try to extract the most "information" possible from a set of Data by using a combination of the original variables/columns. In general, we can consider Feature Extraction in machine learning to be part of the machine learning pre-processing/data preparation cycle that would go back-and-forth against the modeling stage. Feature Extraction - Introduction Copyright © 2020, Oracle and/or its affiliates 7 Cross-industry standard process for data mining Data Understanding Data Preparation Business Understanding Modeling Evaluation Deployment

Part of the capabilities of Feature Extraction tools Feature Selection
, also known as dimensionality reduction , variable selection or attribute selection is the process of selecting a subset of relevant features (variables, predictors, columns) for use in machine learning model construction. Basic benefits of reducing the number of features are: • Simplification of models to the core relevant features • Faster to train and score • Potentially reduce the variance and avoid overfitting (and the curse of dimensionality) Several supervised machine learning algorithms can do a "natural" selection of the best attributes via a "weight" given to the features. Other methods can do an unsupervised selection of features by looking at the natural dispersion and trying to select features that translate most of the information (variability) of the entire dataset with as few features as possible. Feature Selection Copyright © 2020, Oracle and/or its affiliates 8

Algorithms Some of the methods for Feature Extraction include: -
Attribute importance using Minimum Description Length - Feature Extraction methods that use a transformation/translation/rotation of the original attribute axis, or a decomposition of the original variables into a set of matrices, like: - Principal Component Analysis, - Singular Value Decomposition, - Non-Negative Matrix Factorization, - CUR Matrix Decomposition, - Explicit Semantic Analysis for NLP and information retrieval. Using transformations or simply the exclusion of variables/columns with lower relationship with the target is helpful when building predictive models with machine learning, and because good data preparation is usually 90% of the work, Feature Extraction might be a key element to assist in a better model. Feature Extraction Copyright © 2020, Oracle and/or its affiliates 9

• Compute the relative importance of predictor variables for predicting
a response (target) variable • Gain insight into relevance of variables to guide manual variable selection or reduction, with the goal to reduce predictive model build time and/or improve model accuracy • Attribute Importance uses a Minimum Description Length (MDL) based algorithm that ranks the relative importance of predictor variables in predicting a specified response (target) variable • Pairwise only – each predictor with the target • Supports categorical target (classification) and numeric target (regression) Attribute Importance Copyright © 2020, Oracle and/or its affiliates 10

• Includes Auto Data Preparation (Normalization, binning) • Can allow
or ignore missing values • Supports partitioned model builds • Does NOT have a "Scoring" action, only presents the attribute importance so the user can explore the results • Returns a relative metric indicating how much the variable contributes to predicting the target • Values > 0 contribute to prediction • Values <= do not contribute or add noise OML Attribute Importance Copyright © 2020, Oracle and/or its affiliates 11

• Feature extraction algorithm • Orthogonal linear transformations capture the
underlying variance of data by decomposing a rectangular matrix into three matrices: U, D and V • Matrix D is a diagonal matrix and its singular values reflect the amount of data variance captured by the singular vectors Singular Value Decomposition - SVD Copyright © 2020, Oracle and/or its affiliates 12

• Supports narrow data via Tall and Skinny solvers •
Supports wide data via stochastic solvers • Provides Eigen Solvers for faster analysis with sparse data • Provides traditional SVD for more stable results Oracle Machine Learning SVD implementation Copyright © 2020, Oracle and/or its affiliates 13

• State-of-the-art algorithm for Feature Extraction • Dimensionality reduction technique
• Creates new features of existing attributes • Compare to Attribute Importance, which reduces attributes by taking a subset • NMF derives fewer new “features” taking into account interactions among original attributes • Supports text mining, life sciences and marketing applications Non-negative Matrix Factorization - NMF Copyright © 2020, Oracle and/or its affiliates 14

• Useful where there are many attributes • Each has
weak predictability, even ambiguous • But when taken in combination, produce meaningful patterns, topics, or themes • Example: Text • Same word can predict different documents e.g., “hike” can be applied to the outdoors or interest rates • NMF introduces context which is essential for predictive power e.g., “hike” + “mountain” -> “outdoors sports” “hike” + “interest” -> “interest rates” Intuition on NMF Copyright © 2020, Oracle and/or its affiliates 15

Attribute values Attribute values Intuition on NMF a b c
d e f g h x y z … 1 2 Target values a b c d e f g h x y z … Feat 1 Feat 2 Extracted features Target values 1 2 Feat 3 Feat 4 Copyright © 2020, Oracle and/or its affiliates 16

A more interpretable model than LDA (Latent Dirichlet Allocation) In
NLP and information retrieval, ESA is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base • A word is represented as a column vector in the TF-IDF matrix of the text corpus • A document (string of words) is represented as the centroid of the vectors representing its words Text corpus often is English Wikipedia, though other corpora can be used Designed to improve text categorization • Computes "semantic relatedness" using cosine similarity between aforementioned vectors, collectively interpreted as a space of "concepts explicitly defined and described by humans“ • Wikipedia articles are equated with concepts Usual Objectives: • Calculate semantic similarity between text documents or between mixed data • Explicit topic modeling for text Explicit Semantic Analysis (ESA) Copyright © 2020, Oracle and/or its affiliates 20

Thank You Marcos Arancibia | [email protected] Mark Hornick | [email protected]
Oracle Machine Learning Product Management

Machine Learning 101: Feature Extraction

Machine Learning 101: Feature Extraction

Marcos Arancibia

More Decks by Marcos Arancibia

Other Decks in Technology

Featured

Transcript

With Marcos Arancibia, Product Manager, Data Science and Big Data

Today’s Agenda Upcoming session Speaker Marcos Arancibia – Machine Learning

Next Session January 2021: Oracle Machine Office Hours Machine Learning

For product info… https://www.oracle.com/machine-learning Copyright © 2020 Oracle and/or its

Copyright © 2020, Oracle and/or its affiliates https://www.oracle.com/cloud/free/

Today’s Session: Machine Learning 101 – Feature Extraction This seventh

What is Feature Extraction? "Feature extraction involves reducing the number

Part of the capabilities of Feature Extraction tools Feature Selection

Algorithms Some of the methods for Feature Extraction include: -

• Compute the relative importance of predictor variables for predicting

• Includes Auto Data Preparation (Normalization, binning) • Can allow

• Feature extraction algorithm • Orthogonal linear transformations capture the

• Supports narrow data via Tall and Skinny solvers •

• State-of-the-art algorithm for Feature Extraction • Dimensionality reduction technique

• Useful where there are many attributes • Each has

Attribute values Attribute values Intuition on NMF a b c

Vector Quantization Methods for Face Representation Copyright © 2020, Oracle

Principal Component Analysis Methods for Face Representation Copyright © 2020,

Non-negative Matrix Factorization Methods for Face Representation Copyright © 2020,

A more interpretable model than LDA (Latent Dirichlet Allocation) In

Copyright © 2020, Oracle and/or its affiliates 21 Demo of

For more information… oracle.com/machine-learning Copyright © 2020 Oracle and/or its

Copyright © 2020, Oracle and/or its affiliates 23 Q &

Thank You Marcos Arancibia | [email protected] Mark Hornick | [email protected]