Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning 102: Feature Extraction

Machine Learning 102: Feature Extraction

Have you always been curious about what machine learning can do for your business problem, but could never find the time to learn the practical necessary skills? Do you wish to learn what Classification, Regression, Clustering and Feature Extraction techniques do, and how to apply them using the Oracle Machine Learning family of products?

Join us for this special series “Oracle Machine Learning Office Hours – Machine Learning 101”, where we will go through the main steps of solving a Business Problem from beginning to end, using the different components available in Oracle Machine Learning: programming languages and interfaces, including Notebooks with SQL, UI, and languages like R and Python.

This eight session in the series will cover Feature Extraction 102, where we will focus on the Explicit Semantic Analysis, its benefits over LDA and capabilities when working with large Text Corpus like Wikipedia. We will see demos on built-in custom text corpus as well as Wikipedia using SQL and Python with OML4Py.

Marcos Arancibia

February 23, 2021
Tweet

More Decks by Marcos Arancibia

Other Decks in Technology

Transcript

  1. Oracle Machine Learning Office Hours Machine Learning 102 – Feature

    Extraction with Marcos Arancibia and Mark Hornick Product Management, Oracle Machine Learning February 2021
  2. Our mission is to help people see data in new

    ways, discover insights, unlock endless possibilities.
  3. 1. Upcoming session 2. Speaker • Marcos Arancibia – Machine

    Learning 102: Feature Extraction 3. Q&A Today’s Agenda Copyright © 2021, Oracle and/or its affiliates 3
  4. • Join us February 23, 2021: Oracle Machine Office Hours

    • Hands-On Lab using Oracle Machine Learning for Python on Autonomous Database • In this Hands-on Lab, join us to experience Oracle Machine Learning for Python (OML4Py) on Oracle Autonomous Database. • OML4Py supports scalable, in-database data exploration and preparation using native Python syntax, invocation of in-database algorithms for model building and scoring, and embedded execution of user-defined Python functions from Python or REST APIs. • OML4Py also includes the AutoML interface for automated algorithms, feature selection, and hyperparameter tuning. • Sign up for this tour of OML4Py, and we will distribute credentials for you to do the Live exercises using the environment during the Session. Next Session Copyright © 2021, Oracle and/or its affiliates 4
  5. Today’s Session: Machine Learning 102 – Feature Extraction This eight

    session in the series will cover Feature Extraction 102, where we will focus on the Explicit Semantic Analysis, its benefits over LDA and capabilities when working with large Text Corpus like Wikipedia. We will see demos on built-in custom text corpus as well as Wikipedia using SQL and Python with OML4Py.. We will continue to make use of the Oracle Machine Learning Notebooks, OML4SQL and OML4Py interface for the Autonomous Database. Copyright © 2021, Oracle and/or its affiliates 7
  6. What is Feature Extraction? "Feature extraction involves reducing the number

    of resources required to describe a large set of data" Wikipedia: https://en.wikipedia.org/wiki/Feature_extraction The term "Feature Extraction" is used to denote several different methods that try to extract the most "information" possible from a set of Data by using a combination of the original variables/columns. In general, we can consider Feature Extraction in machine learning to be part of the machine learning pre-processing/data preparation cycle that would go back-and-forth against the modeling stage. Feature Extraction - Introduction Copyright © 2021, Oracle and/or its affiliates 8 Cross-industry standard process for data mining Data Understanding Data Preparation Business Understanding Modeling Evaluation Deployment From the ML 101 Session
  7. Part of the capabilities of Feature Extraction tools Feature Selection

    , also known as dimensionality reduction , variable selection or attribute selection is the process of selecting a subset of relevant features (variables, predictors, columns) for use in machine learning model construction. Basic benefits of reducing the number of features are: • Simplification of models to the core relevant features • Faster to train and score • Potentially reduce the variance and avoid overfitting (and the curse of dimensionality) Several supervised machine learning algorithms can do a "natural" selection of the best attributes via a "weight" given to the features. Other methods can do an unsupervised selection of features by looking at the natural dispersion and trying to select features that translate most of the information (variability) of the entire dataset with as few features as possible. Feature Selection Copyright © 2021, Oracle and/or its affiliates 9 From the ML 101 Session
  8. Algorithms Some of the methods for Feature Extraction include: -

    Attribute importance using Minimum Description Length - Feature Extraction methods that use a transformation/translation/rotation of the original attribute axis, or a decomposition of the original variables into a set of matrices, like: - Principal Component Analysis, - Singular Value Decomposition, - Non-Negative Matrix Factorization, - CUR Matrix Decomposition, - Explicit Semantic Analysis for NLP and information retrieval. Using transformations or simply the exclusion of variables/columns with lower relationship with the target is helpful when building predictive models with machine learning, and because good data preparation is usually 90% of the work, Feature Extraction might be a key element to assist in a better model. Feature Extraction Copyright © 2021, Oracle and/or its affiliates 10 From the ML 101 Session
  9. A more interpretable model than LDA (Latent Dirichlet Allocation) In

    NLP and information retrieval, ESA is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base • A word is represented as a column vector in the TF-IDF (Term Frequency–Inverse Document Frequency) matrix of the text corpus • A document (string of words) is represented as the centroid of the vectors representing its words Text corpus often is English Wikipedia, though other corpora can be used Designed to improve text categorization • Computes "semantic relatedness" using cosine similarity between aforementioned vectors, collectively interpreted as a space of "concepts explicitly defined and described by humans“ • Wikipedia articles are equated with concepts Usual Objectives: • Calculate semantic similarity between text documents or between mixed data • Explicit topic modeling for text Explicit Semantic Analysis (ESA) Copyright © 2021, Oracle and/or its affiliates 11
  10. ESA is more interpretable than LDA Topics discovered by LDA

    are latent, meaning difficult to interpret • Topics are defined by their keywords, i.e., they have no names, no abstract descriptions • To give meaning to topics, keywords can be extracted by LDA • Definitions solely based on keywords are fuzzy, and keywords for different topics usually overlap • Extracted keywords can be just generic words • Set of automatically extracted keywords for a topic does not map to a convenient English topic name Biggest problem with LDA: set of topics is fluid • Topic set changes with any changes to the training data • Any modification of training data changes topic boundaries • à topics cannot be mapped to existing knowledge base or topics understood by humans if training data is not static • Training data is almost never static ESA discovers topics from a given set of topics in a knowledge base • Topics are defined by humans à topics are well understood. • Topic set of interest can be selected and augmented if necessary à full control of the selection of topics • Set of topics can be geared toward a specific task, .e.g., knowledge base for topic modeling of online messages possibly related to terrorist activities, which is different than one for topic modeling of technical reports from academia • Can combine multiple knowledge bases, each with its own topic set, which may or may not overlap • Topic overlapping does not affect ESA's capability to detect relevant topics ESA vs. LDA (Latent Dirichlet Allocation) Copyright © 2021, Oracle and/or its affiliates 12
  11. A deeper view The ESA model is basically an inverted

    index that maps words to relevant concepts of the knowledge base. This inverted index also incorporates weights reflecting the strength of association between words and concepts. ESA does not project the original feature space and does not reduce its dimensionality except for filtering out features with uninformative text. There exist vast amounts of knowledge represented as text. Textual knowledge bases are normally collections of common or domain-specific articles, and every article defines one concept. These textual knowledge bases such as Wikipedia usually serve as sources for ESA models. Wikipedia is particularly good as a source for a general-purpose ESA model because Wikipedia is a comprehensive knowledge base. Users can develop and add and use their own custom, domain specific ESA models e.g. medical, homeland security, research & development, etc. Explicit Semantic Analysis (ESA) Copyright © 2021, Oracle and/or its affiliates 13
  12. (shown only for general interest – many people/companies use their

    own custom processing) Load Wikipedia dumps Wikipedia dumps are compressed XML files. Individual pages are tagged as <page>. The contents of the pages is tagged as <text>. Contents inside <text> contain plenty of Wikipedia-specific information that is not visible and various brackets are present. Page Filtering To collect the pages that describe concepts and more general knowledge about various subjects, there is a lot of: parsing and stripping HTML tags from pages, partial tokenization, special characters removals, dropping of words with special characters or numbers and more. The outcome of Wikipedia page processing is tab-separated files. Category & Article DocStore from Oracle Labs is used to remove non-usable information and to split the Wikipedia XML dumps into individual entities including article and category pages (ignoring other types of pages). The outcome of DocStore processing is text with HTML tags. ESA Model Build We calculate the number of incoming links for every page using cross-page links. ESA model is reduced to retain the pages that are more general and describe concepts, filtering out References, References and links, Sources, Further reading etc.. The final ESA model is built with a limit of 200,000 Features and 1,000 Top Features retained, resulting in some 27 mi records and 800 MB in size (current version) Steps used by the Oracle Team (internally) to Process the Wikipedia data Copyright © 2021, Oracle and/or its affiliates 14 XML Article pages Category pages TSV Pages TSV pages x-links TSV pages by category OML in-DB ESA Wiki Model
  13. Copyright © 2021, Oracle and/or its affiliates 15 Where to

    download the OML ESA Wikipedia Model Today: oss.oracle.com/machine-learning Future, a section at: oracle.com/machine-learning
  14. OML Services extends OML functionality to support model deployment and

    model lifecycle management for both in-database OML models and third-party Open Neural Networks Exchange (ONNX) machine learning models via REST APIs. The REST API for Oracle Machine Learning Services on Oracle Autonomous Database provides: • Endpoints that enable the storing machine learning models along with its metadata • Creation of scoring endpoints for the registered models. • Support for classification and regression of third-party ONNX models, including from packages like Scikit-learn and TensorFlow, among several others. • Proprietary cognitive text capabilities in English, French and Spanish for topic discovery, keywords, summary, sentiment, and feature extraction, based on a Wikipedia knowledge base using Embeddings. • Cognitive image functionality, supported through the ONNX format third-party model deployment feature, with the ability to score using images or tensors. Oracle Machine Learning Services overview Copyright © 2021, Oracle and/or its affiliates. All rights reserved 16
  15. 17 Connectivity and use from Client Oracle Machine Learning Services

    architecture Copyright © 2021, Oracle and/or its affiliates. All rights reserved REST Client user/pass GET Token Token + Actions & Text/Objects GET POST DELETE Oracle Autonomous Database /omlusers PDB /omlmod OML Services
  16. Components with built-in Oracle Machine Learning Admin • Token using

    ADB user and password Generic • Metadata for all Versions: Version 1 Metadata • Open API Specification Deployment • Create Model Endpoint • Score Model using Endpoint • Endpoints • Endpoint Details • Open API Specification for Endpoint • Endpoint Repository • Store Model • Update Model Namespace • Models list • Model Info • Model Metadata • Model Content • Model Cognitive Text • Get Most Relevant Topics • Get Most Relevant Keywords • Get Summaries • Get Sentiments • Get Semantic Similarities • Numeric Features • Get Endpoints Oracle Machine Learning Services - Methods 18 Copyright © 2021, Oracle and/or its affiliates. All rights reserved GET POST DELETE GET POST DELETE GET POST GET POST
  17. Copyright © 2021, Oracle and/or its affiliates 19 Demos: -

    Feature Extraction 102 on OML Notebooks Using ESA on OML4SQL and OML4Python - OML Services Cognitive Text via REST Using Postman
  18. On our GitHub, you can find: Copyright © 2021, Oracle

    and/or its affiliates 21 github.com/oracle/oracle-db-examples/tree/master/machine-learning • Examples Notebooks in OML4SQL and OML4Python including Feature Extraction • SQL code examples for ESA for DB 18c, 19c and 21c • Labs folder with OML4Py HOL Labs • OML Services demos including Cognitive Text Demos, in PostMan collections
  19. Thank You Marcos Arancibia | [email protected] Mark Hornick | [email protected]

    Oracle Machine Learning Product Management Copyright © 2021, Oracle and/or its affiliates 23