OML feature highlight: New OML Notebook templates for ML Feature Extraction with Text (ESA)

OML feature highlight: New OML Notebook templates for ML Feature
Extraction for text OML Office Hours Marcos Arancibia, Senior Principal Product Manager, Machine Learning Supported by Mark Hornick and Sherry LaMonica Move the Algorithms; Not the Data! Copyright © 2022, Oracle and/or its affiliates. This Session will be Recorded

• Upcoming Sessions • Follow-along OML Notebooks ML Feature Extraction
demos • Q&A Topics for today Copyright © 2022, Oracle and/or its affiliates | Confidential: Internal/Restricted/Highly Restricted 2

We will begin a Series of Follow-along reviews of the
Example Template notebooks every week, with one subject per week. These will be Hands-on if you have access to any Autonomous Database, even the Always-Free one. • Classification - done • Regression - done • Clustering • Feature Extraction I – Dimensionality Reduction • Feature Extraction II - Explicit Semantic Analysis • Time Series 29 March 2022 - Connecting to Autonomous Database using the new OML4Py Universal Client Upcoming Sessions Copyright © 2022, Oracle and/or its affiliates | Confidential: Internal/Restricted/Highly Restricted 3

Algorithms Some of the methods for Feature Extraction include: -
Attribute importance using Minimum Description Length - Feature Extraction methods that use a transformation/translation/rotation of the original attribute axis, or a decomposition of the original variables into a set of matrices, like: - (PCA) Principal Component Analysis, - (SVD) Singular Value Decomposition, - (NMF) Non-Negative Matrix Factorization, - (EM) Expectation-Maximization, - CUR Matrix Decomposition, - Explicit Semantic Analysis for NLP and information retrieval. Using transformations or simply the exclusion of variables/columns with lower relationship with the target is helpful when building predictive models with machine learning, and because good data preparation is usually 90% of the work, Feature Extraction might be a key element to assist in a better model. Feature Extraction Copyright © 2022, Oracle and/or its affiliates 4

A more interpretable model than LDA (Latent Dirichlet Allocation) In
NLP and information retrieval, ESA is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base • A word is represented as a column vector in the TF-IDF (Term Frequency–Inverse Document Frequency) matrix of the text corpus • A document (string of words) is represented as the centroid of the vectors representing its words Text corpus often is English Wikipedia, though other corpora can be used Designed to improve text categorization • Computes "semantic relatedness" using cosine similarity between aforementioned vectors, collectively interpreted as a space of "concepts explicitly defined and described by humans“ • Wikipedia articles are equated with concepts Usual Objectives: • Calculate semantic similarity between text documents or between mixed data • Explicit topic modeling for text Explicit Semantic Analysis (ESA) Copyright © 2021, Oracle and/or its affiliates 5

ESA is more interpretable than LDA Topics discovered by LDA
are latent, meaning difficult to interpret • Topics are defined by their keywords, i.e., they have no names, no abstract descriptions • To give meaning to topics, keywords can be extracted by LDA • Definitions solely based on keywords are fuzzy, and keywords for different topics usually overlap • Extracted keywords can be just generic words • Set of automatically extracted keywords for a topic does not map to a convenient English topic name Biggest problem with LDA: set of topics is fluid • Topic set changes with any changes to the training data • Any modification of training data changes topic boundaries • à topics cannot be mapped to existing knowledge base or topics understood by humans if training data is not static • Training data is almost never static ESA discovers topics from a given set of topics in a knowledge base • Topics are defined by humans à topics are well understood. • Topic set of interest can be selected and augmented if necessary à full control of the selection of topics • Set of topics can be geared toward a specific task, .e.g., knowledge base for topic modeling of online messages possibly related to terrorist activities, which is different than one for topic modeling of technical reports from academia • Can combine multiple knowledge bases, each with its own topic set, which may or may not overlap • Topic overlapping does not affect ESA's capability to detect relevant topics ESA vs. LDA (Latent Dirichlet Allocation) Copyright © 2021, Oracle and/or its affiliates 6

A deeper view The ESA model is basically an inverted
index that maps words to relevant concepts of the knowledge base. This inverted index also incorporates weights reflecting the strength of association between words and concepts. ESA does not project the original feature space and does not reduce its dimensionality except for filtering out features with uninformative text. There exist vast amounts of knowledge represented as text. Textual knowledge bases are normally collections of common or domain-specific articles, and every article defines one concept. These textual knowledge bases such as Wikipedia usually serve as sources for ESA models. Wikipedia is particularly good as a source for a general-purpose ESA model because Wikipedia is a comprehensive knowledge base. Users can develop and add and use their own custom, domain specific ESA models e.g. medical, homeland security, research & development, etc. Explicit Semantic Analysis (ESA) Copyright © 2021, Oracle and/or its affiliates 7

(shown only for general interest – many people/companies use their
own custom processing) Load Wikipedia dumps Wikipedia dumps are compressed XML files. Individual pages are tagged as <page>. The contents of the pages is tagged as <text>. Contents inside <text> contain plenty of Wikipedia-specific information that is not visible and various brackets are present. Page Filtering To collect the pages that describe concepts and more general knowledge about various subjects, there is a lot of: parsing and stripping HTML tags from pages, partial tokenization, special characters removals, dropping of words with special characters or numbers and more. The outcome of Wikipedia page processing is tab-separated files. Category & Article DocStore from Oracle Labs is used to remove non-usable information and to split the Wikipedia XML dumps into individual entities including article and category pages (ignoring other types of pages). The outcome of DocStore processing is text with HTML tags. ESA Model Build We calculate the number of incoming links for every page using cross-page links. ESA model is reduced to retain the pages that are more general and describe concepts, filtering out References, References and links, Sources, Further reading etc.. The final ESA model is built with a limit of around 161,000 Topics and 1,000 Top Features retained, resulting in some 27 mi records and 800 MB in size (current version) Steps used by the Oracle Team (internally) to Process the Wikipedia data Copyright © 2021, Oracle and/or its affiliates 8 XML Article pages Category pages TSV Pages TSV pages x-links TSV pages by category OML in-DB ESA Wiki Model

Copyright © 2021, Oracle and/or its affiliates 9 Where to
download the OML ESA Wikipedia Model Today: oss.oracle.com/machine-learning Future, a section at: oracle.com/machine-learning

There are Three Feature Extraction demos specific to ESA Feature
Extraction using ESA Feature Extraction using ESA Wiki Model Type "esa"

Click Create Notebook to create a copy for yourself Copyright
© 2022, Oracle and/or its affiliates | Confidential: Internal/Restricted/Highly Restricted 12 Click on "Create Notebook" Give it a name (or accept the default) Click OK

OML feature highlight: New OML Notebook templat...

OML feature highlight: New OML Notebook templates for ML Feature Extraction with Text (ESA)

Marcos Arancibia

More Decks by Marcos Arancibia

Other Decks in Technology

Featured

Transcript

OML feature highlight: New OML Notebook templates for ML Feature

• Upcoming Sessions • Follow-along OML Notebooks ML Feature Extraction

We will begin a Series of Follow-along reviews of the

Algorithms Some of the methods for Feature Extraction include: -

A more interpretable model than LDA (Latent Dirichlet Allocation) In

ESA is more interpretable than LDA Topics discovered by LDA

A deeper view The ESA model is basically an inverted

(shown only for general interest – many people/companies use their

Copyright © 2021, Oracle and/or its affiliates 9 Where to

Two options to access the Template Examples Copyright © 2022,

There are Three Feature Extraction demos specific to ESA Feature

Click Create Notebook to create a copy for yourself Copyright

The new notebook will show up in the notebooks listing.

Live Demo 14 Copyright © 2022, Oracle and/or its affiliates

Q & A Copyright © 2022, Oracle and/or its affiliates

Copyright © 2022, Oracle and/or its affiliates. 16 Thank you