Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OML feature highlight: New OML Notebook templates for ML Feature Extraction with Text (ESA)

OML feature highlight: New OML Notebook templates for ML Feature Extraction with Text (ESA)

On this Office Hours for Oracle Machine Learning on Autonomous Database, we introduced the latest Notebook templates for Machine Learning Feature Extraction for Text problems, using Explicit Semantic Analysis (ESA). This was follow-along Session, since the OML Notebook templates are available to any Autonomous Database tenancy, and you were able to run it while we demonstrated it.

The Oracle Machine Learning product family supports data scientists, analysts, developers, and IT to achieve data science project goals faster while taking full advantage of the Oracle platform.

The Oracle Machine Learning Notebooks offers an easy-to-use, interactive, multi-user, collaborative interface based on Apache Zeppelin notebook technology, and support SQL, PL/SQL, Python and Markdown interpreters. It is available on all Autonomous Database versions and Tiers, including the always-free editions.

OML includes AutoML, which provides automated machine learning algorithm features for algorithm selection, feature selection and model tuning, in addition to a specialized AutoML UI exclusive to the Autonomous Database.

OML Services, which is also included with Autonomous Database provides a REST interface for model deployment and management. OML Services supports in-database models as well as ONNX-format models (for classification, regression and clustering) built using third-party engines. OML Services also supports cognitive text analytics in English, French, Spanish and Italian.

Marcos Arancibia

March 15, 2022
Tweet

More Decks by Marcos Arancibia

Other Decks in Technology

Transcript

  1. OML feature highlight: New OML Notebook
    templates for ML Feature Extraction for text
    OML Office Hours
    Marcos Arancibia, Senior Principal Product Manager, Machine Learning
    Supported by Mark Hornick and Sherry LaMonica
    Move the Algorithms; Not the Data!
    Copyright © 2022, Oracle and/or its affiliates.
    This Session will
    be Recorded

    View Slide

  2. • Upcoming Sessions
    • Follow-along OML Notebooks ML Feature Extraction demos
    • Q&A
    Topics for today
    Copyright © 2022, Oracle and/or its affiliates | Confidential: Internal/Restricted/Highly Restricted
    2

    View Slide

  3. We will begin a Series of Follow-along reviews of the Example Template notebooks every week, with
    one subject per week. These will be Hands-on if you have access to any Autonomous Database, even
    the Always-Free one.
    • Classification - done
    • Regression - done
    • Clustering
    • Feature Extraction I – Dimensionality Reduction
    • Feature Extraction II - Explicit Semantic Analysis
    • Time Series
    29 March 2022 - Connecting to Autonomous Database using the new OML4Py Universal Client
    Upcoming Sessions
    Copyright © 2022, Oracle and/or its affiliates | Confidential: Internal/Restricted/Highly Restricted
    3

    View Slide

  4. Algorithms
    Some of the methods for Feature Extraction include:
    - Attribute importance using Minimum Description Length
    - Feature Extraction methods that use a transformation/translation/rotation of the original attribute
    axis, or a decomposition of the original variables into a set of matrices, like:
    - (PCA) Principal Component Analysis,
    - (SVD) Singular Value Decomposition,
    - (NMF) Non-Negative Matrix Factorization,
    - (EM) Expectation-Maximization,
    - CUR Matrix Decomposition,
    - Explicit Semantic Analysis for NLP and information retrieval.
    Using transformations or simply the exclusion of variables/columns with lower relationship with the
    target is helpful when building predictive models with machine learning, and because good data
    preparation is usually 90% of the work, Feature Extraction might be a key element to assist in a better
    model.
    Feature Extraction
    Copyright © 2022, Oracle and/or its affiliates
    4

    View Slide

  5. A more interpretable model than LDA (Latent Dirichlet Allocation)
    In NLP and information retrieval, ESA is a vectorial representation of text (individual words or entire
    documents) that uses a document corpus as a knowledge base
    • A word is represented as a column vector in the TF-IDF (Term Frequency–Inverse Document
    Frequency) matrix of the text corpus
    • A document (string of words) is represented as the centroid of the vectors representing its words
    Text corpus often is English Wikipedia, though other corpora can be used
    Designed to improve text categorization
    • Computes "semantic relatedness" using cosine similarity between aforementioned vectors,
    collectively interpreted as a space of "concepts explicitly defined and described by humans“
    • Wikipedia articles are equated with concepts
    Usual Objectives:
    • Calculate semantic similarity between text documents or between mixed data
    • Explicit topic modeling for text
    Explicit Semantic Analysis (ESA)
    Copyright © 2021, Oracle and/or its affiliates
    5

    View Slide

  6. ESA is more interpretable than LDA
    Topics discovered by LDA are latent, meaning difficult to
    interpret
    • Topics are defined by their keywords, i.e., they have no
    names, no abstract descriptions
    • To give meaning to topics, keywords can be extracted
    by LDA
    • Definitions solely based on keywords are fuzzy, and
    keywords for different topics usually overlap
    • Extracted keywords can be just generic words
    • Set of automatically extracted keywords for a topic
    does not map to a convenient English topic name
    Biggest problem with LDA: set of topics is fluid
    • Topic set changes with any changes to the training
    data
    • Any modification of training data changes topic
    boundaries
    • à topics cannot be mapped to existing knowledge
    base or topics understood by humans if training data
    is not static
    • Training data is almost never static
    ESA discovers topics from a given set of topics
    in a knowledge base
    • Topics are defined by humans à topics are well
    understood.
    • Topic set of interest can be selected and
    augmented if necessary à full control of the
    selection of topics
    • Set of topics can be geared toward a specific task,
    .e.g., knowledge base for topic modeling of online
    messages possibly related to terrorist activities,
    which is different than one for topic modeling of
    technical reports from academia
    • Can combine multiple knowledge bases, each with
    its own topic set, which may or may not overlap
    • Topic overlapping does not affect ESA's capability
    to detect relevant topics
    ESA vs. LDA (Latent Dirichlet Allocation)
    Copyright © 2021, Oracle and/or its affiliates
    6

    View Slide

  7. A deeper view
    The ESA model is basically an inverted index that maps words to relevant concepts of the
    knowledge base. This inverted index also incorporates weights reflecting the strength of
    association between words and concepts. ESA does not project the original feature space and
    does not reduce its dimensionality except for filtering out features with uninformative text.
    There exist vast amounts of knowledge represented as text. Textual knowledge bases are
    normally collections of common or domain-specific articles, and every article defines one
    concept. These textual knowledge bases such as Wikipedia usually serve as sources for ESA
    models.
    Wikipedia is particularly good as a source for a general-purpose ESA model because
    Wikipedia is a comprehensive knowledge base. Users can develop and add and use their own
    custom, domain specific ESA models e.g. medical, homeland security, research &
    development, etc.
    Explicit Semantic Analysis (ESA)
    Copyright © 2021, Oracle and/or its affiliates
    7

    View Slide

  8. (shown only for general interest – many people/companies use their own custom processing)
    Load Wikipedia dumps
    Wikipedia dumps are
    compressed XML files.
    Individual pages are tagged as
    . The contents of the
    pages is tagged as .
    Contents inside contain
    plenty of Wikipedia-specific
    information that is not visible
    and various brackets are
    present.
    Page Filtering
    To collect the pages that
    describe concepts and more
    general knowledge about
    various subjects, there is a lot
    of: parsing and stripping
    HTML tags from pages, partial
    tokenization, special
    characters removals, dropping
    of words with special
    characters or numbers and
    more. The outcome of
    Wikipedia page processing is
    tab-separated files.
    Category & Article
    DocStore from Oracle Labs is
    used to remove non-usable
    information and to split the
    Wikipedia XML dumps into
    individual entities including
    article and category pages
    (ignoring other types of
    pages).
    The outcome of DocStore
    processing is text with HTML
    tags.
    ESA Model Build
    We calculate the number of
    incoming links for every
    page using cross-page links.
    ESA model is reduced to retain
    the pages that are more
    general and describe
    concepts, filtering out
    References, References and
    links, Sources, Further reading
    etc.. The final ESA model is
    built with a limit of around
    161,000 Topics and 1,000 Top
    Features retained, resulting in
    some 27 mi records and 800
    MB in size (current version)
    Steps used by the Oracle Team (internally) to Process the Wikipedia data
    Copyright © 2021, Oracle and/or its affiliates
    8
    XML Article
    pages
    Category
    pages
    TSV
    Pages
    TSV
    pages
    x-links
    TSV
    pages by
    category
    OML in-DB ESA Wiki Model

    View Slide

  9. Copyright © 2021, Oracle and/or its affiliates
    9
    Where to download the OML ESA Wikipedia Model
    Today:
    oss.oracle.com/machine-learning
    Future, a section at:
    oracle.com/machine-learning

    View Slide

  10. Two options to access the Template Examples
    Copyright © 2022, Oracle and/or its affiliates | Confidential: Internal/Restricted/Highly Restricted
    10

    View Slide

  11. There are Three Feature Extraction demos specific to ESA
    Feature Extraction using ESA Feature Extraction using ESA Wiki Model
    Type "esa"

    View Slide

  12. Click Create Notebook to create a copy for yourself
    Copyright © 2022, Oracle and/or its affiliates | Confidential: Internal/Restricted/Highly Restricted
    12
    Click on "Create
    Notebook"
    Give it a name (or
    accept the
    default)
    Click OK

    View Slide

  13. The new notebook will show up in the notebooks listing.
    Copyright © 2022, Oracle and/or its affiliates | Confidential: Internal/Restricted/Highly Restricted
    13

    View Slide

  14. Live Demo
    14 Copyright © 2022, Oracle and/or its affiliates

    View Slide

  15. Q & A
    Copyright © 2022, Oracle and/or its affiliates
    15

    View Slide

  16. Copyright © 2022, Oracle and/or its affiliates.
    16
    Thank you

    View Slide