$30 off During Our Annual Pro Sale. View Details »

JGS594 Lecture 17

JGS594 Lecture 17

Software Engineering for Machine Learning
Text Mining
(202204)

Javier Gonzalez-Sanchez
PRO

April 05, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Programming

Transcript

  1. jgs
    SER 594
    Software Engineering for
    Machine Learning
    Lecture 17: Text Mining
    Dr. Javier Gonzalez-Sanchez
    [email protected]
    javiergs.engineering.asu.edu | javiergs.com
    PERALTA 230U
    Office Hours: By appointment

    View Slide

  2. jgs
    Previously …
    Unsupervised Learning

    View Slide

  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3
    jgs
    Machine Learning

    View Slide

  4. jgs
    Text Mining
    Unsupervised Learning

    View Slide

  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5
    jgs
    Text Mining
    Extracting information from text documents (natural language).
    § generate a search index;
    § text categorization into domains;
    § text clustering to organize a set of documents;
    § sentiment analysis to identify subjective information;
    § concept or entity extraction (people, places, organizations);
    § document summarization (identify important points); and
    § learning relations between named entities.
    § Spam detection in email messages, user comments, web pages, and so on

    View Slide

  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6
    jgs
    Topics Modeling
    § unsupervised technique that looks for patterns in a corpus of text – identify
    words (topics) that appear un a statistically meaningful way
    § analyze a large archive of text documents (blog post, an email, a tweet, a
    book chapter, etc.) and understand what the archive contains
    § The most well-known algorithm is Latent Dirichlet Allocation (LDA)
    2003, Blei, Ng, & Jordan (available on Canvas)

    View Slide

  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7
    jgs
    § Latent Dirichlet Allocation (LDA)
    Do not confuse with
    § Linear Discriminant Analysis (LDA)

    View Slide

  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8
    jgs
    Latent Dirichlet Allocation (LDA)
    § Latent: existing but not yet developed or manifest
    § Dirichlet, Peter Gustav Lejeune
    § Allocation: process of distributing something
    § It assumes that the author composed a piece of text by selecting words
    from possible baskets of words, where each basket corresponds to a topic.
    § It represents documents as a mixture of topics.
    § It represents a topic is a mixture of words.
    § If a word w has high probability of being in a topic t, all the documents
    having w will be more strongly associated with t as well.
    §

    View Slide

  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9
    jgs
    Latent Dirichlet Allocation (LDA)
    https://towardsdatascience.com/dimensionality-reduction-with-latent-dirichlet-allocation

    View Slide

  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10
    jgs
    Latent Dirichlet Allocation (LDA)
    https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-allocation-lda-

    View Slide

  11. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11
    jgs
    Latent Dirichlet Allocation (LDA)

    View Slide

  12. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12
    jgs
    Algorithm | assumptions
    § Words like am/is/are/of/a/the/but/… don’t carry any information about the
    “topics” are eliminated from the documents as a preprocessing step.
    § Words that occur in >80% of the documents can be eliminated without
    losing any information.
    § The number of topics we have or want to is pre-defined.
    § Order of the words and the grammatical role of the words (subject, object,
    verbs, …) are not considered in the model.

    View Slide

  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13
    jgs
    Algorithm
    Randomly assign each word w in each document d to one
    of k topics
    For each document d,
    For each word w and compute :
    How many words belong to the topic t for a given document d.
    Excluding the current word. If a lot of words from d belongs
    to t, it is more probable that word w belongs to t.
    p(topic t | document d)
    the proportion of words in document d that are assigned to topic t.
    How many documents are in topic t because of word w.
    p(word w| topic t)
    the proportion of assignments to topic t over all documents that come from this word w.
    Update the probability for the word w belonging to topic t,
    p(word w with topic t) = p(topic t | document d) * p(word w | topic t)

    View Slide

  14. jgs
    Tools
    Text Mining

    View Slide

  15. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 15
    jgs
    Tools
    https://www.linuxlinks.com/excellent-java-natural-language-processing-tools/
    (September 2019)

    View Slide

  16. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16
    jgs
    Library
    § Mallet (MAchine Learning for Language Toolkit), a Java-based package for
    statistical natural-language processing, document classification, clustering,
    topic modeling, information extraction, and other machine-learning
    applications to text.
    § Andrew McCallum, University of Massachusetts Amherst, 2002)

    View Slide

  17. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 17
    jgs
    Mallet
    § Download
    http://mallet.cs.umass.edu/download.php

    View Slide

  18. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18
    jgs
    Mallet API
    mallet.jar mallet-deps.jar

    View Slide

  19. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19
    jgs
    Data | BBC News Dataset
    § Raw text files
    § Consists of 2225 documents from the BBC news website corresponding to
    stories in five topical areas from 2004-2005.
    § Class Labels: 5 (business, entertainment, politics, sport, tech)
    § http://mlg.ucd.ie/datasets/bbc.html

    View Slide

  20. jgs
    Mallet API
    Text Mining

    View Slide

  21. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 21
    jgs
    Import Data for Training
    File iterator
    cc.mallet.pipe.iterator.FileIterator
    FileIterator iterator = new FileIterator(
    § A list of File[] directories with text files
    § A file filter that specifies which files to select within a directory
    § A pattern that is applied to a filename to produce a class label
    )

    View Slide

  22. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 22
    jgs
    Import Data
    import cc.mallet.pipe.iterator.FileIterator;
    import java.io.File;
    public class Main {
    public static void main(String[] args) {
    File[] directories = new File[1];
    directories[0] = new File ("/src/data/");
    FileIterator iterator = new FileIterator(
    directories,
    new TextFileFilter(),
    FileIterator.LAST_DIRECTORY
    );
    }
    }
    import java.io.File;
    import java.io.FileFilter;
    public class TextFileFilter implements FileFilter {
    @Override
    public boolean accept(File file) {
    return file.toString().endsWith(".txt");
    }
    }

    View Slide

  23. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 23
    jgs
    Code

    View Slide

  24. jgs
    To be continued …

    View Slide

  25. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 25
    jgs
    Questions

    View Slide

  26. jgs
    SER 594 Software Engineering for Machine Learning
    Javier Gonzalez-Sanchez, Ph.D.
    [email protected]
    Spring 2022
    Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University.
    They cannot be distributed or used for another purpose.

    View Slide