Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGS594 Lecture 17

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

JGS594 Lecture 17

Software Engineering for Machine Learning
Text Mining
(202204)

Avatar for Javier Gonzalez-Sanchez

Javier Gonzalez-Sanchez PRO

April 05, 2022

More Decks by Javier Gonzalez-Sanchez

Other Decks in Programming

Transcript

  1. jgs SER 594 Software Engineering for Machine Learning Lecture 17:

    Text Mining Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
  2. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

    jgs Text Mining Extracting information from text documents (natural language). § generate a search index; § text categorization into domains; § text clustering to organize a set of documents; § sentiment analysis to identify subjective information; § concept or entity extraction (people, places, organizations); § document summarization (identify important points); and § learning relations between named entities. § Spam detection in email messages, user comments, web pages, and so on
  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6

    jgs Topics Modeling § unsupervised technique that looks for patterns in a corpus of text – identify words (topics) that appear un a statistically meaningful way § analyze a large archive of text documents (blog post, an email, a tweet, a book chapter, etc.) and understand what the archive contains § The most well-known algorithm is Latent Dirichlet Allocation (LDA) 2003, Blei, Ng, & Jordan (available on Canvas)
  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7

    jgs § Latent Dirichlet Allocation (LDA) Do not confuse with § Linear Discriminant Analysis (LDA)
  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8

    jgs Latent Dirichlet Allocation (LDA) § Latent: existing but not yet developed or manifest § Dirichlet, Peter Gustav Lejeune § Allocation: process of distributing something § It assumes that the author composed a piece of text by selecting words from possible baskets of words, where each basket corresponds to a topic. § It represents documents as a mixture of topics. § It represents a topic is a mixture of words. § If a word w has high probability of being in a topic t, all the documents having w will be more strongly associated with t as well. §
  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

    jgs Latent Dirichlet Allocation (LDA) https://towardsdatascience.com/dimensionality-reduction-with-latent-dirichlet-allocation
  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

    jgs Latent Dirichlet Allocation (LDA) https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-allocation-lda-
  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11

    jgs Latent Dirichlet Allocation (LDA)
  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12

    jgs Algorithm | assumptions § Words like am/is/are/of/a/the/but/… don’t carry any information about the “topics” are eliminated from the documents as a preprocessing step. § Words that occur in >80% of the documents can be eliminated without losing any information. § The number of topics we have or want to is pre-defined. § Order of the words and the grammatical role of the words (subject, object, verbs, …) are not considered in the model.
  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

    jgs Algorithm Randomly assign each word w in each document d to one of k topics For each document d, For each word w and compute : How many words belong to the topic t for a given document d. Excluding the current word. If a lot of words from d belongs to t, it is more probable that word w belongs to t. p(topic t | document d) the proportion of words in document d that are assigned to topic t. How many documents are in topic t because of word w. p(word w| topic t) the proportion of assignments to topic t over all documents that come from this word w. Update the probability for the word w belonging to topic t, p(word w with topic t) = p(topic t | document d) * p(word w | topic t)
  11. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 15

    jgs Tools https://www.linuxlinks.com/excellent-java-natural-language-processing-tools/ (September 2019)
  12. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16

    jgs Library § Mallet (MAchine Learning for Language Toolkit), a Java-based package for statistical natural-language processing, document classification, clustering, topic modeling, information extraction, and other machine-learning applications to text. § Andrew McCallum, University of Massachusetts Amherst, 2002)
  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 17

    jgs Mallet § Download http://mallet.cs.umass.edu/download.php
  14. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18

    jgs Mallet API mallet.jar mallet-deps.jar
  15. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19

    jgs Data | BBC News Dataset § Raw text files § Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. § Class Labels: 5 (business, entertainment, politics, sport, tech) § http://mlg.ucd.ie/datasets/bbc.html
  16. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 21

    jgs Import Data for Training File iterator cc.mallet.pipe.iterator.FileIterator FileIterator iterator = new FileIterator( § A list of File[] directories with text files § A file filter that specifies which files to select within a directory § A pattern that is applied to a filename to produce a class label )
  17. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 22

    jgs Import Data import cc.mallet.pipe.iterator.FileIterator; import java.io.File; public class Main { public static void main(String[] args) { File[] directories = new File[1]; directories[0] = new File ("/src/data/"); FileIterator iterator = new FileIterator( directories, new TextFileFilter(), FileIterator.LAST_DIRECTORY ); } } import java.io.File; import java.io.FileFilter; public class TextFileFilter implements FileFilter { @Override public boolean accept(File file) { return file.toString().endsWith(".txt"); } }
  18. jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,

    Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.