JGS594 Lecture 17

jgs SER 594 Software Engineering for Machine Learning Lecture 17:
Text Mining Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment

jgs Previously … Unsupervised Learning

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3
jgs Machine Learning

jgs Text Mining Unsupervised Learning

jgs Text Mining Extracting information from text documents (natural language). § generate a search index; § text categorization into domains; § text clustering to organize a set of documents; § sentiment analysis to identify subjective information; § concept or entity extraction (people, places, organizations); § document summarization (identify important points); and § learning relations between named entities. § Spam detection in email messages, user comments, web pages, and so on

jgs Topics Modeling § unsupervised technique that looks for patterns in a corpus of text – identify words (topics) that appear un a statistically meaningful way § analyze a large archive of text documents (blog post, an email, a tweet, a book chapter, etc.) and understand what the archive contains § The most well-known algorithm is Latent Dirichlet Allocation (LDA) 2003, Blei, Ng, & Jordan (available on Canvas)

jgs § Latent Dirichlet Allocation (LDA) Do not confuse with § Linear Discriminant Analysis (LDA)

jgs Latent Dirichlet Allocation (LDA) § Latent: existing but not yet developed or manifest § Dirichlet, Peter Gustav Lejeune § Allocation: process of distributing something § It assumes that the author composed a piece of text by selecting words from possible baskets of words, where each basket corresponds to a topic. § It represents documents as a mixture of topics. § It represents a topic is a mixture of words. § If a word w has high probability of being in a topic t, all the documents having w will be more strongly associated with t as well. §

jgs Latent Dirichlet Allocation (LDA) https://towardsdatascience.com/dimensionality-reduction-with-latent-dirichlet-allocation

jgs Latent Dirichlet Allocation (LDA) https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-allocation-lda-

jgs Latent Dirichlet Allocation (LDA)

jgs Algorithm | assumptions § Words like am/is/are/of/a/the/but/… don’t carry any information about the “topics” are eliminated from the documents as a preprocessing step. § Words that occur in >80% of the documents can be eliminated without losing any information. § The number of topics we have or want to is pre-defined. § Order of the words and the grammatical role of the words (subject, object, verbs, …) are not considered in the model.

jgs Algorithm Randomly assign each word w in each document d to one of k topics For each document d, For each word w and compute : How many words belong to the topic t for a given document d. Excluding the current word. If a lot of words from d belongs to t, it is more probable that word w belongs to t. p(topic t | document d) the proportion of words in document d that are assigned to topic t. How many documents are in topic t because of word w. p(word w| topic t) the proportion of assignments to topic t over all documents that come from this word w. Update the probability for the word w belonging to topic t, p(word w with topic t) = p(topic t | document d) * p(word w | topic t)

jgs Tools Text Mining

jgs Tools https://www.linuxlinks.com/excellent-java-natural-language-processing-tools/ (September 2019)

jgs Library § Mallet (MAchine Learning for Language Toolkit), a Java-based package for statistical natural-language processing, document classification, clustering, topic modeling, information extraction, and other machine-learning applications to text. § Andrew McCallum, University of Massachusetts Amherst, 2002)

jgs Mallet § Download http://mallet.cs.umass.edu/download.php

jgs Mallet API mallet.jar mallet-deps.jar

jgs Data | BBC News Dataset § Raw text files § Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. § Class Labels: 5 (business, entertainment, politics, sport, tech) § http://mlg.ucd.ie/datasets/bbc.html

jgs Mallet API Text Mining

jgs Import Data for Training File iterator cc.mallet.pipe.iterator.FileIterator FileIterator iterator = new FileIterator( § A list of File[] directories with text files § A file filter that specifies which files to select within a directory § A pattern that is applied to a filename to produce a class label )

jgs Import Data import cc.mallet.pipe.iterator.FileIterator; import java.io.File; public class Main { public static void main(String[] args) { File[] directories = new File[1]; directories[0] = new File ("/src/data/"); FileIterator iterator = new FileIterator( directories, new TextFileFilter(), FileIterator.LAST_DIRECTORY ); } } import java.io.File; import java.io.FileFilter; public class TextFileFilter implements FileFilter { @Override public boolean accept(File file) { return file.toString().endsWith(".txt"); } }

jgs Code

jgs To be continued …

jgs Questions

jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,
Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.

JGS594 Lecture 17

JGS594 Lecture 17

Javier Gonzalez-Sanchez
PRO

More Decks by Javier Gonzalez-Sanchez

Other Decks in Programming

Featured

Transcript

jgs SER 594 Software Engineering for Machine Learning Lecture 17:

jgs Previously … Unsupervised Learning

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

jgs Text Mining Unsupervised Learning

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

jgs Tools Text Mining

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 15

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 17

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19

jgs Mallet API Text Mining

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 21

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 22

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 23

jgs To be continued …

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 25

jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,