Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGS594 Lecture 17

JGS594 Lecture 17

Software Engineering for Machine Learning
Text Mining
(202204)

B546a9b97d993392e4b22b74b99b91fe?s=128

Javier Gonzalez
PRO

April 05, 2022
Tweet

More Decks by Javier Gonzalez

Other Decks in Programming

Transcript

  1. jgs SER 594 Software Engineering for Machine Learning Lecture 17:

    Text Mining Dr. Javier Gonzalez-Sanchez javiergs@asu.edu javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
  2. jgs Previously … Unsupervised Learning

  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

    jgs Machine Learning
  4. jgs Text Mining Unsupervised Learning

  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

    jgs Text Mining Extracting information from text documents (natural language). § generate a search index; § text categorization into domains; § text clustering to organize a set of documents; § sentiment analysis to identify subjective information; § concept or entity extraction (people, places, organizations); § document summarization (identify important points); and § learning relations between named entities. § Spam detection in email messages, user comments, web pages, and so on
  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6

    jgs Topics Modeling § unsupervised technique that looks for patterns in a corpus of text – identify words (topics) that appear un a statistically meaningful way § analyze a large archive of text documents (blog post, an email, a tweet, a book chapter, etc.) and understand what the archive contains § The most well-known algorithm is Latent Dirichlet Allocation (LDA) 2003, Blei, Ng, & Jordan (available on Canvas)
  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7

    jgs § Latent Dirichlet Allocation (LDA) Do not confuse with § Linear Discriminant Analysis (LDA)
  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8

    jgs Latent Dirichlet Allocation (LDA) § Latent: existing but not yet developed or manifest § Dirichlet, Peter Gustav Lejeune § Allocation: process of distributing something § It assumes that the author composed a piece of text by selecting words from possible baskets of words, where each basket corresponds to a topic. § It represents documents as a mixture of topics. § It represents a topic is a mixture of words. § If a word w has high probability of being in a topic t, all the documents having w will be more strongly associated with t as well. §
  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

    jgs Latent Dirichlet Allocation (LDA) https://towardsdatascience.com/dimensionality-reduction-with-latent-dirichlet-allocation
  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

    jgs Latent Dirichlet Allocation (LDA) https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-allocation-lda-
  11. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11

    jgs Latent Dirichlet Allocation (LDA)
  12. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12

    jgs Algorithm | assumptions § Words like am/is/are/of/a/the/but/… don’t carry any information about the “topics” are eliminated from the documents as a preprocessing step. § Words that occur in >80% of the documents can be eliminated without losing any information. § The number of topics we have or want to is pre-defined. § Order of the words and the grammatical role of the words (subject, object, verbs, …) are not considered in the model.
  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

    jgs Algorithm Randomly assign each word w in each document d to one of k topics For each document d, For each word w and compute : How many words belong to the topic t for a given document d. Excluding the current word. If a lot of words from d belongs to t, it is more probable that word w belongs to t. p(topic t | document d) the proportion of words in document d that are assigned to topic t. How many documents are in topic t because of word w. p(word w| topic t) the proportion of assignments to topic t over all documents that come from this word w. Update the probability for the word w belonging to topic t, p(word w with topic t) = p(topic t | document d) * p(word w | topic t)
  14. jgs Tools Text Mining

  15. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 15

    jgs Tools https://www.linuxlinks.com/excellent-java-natural-language-processing-tools/ (September 2019)
  16. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16

    jgs Library § Mallet (MAchine Learning for Language Toolkit), a Java-based package for statistical natural-language processing, document classification, clustering, topic modeling, information extraction, and other machine-learning applications to text. § Andrew McCallum, University of Massachusetts Amherst, 2002)
  17. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 17

    jgs Mallet § Download http://mallet.cs.umass.edu/download.php
  18. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18

    jgs Mallet API mallet.jar mallet-deps.jar
  19. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19

    jgs Data | BBC News Dataset § Raw text files § Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005. § Class Labels: 5 (business, entertainment, politics, sport, tech) § http://mlg.ucd.ie/datasets/bbc.html
  20. jgs Mallet API Text Mining

  21. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 21

    jgs Import Data for Training File iterator cc.mallet.pipe.iterator.FileIterator FileIterator iterator = new FileIterator( § A list of File[] directories with text files § A file filter that specifies which files to select within a directory § A pattern that is applied to a filename to produce a class label )
  22. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 22

    jgs Import Data import cc.mallet.pipe.iterator.FileIterator; import java.io.File; public class Main { public static void main(String[] args) { File[] directories = new File[1]; directories[0] = new File ("/src/data/"); FileIterator iterator = new FileIterator( directories, new TextFileFilter(), FileIterator.LAST_DIRECTORY ); } } import java.io.File; import java.io.FileFilter; public class TextFileFilter implements FileFilter { @Override public boolean accept(File file) { return file.toString().endsWith(".txt"); } }
  23. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 23

    jgs Code
  24. jgs To be continued …

  25. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 25

    jgs Questions
  26. jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,

    Ph.D. javiergs@asu.edu Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.