JGS594 Lecture 19

jgs SER 594 Software Engineering for Machine Learning Lecture 19:
Text Mining III Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment

jgs Previously … MALLET

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3
jgs Data and Pipeline

jgs Model InstanceList instances = new InstanceList(pipeline); instances.addThruPipe(iterator); // topics, alpha, beta § High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. § Low alpha value means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. § A high beta value means that a topic is likely to contain a mixture of most of the words § A low beta value means that a topic may contain a mixture of just a few of the words. § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta. ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01); model.addInstances(instances); model.setNumThreads(4); model.setNumIterations(1000); // 50 to 2000 model.estimate();

jgs Results LL/token the model's likelihood divided by the total number of tokens

jgs Evaluation Text Mining

jgs Evaluation § unsupervised nature § Estimate the model's ability to generalize topics § The likelihood of unseen documents can be used to compare models— higher likelihood implies a better model. (2009, Wallach et al.)

jgs Evaluation // Split dataset InstanceList[] instanceSplit = instances.split( new double[] {0.9, 0.1, 0.0} // 90% training, 10% testing. 0%-fold ); // Use the first 90% for training model.addInstances(instanceSplit[0]); model.setNumThreads(4); model.setNumIterations(50); model.estimate(); // Get estimator MarginalProbEstimator estimator = model.getProbEstimator(); double loglike = estimator.evaluateLeftToRight( instanceSplit[1], // test instances 10, // default # tokens false, //resampling allowed null); // printstream to write the output System.out.println("Total log likelihood: "+loglike);

jgs Evaluation (50)

jgs Evaluation (1000)

jgs Test Yourselves Text Mining

jgs Summary

jgs Homework 1. Make the source code for BBC dataset work 2. Create a dataset (your choice) with the content of A) papers you read B) Webpages or blog you are interested C) Course notes D) Etc. 3. Create a model. Justify your selection of topics (k) 4. Try with diverse configurations 5. Describe your results. Good, bad, expected, unexpected. 6. Submit a PDF file and the dataset (if possible) otherwise at least a sampling

jgs Questions

jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,
Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.

JGS594 Lecture 19

JGS594 Lecture 19

Javier Gonzalez-Sanchez
PRO

More Decks by Javier Gonzalez-Sanchez

Other Decks in Programming

Featured

Transcript

jgs SER 594 Software Engineering for Machine Learning Lecture 19:

jgs Previously … MALLET

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 4

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

jgs Evaluation Text Mining

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

jgs Test Yourselves Text Mining

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14

jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,