Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGS594 Lecture 19

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

JGS594 Lecture 19

Software Engineering for Machine Learning
Text Mining III
(202203)

Avatar for Javier Gonzalez-Sanchez

Javier Gonzalez-Sanchez PRO

April 12, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Programming

Transcript

  1. jgs SER 594 Software Engineering for Machine Learning Lecture 19:

    Text Mining III Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
  2. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 4

    jgs Model InstanceList instances = new InstanceList(pipeline); instances.addThruPipe(iterator); // topics, alpha, beta § High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. § Low alpha value means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. § A high beta value means that a topic is likely to contain a mixture of most of the words § A low beta value means that a topic may contain a mixture of just a few of the words. § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta. ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01); model.addInstances(instances); model.setNumThreads(4); model.setNumIterations(1000); // 50 to 2000 model.estimate();
  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

    jgs Results LL/token the model's likelihood divided by the total number of tokens
  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7

    jgs Evaluation § unsupervised nature § Estimate the model's ability to generalize topics § The likelihood of unseen documents can be used to compare models— higher likelihood implies a better model. (2009, Wallach et al.)
  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8

    jgs Evaluation // Split dataset InstanceList[] instanceSplit = instances.split( new double[] {0.9, 0.1, 0.0} // 90% training, 10% testing. 0%-fold ); // Use the first 90% for training model.addInstances(instanceSplit[0]); model.setNumThreads(4); model.setNumIterations(50); model.estimate(); // Get estimator MarginalProbEstimator estimator = model.getProbEstimator(); double loglike = estimator.evaluateLeftToRight( instanceSplit[1], // test instances 10, // default # tokens false, //resampling allowed null); // printstream to write the output System.out.println("Total log likelihood: "+loglike);
  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

    jgs Homework 1. Make the source code for BBC dataset work 2. Create a dataset (your choice) with the content of A) papers you read B) Webpages or blog you are interested C) Course notes D) Etc. 3. Create a model. Justify your selection of topics (k) 4. Try with diverse configurations 5. Describe your results. Good, bad, expected, unexpected. 6. Submit a PDF file and the dataset (if possible) otherwise at least a sampling
  7. jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,

    Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.