Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGS594 Lecture 19

JGS594 Lecture 19

Software Engineering for Machine Learning
Text Mining III
(202203)

B546a9b97d993392e4b22b74b99b91fe?s=128

Javier Gonzalez
PRO

April 12, 2022
Tweet

More Decks by Javier Gonzalez

Other Decks in Programming

Transcript

  1. jgs SER 594 Software Engineering for Machine Learning Lecture 19:

    Text Mining III Dr. Javier Gonzalez-Sanchez javiergs@asu.edu javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
  2. jgs Previously … MALLET

  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

    jgs Data and Pipeline
  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 4

    jgs Model InstanceList instances = new InstanceList(pipeline); instances.addThruPipe(iterator); // topics, alpha, beta § High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. § Low alpha value means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. § A high beta value means that a topic is likely to contain a mixture of most of the words § A low beta value means that a topic may contain a mixture of just a few of the words. § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta. ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01); model.addInstances(instances); model.setNumThreads(4); model.setNumIterations(1000); // 50 to 2000 model.estimate();
  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

    jgs Results LL/token the model's likelihood divided by the total number of tokens
  6. jgs Evaluation Text Mining

  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7

    jgs Evaluation § unsupervised nature § Estimate the model's ability to generalize topics § The likelihood of unseen documents can be used to compare models— higher likelihood implies a better model. (2009, Wallach et al.)
  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8

    jgs Evaluation // Split dataset InstanceList[] instanceSplit = instances.split( new double[] {0.9, 0.1, 0.0} // 90% training, 10% testing. 0%-fold ); // Use the first 90% for training model.addInstances(instanceSplit[0]); model.setNumThreads(4); model.setNumIterations(50); model.estimate(); // Get estimator MarginalProbEstimator estimator = model.getProbEstimator(); double loglike = estimator.evaluateLeftToRight( instanceSplit[1], // test instances 10, // default # tokens false, //resampling allowed null); // printstream to write the output System.out.println("Total log likelihood: "+loglike);
  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

    jgs Evaluation (50)
  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

    jgs Evaluation (1000)
  11. jgs Test Yourselves Text Mining

  12. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12

    jgs Summary
  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

    jgs Homework 1. Make the source code for BBC dataset work 2. Create a dataset (your choice) with the content of A) papers you read B) Webpages or blog you are interested C) Course notes D) Etc. 3. Create a model. Justify your selection of topics (k) 4. Try with diverse configurations 5. Describe your results. Good, bad, expected, unexpected. 6. Submit a PDF file and the dataset (if possible) otherwise at least a sampling
  14. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14

    jgs Questions
  15. jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,

    Ph.D. javiergs@asu.edu Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.