$30 off During Our Annual Pro Sale. View Details »

JGS594 Lecture 18

JGS594 Lecture 18

Software Engineering for Machine Learning
Text Mining II
(202203)

Javier Gonzalez-Sanchez
PRO

April 06, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Research

Transcript

  1. jgs SER 594 Software Engineering for Machine Learning Lecture 18:

    Text Mining II Dr. Javier Gonzalez-Sanchez javiergs@asu.edu javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
  2. jgs Previously … MALLET

  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

    jgs Import Data import cc.mallet.pipe.iterator.FileIterator; import java.io.File; public class Main { public static void main(String[] args) { File[] directories = new File[1]; directories[0] = new File ("/src/data/"); FileIterator iterator = new FileIterator( directories, new TextFileFilter(), FileIterator.LAST_DIRECTORY ); } } import java.io.File; import java.io.FileFilter; public class TextFileFilter implements FileFilter { @Override public boolean accept(File file) { return file.toString().endsWith(".txt"); } }
  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 4

    jgs Code
  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

    jgs Pre-Processing § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList<Pipe> pipeList = new ArrayList <Pipe>(); // lowercase pipeList.add(new Input2CharSequence("UTF-8")); pipeList.add( new CharSequenceLowercase() ); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);
  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6

    jgs java.util.regex.Pattern Pattern pattern = Pattern.compile("[\\p{L}\\p{N}_]+"); Matcher matcher = pattern.matcher(inputString); boolean isOK = matcher.matches();
  7. jgs Mallet API Text Mining

  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8

    jgs Code
  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

    jgs Model InstanceList instances = new InstanceList(pipeline); instances.addThruPipe(iterator); // topics, alpha, beta § High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. § Low alpha value means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. § A high beta value means that a topic is likely to contain a mixture of most of the words § A low beta value means that a topic may contain a mixture of just a few of the words. § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta. ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01); model.addInstances(instances); model.setNumThreads(4); model.setNumIterations(1000); // 50 to 2000 model.estimate();
  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

    jgs Results LL/token the model's likelihood divided by the total number of tokens
  11. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11

    jgs Remove Words § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList<Pipe> pipeList = new ArrayList <Pipe>(); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords( new File(stopListFilePath), // file "utf-8", // encoding false, // include the default Mallet list false, // case sensitive false // mark deletions ) ); // pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);
  12. jgs Evaluation Text Mining

  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

    jgs Evaluation § unsupervised nature § Estimate the model's ability to generalize topics § The likelihood of unseen documents can be used to compare models— higher likelihood implies a better model. (2009, Wallach et al.)
  14. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14

    jgs Evaluation // Split dataset InstanceList[] instanceSplit = instances.split( new double[] {0.9, 0.1, 0.0} // 90% training, 10% testing. 0% fold ); // Use the first 90% for training model.addInstances(instanceSplit[0]); model.setNumThreads(4); model.setNumIterations(50); model.estimate(); // Get estimator MarginalProbEstimator estimator = model.getProbEstimator(); double loglike = estimator.evaluateLeftToRight( instanceSplit[1], // test instances 10, // default # tokens false, //resampling allowed null); // printstream to write the output System.out.println("Total log likelihood: "+loglike);
  15. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 15

    jgs Evaluation (50)
  16. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16

    jgs Evaluation (1000)
  17. jgs Summary Text Mining

  18. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18

    jgs Summary
  19. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19

    jgs Questions
  20. jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,

    Ph.D. javiergs@asu.edu Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.