Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGS594 Lecture 18

JGS594 Lecture 18

Software Engineering for Machine Learning
Text Mining II
(202203)

Javier Gonzalez-Sanchez

April 06, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Research

Transcript

  1. jgs SER 594 Software Engineering for Machine Learning Lecture 18:

    Text Mining II Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
  2. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

    jgs Import Data import cc.mallet.pipe.iterator.FileIterator; import java.io.File; public class Main { public static void main(String[] args) { File[] directories = new File[1]; directories[0] = new File ("/src/data/"); FileIterator iterator = new FileIterator( directories, new TextFileFilter(), FileIterator.LAST_DIRECTORY ); } } import java.io.File; import java.io.FileFilter; public class TextFileFilter implements FileFilter { @Override public boolean accept(File file) { return file.toString().endsWith(".txt"); } }
  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

    jgs Pre-Processing § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList<Pipe> pipeList = new ArrayList <Pipe>(); // lowercase pipeList.add(new Input2CharSequence("UTF-8")); pipeList.add( new CharSequenceLowercase() ); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);
  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6

    jgs java.util.regex.Pattern Pattern pattern = Pattern.compile("[\\p{L}\\p{N}_]+"); Matcher matcher = pattern.matcher(inputString); boolean isOK = matcher.matches();
  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

    jgs Model InstanceList instances = new InstanceList(pipeline); instances.addThruPipe(iterator); // topics, alpha, beta § High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. § Low alpha value means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. § A high beta value means that a topic is likely to contain a mixture of most of the words § A low beta value means that a topic may contain a mixture of just a few of the words. § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta. ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01); model.addInstances(instances); model.setNumThreads(4); model.setNumIterations(1000); // 50 to 2000 model.estimate();
  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

    jgs Results LL/token the model's likelihood divided by the total number of tokens
  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11

    jgs Remove Words § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList<Pipe> pipeList = new ArrayList <Pipe>(); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords( new File(stopListFilePath), // file "utf-8", // encoding false, // include the default Mallet list false, // case sensitive false // mark deletions ) ); // pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);
  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

    jgs Evaluation § unsupervised nature § Estimate the model's ability to generalize topics § The likelihood of unseen documents can be used to compare models— higher likelihood implies a better model. (2009, Wallach et al.)
  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14

    jgs Evaluation // Split dataset InstanceList[] instanceSplit = instances.split( new double[] {0.9, 0.1, 0.0} // 90% training, 10% testing. 0% fold ); // Use the first 90% for training model.addInstances(instanceSplit[0]); model.setNumThreads(4); model.setNumIterations(50); model.estimate(); // Get estimator MarginalProbEstimator estimator = model.getProbEstimator(); double loglike = estimator.evaluateLeftToRight( instanceSplit[1], // test instances 10, // default # tokens false, //resampling allowed null); // printstream to write the output System.out.println("Total log likelihood: "+loglike);
  10. jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,

    Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.