Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JGS594 Lecture 18

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

JGS594 Lecture 18

Software Engineering for Machine Learning
Text Mining II
(202203)

Avatar for Javier Gonzalez-Sanchez

Javier Gonzalez-Sanchez PRO

April 06, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Research

Transcript

  1. jgs SER 594 Software Engineering for Machine Learning Lecture 18:

    Text Mining II Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
  2. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

    jgs Import Data import cc.mallet.pipe.iterator.FileIterator; import java.io.File; public class Main { public static void main(String[] args) { File[] directories = new File[1]; directories[0] = new File ("/src/data/"); FileIterator iterator = new FileIterator( directories, new TextFileFilter(), FileIterator.LAST_DIRECTORY ); } } import java.io.File; import java.io.FileFilter; public class TextFileFilter implements FileFilter { @Override public boolean accept(File file) { return file.toString().endsWith(".txt"); } }
  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

    jgs Pre-Processing § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList<Pipe> pipeList = new ArrayList <Pipe>(); // lowercase pipeList.add(new Input2CharSequence("UTF-8")); pipeList.add( new CharSequenceLowercase() ); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);
  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6

    jgs java.util.regex.Pattern Pattern pattern = Pattern.compile("[\\p{L}\\p{N}_]+"); Matcher matcher = pattern.matcher(inputString); boolean isOK = matcher.matches();
  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

    jgs Model InstanceList instances = new InstanceList(pipeline); instances.addThruPipe(iterator); // topics, alpha, beta § High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. § Low alpha value means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. § A high beta value means that a topic is likely to contain a mixture of most of the words § A low beta value means that a topic may contain a mixture of just a few of the words. § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta. ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01); model.addInstances(instances); model.setNumThreads(4); model.setNumIterations(1000); // 50 to 2000 model.estimate();
  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

    jgs Results LL/token the model's likelihood divided by the total number of tokens
  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11

    jgs Remove Words § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList<Pipe> pipeList = new ArrayList <Pipe>(); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords( new File(stopListFilePath), // file "utf-8", // encoding false, // include the default Mallet list false, // case sensitive false // mark deletions ) ); // pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);
  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

    jgs Evaluation § unsupervised nature § Estimate the model's ability to generalize topics § The likelihood of unseen documents can be used to compare models— higher likelihood implies a better model. (2009, Wallach et al.)
  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14

    jgs Evaluation // Split dataset InstanceList[] instanceSplit = instances.split( new double[] {0.9, 0.1, 0.0} // 90% training, 10% testing. 0% fold ); // Use the first 90% for training model.addInstances(instanceSplit[0]); model.setNumThreads(4); model.setNumIterations(50); model.estimate(); // Get estimator MarginalProbEstimator estimator = model.getProbEstimator(); double loglike = estimator.evaluateLeftToRight( instanceSplit[1], // test instances 10, // default # tokens false, //resampling allowed null); // printstream to write the output System.out.println("Total log likelihood: "+loglike);
  10. jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,

    Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.