jgs SER 594 Software Engineering for Machine Learning Lecture 18: Text Mining II Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment
Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3 jgs Import Data import cc.mallet.pipe.iterator.FileIterator; import java.io.File; public class Main { public static void main(String[] args) { File[] directories = new File[1]; directories[0] = new File ("/src/data/"); FileIterator iterator = new FileIterator( directories, new TextFileFilter(), FileIterator.LAST_DIRECTORY ); } } import java.io.File; import java.io.FileFilter; public class TextFileFilter implements FileFilter { @Override public boolean accept(File file) { return file.toString().endsWith(".txt"); } }
Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5 jgs Pre-Processing § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList pipeList = new ArrayList (); // lowercase pipeList.add(new Input2CharSequence("UTF-8")); pipeList.add( new CharSequenceLowercase() ); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);
Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9 jgs Model InstanceList instances = new InstanceList(pipeline); instances.addThruPipe(iterator); // topics, alpha, beta § High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. § Low alpha value means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. § A high beta value means that a topic is likely to contain a mixture of most of the words § A low beta value means that a topic may contain a mixture of just a few of the words. § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta. ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01); model.addInstances(instances); model.setNumThreads(4); model.setNumIterations(1000); // 50 to 2000 model.estimate();
Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11 jgs Remove Words § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList pipeList = new ArrayList (); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords( new File(stopListFilePath), // file "utf-8", // encoding false, // include the default Mallet list false, // case sensitive false // mark deletions ) ); // pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);
Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13 jgs Evaluation § unsupervised nature § Estimate the model's ability to generalize topics § The likelihood of unseen documents can be used to compare models— higher likelihood implies a better model. (2009, Wallach et al.)
jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez, Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.