JGS594 Lecture 18

jgs SER 594 Software Engineering for Machine Learning Lecture 18:
Text Mining II Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment

jgs Previously … MALLET

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3
jgs Import Data import cc.mallet.pipe.iterator.FileIterator; import java.io.File; public class Main { public static void main(String[] args) { File[] directories = new File[1]; directories[0] = new File ("/src/data/"); FileIterator iterator = new FileIterator( directories, new TextFileFilter(), FileIterator.LAST_DIRECTORY ); } } import java.io.File; import java.io.FileFilter; public class TextFileFilter implements FileFilter { @Override public boolean accept(File file) { return file.toString().endsWith(".txt"); } }

jgs Code

jgs Pre-Processing § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList<Pipe> pipeList = new ArrayList <Pipe>(); // lowercase pipeList.add(new Input2CharSequence("UTF-8")); pipeList.add( new CharSequenceLowercase() ); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);

jgs java.util.regex.Pattern Pattern pattern = Pattern.compile("[\\p{L}\\p{N}_]+"); Matcher matcher = pattern.matcher(inputString); boolean isOK = matcher.matches();

jgs Mallet API Text Mining

jgs Code

jgs Model InstanceList instances = new InstanceList(pipeline); instances.addThruPipe(iterator); // topics, alpha, beta § High alpha value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. § Low alpha value means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. § A high beta value means that a topic is likely to contain a mixture of most of the words § A low beta value means that a topic may contain a mixture of just a few of the words. § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta. ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01); model.addInstances(instances); model.setNumThreads(4); model.setNumIterations(1000); // 50 to 2000 model.estimate();

jgs Results LL/token the model's likelihood divided by the total number of tokens

jgs Remove Words § Pipeline a modification of Chain of Responsibilities Pattern // example ArrayList<Pipe> pipeList = new ArrayList <Pipe>(); // remove stop words using a standard English stop list (case sensitive) pipeList.add( new TokenSequenceRemoveStopwords( new File(stopListFilePath), // file "utf-8", // encoding false, // include the default Mallet list false, // case sensitive false // mark deletions ) ); // pipeList.add( new TokenSequenceRemoveStopwords(false)); // tokenize Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+"); pipeList.add(new CharSequence2TokenSequence(tokenPattern)); SerialPipes pipeline = new SerialPipes(pipeList);

jgs Evaluation Text Mining

jgs Evaluation § unsupervised nature § Estimate the model's ability to generalize topics § The likelihood of unseen documents can be used to compare models— higher likelihood implies a better model. (2009, Wallach et al.)

jgs Evaluation // Split dataset InstanceList[] instanceSplit = instances.split( new double[] {0.9, 0.1, 0.0} // 90% training, 10% testing. 0% fold ); // Use the first 90% for training model.addInstances(instanceSplit[0]); model.setNumThreads(4); model.setNumIterations(50); model.estimate(); // Get estimator MarginalProbEstimator estimator = model.getProbEstimator(); double loglike = estimator.evaluateLeftToRight( instanceSplit[1], // test instances 10, // default # tokens false, //resampling allowed null); // printstream to write the output System.out.println("Total log likelihood: "+loglike);

jgs Evaluation (50)

jgs Evaluation (1000)

jgs Summary Text Mining

jgs Summary

jgs Questions

jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,
Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.

JGS594 Lecture 18

JGS594 Lecture 18

Javier Gonzalez-Sanchez PRO

More Decks by Javier Gonzalez-Sanchez

Other Decks in Research

Featured

Transcript

jgs SER 594 Software Engineering for Machine Learning Lecture 18:

jgs Previously … MALLET

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 4

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6

jgs Mallet API Text Mining

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11

jgs Evaluation Text Mining

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 15

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16

jgs Summary Text Mining

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19

jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez,