$30 off During Our Annual Pro Sale. View Details »

JGS594 Lecture 18

JGS594 Lecture 18

Software Engineering for Machine Learning
Text Mining II
(202203)

Javier Gonzalez-Sanchez
PRO

April 06, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Research

Transcript

  1. jgs
    SER 594
    Software Engineering for
    Machine Learning
    Lecture 18: Text Mining II
    Dr. Javier Gonzalez-Sanchez
    [email protected]
    javiergs.engineering.asu.edu | javiergs.com
    PERALTA 230U
    Office Hours: By appointment

    View Slide

  2. jgs
    Previously …
    MALLET

    View Slide

  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3
    jgs
    Import Data
    import cc.mallet.pipe.iterator.FileIterator;
    import java.io.File;
    public class Main {
    public static void main(String[] args) {
    File[] directories = new File[1];
    directories[0] = new File ("/src/data/");
    FileIterator iterator = new FileIterator(
    directories,
    new TextFileFilter(),
    FileIterator.LAST_DIRECTORY
    );
    }
    }
    import java.io.File;
    import java.io.FileFilter;
    public class TextFileFilter implements FileFilter {
    @Override
    public boolean accept(File file) {
    return file.toString().endsWith(".txt");
    }
    }

    View Slide

  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 4
    jgs
    Code

    View Slide

  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5
    jgs
    Pre-Processing
    § Pipeline
    a modification of Chain of Responsibilities Pattern
    // example
    ArrayList pipeList = new ArrayList ();
    // lowercase
    pipeList.add(new Input2CharSequence("UTF-8"));
    pipeList.add( new CharSequenceLowercase() );
    // remove stop words using a standard English stop list (case sensitive)
    pipeList.add( new TokenSequenceRemoveStopwords(false));
    // tokenize
    Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+");
    pipeList.add(new CharSequence2TokenSequence(tokenPattern));
    SerialPipes pipeline = new SerialPipes(pipeList);

    View Slide

  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6
    jgs
    java.util.regex.Pattern
    Pattern pattern =
    Pattern.compile("[\\p{L}\\p{N}_]+");
    Matcher matcher =
    pattern.matcher(inputString);
    boolean isOK = matcher.matches();

    View Slide

  7. jgs
    Mallet API
    Text Mining

    View Slide

  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8
    jgs
    Code

    View Slide

  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9
    jgs
    Model
    InstanceList instances = new InstanceList(pipeline);
    instances.addThruPipe(iterator);
    // topics, alpha, beta
    § High alpha value means that each document is likely to contain a mixture of most of the
    topics, and not any single topic specifically.
    § Low alpha value means that it is more likely that a document may contain mixture of just a
    few, or even only one, of the topics.
    § A high beta value means that a topic is likely to contain a mixture of most of the words
    § A low beta value means that a topic may contain a mixture of just a few of the words.
    § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta.
    ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01);
    model.addInstances(instances);
    model.setNumThreads(4);
    model.setNumIterations(1000); // 50 to 2000
    model.estimate();

    View Slide

  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10
    jgs
    Results
    LL/token
    the model's likelihood divided by the total
    number of tokens

    View Slide

  11. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11
    jgs
    Remove Words
    § Pipeline
    a modification of Chain of Responsibilities Pattern
    // example
    ArrayList pipeList = new ArrayList ();
    // remove stop words using a standard English stop list (case sensitive)
    pipeList.add(
    new TokenSequenceRemoveStopwords(
    new File(stopListFilePath), // file
    "utf-8", // encoding
    false, // include the default Mallet list
    false, // case sensitive
    false // mark deletions
    )
    );
    // pipeList.add( new TokenSequenceRemoveStopwords(false));
    // tokenize
    Pattern tokenPattern = Pattern.compile("[\\p{L}\\p{N}_]+");
    pipeList.add(new CharSequence2TokenSequence(tokenPattern));
    SerialPipes pipeline = new SerialPipes(pipeList);

    View Slide

  12. jgs
    Evaluation
    Text Mining

    View Slide

  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13
    jgs
    Evaluation
    § unsupervised nature
    § Estimate the model's ability to generalize topics
    § The likelihood of unseen documents can be used to compare models—
    higher likelihood implies a better model.
    (2009, Wallach et al.)

    View Slide

  14. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14
    jgs
    Evaluation
    // Split dataset
    InstanceList[] instanceSplit = instances.split(
    new double[] {0.9, 0.1, 0.0} // 90% training, 10% testing. 0% fold
    );
    // Use the first 90% for training
    model.addInstances(instanceSplit[0]);
    model.setNumThreads(4);
    model.setNumIterations(50);
    model.estimate();
    // Get estimator
    MarginalProbEstimator estimator = model.getProbEstimator();
    double loglike = estimator.evaluateLeftToRight(
    instanceSplit[1], // test instances
    10, // default # tokens
    false, //resampling allowed
    null); // printstream to write the output
    System.out.println("Total log likelihood: "+loglike);

    View Slide

  15. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 15
    jgs
    Evaluation (50)

    View Slide

  16. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16
    jgs
    Evaluation (1000)

    View Slide

  17. jgs
    Summary
    Text Mining

    View Slide

  18. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18
    jgs
    Summary

    View Slide

  19. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19
    jgs
    Questions

    View Slide

  20. jgs
    SER 594 Software Engineering for Machine Learning
    Javier Gonzalez-Sanchez, Ph.D.
    [email protected]
    Spring 2022
    Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University.
    They cannot be distributed or used for another purpose.

    View Slide