$30 off During Our Annual Pro Sale. View Details »

JGS594 Lecture 19

JGS594 Lecture 19

Software Engineering for Machine Learning
Text Mining III
(202203)

Javier Gonzalez-Sanchez
PRO

April 12, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Programming

Transcript

  1. jgs
    SER 594
    Software Engineering for
    Machine Learning
    Lecture 19: Text Mining III
    Dr. Javier Gonzalez-Sanchez
    [email protected]
    javiergs.engineering.asu.edu | javiergs.com
    PERALTA 230U
    Office Hours: By appointment

    View Slide

  2. jgs
    Previously …
    MALLET

    View Slide

  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3
    jgs
    Data and Pipeline

    View Slide

  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 4
    jgs
    Model
    InstanceList instances = new InstanceList(pipeline);
    instances.addThruPipe(iterator);
    // topics, alpha, beta
    § High alpha value means that each document is likely to contain a mixture of most of the
    topics, and not any single topic specifically.
    § Low alpha value means that it is more likely that a document may contain mixture of just a
    few, or even only one, of the topics.
    § A high beta value means that a topic is likely to contain a mixture of most of the words
    § A low beta value means that a topic may contain a mixture of just a few of the words.
    § Griffiths and Steyvers (2004) suggest a value of 50/#topics for alpha and 0.1 for beta.
    ParallelTopicModel model = new ParallelTopicModel(5, 0.01, 0.01);
    model.addInstances(instances);
    model.setNumThreads(4);
    model.setNumIterations(1000); // 50 to 2000
    model.estimate();

    View Slide

  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5
    jgs
    Results
    LL/token
    the model's likelihood divided by the total
    number of tokens

    View Slide

  6. jgs
    Evaluation
    Text Mining

    View Slide

  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7
    jgs
    Evaluation
    § unsupervised nature
    § Estimate the model's ability to generalize topics
    § The likelihood of unseen documents can be used to compare models—
    higher likelihood implies a better model.
    (2009, Wallach et al.)

    View Slide

  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8
    jgs
    Evaluation
    // Split dataset
    InstanceList[] instanceSplit = instances.split(
    new double[] {0.9, 0.1, 0.0} // 90% training, 10% testing. 0%-fold
    );
    // Use the first 90% for training
    model.addInstances(instanceSplit[0]);
    model.setNumThreads(4);
    model.setNumIterations(50);
    model.estimate();
    // Get estimator
    MarginalProbEstimator estimator = model.getProbEstimator();
    double loglike = estimator.evaluateLeftToRight(
    instanceSplit[1], // test instances
    10, // default # tokens
    false, //resampling allowed
    null); // printstream to write the output
    System.out.println("Total log likelihood: "+loglike);

    View Slide

  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9
    jgs
    Evaluation (50)

    View Slide

  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10
    jgs
    Evaluation (1000)

    View Slide

  11. jgs
    Test Yourselves
    Text Mining

    View Slide

  12. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12
    jgs
    Summary

    View Slide

  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13
    jgs
    Homework
    1. Make the source code for BBC dataset work
    2. Create a dataset (your choice) with the content of
    A) papers you read
    B) Webpages or blog you are interested
    C) Course notes
    D) Etc.
    3. Create a model. Justify your selection of topics (k)
    4. Try with diverse configurations
    5. Describe your results. Good, bad, expected, unexpected.
    6. Submit a PDF file and the dataset (if possible) otherwise at least a sampling

    View Slide

  14. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14
    jgs
    Questions

    View Slide

  15. jgs
    SER 594 Software Engineering for Machine Learning
    Javier Gonzalez-Sanchez, Ph.D.
    [email protected]
    Spring 2022
    Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University.
    They cannot be distributed or used for another purpose.

    View Slide