Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Optimizing Topic Models for Classification Tasks

Tommy
November 09, 2019

Optimizing Topic Models for Classification Tasks

Topic models are hard to evaluate as the variables they measure are latent, i.e. unobserved. Most topic model evaluation has focused on coherence metrics to mimic human judgment or preference. Yet topic models are often not used merely for the pleasure of the human eye. Instead, topic models are often used to support classification tasks, where ground truth exists. In this research, I compare LDA and LSA on how well they support a simple classification task. I use a Bayesian optimization service—SigOpt—to aid choosing the hyperparameters for each model, allowing each to be at its best. I optimize for both coherence of the topic model as well as classification accuracy on held-out data. All code is performed in R using primarily the textmineR, randomForest, and SigOptR packages and available on GitHub.

Tommy

November 09, 2019
Tweet

More Decks by Tommy

Other Decks in Research

Transcript

  1. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  2. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  3. Latent Dirichlet Allocation (LDA) → Words Documents Words Documents Topics

    Topics DTM • Three hyper parameters: k - Number of topics α - Tunes distribution of topics over documents β - Tunes distribution of words over topics
  4. Latent Semantic Analysis (LSA) • Only one hyper parameter: k

    - Number of topics = Words Documents Words Documents Topics Single Values Topics DTM (w/ TFIDF)
  5. Evaluating topic models is hard • Latent (i.e. unseen) variables

    • People don’t write that way anyway • Coherence? • Goodness of fit? • Task-specific accuracy?
  6. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  7. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  8. The Experiment • Grab a well-known data set for text

    classification
 (20 Newsgroups) • Use the SigOpt hyper parameter optimization service to optimize your topic model for classification 
 (and coherence as a control) • Compare LDA and LSA • Learn something about the relationships between the hyper parameters and classification accuracy vs. topic coherence
  9. The Stack • textmineR 
 - text vectorization and topic

    modeling • randomForest 
 - good ol’ workhorse of classification • SigOptR 
 - R API to the SigOpt service • cloudml 
 - R API to Google Cloud ML service • magrittr, stringr, parallel 
 - do other stuff good too
  10. Process • Remove numbers and punctuation • Tokenize unigrams •

    Remove stop words • Remove infrequent words • Split into training sets and test set Data prep.
  11. Wonky Train/Test Splits 1,000 obs. to train a topic model

    5,665 obs. to train a classifier 6,665 obs. to calculate out-of-sample metrics
 & report out to SigOpt for optimization 6,667 obs. as a hold out for the final evaluation } 19,997 obs. total
  12. Optimization with SigOptR 1. Set your API token 2. Declare

    an experiment 3. Declare a function to build your model(s) and report your metric(s) 4. Loop over the following: A. Get hyper parameter suggestions from SigOpt B. Use those to train your model(s) C. Report your metric(s) to SigOpt 5. Get the optimal results back
  13. 3.

  14. In practice… One week of runtime later realize that:
 


    (a) You were trying to build LDA on the whole data set
 
 (b) You have a pretty bad bottleneck in the Gibbs sampler you’re so proud of

  15. In practice… Fix the training set and get results in

    a few hours. 
 (Also make plans to overhaul your Gibbs 
 sampler.)
  16. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  17. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  18. • LDA pretty much kills LSA in terms of accuracy

    • LSA pretty much kills LDA in terms of speed • With LDA there isn’t much of a tradeoff between accuracy and coherence • That said… K is by far the most important for accuracy • All 3 hyper parameters are equally important for coherence
  19. • I think SigOpt is pretty nifty. • I think

    LDA is nifty too. • I have some controversial thoughts on LDA, embeddings, and transfer learning. (Ask me during the break.)
  20. Come work with me! • Artificial Intelligence Product Architect •

    Technology Architect - Cloud & Software Operations • Software Engineer (Testing) • UX Designer & Strategist, IQT Labs • Data Science Interns