Optimizing Topic Models for Classification Tasks

056711992030f664b54fa3097972e833?s=47 Tommy
November 09, 2019

Optimizing Topic Models for Classification Tasks

Topic models are hard to evaluate as the variables they measure are latent, i.e. unobserved. Most topic model evaluation has focused on coherence metrics to mimic human judgment or preference. Yet topic models are often not used merely for the pleasure of the human eye. Instead, topic models are often used to support classification tasks, where ground truth exists. In this research, I compare LDA and LSA on how well they support a simple classification task. I use a Bayesian optimization service—SigOpt—to aid choosing the hyperparameters for each model, allowing each to be at its best. I optimize for both coherence of the topic model as well as classification accuracy on held-out data. All code is performed in R using primarily the textmineR, randomForest, and SigOptR packages and available on GitHub.

056711992030f664b54fa3097972e833?s=128

Tommy

November 09, 2019
Tweet

Transcript

  1. Optimizing Topic Models for Classification Tasks Tommy Jones Delivered at

    the DC R Conference November 8, 2019
  2. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  3. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  4. Latent Dirichlet Allocation (LDA) → Words Documents Words Documents Topics

    Topics DTM • Three hyper parameters: k - Number of topics α - Tunes distribution of topics over documents β - Tunes distribution of words over topics
  5. Latent Semantic Analysis (LSA) • Only one hyper parameter: k

    - Number of topics = Words Documents Words Documents Topics Single Values Topics DTM (w/ TFIDF)
  6. Evaluating topic models is hard • Latent (i.e. unseen) variables

    • People don’t write that way anyway • Coherence? • Goodness of fit? • Task-specific accuracy?
  7. It’s 2019. Why are you still talking about LDA?

  8. It’s 2019. Why are you still talking about LDA?

  9. https://xkcd.com/2173/

  10. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  11. What do we use these models for anyway?

  12. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  13. The Experiment • Grab a well-known data set for text

    classification
 (20 Newsgroups) • Use the SigOpt hyper parameter optimization service to optimize your topic model for classification 
 (and coherence as a control) • Compare LDA and LSA • Learn something about the relationships between the hyper parameters and classification accuracy vs. topic coherence
  14. The Stack • textmineR 
 - text vectorization and topic

    modeling • randomForest 
 - good ol’ workhorse of classification • SigOptR 
 - R API to the SigOpt service • cloudml 
 - R API to Google Cloud ML service • magrittr, stringr, parallel 
 - do other stuff good too
  15. On Optimization & the Choice of SigOpt

  16. Process

  17. Process • Remove numbers and punctuation • Tokenize unigrams •

    Remove stop words • Remove infrequent words • Split into training sets and test set Data prep.
  18. Wonky Train/Test Splits 1,000 obs. to train a topic model

    5,665 obs. to train a classifier 6,665 obs. to calculate out-of-sample metrics
 & report out to SigOpt for optimization 6,667 obs. as a hold out for the final evaluation } 19,997 obs. total
  19. Process Optimization loop

  20. Optimization with SigOptR 1. Set your API token 2. Declare

    an experiment 3. Declare a function to build your model(s) and report your metric(s) 4. Loop over the following: A. Get hyper parameter suggestions from SigOpt B. Use those to train your model(s) C. Report your metric(s) to SigOpt 5. Get the optimal results back
  21. 1. 2.

  22. 3.

  23. 4. 5.

  24. In practice… Submit your job to Google Cloud ML 


    and wait for results
  25. In practice… One week of runtime later realize that:
 


    (a) You were trying to build LDA on the whole data set
 
 (b) You have a pretty bad bottleneck in the Gibbs sampler you’re so proud of

  26. In practice… Fix the training set and get results in

    a few hours. 
 (Also make plans to overhaul your Gibbs 
 sampler.)
  27. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  28. Pareto Frontiers

  29. Parameter Importance with LDA

  30. Accuracy Coherence

  31. Agenda • Review: Topic Model Basics • The Hypothesis •

    The Experiment • The Results • So What?
  32. • LDA pretty much kills LSA in terms of accuracy

    • LSA pretty much kills LDA in terms of speed • With LDA there isn’t much of a tradeoff between accuracy and coherence • That said… K is by far the most important for accuracy • All 3 hyper parameters are equally important for coherence
  33. • I think SigOpt is pretty nifty. • I think

    LDA is nifty too. • I have some controversial thoughts on LDA, embeddings, and transfer learning. (Ask me during the break.)
  34. Come work with me! • Artificial Intelligence Product Architect •

    Technology Architect - Cloud & Software Operations • Software Engineer (Testing) • UX Designer & Strategist, IQT Labs • Data Science Interns
  35. Thank You • jones.thos.w@gmail.com • twitter: @thos_jones • http://www.biasedestimates.com •

    https://GitHub.com/TommyJones