Optimizing Topic Models for Classification Tasks

Optimizing Topic Models for Classiﬁcation Tasks Tommy Jones Delivered at
the DC R Conference November 8, 2019

Agenda • Review: Topic Model Basics • The Hypothesis •
The Experiment • The Results • So What?

Latent Dirichlet Allocation (LDA) → Words Documents Words Documents Topics
Topics DTM • Three hyper parameters: k - Number of topics α - Tunes distribution of topics over documents β - Tunes distribution of words over topics

Latent Semantic Analysis (LSA) • Only one hyper parameter: k
- Number of topics = Words Documents Words Documents Topics Single Values Topics DTM (w/ TFIDF)

Evaluating topic models is hard • Latent (i.e. unseen) variables
• People don’t write that way anyway • Coherence? • Goodness of ﬁt? • Task-speciﬁc accuracy?

It’s 2019. Why are you still talking about LDA?

https://xkcd.com/2173/

What do we use these models for anyway?

The Experiment • Grab a well-known data set for text
classification  (20 Newsgroups) • Use the SigOpt hyper parameter optimization service to optimize your topic model for classification   (and coherence as a control) • Compare LDA and LSA • Learn something about the relationships between the hyper parameters and classification accuracy vs. topic coherence

The Stack • textmineR   - text vectorization and topic
modeling • randomForest   - good ol’ workhorse of classiﬁcation • SigOptR   - R API to the SigOpt service • cloudml   - R API to Google Cloud ML service • magrittr, stringr, parallel   - do other stuﬀ good too

On Optimization & the Choice of SigOpt

Process

Process • Remove numbers and punctuation • Tokenize unigrams •
Remove stop words • Remove infrequent words • Split into training sets and test set Data prep.

Wonky Train/Test Splits 1,000 obs. to train a topic model
5,665 obs. to train a classiﬁer 6,665 obs. to calculate out-of-sample metrics  & report out to SigOpt for optimization 6,667 obs. as a hold out for the ﬁnal evaluation ｝ 19,997 obs. total

Process Optimization loop

Optimization with SigOptR 1. Set your API token 2. Declare
an experiment 3. Declare a function to build your model(s) and report your metric(s) 4. Loop over the following: A. Get hyper parameter suggestions from SigOpt B. Use those to train your model(s) C. Report your metric(s) to SigOpt 5. Get the optimal results back

In practice… Submit your job to Google Cloud ML  
and wait for results

In practice… One week of runtime later realize that:   
(a) You were trying to build LDA on the whole data set    (b) You have a pretty bad bottleneck in the Gibbs sampler you’re so proud of 

In practice… Fix the training set and get results in
a few hours.   (Also make plans to overhaul your Gibbs   sampler.)

Pareto Frontiers

Parameter Importance with LDA

Accuracy Coherence

• LDA pretty much kills LSA in terms of accuracy
• LSA pretty much kills LDA in terms of speed • With LDA there isn’t much of a tradeoﬀ between accuracy and coherence • That said… K is by far the most important for accuracy • All 3 hyper parameters are equally important for coherence

• I think SigOpt is pretty nifty. • I think
LDA is nifty too. • I have some controversial thoughts on LDA, embeddings, and transfer learning. (Ask me during the break.)

Come work with me! • Artiﬁcial Intelligence Product Architect •
Technology Architect - Cloud & Software Operations • Software Engineer (Testing) • UX Designer & Strategist, IQT Labs • Data Science Interns

Thank You • [email protected] • twitter: @thos_jones • http://www.biasedestimates.com •
https://GitHub.com/TommyJones

Optimizing Topic Models for Classification Tasks

Optimizing Topic Models for Classification Tasks

More Decks by Tommy

Other Decks in Research

Featured

Transcript