$30 off During Our Annual Pro Sale. View Details »

Optimizing Topic Models for Classification Tasks

Tommy
November 09, 2019

Optimizing Topic Models for Classification Tasks

Topic models are hard to evaluate as the variables they measure are latent, i.e. unobserved. Most topic model evaluation has focused on coherence metrics to mimic human judgment or preference. Yet topic models are often not used merely for the pleasure of the human eye. Instead, topic models are often used to support classification tasks, where ground truth exists. In this research, I compare LDA and LSA on how well they support a simple classification task. I use a Bayesian optimization service—SigOpt—to aid choosing the hyperparameters for each model, allowing each to be at its best. I optimize for both coherence of the topic model as well as classification accuracy on held-out data. All code is performed in R using primarily the textmineR, randomForest, and SigOptR packages and available on GitHub.

Tommy

November 09, 2019
Tweet

More Decks by Tommy

Other Decks in Research

Transcript

  1. Optimizing Topic Models
    for Classification Tasks
    Tommy Jones
    Delivered at the DC R Conference
    November 8, 2019

    View Slide

  2. Agenda
    • Review: Topic Model Basics

    • The Hypothesis

    • The Experiment

    • The Results

    • So What?

    View Slide

  3. Agenda
    • Review: Topic Model Basics
    • The Hypothesis

    • The Experiment

    • The Results

    • So What?

    View Slide

  4. Latent Dirichlet Allocation
    (LDA)

    Words
    Documents
    Words
    Documents
    Topics
    Topics
    DTM

    • Three hyper parameters:
    k - Number of topics
    α - Tunes distribution of topics over documents
    β - Tunes distribution of words over topics

    View Slide

  5. Latent Semantic Analysis (LSA)
    • Only one hyper parameter:
    k - Number of topics
    =
    Words
    Documents
    Words
    Documents
    Topics
    Single
    Values
    Topics
    DTM
    (w/ TFIDF)


    View Slide

  6. Evaluating topic models is
    hard
    • Latent (i.e. unseen) variables

    • People don’t write that way anyway

    • Coherence?

    • Goodness of fit?

    • Task-specific accuracy?

    View Slide

  7. It’s 2019. Why are you
    still talking about LDA?

    View Slide

  8. It’s 2019. Why are you
    still talking about LDA?

    View Slide

  9. https://xkcd.com/2173/

    View Slide

  10. Agenda
    • Review: Topic Model Basics

    • The Hypothesis
    • The Experiment

    • The Results

    • So What?

    View Slide

  11. What do we use these
    models for anyway?

    View Slide

  12. Agenda
    • Review: Topic Model Basics

    • The Hypothesis

    • The Experiment
    • The Results

    • So What?

    View Slide

  13. The Experiment
    • Grab a well-known data set for text classification

    (20 Newsgroups)

    • Use the SigOpt hyper parameter optimization service to
    optimize your topic model for classification 

    (and coherence as a control)

    • Compare LDA and LSA

    • Learn something about the relationships between the
    hyper parameters and classification accuracy vs. topic
    coherence

    View Slide

  14. The Stack
    • textmineR 

    - text vectorization and topic modeling

    • randomForest 

    - good ol’ workhorse of classification

    • SigOptR 

    - R API to the SigOpt service

    • cloudml 

    - R API to Google Cloud ML service

    • magrittr, stringr, parallel 

    - do other stuff good too

    View Slide

  15. On Optimization
    & the Choice of SigOpt

    View Slide

  16. Process

    View Slide

  17. Process
    • Remove numbers and punctuation
    • Tokenize unigrams
    • Remove stop words
    • Remove infrequent words
    • Split into training sets and test set
    Data prep.

    View Slide

  18. Wonky Train/Test Splits
    1,000 obs. to train a topic model
    5,665 obs. to train a classifier
    6,665 obs. to calculate out-of-sample metrics

    & report out to SigOpt for optimization
    6,667 obs. as a hold out for the final evaluation

    19,997 obs. total

    View Slide

  19. Process
    Optimization loop

    View Slide

  20. Optimization with SigOptR
    1. Set your API token

    2. Declare an experiment

    3. Declare a function to build your model(s) and report your metric(s)

    4. Loop over the following:

    A. Get hyper parameter suggestions from SigOpt

    B. Use those to train your model(s)

    C. Report your metric(s) to SigOpt

    5. Get the optimal results back

    View Slide

  21. 1.
    2.

    View Slide

  22. 3.

    View Slide

  23. 4.
    5.

    View Slide

  24. In practice…
    Submit your job to Google Cloud ML 

    and wait for results

    View Slide

  25. In practice…
    One week of runtime later
    realize that:


    (a) You were trying to build
    LDA on the whole data set


    (b) You have a pretty bad
    bottleneck in the Gibbs
    sampler you’re so proud
    of


    View Slide

  26. In practice…
    Fix the training set and get results in a few hours. 

    (Also make plans to overhaul your Gibbs 

    sampler.)

    View Slide

  27. Agenda
    • Review: Topic Model Basics

    • The Hypothesis

    • The Experiment

    • The Results
    • So What?

    View Slide

  28. Pareto Frontiers

    View Slide

  29. Parameter Importance
    with LDA

    View Slide

  30. Accuracy
    Coherence

    View Slide

  31. Agenda
    • Review: Topic Model Basics

    • The Hypothesis

    • The Experiment

    • The Results

    • So What?

    View Slide

  32. • LDA pretty much kills LSA in
    terms of accuracy

    • LSA pretty much kills LDA in
    terms of speed

    • With LDA there isn’t much of a
    tradeoff between accuracy and
    coherence

    • That said… K is by far the most
    important for accuracy

    • All 3 hyper parameters are
    equally important for coherence

    View Slide

  33. • I think SigOpt is pretty nifty.

    • I think LDA is nifty too.

    • I have some controversial
    thoughts on LDA,
    embeddings, and transfer
    learning. (Ask me during the
    break.)

    View Slide

  34. Come work with me!
    • Artificial Intelligence Product
    Architect

    • Technology Architect - Cloud
    & Software Operations

    • Software Engineer (Testing)

    • UX Designer & Strategist, IQT
    Labs

    • Data Science Interns

    View Slide

  35. Thank You
    [email protected]
    • twitter: @thos_jones
    • http://www.biasedestimates.com
    • https://GitHub.com/TommyJones

    View Slide