Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lecture: NLP, Topic Modeling and Applications

Lecture: NLP, Topic Modeling and Applications

Code: https://github.com/polymorpher/bittiger
Course (My Lectures + Tutorials): https://www.bittiger.io/livecourses/YQCMuXwL7fhHuQT5K

This is an introductory level course for theory, implementation, and applications of topic modeling (and NLP). It also includes some pointers to advanced topics and state-of-the-art research papers.

Aaron Li

May 02, 2017
Tweet

More Decks by Aaron Li

Other Decks in Research

Transcript

  1. Copyright 2017 Aaron Li ([email protected]) Modelling Aaron Li [email protected] for

    news recommendation, user behaviour modelling, and many more
  2. About me • Working on a stealth startup • Former

    lead inference engineer at Scaled Inference • Did AI / Machine Learning at Google Research, NICTA, CMU, ANU, etc. • https://www.linkedin.com/in/aaronqli/ Copyright 2017 Aaron Li ([email protected]) 2 Copyright 2017 Aaron Li ([email protected])
  3. Overview • Theory (2 classes, 2h each) • work out

    the problem & solutions & why • discuss the math & models & NLP fundamentals • Industry use cases & systems & applications • Practice (2 classes, 2h each) • live demo + coding + debugging • data sets, open source tools, Q & A Copyright 2017 Aaron Li ([email protected]) 3 Copyright 2017 Aaron Li ([email protected])
  4. Overview • Background Knowledge • Linear Algebra • Probability Theory

    • Calculus • Scala / Go / Node / C++ (please vote) Copyright 2017 Aaron Li ([email protected]) 4 Copyright 2017 Aaron Li ([email protected])
  5. Theory 1 What is news recommendation? What is topic modeling?

    Why? Basic architecture NLP foundamentals Basic model: LDA Practice 1 LDA live demo NLP tools introduction Preprocessed Datasets Code LDA + Experiments Open source tools for industry Theory 2 LDA Inference Gibbs sampling SparseLDA, AliasLDA, LightLDA Applications & Industrial use cases Practice 2 Set up NLP pipeline SparseLDA, AliasLDA, LightLDA Train & use the model
 News recommendation demo Schedule Copyright 2017 Aaron Li ([email protected]) 5 Copyright 2017 Aaron Li ([email protected])
  6. 7 • A lot of people read news every day

    • Flipboard, CNN, Facebook, WeChat …
 • How do we make people more engaged? • Personalise & Recommendation • learn preference and show relevant content • recommend articles based on the current one News Recommendation Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  7. 8 • Top websites / apps already doing this News

    Recommendation Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  8. 11 • Many websites don’t do it (e.g CNN) •

    Why not? It’s not a easy problem • Challenges • News article vocabulary is large (100k ~ 1M) • Documents are represented by high-dimensional vector, based on count of vocabulary • Traditional similarity measures don’t work News Recommendation Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  9. Example In 1996 Linus Torvalds, the Finnish creator of the

    Open Source operating system Linux, visited the National Zoo and Aquarium with members of the Canberra Linux Users Group, and was captivated by one of the Zoo's little Penguins. Legend has it that Linus was infected with a mythical disease called Penguinitis. Penguinitis makes you stay awake at night thinking about Penguins and feeling great love towards them. Not long after this event the Open Source Software community decided they needed a logo for Linux. They were looking for something fun and after Linus mentioned his fondness of penguins, a slightly overweighted penguin sitting down after having a great meal seemed to fit the bill perfectly. Hence, Tux the penguin was created and now when people think of Linux they think of Tux. Copyright 2017 Aaron Li ([email protected]) 12 Copyright 2017 Aaron Li ([email protected])
  10. Example • Word count = 132, unique words = 91

    • Very hard to measure its distance to other articles in our database talking about Linux, Linus Torvalds, and the creation of Tux • Distance for low-dimensional space aren’t effective • e.g. cosine similarity won’t make sense • Need to represent things in low-dimensional vectors • Capture semantics / topics efficiently Copyright 2017 Aaron Li ([email protected]) 13 Copyright 2017 Aaron Li ([email protected])
  11. Solutions Copyright 2017 Aaron Li ([email protected]) 14 Copyright 2017 Aaron

    Li ([email protected]) Step 1. Get text data 
 Step 2. ??? (Machine can’t read text) 
 Step 3. Model & Train Step 4. Deploy & Predict News articles Emails Legal docs Resume … (i.e. documents)
  12. 15 Step 2: NLP Preprocessing - common pipeline Solutions Sentence

    splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words There are a lot more… (used in advanced NLP tasks) Chunking Named Entity Recognition Sentiment Analysis Syntactic Analysis Dependency Parsing Coreference Resolution Entity Relationship Extraction Semantic Analysis … Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  13. NLP Preprocessing Copyright 2017 Aaron Li ([email protected]) 16 Copyright 2017

    Aaron Li ([email protected]) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Mostly rules (by regex or FST) • Look for sentence splitter • For English: , . ! ? etc. • Checkout Wikipedia article • Open source code is good • Also checkout this article
  14. NLP Preprocessing Copyright 2017 Aaron Li ([email protected]) 17 Copyright 2017

    Aaron Li ([email protected]) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Find boundaries for words • Easy for English (look for space) • Hard for Chinese etc. • Solution: FST, CRF, etc. • Difficulties: see Wikipedia article • Try making one by yourself using FST! (CMU 11711 homework)
  15. NLP Preprocessing Copyright 2017 Aaron Li ([email protected]) 18 Copyright 2017

    Aaron Li ([email protected]) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Stop words: • occurs frequently • semantically not meaningful • i.e. am, is, who, what, etc.
 • Small set of words
 • Easy to implement • e.g. in-memory hashset
  16. NLP Preprocessing Copyright 2017 Aaron Li ([email protected]) 19 Copyright 2017

    Aaron Li ([email protected]) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Reduce word to root (stem) • Usually used in IR system • Root can be a non-word. e.g. • fishing, fished, fisher => fish • cats, catty => cat • argument, arguing => argu
 • Rule based implementation
 • e.g. Porter’s Snowball stemmer
 Also see Wikipedia article
  17. NLP Preprocessing Copyright 2017 Aaron Li ([email protected]) 20 Copyright 2017

    Aaron Li ([email protected]) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • POS = Part of Speech • Find grammar role of each word. 
 I ate a fish PRP VBD DT NN
 • Disambiguate same words used in different context. e.g: • “Train” as in “train a model” • “Train” as in “catch a train” • Techniques: HMM, CRF, etc. • See this article for more details
  18. NLP Preprocessing Copyright 2017 Aaron Li ([email protected]) 21 Copyright 2017

    Aaron Li ([email protected]) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Find base form of a word • More complex than stemming • Use POS tag information • Different rules for different POS
 • Base form is a valid word. e.g. • walks, walking, walked =>walk • am, are, is => be • argument (NN) => argument • arguing (VBG) => argue
 • See Wikipedia article for details
  19. NLP Preprocessing Copyright 2017 Aaron Li ([email protected]) 22 Copyright 2017

    Aaron Li ([email protected]) Sentence splitting Tokenisation Stop words removal Stemming (optional) POS Tagging Lemmatisation Form bag of words • Index pre-processed documents and words with id and frequency
 • e.g: • id:1 word:(train, VBG) freq: 5 • id:2 word:(model, NN) freq: 2 • id:3 word:(train, NN) freq: 3 • … See UCI Bag of Words dataset
  20. Solutions • Modelling & Training • Naive Bayes • Latent

    Semantic Analysis • word2vec, doc2vec, … • Topic Modelling Copyright 2017 Aaron Li ([email protected]) 23 Copyright 2017 Aaron Li ([email protected])
  21. 24 • Naive Bayes (very old technique) • Use only

    key words to get probability for K labels • Good for spam detection • Poor performance for news recommendation • Does not capture semantics / topics • https://web.stanford.edu/class/cs124/lec/ naivebayes.pdf Solutions Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  22. 25 • Latent Semantic Analysis (~1990 - 2000) • SVD

    on a TF-IDF frequency matrix with documents as columns and words as rows • Gives a low-rank approximation of the matrix and represent documents in low dimension vectors • Problem: hard to interpret vectors / documents, probability distribution is wrong (Gaussian) • https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic- indexing-1.html • Thomas Hofmann. Probabilistic Latent Semantic Analysis. In Kathryn B. Laskey and Henri Prade, editors, UAI, Solutions Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  23. 26 • word2vec, doc2vec (2013~) • Convert words to dense,

    low-dimensional, compositional vectors (e.g. king - man + woman = queen) • Good for classification problems • Slow to train, hard to interpret (because of neural network), yet to be tested in industrial use cases • Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space" ICLR 2013. • Getting started with word2vec Solutions Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  24. 27 • Topic Models (LDA etc., 2003~) • Define a

    generative structure involving latent variables (e.g topics) using well-structured distributions and infer the parameters • Represent documents / words using low-dimensional, highly interpretable distributions • Extensively used in industry. Many open source tools • Extensive research on speeding up / scaling up • D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, 2003 • Tutorial: Parameter estimation for text analysis, Gregor Heinrich 2008 Solutions Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  25. Topic Models Copyright 2017 Aaron Li ([email protected]) 29 Copyright 2017

    Aaron Li ([email protected]) Latent Dirichlet Allocation (LDA) Image from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  26. 30 • LDA (Latent Dirichlet Allocation) • Arguably the most

    popular topic model since 2013 • Created by David Blei, Andrew Ng, Michael Jordan • To be practical we use this topic model in class Topic Models Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  27. LDA Copyright 2017 Aaron Li ([email protected]) 31 Copyright 2017 Aaron

    Li ([email protected]) Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  28. LDA Copyright 2017 Aaron Li ([email protected]) 32 Copyright 2017 Aaron

    Li ([email protected]) Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  29. LDA Copyright 2017 Aaron Li ([email protected]) 33 Copyright 2017 Aaron

    Li ([email protected]) Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  30. Example Copyright 2017 Aaron Li ([email protected]) 35 Copyright 2017 Aaron

    Li ([email protected]) Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]
  31. Example Copyright 2017 Aaron Li ([email protected]) 36 Copyright 2017 Aaron

    Li ([email protected]) Extracted from [BleiNgJordan2003, Latent Dirichlet Allocation]
  32. LDA • Task: infer parameters • each document’s representation by

    topic vector • with this we can compute document similarity! • each topic’s representation by words (counts) • with this we can look at each topic manually, and interpret the meaning of them! Copyright 2017 Aaron Li ([email protected]) 37 Copyright 2017 Aaron Li ([email protected])
  33. Industrial Applications & Use cases • Yi Wang et al.

    Peacock: Learning Long-Tail Topic Features for Industrial Applications (TIST 2014) • Advertising system in production • Aaron Li et al. High Performance Latent Variable Models (arxiv, 2014) • User preference learning from search data • Arnab Bhadury, Clustering Similar Stories Using LDA • News Recommendation • And many more… search “AliasLDA” or “LightLDA” on Google Copyright 2017 Aaron Li ([email protected]) 39 Copyright 2017 Aaron Li ([email protected])
  34. 41 In LDA, the topic assignment for each word is

    latent LDA Inference Intractable: KL terms in denominator Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  35. 42 What can we do to address intractability? • Gibbs

    sampling • Variational inference (not discussed in class) LDA Inference Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  36. LDA Inference Copyright 2017 Aaron Li ([email protected]) 44 Copyright 2017

    Aaron Li ([email protected]) We can compute using Bayes rules Above equation is called “predictive probability”. It can be applied to the latent variable which assigns a topic to each word i.e. compute the probability of a word is assigned with a particular topic, given other topic assignments and the data (docs, words)
  37. Terms on the right are known all the time! We

    can compute the predictive probability (left term) by normalising over all k’s LDA Inference Copyright 2017 Aaron Li ([email protected]) 49 Copyright 2017 Aaron Li ([email protected])
  38. 50 • Algorithm (Gibbs sampling): • Randomly assign a topic

    to each word & doc • For T iterations (a large number to ensure convergence) • For each doc • For each word • For each topic, compute predictive prob • Sample topic by normalising over all predictive prob • Repeat for T’ iterations (a small number) and compute topic count per word and per doc. Use them to estimate and LDA Inference Copyright 2017 Aaron Li ([email protected]) Copyright 2017 Aaron Li ([email protected])
  39. Speed up LDA (Switch to my KDD 2014 slides) https://www.slideshare.net/AaronLi11/

    kdd-2014-presentation-best-research- paper-award-alias-topic-modelling- reducing-the-sampling-complexity-of- topic-models Copyright 2017 Aaron Li ([email protected]) 51 Copyright 2017 Aaron Li ([email protected])