An introduction to textmineR

056711992030f664b54fa3097972e833?s=47 Tommy
November 09, 2018

An introduction to textmineR

Deliverd at the DC R Conference on November 9, 2018

056711992030f664b54fa3097972e833?s=128

Tommy

November 09, 2018
Tweet

Transcript

  1. An Introduction to textmineR Thomas W. Jones Delivered at DC

    R Conference November 9, 2018
  2. None
  3. None
  4. Why might you use textmineR? • Three principles of the

    framework 1. Maximal interoperability with other R packages 2. Scaleable for object storage and computation time 3. Syntax that is idiomatic to R • A pretty darn good topic modeling work bench • Some special stuff based on my research
  5. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  6. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  7. Document Term Matrix (DTM) Document 1 Document 2 Document 3

    Document 4 Document 5 … reduce health policy food choice study sodium social … 1 1 1 2 1 2 1 1 3 • Rows are documents • Columns are linguistic features • Much wider than tall
  8. Term Co-occurence Matrix (TCM) reduce health policy food choice study

    sodium social … reduce health policy food choice study sodium social … 1 1 2 2 1 2 1 1 3 1 • Rows and columns are linguistic features • Square, not necessarily symmetric
  9. An NLP Pipeline 1. Construct a DTM or TCM 2.

    Fit a model or run an algorithm on the DTM or TCM 3. Analyze the model/algorithm results 4. Apply the model/algorithm to new documents
  10. Decisions Decisions… • What is a term? • Unigrams vs.

    n-grams • stems, lemmas, parts of speech, named entities, etc… • What is a document? • Books, chapters, pages, paragraphs, sentences, etc… • What is the measure relating my rows to columns? • Raw counts, TF-IDF, some other index, etc… • Skip-grams, count of document co-occurence, etc…
  11. RAM Math (n rows) * (k columns) * (8 bytes)

    / 1,000,000,000 = Size of matrix in Gb
  12. (100 thousand) * (6 million) * (8 bytes) / (1

    billion) = 4,800 Gb = 4.8 Tb
  13. More RAM Math • 10,000 X 20,000 matrix (a moderately-sized

    corpus with many terms removed) is 1.6 Gb • 10,000 X 60,000 matrix (the same ratio of rows to columns as on the last slide) is 4.8 Gb • 20,000 X 100,000 (a fairly standard corpus size) is 16 Gb.
  14. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  15. Three Principles • Maximum interoperability within R’s ecosystem • Syntax

    that is idiomatic in R • Scaleable in terms of object storage and computation time
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  23. Topic models & embeddings are functions Mapping a DTM (TCM)

    to 2 or more matrices → Words Documents Words Documents Topics Topics DTM → Words Words Words Words Topics Topics TCM
  24. None
  25. None
  26. Supported Models • Latent Dirichlet Allocation (LDA) - Native •

    Latent Semantic Analysis (LSA/LSI) - Native with help from RSpectra • Correlated Topic Models (CTM) - from the topicmodels package • Help wanted for more!
  27. • SummarizeTopics() • LabelTopics() • GetTopTerms() • Cluster2TopicModel() • CalcLikelihood()

    • CalcProbCoherence() • CalcTopicModelR2()
  28. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  29. CalcTopicModelR2()

  30. Probabilistic Coherence • Average measure of statistical independence between M

    neighboring words in each topic • Ranges between 1 and -1 • Values close to 0 indicate statistical independence (not a great topic) • Negative values indicate negative correlation (likely a terrible topic)
  31. LDA asymmetric priors + more!

  32. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  33. • update() method for LDA • posterior() method for LDA

    • Parallel fitting for LDA • Native DTM and TCM functions
  34. textmineR is on CRAN • Latest CRAN version is 3.0.1

    
 (updated last week) • Development version is on GitHub • Be a contributor!
 https://github.com/tommyjones/textmineR
  35. Start with the vignettes!

  36. Thank You • jones.thos.w@gmail.com • twitter: @thos_jones • http://www.biasedestimates.com •

    https://GitHub.com/TommyJones