Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An introduction to textmineR

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.
Avatar for Tommy Tommy
November 09, 2018

An introduction to textmineR

Deliverd at the DC R Conference on November 9, 2018

Avatar for Tommy

Tommy

November 09, 2018

More Decks by Tommy

Other Decks in Programming

Transcript

  1. Why might you use textmineR? • Three principles of the

    framework 1. Maximal interoperability with other R packages 2. Scaleable for object storage and computation time 3. Syntax that is idiomatic to R • A pretty darn good topic modeling work bench • Some special stuff based on my research
  2. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  3. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  4. Document Term Matrix (DTM) Document 1 Document 2 Document 3

    Document 4 Document 5 … reduce health policy food choice study sodium social … 1 1 1 2 1 2 1 1 3 • Rows are documents • Columns are linguistic features • Much wider than tall
  5. Term Co-occurence Matrix (TCM) reduce health policy food choice study

    sodium social … reduce health policy food choice study sodium social … 1 1 2 2 1 2 1 1 3 1 • Rows and columns are linguistic features • Square, not necessarily symmetric
  6. An NLP Pipeline 1. Construct a DTM or TCM 2.

    Fit a model or run an algorithm on the DTM or TCM 3. Analyze the model/algorithm results 4. Apply the model/algorithm to new documents
  7. Decisions Decisions… • What is a term? • Unigrams vs.

    n-grams • stems, lemmas, parts of speech, named entities, etc… • What is a document? • Books, chapters, pages, paragraphs, sentences, etc… • What is the measure relating my rows to columns? • Raw counts, TF-IDF, some other index, etc… • Skip-grams, count of document co-occurence, etc…
  8. RAM Math (n rows) * (k columns) * (8 bytes)

    / 1,000,000,000 = Size of matrix in Gb
  9. (100 thousand) * (6 million) * (8 bytes) / (1

    billion) = 4,800 Gb = 4.8 Tb
  10. More RAM Math • 10,000 X 20,000 matrix (a moderately-sized

    corpus with many terms removed) is 1.6 Gb • 10,000 X 60,000 matrix (the same ratio of rows to columns as on the last slide) is 4.8 Gb • 20,000 X 100,000 (a fairly standard corpus size) is 16 Gb.
  11. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  12. Three Principles • Maximum interoperability within R’s ecosystem • Syntax

    that is idiomatic in R • Scaleable in terms of object storage and computation time
  13. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  14. Topic models & embeddings are functions Mapping a DTM (TCM)

    to 2 or more matrices → Words Documents Words Documents Topics Topics DTM → Words Words Words Words Topics Topics TCM
  15. Supported Models • Latent Dirichlet Allocation (LDA) - Native •

    Latent Semantic Analysis (LSA/LSI) - Native with help from RSpectra • Correlated Topic Models (CTM) - from the topicmodels package • Help wanted for more!
  16. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  17. Probabilistic Coherence • Average measure of statistical independence between M

    neighboring words in each topic • Ranges between 1 and -1 • Values close to 0 indicate statistical independence (not a great topic) • Negative values indicate negative correlation (likely a terrible topic)
  18. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  19. • update() method for LDA • posterior() method for LDA

    • Parallel fitting for LDA • Native DTM and TCM functions
  20. textmineR is on CRAN • Latest CRAN version is 3.0.1

    
 (updated last week) • Development version is on GitHub • Be a contributor!
 https://github.com/tommyjones/textmineR