Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An introduction to textmineR

Tommy
November 09, 2018

An introduction to textmineR

Deliverd at the DC R Conference on November 9, 2018

Tommy

November 09, 2018
Tweet

More Decks by Tommy

Other Decks in Programming

Transcript

  1. Why might you use textmineR? • Three principles of the

    framework 1. Maximal interoperability with other R packages 2. Scaleable for object storage and computation time 3. Syntax that is idiomatic to R • A pretty darn good topic modeling work bench • Some special stuff based on my research
  2. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  3. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  4. Document Term Matrix (DTM) Document 1 Document 2 Document 3

    Document 4 Document 5 … reduce health policy food choice study sodium social … 1 1 1 2 1 2 1 1 3 • Rows are documents • Columns are linguistic features • Much wider than tall
  5. Term Co-occurence Matrix (TCM) reduce health policy food choice study

    sodium social … reduce health policy food choice study sodium social … 1 1 2 2 1 2 1 1 3 1 • Rows and columns are linguistic features • Square, not necessarily symmetric
  6. An NLP Pipeline 1. Construct a DTM or TCM 2.

    Fit a model or run an algorithm on the DTM or TCM 3. Analyze the model/algorithm results 4. Apply the model/algorithm to new documents
  7. Decisions Decisions… • What is a term? • Unigrams vs.

    n-grams • stems, lemmas, parts of speech, named entities, etc… • What is a document? • Books, chapters, pages, paragraphs, sentences, etc… • What is the measure relating my rows to columns? • Raw counts, TF-IDF, some other index, etc… • Skip-grams, count of document co-occurence, etc…
  8. RAM Math (n rows) * (k columns) * (8 bytes)

    / 1,000,000,000 = Size of matrix in Gb
  9. (100 thousand) * (6 million) * (8 bytes) / (1

    billion) = 4,800 Gb = 4.8 Tb
  10. More RAM Math • 10,000 X 20,000 matrix (a moderately-sized

    corpus with many terms removed) is 1.6 Gb • 10,000 X 60,000 matrix (the same ratio of rows to columns as on the last slide) is 4.8 Gb • 20,000 X 100,000 (a fairly standard corpus size) is 16 Gb.
  11. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  12. Three Principles • Maximum interoperability within R’s ecosystem • Syntax

    that is idiomatic in R • Scaleable in terms of object storage and computation time
  13. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  14. Topic models & embeddings are functions Mapping a DTM (TCM)

    to 2 or more matrices → Words Documents Words Documents Topics Topics DTM → Words Words Words Words Topics Topics TCM
  15. Supported Models • Latent Dirichlet Allocation (LDA) - Native •

    Latent Semantic Analysis (LSA/LSI) - Native with help from RSpectra • Correlated Topic Models (CTM) - from the topicmodels package • Help wanted for more!
  16. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  17. Probabilistic Coherence • Average measure of statistical independence between M

    neighboring words in each topic • Ranges between 1 and -1 • Values close to 0 indicate statistical independence (not a great topic) • Negative values indicate negative correlation (likely a terrible topic)
  18. Agenda • NLP Workflow & Computational Considerations • textmineR’s philosophy

    • Topic models & embeddings • textmineR’s special stuff • What’s next?
  19. • update() method for LDA • posterior() method for LDA

    • Parallel fitting for LDA • Native DTM and TCM functions
  20. textmineR is on CRAN • Latest CRAN version is 3.0.1

    
 (updated last week) • Development version is on GitHub • Be a contributor!
 https://github.com/tommyjones/textmineR