An introduction to textmineR

An Introduction to textmineR Thomas W. Jones Delivered at DC
R Conference November 9, 2018

Why might you use textmineR? • Three principles of the
framework 1. Maximal interoperability with other R packages 2. Scaleable for object storage and computation time 3. Syntax that is idiomatic to R • A pretty darn good topic modeling work bench • Some special stuﬀ based on my research

Agenda • NLP Workﬂow & Computational Considerations • textmineR’s philosophy
• Topic models & embeddings • textmineR’s special stuff • What’s next?

Document Term Matrix (DTM) Document 1 Document 2 Document 3
Document 4 Document 5 … reduce health policy food choice study sodium social … 1 1 1 2 1 2 1 1 3 • Rows are documents • Columns are linguistic features • Much wider than tall

Term Co-occurence Matrix (TCM) reduce health policy food choice study
sodium social … reduce health policy food choice study sodium social … 1 1 2 2 1 2 1 1 3 1 • Rows and columns are linguistic features • Square, not necessarily symmetric

An NLP Pipeline 1. Construct a DTM or TCM 2.
Fit a model or run an algorithm on the DTM or TCM 3. Analyze the model/algorithm results 4. Apply the model/algorithm to new documents

Decisions Decisions… • What is a term? • Unigrams vs.
n-grams • stems, lemmas, parts of speech, named entities, etc… • What is a document? • Books, chapters, pages, paragraphs, sentences, etc… • What is the measure relating my rows to columns? • Raw counts, TF-IDF, some other index, etc… • Skip-grams, count of document co-occurence, etc…

RAM Math (n rows) * (k columns) * (8 bytes)
/ 1,000,000,000 = Size of matrix in Gb

(100 thousand) * (6 million) * (8 bytes) / (1
billion) = 4,800 Gb = 4.8 Tb

More RAM Math • 10,000 X 20,000 matrix (a moderately-sized
corpus with many terms removed) is 1.6 Gb • 10,000 X 60,000 matrix (the same ratio of rows to columns as on the last slide) is 4.8 Gb • 20,000 X 100,000 (a fairly standard corpus size) is 16 Gb.

Three Principles • Maximum interoperability within R’s ecosystem • Syntax
that is idiomatic in R • Scaleable in terms of object storage and computation time

Topic models & embeddings are functions Mapping a DTM (TCM)
to 2 or more matrices → Words Documents Words Documents Topics Topics DTM → Words Words Words Words Topics Topics TCM

Supported Models • Latent Dirichlet Allocation (LDA) - Native •
Latent Semantic Analysis (LSA/LSI) - Native with help from RSpectra • Correlated Topic Models (CTM) - from the topicmodels package • Help wanted for more!

• SummarizeTopics() • LabelTopics() • GetTopTerms() • Cluster2TopicModel() • CalcLikelihood()
• CalcProbCoherence() • CalcTopicModelR2()

CalcTopicModelR2()

Probabilistic Coherence • Average measure of statistical independence between M
neighboring words in each topic • Ranges between 1 and -1 • Values close to 0 indicate statistical independence (not a great topic) • Negative values indicate negative correlation (likely a terrible topic)

LDA asymmetric priors + more!

• update() method for LDA • posterior() method for LDA
• Parallel ﬁtting for LDA • Native DTM and TCM functions

textmineR is on CRAN • Latest CRAN version is 3.0.1
  (updated last week) • Development version is on GitHub • Be a contributor!  https://github.com/tommyjones/textmineR

Start with the vignettes!

Thank You • [email protected] • twitter: @thos_jones • http://www.biasedestimates.com •
https://GitHub.com/TommyJones

An introduction to textmineR

An introduction to textmineR

Tommy

More Decks by Tommy

Other Decks in Programming

Featured

Transcript