An introduction to textmineR

Slide 1

Slide 1 text

An Introduction to textmineR Thomas W. Jones Delivered at DC R Conference November 9, 2018

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Why might you use textmineR? • Three principles of the framework 1. Maximal interoperability with other R packages 2. Scaleable for object storage and computation time 3. Syntax that is idiomatic to R • A pretty darn good topic modeling work bench • Some special stuﬀ based on my research

Slide 5

Slide 5 text

Agenda • NLP Workﬂow & Computational Considerations • textmineR’s philosophy • Topic models & embeddings • textmineR’s special stuff • What’s next?

Slide 6

Slide 6 text

Agenda • NLP Workﬂow & Computational Considerations • textmineR’s philosophy • Topic models & embeddings • textmineR’s special stuff • What’s next?

Slide 7

Slide 7 text

Document Term Matrix (DTM) Document 1 Document 2 Document 3 Document 4 Document 5 … reduce health policy food choice study sodium social … 1 1 1 2 1 2 1 1 3 • Rows are documents • Columns are linguistic features • Much wider than tall

Slide 8

Slide 8 text

Term Co-occurence Matrix (TCM) reduce health policy food choice study sodium social … reduce health policy food choice study sodium social … 1 1 2 2 1 2 1 1 3 1 • Rows and columns are linguistic features • Square, not necessarily symmetric

Slide 9

Slide 9 text

An NLP Pipeline 1. Construct a DTM or TCM 2. Fit a model or run an algorithm on the DTM or TCM 3. Analyze the model/algorithm results 4. Apply the model/algorithm to new documents

Slide 10

Slide 10 text

Decisions Decisions… • What is a term? • Unigrams vs. n-grams • stems, lemmas, parts of speech, named entities, etc… • What is a document? • Books, chapters, pages, paragraphs, sentences, etc… • What is the measure relating my rows to columns? • Raw counts, TF-IDF, some other index, etc… • Skip-grams, count of document co-occurence, etc…

Slide 11

Slide 11 text

RAM Math (n rows) * (k columns) * (8 bytes) / 1,000,000,000 = Size of matrix in Gb

Slide 12

Slide 12 text

(100 thousand) * (6 million) * (8 bytes) / (1 billion) = 4,800 Gb = 4.8 Tb

Slide 13

Slide 13 text

More RAM Math • 10,000 X 20,000 matrix (a moderately-sized corpus with many terms removed) is 1.6 Gb • 10,000 X 60,000 matrix (the same ratio of rows to columns as on the last slide) is 4.8 Gb • 20,000 X 100,000 (a fairly standard corpus size) is 16 Gb.

Slide 14

Slide 14 text

Agenda • NLP Workﬂow & Computational Considerations • textmineR’s philosophy • Topic models & embeddings • textmineR’s special stuff • What’s next?

Slide 15

Slide 15 text

Three Principles • Maximum interoperability within R’s ecosystem • Syntax that is idiomatic in R • Scaleable in terms of object storage and computation time

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Agenda • NLP Workﬂow & Computational Considerations • textmineR’s philosophy • Topic models & embeddings • textmineR’s special stuff • What’s next?

Slide 23

Slide 23 text

Topic models & embeddings are functions Mapping a DTM (TCM) to 2 or more matrices → Words Documents Words Documents Topics Topics DTM → Words Words Words Words Topics Topics TCM

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Supported Models • Latent Dirichlet Allocation (LDA) - Native • Latent Semantic Analysis (LSA/LSI) - Native with help from RSpectra • Correlated Topic Models (CTM) - from the topicmodels package • Help wanted for more!

Slide 27

Slide 27 text

• SummarizeTopics() • LabelTopics() • GetTopTerms() • Cluster2TopicModel() • CalcLikelihood() • CalcProbCoherence() • CalcTopicModelR2()

Slide 28

Slide 28 text

Agenda • NLP Workﬂow & Computational Considerations • textmineR’s philosophy • Topic models & embeddings • textmineR’s special stuff • What’s next?

Slide 29

Slide 29 text

CalcTopicModelR2()

Slide 30

Slide 30 text

Probabilistic Coherence • Average measure of statistical independence between M neighboring words in each topic • Ranges between 1 and -1 • Values close to 0 indicate statistical independence (not a great topic) • Negative values indicate negative correlation (likely a terrible topic)

Slide 31

Slide 31 text

LDA asymmetric priors + more!

Slide 32

Slide 32 text

Agenda • NLP Workﬂow & Computational Considerations • textmineR’s philosophy • Topic models & embeddings • textmineR’s special stuff • What’s next?

Slide 33

Slide 33 text

• update() method for LDA • posterior() method for LDA • Parallel ﬁtting for LDA • Native DTM and TCM functions

Slide 34

Slide 34 text

textmineR is on CRAN • Latest CRAN version is 3.0.1   (updated last week) • Development version is on GitHub • Be a contributor!  https://github.com/tommyjones/textmineR

Slide 35

Slide 35 text

Start with the vignettes!

Slide 36

Slide 36 text

Thank You • [email protected] • twitter: @thos_jones • http://www.biasedestimates.com • https://GitHub.com/TommyJones