Document Term Matrix (DTM) Document 1 Document 2 Document 3 Document 4 Document 5 … reduce health policy food choice study sodium social … 1 1 1 2 1 2 1 1 3 • Rows are documents • Columns are linguistic features • Much wider than tall
Term Co-occurence Matrix (TCM) reduce health policy food choice study sodium social … reduce health policy food choice study sodium social … 1 1 2 2 1 2 1 1 3 1 • Rows and columns are linguistic features • Square, not necessarily symmetric
An NLP Pipeline 1. Construct a DTM or TCM 2. Fit a model or run an algorithm on the DTM or TCM 3. Analyze the model/algorithm results 4. Apply the model/algorithm to new documents
Decisions Decisions… • What is a term? • Unigrams vs. n-grams • stems, lemmas, parts of speech, named entities, etc… • What is a document? • Books, chapters, pages, paragraphs, sentences, etc… • What is the measure relating my rows to columns? • Raw counts, TF-IDF, some other index, etc… • Skip-grams, count of document co-occurence, etc…
More RAM Math • 10,000 X 20,000 matrix (a moderately-sized corpus with many terms removed) is 1.6 Gb • 10,000 X 60,000 matrix (the same ratio of rows to columns as on the last slide) is 4.8 Gb • 20,000 X 100,000 (a fairly standard corpus size) is 16 Gb.
Three Principles • Maximum interoperability within R’s ecosystem • Syntax that is idiomatic in R • Scaleable in terms of object storage and computation time
Topic models & embeddings are functions Mapping a DTM (TCM) to 2 or more matrices → Words Documents Words Documents Topics Topics DTM → Words Words Words Words Topics Topics TCM
Probabilistic Coherence • Average measure of statistical independence between M neighboring words in each topic • Ranges between 1 and -1 • Values close to 0 indicate statistical independence (not a great topic) • Negative values indicate negative correlation (likely a terrible topic)
textmineR is on CRAN • Latest CRAN version is 3.0.1 (updated last week) • Development version is on GitHub • Be a contributor! https://github.com/tommyjones/textmineR