$30 off During Our Annual Pro Sale. View Details »

An introduction to textmineR

Tommy
November 09, 2018

An introduction to textmineR

Deliverd at the DC R Conference on November 9, 2018

Tommy

November 09, 2018
Tweet

More Decks by Tommy

Other Decks in Programming

Transcript

  1. An Introduction to
    textmineR
    Thomas W. Jones
    Delivered at DC R Conference
    November 9, 2018

    View Slide

  2. View Slide

  3. View Slide

  4. Why might you use
    textmineR?
    • Three principles of the framework
    1. Maximal interoperability with other R packages

    2. Scaleable for object storage and computation time

    3. Syntax that is idiomatic to R

    • A pretty darn good topic modeling work
    bench
    • Some special stuff based on my research

    View Slide

  5. Agenda
    • NLP Workflow & Computational Considerations
    • textmineR’s philosophy
    • Topic models & embeddings
    • textmineR’s special stuff
    • What’s next?

    View Slide

  6. Agenda
    • NLP Workflow & Computational Considerations
    • textmineR’s philosophy
    • Topic models & embeddings
    • textmineR’s special stuff
    • What’s next?

    View Slide

  7. Document Term Matrix
    (DTM)
    Document 1
    Document 2
    Document 3
    Document 4
    Document 5

    reduce health
    policy
    food choice
    study sodium
    social

    1 1
    1
    2
    1
    2
    1
    1
    3
    • Rows are documents
    • Columns are linguistic
    features
    • Much wider than tall

    View Slide

  8. Term Co-occurence Matrix
    (TCM)
    reduce health
    policy
    food choice
    study sodium
    social

    reduce health
    policy
    food choice
    study sodium
    social

    1 1
    2
    2
    1
    2 1
    1
    3
    1
    • Rows and columns are
    linguistic features
    • Square, not necessarily
    symmetric

    View Slide

  9. An NLP Pipeline
    1. Construct a DTM or TCM
    2. Fit a model or run an algorithm on the DTM or
    TCM
    3. Analyze the model/algorithm results
    4. Apply the model/algorithm to new documents

    View Slide

  10. Decisions Decisions…
    • What is a term?
    • Unigrams vs. n-grams
    • stems, lemmas, parts of speech, named entities, etc…
    • What is a document?
    • Books, chapters, pages, paragraphs, sentences, etc…
    • What is the measure relating my rows to columns?
    • Raw counts, TF-IDF, some other index, etc…
    • Skip-grams, count of document co-occurence, etc…

    View Slide

  11. RAM Math
    (n rows) * (k columns) * (8 bytes) / 1,000,000,000 =
    Size of matrix in Gb

    View Slide

  12. (100 thousand) * (6 million) * (8 bytes) / (1 billion) =
    4,800 Gb = 4.8 Tb

    View Slide

  13. More RAM Math
    • 10,000 X 20,000 matrix (a moderately-sized
    corpus with many terms removed) is 1.6 Gb
    • 10,000 X 60,000 matrix (the same ratio of rows to
    columns as on the last slide) is 4.8 Gb
    • 20,000 X 100,000 (a fairly standard corpus size)
    is 16 Gb.

    View Slide

  14. Agenda
    • NLP Workflow & Computational Considerations
    • textmineR’s philosophy
    • Topic models & embeddings
    • textmineR’s special stuff
    • What’s next?

    View Slide

  15. Three Principles
    • Maximum interoperability within R’s ecosystem
    • Syntax that is idiomatic in R
    • Scaleable in terms of object storage and
    computation time

    View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. Agenda
    • NLP Workflow & Computational Considerations
    • textmineR’s philosophy
    • Topic models & embeddings
    • textmineR’s special stuff
    • What’s next?

    View Slide

  23. Topic models & embeddings are functions
    Mapping a DTM (TCM) to 2 or more
    matrices

    Words
    Documents
    Words
    Documents
    Topics
    Topics
    DTM

    Words
    Words
    Words
    Words
    Topics
    Topics
    TCM



    View Slide

  24. View Slide

  25. View Slide

  26. Supported Models
    • Latent Dirichlet Allocation (LDA) - Native

    • Latent Semantic Analysis (LSA/LSI) - Native with help
    from RSpectra

    • Correlated Topic Models (CTM) - from the topicmodels
    package

    • Help wanted for more!

    View Slide

  27. • SummarizeTopics()
    • LabelTopics()
    • GetTopTerms()
    • Cluster2TopicModel()
    • CalcLikelihood()
    • CalcProbCoherence()
    • CalcTopicModelR2()

    View Slide

  28. Agenda
    • NLP Workflow & Computational Considerations
    • textmineR’s philosophy
    • Topic models & embeddings
    • textmineR’s special stuff
    • What’s next?

    View Slide

  29. CalcTopicModelR2()

    View Slide

  30. Probabilistic Coherence
    • Average measure of statistical independence
    between M neighboring words in each topic
    • Ranges between 1 and -1
    • Values close to 0 indicate statistical
    independence (not a great topic)
    • Negative values indicate negative correlation
    (likely a terrible topic)

    View Slide

  31. LDA asymmetric priors +
    more!

    View Slide

  32. Agenda
    • NLP Workflow & Computational Considerations
    • textmineR’s philosophy
    • Topic models & embeddings
    • textmineR’s special stuff
    • What’s next?

    View Slide

  33. • update() method for LDA

    • posterior() method for LDA

    • Parallel fitting for LDA

    • Native DTM and TCM functions

    View Slide

  34. textmineR is on CRAN
    • Latest CRAN version is 3.0.1 

    (updated last week)
    • Development version is on GitHub
    • Be a contributor!

    https://github.com/tommyjones/textmineR

    View Slide

  35. Start with the vignettes!

    View Slide

  36. Thank You
    [email protected]
    • twitter: @thos_jones
    • http://www.biasedestimates.com
    • https://GitHub.com/TommyJones

    View Slide