Text Mining: Exploratory Data Analysis to Machine Learning

Text Mining: Exploratory Data Analysis to Machine Learning

March 2019 talk at WiDS Salt Lake City regional event

274bc3b916eac3fd5280c4a8b60b244b?s=128

Julia Silge

March 04, 2019
Tweet

Transcript

  1. T E X T M I N I N G

    EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
  2. HELLO T I D Y T E X T Data

    Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
  3. T I D Y T E X T TEXT DATA

    IS INCREASINGLY IMPORTANT 
  4. T I D Y T E X T TEXT DATA

    IS INCREASINGLY IMPORTANT  NLP TRAINING IS SCARCE ON THE GROUND 
  5. TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D

    Y T E X T
  6. https://github.com/juliasilge/tidytext

  7. https://github.com/juliasilge/tidytext

  8. http://tidytextmining.com/

  9. T I D Y T E X T EXPLORATORY DATA

    ANALYSIS  N-GRAMS AND MORE WORDS MACHINE LEARNING  
  10. EXPLORATORY DATA ANALYSIS T I D Y T E X

    T
  11. from the Washington Post’s Wonkblog

  12. from the Washington Post’s Wonkblog

  13. D3 visualization on Glitch

  14. WHAT IS A DOCUMENT ABOUT? T I D Y T

    E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
  15. None
  16. None
  17. • As part of the NASA Datanauts program, I worked

    on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
  18. None
  19. T A K I N G T I D Y

    T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
  20. None
  21. None
  22. None
  23. None
  24. None
  25. T A K I N G T I D Y

    T E X T T O T H E N E X T L E V E L TOPIC MODELING
  26. TOPIC MODELING T I D Y T E X T

    •Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
  27. None
  28. None
  29. None
  30. None
  31. T A K I N G T I D Y

    T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
  32. TRAIN A GLMNET MODEL T I D Y T E

    X T
  33. TEXT CLASSIFICATION T I D Y T E X T

    > library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
  34. None
  35. None
  36. THANK YOU T I D Y T E X T

    @juliasilge https://juliasilge.com JULIA SILGE
  37. THANK YOU T I D Y T E X T

    @juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE