Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Mining: Exploratory Data Analysis to Machine Learning

Text Mining: Exploratory Data Analysis to Machine Learning

March 2019 talk at WiDS Salt Lake City regional event

Julia Silge

March 04, 2019
Tweet

More Decks by Julia Silge

Other Decks in Technology

Transcript

  1. T E X T M I N I N G

    EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
  2. HELLO T I D Y T E X T Data

    Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
  3. T I D Y T E X T TEXT DATA

    IS INCREASINGLY IMPORTANT 
  4. T I D Y T E X T TEXT DATA

    IS INCREASINGLY IMPORTANT  NLP TRAINING IS SCARCE ON THE GROUND 
  5. T I D Y T E X T EXPLORATORY DATA

    ANALYSIS  N-GRAMS AND MORE WORDS MACHINE LEARNING  
  6. WHAT IS A DOCUMENT ABOUT? T I D Y T

    E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
  7. • As part of the NASA Datanauts program, I worked

    on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
  8. T A K I N G T I D Y

    T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
  9. T A K I N G T I D Y

    T E X T T O T H E N E X T L E V E L TOPIC MODELING
  10. TOPIC MODELING T I D Y T E X T

    •Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
  11. T A K I N G T I D Y

    T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
  12. TEXT CLASSIFICATION T I D Y T E X T

    > library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
  13. THANK YOU T I D Y T E X T

    @juliasilge https://juliasilge.com JULIA SILGE
  14. THANK YOU T I D Y T E X T

    @juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE