Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Life-Changing Magic of Tidying Text

Julia Silge
November 18, 2016

The Life-Changing Magic of Tidying Text

November 2016 talk for the Utah R Users group and Intermountain Data Conference

Links:
tidytext on GitHub: https://github.com/juliasilge/tidytext
States & Song Lyrics on the WashPo's Wonkblog: https://www.washingtonpost.com/news/wonk/wp/2016/10/01/the-states-that-americans-sing-about-most/
Trump's Android & iPhone Tweets: http://varianceexplained.org/r/trump-tweets/
NASA Datanauts: https://open.nasa.gov/explore/datanauts/
Tidy Text Mining with R: http://tidytextmining.com/

Julia Silge

November 18, 2016
Tweet

More Decks by Julia Silge

Other Decks in Technology

Transcript

  1. Does your text data spark joy? • Analysts and data

    scientists are typically trained to handle rectangular, numeric data • Much of the data proliferating today is unstructured and text-heavy • Estimates vary, but perhaps 70% to 90% of potentially usable information starts in unstructured form
  2. Tidy data • Tidy data principles are a powerful framework

    for manipulating, modeling, and visualizing data • Using tidy data principles and tidy tools can also make text mining easier and more effective
  3. Tidy data Tidy data has a specific structure: • each

    variable is a column • each observation is a row • each type of observational unit is a table
  4. What do we mean by tidy text? > text <-

    c("Because I could not stop for Death -", "He kindly stopped for me -", "The Carriage held but just Ourselves -", "and Immortality") > > text ## [1] "Because I could not stop for Death -" "He kindly stopped for me -" ## [3] "The Carriage held but just Ourselves -" "and Immortality"
  5. What do we mean by tidy text? > library(dplyr) >

    text_df <- data_frame(line = 1:4, text = text) > > text_df ## # A tibble: 4 × 2 ## line text ## <int> <chr> ## 1 1 Because I could not stop for Death - ## 2 2 He kindly stopped for me - ## 3 3 The Carriage held but just Ourselves - ## 4 4 and Immortality
  6. What do we mean by tidy text? > library(tidytext) >

    text_df %>% unnest_tokens(word, text) ## # A tibble: 20 × 2 ## line word ## <int> <chr> ## 1 1 because ## 2 1 i ## 3 1 could ## 4 1 not ## 5 1 stop ## 6 1 for ## 7 1 death ## 8 2 he ## 9 2 kindly ## 10 2 stopped ## # ... with 10 more rows
  7. What do we mean by tidy text? > library(tidytext) >

    text_df %>% unnest_tokens(word, text) ## # A tibble: 20 × 2 ## line word ## <int> <chr> ## 1 1 because ## 2 1 i ## 3 1 could ## 4 1 not ## 5 1 stop ## 6 1 for ## 7 1 death ## 8 2 he ## 9 2 kindly ## 10 2 stopped ## # ... with 10 more rows • Other columns have been retained. • Punctuation has been stripped. • Words have been converted to lowercase.
  8. Tidying the works of Jane Austen > library(janeaustenr) > library(dplyr)

    > library(stringr) > > original_books <- austen_books() %>% group_by(book) %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% ungroup()
  9. Tidying the works of Jane Austen > original_books # A

    tibble: 73,422 × 4 text book linenumber chapter <chr> <fctr> <int> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 1 0 2 Sense & Sensibility 2 0 3 by Jane Austen Sense & Sensibility 3 0 4 Sense & Sensibility 4 0 5 (1811) Sense & Sensibility 5 0 6 Sense & Sensibility 6 0 7 Sense & Sensibility 7 0 8 Sense & Sensibility 8 0 9 Sense & Sensibility 9 0 10 CHAPTER 1 Sense & Sensibility 10 1 # ... with 73,412 more rows
  10. > tidy_books <- original_books %>% unnest_tokens(word, text) > > tidy_books

    # A tibble: 725,054 × 4 book linenumber chapter word <fctr> <int> <int> <chr> 1 Sense & Sensibility 1 0 sense 2 Sense & Sensibility 1 0 and 3 Sense & Sensibility 1 0 sensibility 4 Sense & Sensibility 3 0 by 5 Sense & Sensibility 3 0 jane 6 Sense & Sensibility 3 0 austen 7 Sense & Sensibility 5 0 1811 8 Sense & Sensibility 10 1 chapter 9 Sense & Sensibility 10 1 1 10 Sense & Sensibility 13 1 the # ... with 725,044 more rows Tidying the works of Jane Austen
  11. Removing stop words > data(stop_words) > > tidy_books <- tidy_books

    %>% anti_join(stop_words) > > tidy_books %>% count(word, sort = TRUE)
  12. Sentiment analysis > get_sentiments("afinn") # A tibble: 2,476 × 2

    word score <chr> <int> 1 abandon -2 2 abandoned -2 3 abandons -2 4 abducted -2 5 abduction -2 6 abductions -2 7 abhor -3 8 abhorred -3 9 abhorrent -3 10 abhors -3 # ... with 2,466 more rows > get_sentiments("bing") # A tibble: 6,788 × 2 word sentiment <chr> <chr> 1 2-faced negative 2 2-faces negative 3 a+ positive 4 abnormal negative 5 abolish negative 6 abominable negative 7 abominably negative 8 abominate negative 9 abomination negative 10 abort negative # ... with 6,778 more rows > get_sentiments("nrc") # A tibble: 13,901 × 2 word sentiment <chr> <chr> 1 abacus trust 2 abandon fear 3 abandon negative 4 abandon sadness 5 abandoned anger 6 abandoned fear 7 abandoned negative 8 abandoned sadness 9 abandonment anger 10 abandonment fear # ... with 13,891 more rows
  13. Sentiment analysis > nrcjoy <- get_sentiments("nrc") %>% filter(sentiment == "joy")

    > > tidy_books %>% filter(book == "Emma") %>% semi_join(nrcjoy) %>% count(word, sort = TRUE) What are the most common joy words in Emma?
  14. Sentiment analysis # A tibble: 303 × 2 word n

    <chr> <int> 1 good 359 2 young 192 3 friend 166 4 hope 143 5 happy 125 6 love 117 7 deal 92 8 found 92 9 present 89 10 kind 82 # ... with 293 more rows What are the most common joy words in Emma?
  15. Sentiment analysis > library(tidyr) > > janeaustensentiment <- tidy_books %>%

    inner_join(get_sentiments("bing")) %>% count(book, index = linenumber %/% 80, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative)
  16. Sentiment analysis > bing_word_counts <- tidy_books %>% inner_join(get_sentiments("bing")) %>% count(word,

    sentiment, sort = TRUE) %>% ungroup() Which words contribute to each sentiment?
  17. Sentiment analysis > bing_word_counts # A tibble: 2,585 × 3

    word sentiment n <chr> <chr> <int> 1 miss negative 1855 2 well positive 1523 3 good positive 1380 4 great positive 981 5 like positive 725 6 better positive 639 7 enough positive 613 8 happy positive 534 9 love positive 495 10 pleasure positive 462 # ... with 2,575 more rows Which words contribute to each sentiment?
  18. Sentiment analysis > library(wordcloud) > > tidy_books %>% anti_join(stop_words) %>%

    count(word) %>% with(wordcloud(word, n, max.words = 100))
  19. Sentiment analysis > library(reshape2) > > tidy_books %>% inner_join(get_sentiments("bing")) %>%

    count(word, sentiment, sort = TRUE) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("#F8766D", "#00BFC4"), max.words = 100)
  20. Sentiment analysis beyond single words > austen_chapters <- austen_books() %>%

    group_by(book) %>% unnest_tokens(chapter, text, token = "regex", pattern = "Chapter|CHAPTER [\\dIVXLC]") %>% ungroup() > > austen_chapters %>% group_by(book) %>% summarise(chapters = n()) # A tibble: 6 × 2 book chapters <fctr> <int> 1 Sense & Sensibility 51 2 Pride & Prejudice 62 3 Mansfield Park 49 4 Emma 56 5 Northanger Abbey 32 6 Persuasion 25
  21. Sentiment analysis beyond single words > wordcounts <- tidy_books %>%

    group_by(book, chapter) %>% summarize(words = n()) > > wordcounts Source: local data frame [275 x 3] Groups: book [?] book chapter words <fctr> <int> <int> 1 Sense & Sensibility 0 7 2 Sense & Sensibility 1 1571 3 Sense & Sensibility 2 1970 4 Sense & Sensibility 3 1538 5 Sense & Sensibility 4 1952 6 Sense & Sensibility 5 1030 7 Sense & Sensibility 6 1353 8 Sense & Sensibility 7 1288 9 Sense & Sensibility 8 1256
  22. Sentiment analysis beyond single words > tidy_books %>% semi_join(bingnegative) %>%

    group_by(book, chapter) %>% summarize(negativewords = n()) %>% left_join(wordcounts, by = c("book", "chapter")) %>% mutate(ratio = negativewords/words) %>% filter(chapter != 0) %>% top_n(1) %>% ungroup
  23. Sentiment analysis beyond single words Joining, by = "word" Selecting

    by ratio # A tibble: 6 × 5 book chapter negativewords words ratio <fctr> <int> <int> <int> <dbl> 1 Sense & Sensibility 43 161 3405 0.04728341 2 Pride & Prejudice 34 111 2104 0.05275665 3 Mansfield Park 46 173 3685 0.04694708 4 Emma 15 151 3340 0.04520958 5 Northanger Abbey 21 149 2982 0.04996647 6 Persuasion 4 62 1807 0.03431101
  24. What else can you do with sentiment analysis? from David

    Robinson at http://varianceexplained.org/
  25. What is a document about? • Term frequency measures how

    frequently a word occurs in a document • Inverse document frequency decreases the weight for commonly used words and increases the weight for words that are not used much.
  26. What is a document about? • tf-idf is a statistic

    to identify words that are important (i.e. common) in a text, but not TOO common
  27. tf-idf > book_words <- austen_books() %>% unnest_tokens(word, text) %>% count(book,

    word, sort = TRUE) %>% ungroup() > > total_words <- book_words %>% group_by(book) %>% summarize(total = sum(n)) > > book_words <- left_join(book_words, total_words)
  28. tf-idf > book_words # A tibble: 40,379 × 4 book

    word n total <fctr> <chr> <int> <int> 1 Mansfield Park the 6206 160460 2 Mansfield Park to 5475 160460 3 Mansfield Park and 5438 160460 4 Emma to 5239 160996 5 Emma the 5201 160996 6 Emma and 4896 160996 7 Mansfield Park of 4778 160460 8 Pride & Prejudice the 4331 122204 9 Emma of 4291 160996 10 Pride & Prejudice to 4162 122204 # ... with 40,369 more rows
  29. tf-idf > book_words <- book_words %>% bind_tf_idf(word, book, n) >

    book_words # A tibble: 40,379 × 7 book word n total tf idf tf_idf <fctr> <chr> <int> <int> <dbl> <dbl> <dbl> 1 Mansfield Park the 6206 160460 0.03867631 0 0 2 Mansfield Park to 5475 160460 0.03412065 0 0 3 Mansfield Park and 5438 160460 0.03389007 0 0 4 Emma to 5239 160996 0.03254118 0 0 5 Emma the 5201 160996 0.03230515 0 0 6 Emma and 4896 160996 0.03041069 0 0 7 Mansfield Park of 4778 160460 0.02977689 0 0 8 Pride & Prejudice the 4331 122204 0.03544074 0 0 9 Emma of 4291 160996 0.02665284 0 0 10 Pride & Prejudice to 4162 122204 0.03405780 0 0 # ... with 40,369 more rows
  30. tf-idf > book_words %>% + select(-total) %>% + arrange(desc(tf_idf)) #

    A tibble: 40,379 × 6 book word n tf idf tf_idf <fctr> <chr> <int> <dbl> <dbl> <dbl> 1 Sense & Sensibility elinor 623 0.005193528 1.791759 0.009305552 2 Sense & Sensibility marianne 492 0.004101470 1.791759 0.007348847 3 Mansfield Park crawford 493 0.003072417 1.791759 0.005505032 4 Pride & Prejudice darcy 373 0.003052273 1.791759 0.005468939 5 Persuasion elliot 254 0.003036207 1.791759 0.005440153 6 Emma emma 786 0.004882109 1.098612 0.005363545 7 Northanger Abbey tilney 196 0.002519928 1.791759 0.004515105 8 Emma weston 389 0.002416209 1.791759 0.004329266 9 Pride & Prejudice bennet 294 0.002405813 1.791759 0.004310639 10 Persuasion wentworth 191 0.002283132 1.791759 0.004090824 # ... with 40,369 more rows
  31. • As part of the NASA Datanauts program, I am

    working on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
  32. Taking tidy data to the next level • We can

    identify and handle n-grams using tidy principles • Analysis of n-grams lends itself to looking at networks of words and correlations of words • We could also, for example, find negation words that contribute to misidentification of sentiment
  33. Taking tidy data to the next level • Text won’t

    be tidy at all stages of an analysis • We can convert back and forth from a tidy format to a format like a document-term matrix • We can also tidy (i.e. summarize into a tidy data frame) a model’s statistical findings, for example, topic modeling