Text Mining / NLP in R

1 Survey of Text Mining / NLP in R
Ike Okonkwo Aus=n R User Group October 18, 2012

2 Prerequisites / Packages •  tm, twiJeR,
wordcloud, RColorBrewer •  e1017, class

3 Outline Mini Project •  Pre-‐processing
•  Classiﬁca=on Text / Character Manipula=on •  Text Manipula=on •  Mining TwiJer

4 Text / Character Manipula=on •  Text Manipula=on
•  Mining TwiJer

5 Text Manipula=on > args(grep)! > function (pattern,
x, ignore.case = FALSE, perl = FALSE, value = FALSE,! fixed = FALSE, useBytes = FALSE, invert = FALSE) ! > grep('N.t', c('Mark', 'Nathan','Jo','Natasha', 'Dave'))! > [1] 2 4! grep() : searches for a speciﬁed substring paJern in a vector x of strings > args(nchar)! > function (x, type = "chars", allowNA = FALSE)! > nchar('package')! > [1] 7! nchar() : length of string x > args(gsub)! > function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, ! fixed = FALSE, useBytes = FALSE)! > gsub("rt ", "", 'please rt asap')! > [1] "please asap"! gsub() : perform replacement of the ﬁrst and all matches

6 Text Manipula=on > args(paste)! > function (...,
sep = " ", collapse = NULL) ! > paste('Info', 'Chimps', sep='')! > [1] "InfoChimps" ! paste() : concatenate several strings together > args(substr)! > function (x, start, stop)! > substr('package',5,7)! > [1] "age"! substr() : returns the substring in the given character range start : stop for the given string x > args(strsplit)! > function (x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)! > strsplit('2012-10-18', split='-')! > [[1]]! > [1] "2012" "10" "18"! strsplit() : splits a string into a list of substrings based on another string split in x

7 Deﬁni=ons •  Document : individual units
in a document retrieval system, eg A resume would be considered as a document in a resume classiﬁca=on system •  Corpus / Document Collec=on : the group of documents over which retrieval is performed •  Bag-‐of-‐Words : unordered collec=on of words disregarding grammar or word order •  n-‐grams : con=guous sequence of n items from a a given sequence of text, eg character [ uni-‐grams -‐ A,G,C,T,T,C,G,A bi-‐grams -‐ AG,CT,TC,GA ] or word [uni-‐grams -‐ bLibrary', 'Engineering`, bSQL` bi-‐grams -‐ bNew York`, bProduct Manager', 'Data Analyst` ] •  Stopwords : words that appear too ocen and impact liJle meaning to text, eg most preposi=ons, bthe`, band`, bor`, bI`, bto` •  Tokens : any combina=on of characters (words)

8 Defini=ons con=nued •  Stemming : a
heuris=c process that removes derived word affixes, eg borganize`, borganizer`, borganizes`, borganized` reduced to borganize` •  Lemma=za=on : uses morphological analysis of words to return them to their base or dic=onary form, eg bam`,bare` ,ìs` reduced to bbe` •  TF-‐IDF : term frequency-‐inverse document frequency is sta=s=c that tells us how important a word is in a given corpus. Detects high-‐informa=on words •  h-‐idf = h x log(N/df) •  Term Document Matrix / Vector Space Matrix : representa=on of a document collec=on as vectors

9 Mining TwiJer > library(twitteR)! > library(tm)! >
library(wordcloud)! > library(RColorBrewer)! > rstats_tweets = searchTwitter("rstats", n=1500, lang="en")! > rstats_text = sapply(rstats_tweets, function(x) x$getText())! > rstats_text = iconv(rstats_text, 'UTF-8', 'ASCII') # remove emoticons! > rstats_corpus = Corpus(VectorSource(rstats_text)) # create a corpus! ! # create document term matrix applying some transformations! > term_doc_matrix <- TermDocumentMatrix(rstats_corpus,+ ! !control = list(removePunctuation = TRUE,+ stopwords = ! ! ! ! ! !c("rstats","http”, stopwords("english")),+ removeNumbers = TRUE, ! ! !tolower = TRUE))! ! > head(term_doc_matrix )! > Non-/sparse entries: 21/3675! > Sparsity: 99%! > Maximal term length: 54! Ø  Weighting : term frequency (tf)! > term_doc_matrix <- as.matrix(term_doc_matrix)! > # get word counts in decreasing order! > word_freqs = sort(rowSums(term_doc_matrix), decreasing=TRUE) ! > # create a data frame with words and their frequencies! > dm = data.frame(word=names(word_freqs), freq=word_freqs)! !

10 Mining TwiJer Ø  wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8,
"Dark2"))!

11 Mini Project -‐ Classifying Text •  Pre-‐processing
: vector space matrix •  Classiﬁca=on : Naive Bayes (NB), k-‐Nearest Neighbour (k-‐NN)

12 Mini Project -‐ Classifying Text •  Craigslist
job descrip=ons across three categories : health/medicine (hea), nonproﬁt sector (npo) and socware (sof) …circa 2011

13 Naïve Bayes (NB) Pr(A|B) = Pr(B|A) x
Pr(A)/Pr(B) Pr(Category|Document) = Pr(Document|Category) x Pr(Category)/Pr(Document) = Pr(Document|Category) x Pr(Category) •  Classify new data by calcula=ng probability of an observa=on belonging to a par=cular class. •  Choose class with highest probability •  Consider each feature(word) to be equally important > model <- naiveBayes(response_var~., data=train.data)! > prediction <- predict(model, test.data[,-8757])! > result <- table(prediction, test.data[,8757])! Ø  result ! prediction hea npo sof! hea 73 26 23! npo 0 12 3! sof 2 4 32! ! > misclass <- (1- (((sum(diag(result))/nrow(test.data))) ))*100! Ø  cat(misclass,'%')! Ø  33.1428%!

14 K -‐ Nearest Neighbor (k-‐NN) •  Classify
new data by comparing each observa=on with the known data and then pick the k nearest neighbors •  Eventual response variable picked from nearest neighbors by majority vote •  Works with both numeric and nominal values •  Distance measures : Euclidean, City Block , Hamming Distance > train_input.knn <- as.matrix(train.data[,-8757])! > train_output.knn <- as.vector(train.data[,8757])! > test_input.knn <- as.matrix(test.data[,-8757])! > prediction <- knn(train_input.knn, test_input.knn, train_output.knn, k=5)! Ø  result <- table(prediction, test.data$response_var)! Ø  result ! prediction hea npo sof! hea 70 27 28! npo 1 13 0! sof 2 7 27! ! > misclass <- (1- (((sum(diag(result))/nrow(test.data))) ))*100! > cat(misclass,'%')! 37.14286%!

15 References •  Introduc=on to Informa=on Retrieval by
Manning, Raghavan, Schutze •  The Art of R Programming by Matloﬀ •  hJp://www.horicky.blogspot.com •  hJp://www.sites.google.com/site/miningtwiJer/home •  hJp://www.craigslist.com [data] •  hJp://www.wikipedia.org •  hJp://cran.r-‐project.org/web/views/NaturalLanguageProcessing.html

Text Mining / NLP in R

Text Mining / NLP in R

Ike Okonkwo

More Decks by Ike Okonkwo

Other Decks in Programming

Featured

Transcript

1 Survey of Text Mining / NLP in R

2 Prerequisites / Packages •  tm, twiJeR,

3 Outline Mini Project •  Pre-‐processing

4 Text / Character Manipula=on •  Text Manipula=on

5 Text Manipula=on > args(grep)! > function (pattern,

6 Text Manipula=on > args(paste)! > function (...,

7 Deﬁni=ons •  Document : individual units

8 Deﬁni=ons con=nued •  Stemming : a

9 Mining TwiJer > library(twitteR)! > library(tm)! >

10 Mining TwiJer Ø  wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8,

11 Mini Project -‐ Classifying Text •  Pre-‐processing

12 Mini Project -‐ Classifying Text •  Craigslist

13 Naïve Bayes (NB) Pr(A|B) = Pr(B|A) x

14 K -‐ Nearest Neighbor (k-‐NN) •  Classify

15 References •  Introduc=on to Informa=on Retrieval by