Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Mining / NLP in R

Ike Okonkwo
November 01, 2012

Text Mining / NLP in R

October Austin R Meetup Slides

Ike Okonkwo

November 01, 2012
Tweet

More Decks by Ike Okonkwo

Other Decks in Programming

Transcript

  1. 1   Survey  of  Text  Mining  /  NLP  in  R

        Ike  Okonkwo   Aus=n  R  User  Group   October  18,  2012  
  2. 2   Prerequisites  /  Packages     •  tm,  twiJeR,

     wordcloud,  RColorBrewer   •  e1017,  class  
  3. 3   Outline   Mini  Project     •  Pre-­‐processing

      •  Classifica=on   Text  /  Character  Manipula=on   •  Text  Manipula=on   •  Mining  TwiJer  
  4. 5   Text  Manipula=on   > args(grep)! > function (pattern,

    x, ignore.case = FALSE, perl = FALSE, value = FALSE,! fixed = FALSE, useBytes = FALSE, invert = FALSE) ! > grep('N.t', c('Mark', 'Nathan','Jo','Natasha', 'Dave'))! > [1] 2 4! grep()  :  searches  for  a  specified  substring  paJern  in  a  vector  x  of    strings   > args(nchar)! > function (x, type = "chars", allowNA = FALSE)! > nchar('package')! > [1] 7! nchar()  :    length  of    string  x   > args(gsub)! > function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, ! fixed = FALSE, useBytes = FALSE)! > gsub("rt ", "", 'please rt asap')! > [1] "please asap"! gsub()  :    perform  replacement  of  the  first  and  all  matches  
  5. 6   Text  Manipula=on   > args(paste)! > function (...,

    sep = " ", collapse = NULL) ! > paste('Info', 'Chimps', sep='')! > [1] "InfoChimps" ! paste()  :  concatenate  several  strings  together   > args(substr)! > function (x, start, stop)! > substr('package',5,7)! > [1] "age"! substr()  :    returns  the  substring  in  the  given  character  range  start  :  stop  for  the  given   string  x   > args(strsplit)! > function (x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)! > strsplit('2012-10-18', split='-')! > [[1]]! > [1] "2012" "10" "18"! strsplit()  :  splits  a  string  into  a  list  of  substrings  based  on  another  string  split  in  x  
  6. 7   Defini=ons   •  Document  :    individual  units

     in  a  document  retrieval  system,  eg  A  resume   would  be  considered  as  a  document  in  a  resume  classifica=on  system   •  Corpus  /  Document  Collec=on  :  the  group  of  documents  over  which  retrieval   is  performed   •  Bag-­‐of-­‐Words  :  unordered  collec=on  of  words  disregarding  grammar  or   word  order   •  n-­‐grams  :    con=guous  sequence  of  n  items  from  a  a  given  sequence  of  text,   eg  character  [  uni-­‐grams  -­‐    A,G,C,T,T,C,G,A    bi-­‐grams  -­‐  AG,CT,TC,GA  ]  or  word     [uni-­‐grams  -­‐  bLibrary',  'Engineering`,  bSQL`      bi-­‐grams  -­‐  bNew  York`,  bProduct   Manager',  'Data  Analyst`  ]   •  Stopwords  :  words  that  appear  too  ocen  and  impact  liJle  meaning  to  text,   eg    most  preposi=ons,  bthe`,  band`,  bor`,  bI`,  bto`   •  Tokens  :  any  combina=on  of  characters  (words)  
  7. 8   Defini=ons  con=nued     •  Stemming  :  a

     heuris=c  process  that  removes  derived  word  affixes,  eg   borganize`,  borganizer`,  borganizes`,  borganized`    reduced  to  borganize`   •  Lemma=za=on  :    uses  morphological  analysis  of  words  to  return  them  to   their  base  or  dic=onary  form,  eg  bam`,bare`  ,`is`  reduced  to  bbe`   •  TF-­‐IDF  :  term  frequency-­‐inverse  document  frequency  is  sta=s=c  that  tells  us   how  important  a  word  is  in  a  given  corpus.  Detects  high-­‐informa=on  words     •  h-­‐idf    =  h  x  log(N/df)   •  Term  Document  Matrix  /  Vector  Space  Matrix  :  representa=on  of  a   document  collec=on  as  vectors    
  8. 9   Mining  TwiJer   > library(twitteR)! > library(tm)! >

    library(wordcloud)! > library(RColorBrewer)! > rstats_tweets = searchTwitter("rstats", n=1500, lang="en")! > rstats_text = sapply(rstats_tweets, function(x) x$getText())! > rstats_text = iconv(rstats_text, 'UTF-8', 'ASCII') # remove emoticons! > rstats_corpus = Corpus(VectorSource(rstats_text)) # create a corpus! ! # create document term matrix applying some transformations! > term_doc_matrix <- TermDocumentMatrix(rstats_corpus,+ ! !control = list(removePunctuation = TRUE,+ stopwords = ! ! ! ! ! !c("rstats","http”, stopwords("english")),+ removeNumbers = TRUE, ! ! !tolower = TRUE))! ! > head(term_doc_matrix )! > Non-/sparse entries: 21/3675! > Sparsity: 99%! > Maximal term length: 54! Ø  Weighting : term frequency (tf)! > term_doc_matrix <- as.matrix(term_doc_matrix)! > # get word counts in decreasing order! > word_freqs = sort(rowSums(term_doc_matrix), decreasing=TRUE) ! > # create a data frame with words and their frequencies! > dm = data.frame(word=names(word_freqs), freq=word_freqs)! !
  9. 11   Mini  Project  -­‐  Classifying  Text   •  Pre-­‐processing

     :    vector  space  matrix   •  Classifica=on  :  Naive  Bayes  (NB),  k-­‐Nearest  Neighbour  (k-­‐NN)  
  10. 12   Mini  Project  -­‐  Classifying  Text   •  Craigslist

     job  descrip=ons  across  three  categories  :  health/medicine   (hea),  nonprofit  sector  (npo)  and  socware  (sof)  …circa  2011  
  11. 13   Naïve  Bayes  (NB)   Pr(A|B)  =  Pr(B|A)  x

     Pr(A)/Pr(B)   Pr(Category|Document)  =  Pr(Document|Category)  x  Pr(Category)/Pr(Document)   =  Pr(Document|Category)  x  Pr(Category)   •  Classify    new  data    by  calcula=ng    probability  of  an  observa=on   belonging  to  a  par=cular  class.   •  Choose  class  with  highest  probability   •  Consider  each  feature(word)  to  be  equally  important     > model <- naiveBayes(response_var~., data=train.data)! > prediction <- predict(model, test.data[,-8757])! > result <- table(prediction, test.data[,8757])! Ø  result ! prediction hea npo sof! hea 73 26 23! npo 0 12 3! sof 2 4 32! ! > misclass <- (1- (((sum(diag(result))/nrow(test.data))) ))*100! Ø  cat(misclass,'%')! Ø  33.1428%!
  12. 14   K  -­‐  Nearest  Neighbor  (k-­‐NN)   •  Classify

       new  data    by  comparing  each  observa=on  with  the  known  data   and    then  pick    the  k  nearest  neighbors   •   Eventual  response  variable  picked  from  nearest  neighbors  by  majority   vote   •  Works  with  both  numeric  and  nominal  values   •  Distance  measures  :  Euclidean,  City  Block  ,  Hamming  Distance   > train_input.knn <- as.matrix(train.data[,-8757])! > train_output.knn <- as.vector(train.data[,8757])! > test_input.knn <- as.matrix(test.data[,-8757])! > prediction <- knn(train_input.knn, test_input.knn, train_output.knn, k=5)! Ø  result <- table(prediction, test.data$response_var)! Ø  result ! prediction hea npo sof! hea 70 27 28! npo 1 13 0! sof 2 7 27! ! > misclass <- (1- (((sum(diag(result))/nrow(test.data))) ))*100! > cat(misclass,'%')! 37.14286%!
  13. 15   References   •  Introduc=on  to  Informa=on  Retrieval  by

     Manning,  Raghavan,  Schutze   •  The  Art  of  R  Programming  by  Matloff   •  hJp://www.horicky.blogspot.com   •  hJp://www.sites.google.com/site/miningtwiJer/home     •  hJp://www.craigslist.com  [data]   •  hJp://www.wikipedia.org                     •   hJp://cran.r-­‐project.org/web/views/NaturalLanguageProcessing.html