sep = " ", collapse = NULL) ! > paste('Info', 'Chimps', sep='')! > [1] "InfoChimps" ! paste() : concatenate several strings together > args(substr)! > function (x, start, stop)! > substr('package',5,7)! > [1] "age"! substr() : returns the substring in the given character range start : stop for the given string x > args(strsplit)! > function (x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)! > strsplit('2012-10-18', split='-')! > [[1]]! > [1] "2012" "10" "18"! strsplit() : splits a string into a list of substrings based on another string split in x
in a document retrieval system, eg A resume would be considered as a document in a resume classifica=on system • Corpus / Document Collec=on : the group of documents over which retrieval is performed • Bag-‐of-‐Words : unordered collec=on of words disregarding grammar or word order • n-‐grams : con=guous sequence of n items from a a given sequence of text, eg character [ uni-‐grams -‐ A,G,C,T,T,C,G,A bi-‐grams -‐ AG,CT,TC,GA ] or word [uni-‐grams -‐ bLibrary', 'Engineering`, bSQL` bi-‐grams -‐ bNew York`, bProduct Manager', 'Data Analyst` ] • Stopwords : words that appear too ocen and impact liJle meaning to text, eg most preposi=ons, bthe`, band`, bor`, bI`, bto` • Tokens : any combina=on of characters (words)
heuris=c process that removes derived word affixes, eg borganize`, borganizer`, borganizes`, borganized` reduced to borganize` • Lemma=za=on : uses morphological analysis of words to return them to their base or dic=onary form, eg bam`,bare` ,`is` reduced to bbe` • TF-‐IDF : term frequency-‐inverse document frequency is sta=s=c that tells us how important a word is in a given corpus. Detects high-‐informa=on words • h-‐idf = h x log(N/df) • Term Document Matrix / Vector Space Matrix : representa=on of a document collec=on as vectors
Pr(A)/Pr(B) Pr(Category|Document) = Pr(Document|Category) x Pr(Category)/Pr(Document) = Pr(Document|Category) x Pr(Category) • Classify new data by calcula=ng probability of an observa=on belonging to a par=cular class. • Choose class with highest probability • Consider each feature(word) to be equally important > model <- naiveBayes(response_var~., data=train.data)! > prediction <- predict(model, test.data[,-8757])! > result <- table(prediction, test.data[,8757])! Ø result ! prediction hea npo sof! hea 73 26 23! npo 0 12 3! sof 2 4 32! ! > misclass <- (1- (((sum(diag(result))/nrow(test.data))) ))*100! Ø cat(misclass,'%')! Ø 33.1428%!
new data by comparing each observa=on with the known data and then pick the k nearest neighbors • Eventual response variable picked from nearest neighbors by majority vote • Works with both numeric and nominal values • Distance measures : Euclidean, City Block , Hamming Distance > train_input.knn <- as.matrix(train.data[,-8757])! > train_output.knn <- as.vector(train.data[,8757])! > test_input.knn <- as.matrix(test.data[,-8757])! > prediction <- knn(train_input.knn, test_input.knn, train_output.knn, k=5)! Ø result <- table(prediction, test.data$response_var)! Ø result ! prediction hea npo sof! hea 70 27 28! npo 1 13 0! sof 2 7 27! ! > misclass <- (1- (((sum(diag(result))/nrow(test.data))) ))*100! > cat(misclass,'%')! 37.14286%!
Manning, Raghavan, Schutze • The Art of R Programming by Matloff • hJp://www.horicky.blogspot.com • hJp://www.sites.google.com/site/miningtwiJer/home • hJp://www.craigslist.com [data] • hJp://www.wikipedia.org • hJp://cran.r-‐project.org/web/views/NaturalLanguageProcessing.html