Email Spam Classification

RANDOM FORESTS Jeyaram Ashokraj

Classification Methods •Classification Trees •Logistic Regression •Model-Based Trees •Random Forests

Logistic Regression •Logistic regression is a type of probabilistic statistical
classification model. • It models the relationship between a dependent and one or more independent variables, and allows us to look at the fit of the model as well as at the significance of the relationships •Logistic regression estimates the probability of an event occurring

History •In 1995 Tin Kam Ho of Bell Labs coined
random forests based on Breiman’s bagging tree. •In 1999 Leo Breiman published a complete paper on random forests •Adele Cutler and Breiman developed algorithms for random forests http://www.stat.berkeley.edu/~breiman/RandomForests/

Random Forests •Random forests are an ensemble learning method for
classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. •Each tree is grown at least partially at random • Randomness is injected by growing each tree on a different random subsample of the training data •Randomness is injected into the split selection process so that the splitter at any node is determined partly at random

Random Forests Benefits: •High levels of predictive accuracy •Resistant to
overtraining (overfitting) –generalizes well to new data •Trains rapidly even with thousands of potential predictors •No need for prior feature (variable) selection

Spam Classification Classify an email as Spam or Not-Spam (Yes/No)
• Response • Discrete/Categorical response • Predictors • Frequency of words (Continuous) Solutions • Logistic Regression • Trees (CARTS, Random forests) • Support vector machines

Wrangling •Visual inspection of data set •Clean up messy data
•Find missing data (replacements) •Transformations (Ex: Newyork, New York, NY) •Facets and Filters •Google OpenRefine and Stanford Data wrangler

Data Partitioning Why? •Need for In-sample(train) and Out-sample(test) data •Over-fitting
How? •Random Sub-sampling •K-fold cross validation (test-train cycle) •Leave one out cross validation •Bootstrap Spam data Split: •¾ – Train and ¼ – Test •Python Scikit-learn •sklearn.cross_validation.train_test_split(*arrays, **options) •In R, createDataParition{caret}

Variable Selection Objective •Find important predictors •Find small number of
predictors Why? •We don't want to use all variables (Occam's razor) •Remove redundant and irrelevant features How? •Reasoning •Cross validation (test-train cycle) •Stepwise regression (Forward/Backward) •Automatic searches •Subset selection •Ridge regression: Elastic net and Lasso http://machinelearning.wustl.edu/mlpapers/paper_files/GuyonE03.pd

Some Idea Spam: credit,money,free,remove Not Spam: meeting,re:,edu,project

Corrgrams http://www.sci.utah.edu/~kpotter/Library/Papers/friendly:2002:EDCM/ •No strong correlation, because they are non-linear

Stepwise Regression •Model with 23 parameters AIC: 965.49 Accuracy: 0.8
•Complex model, less information loss train{caret} and stepAIC{MASS}

Best Subsets - Branch and Bound Algorithm - regsubsets{leaps} http://cran.r-project.org/web/packages/leaps/leaps.pdf

Lasso Regularization •lassoglm in matlab •glmnet{glmnet} in R

http://scikit-learn.org/stable/auto_examples/plot_feature_selection.html#example-plot- feature-selection-py UniVariate Feature Selection Python Sci-Learn Cross Validation Class

Random Forests •Variables which splits the data into large groups

Models Model Count(x) x Information Criteria (AIC) 1 5 free
+ money + remove + re: +edu 1261 2 8 remove + internet + free + you + credit + money + business + edu 1051.5 3 23 address +all + remove + internet + order + people + free + business + email + you + credit + money + lab + data + parts + pm + meeting + original + project + re. + edu + table + conference 995.49

Goodness of Fit Why R2 wont work? •Goal is to
predict the binary outcome •Model gives estimated probabilities •R2 tries to fit a linear line Solution: Compare observed and expected numbers Other tests: •Hosmer & Lemeshow ohoslem.test{ResourceSelection} •ROC / AUC / C-statistic http://cran.r-project.org/web/packages/ResourceSelection/ResourceSelection.pdf http://cran.r-project.org/web/packages/hmeasure/vignettes/hmeasure.pdf http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/7_Binary/text.pdf

Approach [Image: CARET documentation] http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf http://caret.r-forge.r-project.org/training.html#example

Logistic Regression

Analysis Model1 Model2 Model 3 Actuals Actuals Actuals Prediction Not
Spam Spam Prediction Not Spam Spam Prediction Not Spam Spam Not Spam 298 66 Not Spam 294 52 Not Spam 285 42 Spam 13 123 Spam 17 137 Spam 26 147 Mis Classificatio n Rate Mis Classificatio n Rate Mis Classi ficatio n Rate 6% 34% 5% 27% 8% 22%

Voters Targeting •Campaigning, Mobilization and turnout. •Identifying likelihood of support
and turnout. •Targeting for turnout based on the quadrants. http://epub.wu.ac.at/3458/1/Report117.pdf

LORET •Regression model with predictors and hierarchical partitioning. •Combination of
supervised and unsupervised learning. •Fitting a local regression model to a segment of the data. •Usage of Regressive partitioning methods to identify the segments of data.

Classification Trees •Classification(Discrete) vs Regression Tree(Numerical) •Construct a binary tree
which make decisions at each node •Point estimation and Distribution estimation •Maximum likelihood estimate of probability distribution R Package: •tree : S-plus routine Clark and Pregibo •rpart: Implementation of Breiman, Friedman, Olshen and Stone http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf

Original Tree Pruned Tree •Pruning has removed 2 nodes from
original tree Q: Is the mail spam ?

Confusion Matrix Original Tree Pruned Tree Actuals Actuals Prediction Not
Spam Spam Pre dicti on Not Spam Spam Not Spam 277 35 Not Spa m 283 39 Spam 34 154 Spa m 28 150

Random Forests- Packages randomForest [pkg: randomForest] •CART implementation •Biased over
continuous variables and variables with many categories cforest [pkg: cparty] •Unbiased conditional inference trees http://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf

Analysis Confusion Matrix(5 predictors): ctree: cforest:

Confusion matrix for Random forest using randomforest package( 5 predictors)

Confusion Matrix(8 predictors): ctree cforest:

Randomforest using ranfomforest package( 8 predictors)

Confusion Matrix(29 predictors): ctree cforest:

Randomforest using ranfomforest package( 29 predictors)

Plot between actual and predicted

Somers2:

Margin function for Random Forest

Comparison Logistic Regression Classification Trees Random Forests 1 Interpretation Easy
Easy Moderate 2 Average Prediction Accuracy* 86% 87% 88% 3 * For Spam dataset

Other Classifiers •Naive Bayes •Perceptron •K Nearest Neighbor •Support Vector
Machines •Neural Networks •Bayesian Networks •Hidden Markov Models •Learning Vector quantization http://en.wikipedia.org/wiki/Classification_in_machine_learning

Support Vector Machines •Hyper plane - Line separating the classes
(Decision Boundary) •Support vector - Maximize distance between closest points of classes •Lots of Mathematical theories behind •R library - e1071 http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf A B https://www.statsoft.com/textbook/support-vector-machines Kernels : Transforms the objects

Wow!!! •98% accuracy for spam dataset http://www.d.umn.edu/math/Technical%20Reports/Technical%20Reports%202007-/TR%202007- 2008/TR_2008_3.pdf

Appendix R Python Matlab

References http://download.springer.com/static/pdf/639/art%253A10.1023%252FA%253A1010933404324.pdf?auth66=1397157894_3ff0ebbe42d0f5e920b8020a851c030b&ext=.pdf http://nymetro.chapter.informs.org/prac_cor_pubs/RandomForest_SteinbergD.pdf http://www.strath.ac.uk/aer/materials/5furtherquantitativeresearchdesignandanalysis/unit6/whatislogisticregression/ http://en.wikipedia.org/wiki/Random_forest

Email Spam Classification

Email Spam Classification

More Decks by jeyaramashok

Other Decks in Programming

Featured

Transcript