Email Spam Classification

Slide 1

Slide 1 text

RANDOM FORESTS Jeyaram Ashokraj

Slide 2

Slide 2 text

Classification Methods •Classification Trees •Logistic Regression •Model-Based Trees •Random Forests

Slide 3

Slide 3 text

Logistic Regression •Logistic regression is a type of probabilistic statistical classification model. • It models the relationship between a dependent and one or more independent variables, and allows us to look at the fit of the model as well as at the significance of the relationships •Logistic regression estimates the probability of an event occurring

Slide 4

Slide 4 text

History •In 1995 Tin Kam Ho of Bell Labs coined random forests based on Breiman’s bagging tree. •In 1999 Leo Breiman published a complete paper on random forests •Adele Cutler and Breiman developed algorithms for random forests http://www.stat.berkeley.edu/~breiman/RandomForests/

Slide 5

Slide 5 text

Random Forests •Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. •Each tree is grown at least partially at random • Randomness is injected by growing each tree on a different random subsample of the training data •Randomness is injected into the split selection process so that the splitter at any node is determined partly at random

Slide 6

Slide 6 text

Random Forests Benefits: •High levels of predictive accuracy •Resistant to overtraining (overfitting) –generalizes well to new data •Trains rapidly even with thousands of potential predictors •No need for prior feature (variable) selection

Slide 7

Slide 7 text

Spam Classification Classify an email as Spam or Not-Spam (Yes/No) ● Response ● Discrete/Categorical response ● Predictors ● Frequency of words (Continuous) Solutions ● Logistic Regression ● Trees (CARTS, Random forests) ● Support vector machines

Slide 8

Slide 8 text

Wrangling ●Visual inspection of data set ●Clean up messy data ●Find missing data (replacements) ●Transformations (Ex: Newyork, New York, NY) ●Facets and Filters ●Google OpenRefine and Stanford Data wrangler

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Data Partitioning Why? ●Need for In-sample(train) and Out-sample(test) data ●Over-fitting How? ●Random Sub-sampling ●K-fold cross validation (test-train cycle) ●Leave one out cross validation ●Bootstrap Spam data Split: ●¾ – Train and ¼ – Test ●Python Scikit-learn ●sklearn.cross_validation.train_test_split(*arrays, **options) ●In R, createDataParition{caret}

Slide 11

Slide 11 text

Variable Selection Objective ●Find important predictors ●Find small number of predictors Why? ●We don't want to use all variables (Occam's razor) ●Remove redundant and irrelevant features How? ●Reasoning ●Cross validation (test-train cycle) ●Stepwise regression (Forward/Backward) ●Automatic searches ●Subset selection ●Ridge regression: Elastic net and Lasso http://machinelearning.wustl.edu/mlpapers/paper_files/GuyonE03.pd

Slide 12

Slide 12 text

Some Idea Spam: credit,money,free,remove Not Spam: meeting,re:,edu,project

Slide 13

Slide 13 text

Corrgrams http://www.sci.utah.edu/~kpotter/Library/Papers/friendly:2002:EDCM/ ●No strong correlation, because they are non-linear

Slide 14

Slide 14 text

Stepwise Regression ●Model with 23 parameters AIC: 965.49 Accuracy: 0.8 ●Complex model, less information loss train{caret} and stepAIC{MASS}

Slide 15

Slide 15 text

Best Subsets - Branch and Bound Algorithm - regsubsets{leaps} http://cran.r-project.org/web/packages/leaps/leaps.pdf

Slide 16

Slide 16 text

Lasso Regularization •lassoglm in matlab •glmnet{glmnet} in R

Slide 17

Slide 17 text

http://scikit-learn.org/stable/auto_examples/plot_feature_selection.html#example-plot- feature-selection-py UniVariate Feature Selection Python Sci-Learn Cross Validation Class

Slide 18

Slide 18 text

Random Forests ●Variables which splits the data into large groups

Slide 19

Slide 19 text

Models Model Count(x) x Information Criteria (AIC) 1 5 free + money + remove + re: +edu 1261 2 8 remove + internet + free + you + credit + money + business + edu 1051.5 3 23 address +all + remove + internet + order + people + free + business + email + you + credit + money + lab + data + parts + pm + meeting + original + project + re. + edu + table + conference 995.49

Slide 20

Slide 20 text

Goodness of Fit Why R2 wont work? •Goal is to predict the binary outcome •Model gives estimated probabilities •R2 tries to fit a linear line Solution: Compare observed and expected numbers Other tests: •Hosmer & Lemeshow ohoslem.test{ResourceSelection} •ROC / AUC / C-statistic http://cran.r-project.org/web/packages/ResourceSelection/ResourceSelection.pdf http://cran.r-project.org/web/packages/hmeasure/vignettes/hmeasure.pdf http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/7_Binary/text.pdf

Slide 21

Slide 21 text

Approach [Image: CARET documentation] http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf http://caret.r-forge.r-project.org/training.html#example

Slide 22

Slide 22 text

Logistic Regression

Slide 23

Slide 23 text

Analysis Model1 Model2 Model 3 Actuals Actuals Actuals Prediction Not Spam Spam Prediction Not Spam Spam Prediction Not Spam Spam Not Spam 298 66 Not Spam 294 52 Not Spam 285 42 Spam 13 123 Spam 17 137 Spam 26 147 Mis Classificatio n Rate Mis Classificatio n Rate Mis Classi ficatio n Rate 6% 34% 5% 27% 8% 22%

Slide 24

Slide 24 text

Voters Targeting •Campaigning, Mobilization and turnout. •Identifying likelihood of support and turnout. •Targeting for turnout based on the quadrants. http://epub.wu.ac.at/3458/1/Report117.pdf

Slide 25

Slide 25 text

LORET •Regression model with predictors and hierarchical partitioning. •Combination of supervised and unsupervised learning. •Fitting a local regression model to a segment of the data. •Usage of Regressive partitioning methods to identify the segments of data.

Slide 26

Slide 26 text

Classification Trees •Classification(Discrete) vs Regression Tree(Numerical) •Construct a binary tree which make decisions at each node •Point estimation and Distribution estimation •Maximum likelihood estimate of probability distribution R Package: •tree : S-plus routine Clark and Pregibo •rpart: Implementation of Breiman, Friedman, Olshen and Stone http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf

Slide 27

Slide 27 text

Original Tree Pruned Tree ●Pruning has removed 2 nodes from original tree Q: Is the mail spam ?

Slide 28

Slide 28 text

Confusion Matrix Original Tree Pruned Tree Actuals Actuals Prediction Not Spam Spam Pre dicti on Not Spam Spam Not Spam 277 35 Not Spa m 283 39 Spam 34 154 Spa m 28 150

Slide 29

Slide 29 text

Random Forests- Packages randomForest [pkg: randomForest] •CART implementation •Biased over continuous variables and variables with many categories cforest [pkg: cparty] •Unbiased conditional inference trees http://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf

Slide 30

Slide 30 text

Analysis Confusion Matrix(5 predictors): ctree: cforest:

Slide 31

Slide 31 text

Confusion matrix for Random forest using randomforest package( 5 predictors)

Slide 32

Slide 32 text

Confusion Matrix(8 predictors): ctree cforest:

Slide 33

Slide 33 text

Randomforest using ranfomforest package( 8 predictors)

Slide 34

Slide 34 text

Confusion Matrix(29 predictors): ctree cforest:

Slide 35

Slide 35 text

Randomforest using ranfomforest package( 29 predictors)

Slide 36

Slide 36 text

Plot between actual and predicted

Slide 37

Slide 37 text

Somers2:

Slide 38

Slide 38 text

Margin function for Random Forest

Slide 39

Slide 39 text

Comparison Logistic Regression Classification Trees Random Forests 1 Interpretation Easy Easy Moderate 2 Average Prediction Accuracy* 86% 87% 88% 3 * For Spam dataset

Slide 40

Slide 40 text

Other Classifiers ●Naive Bayes ●Perceptron ●K Nearest Neighbor ●Support Vector Machines ●Neural Networks ●Bayesian Networks ●Hidden Markov Models ●Learning Vector quantization http://en.wikipedia.org/wiki/Classification_in_machine_learning

Slide 41

Slide 41 text

Support Vector Machines •Hyper plane - Line separating the classes (Decision Boundary) •Support vector - Maximize distance between closest points of classes •Lots of Mathematical theories behind •R library - e1071 http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf A B https://www.statsoft.com/textbook/support-vector-machines Kernels : Transforms the objects

Slide 42

Slide 42 text

Wow!!! •98% accuracy for spam dataset http://www.d.umn.edu/math/Technical%20Reports/Technical%20Reports%202007-/TR%202007- 2008/TR_2008_3.pdf

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Appendix R Python Matlab

Slide 45

Slide 45 text

References http://download.springer.com/static/pdf/639/art%253A10.1023%252FA%253A1010933404324.pdf?auth66=1397157894_3ff0ebbe42d0f5e920b8020a851c030b&ext=.pdf http://nymetro.chapter.informs.org/prac_cor_pubs/RandomForest_SteinbergD.pdf http://www.strath.ac.uk/aer/materials/5furtherquantitativeresearchdesignandanalysis/unit6/whatislogisticregression/ http://en.wikipedia.org/wiki/Random_forest