Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Email Spam Classification

Email Spam Classification

Classification models to classify email spams.

jeyaramashok

March 20, 2014
Tweet

More Decks by jeyaramashok

Other Decks in Programming

Transcript

  1. Logistic Regression •Logistic regression is a type of probabilistic statistical

    classification model. • It models the relationship between a dependent and one or more independent variables, and allows us to look at the fit of the model as well as at the significance of the relationships •Logistic regression estimates the probability of an event occurring
  2. History •In 1995 Tin Kam Ho of Bell Labs coined

    random forests based on Breiman’s bagging tree. •In 1999 Leo Breiman published a complete paper on random forests •Adele Cutler and Breiman developed algorithms for random forests http://www.stat.berkeley.edu/~breiman/RandomForests/
  3. Random Forests •Random forests are an ensemble learning method for

    classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. •Each tree is grown at least partially at random ™ • Randomness is injected by growing each tree on a different random subsample of the training data ™ •Randomness is injected into the split selection process so that the splitter at any node is determined partly at random
  4. Random Forests Benefits: •High levels of predictive accuracy •Resistant to

    overtraining (overfitting) –generalizes well to new data •Trains rapidly even with thousands of potential predictors •No need for prior feature (variable) selection
  5. Spam Classification Classify an email as Spam or Not-Spam (Yes/No)

    • Response • Discrete/Categorical response • Predictors • Frequency of words (Continuous) Solutions • Logistic Regression • Trees (CARTS, Random forests) • Support vector machines
  6. Wrangling •Visual inspection of data set •Clean up messy data

    •Find missing data (replacements) •Transformations (Ex: Newyork, New York, NY) •Facets and Filters •Google OpenRefine and Stanford Data wrangler
  7. Data Partitioning Why? •Need for In-sample(train) and Out-sample(test) data •Over-fitting

    How? •Random Sub-sampling •K-fold cross validation (test-train cycle) •Leave one out cross validation •Bootstrap Spam data Split: •¾ – Train and ¼ – Test •Python Scikit-learn •sklearn.cross_validation.train_test_split(*arrays, **options) •In R, createDataParition{caret}
  8. Variable Selection Objective •Find important predictors •Find small number of

    predictors Why? •We don't want to use all variables (Occam's razor) •Remove redundant and irrelevant features How? •Reasoning •Cross validation (test-train cycle) •Stepwise regression (Forward/Backward) •Automatic searches •Subset selection •Ridge regression: Elastic net and Lasso http://machinelearning.wustl.edu/mlpapers/paper_files/GuyonE03.pd
  9. Stepwise Regression •Model with 23 parameters AIC: 965.49 Accuracy: 0.8

    •Complex model, less information loss train{caret} and stepAIC{MASS}
  10. Models Model Count(x) x Information Criteria (AIC) 1 5 free

    + money + remove + re: +edu 1261 2 8 remove + internet + free + you + credit + money + business + edu 1051.5 3 23 address +all + remove + internet + order + people + free + business + email + you + credit + money + lab + data + parts + pm + meeting + original + project + re. + edu + table + conference 995.49
  11. Goodness of Fit Why R2 wont work? •Goal is to

    predict the binary outcome •Model gives estimated probabilities •R2 tries to fit a linear line Solution: Compare observed and expected numbers Other tests: •Hosmer & Lemeshow ohoslem.test{ResourceSelection} •ROC / AUC / C-statistic http://cran.r-project.org/web/packages/ResourceSelection/ResourceSelection.pdf http://cran.r-project.org/web/packages/hmeasure/vignettes/hmeasure.pdf http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/7_Binary/text.pdf
  12. Analysis Model1 Model2 Model 3 Actuals Actuals Actuals Prediction Not

    Spam Spam Prediction Not Spam Spam Prediction Not Spam Spam Not Spam 298 66 Not Spam 294 52 Not Spam 285 42 Spam 13 123 Spam 17 137 Spam 26 147 Mis Classificatio n Rate Mis Classificatio n Rate Mis Classi ficatio n Rate 6% 34% 5% 27% 8% 22%
  13. Voters Targeting •Campaigning, Mobilization and turnout. •Identifying likelihood of support

    and turnout. •Targeting for turnout based on the quadrants. http://epub.wu.ac.at/3458/1/Report117.pdf
  14. LORET •Regression model with predictors and hierarchical partitioning. •Combination of

    supervised and unsupervised learning. •Fitting a local regression model to a segment of the data. •Usage of Regressive partitioning methods to identify the segments of data.
  15. Classification Trees •Classification(Discrete) vs Regression Tree(Numerical) •Construct a binary tree

    which make decisions at each node •Point estimation and Distribution estimation •Maximum likelihood estimate of probability distribution R Package: •tree : S-plus routine Clark and Pregibo •rpart: Implementation of Breiman, Friedman, Olshen and Stone http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf
  16. Confusion Matrix Original Tree Pruned Tree Actuals Actuals Prediction Not

    Spam Spam Pre dicti on Not Spam Spam Not Spam 277 35 Not Spa m 283 39 Spam 34 154 Spa m 28 150
  17. Random Forests- Packages randomForest [pkg: randomForest] •CART implementation •Biased over

    continuous variables and variables with many categories cforest [pkg: cparty] •Unbiased conditional inference trees http://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf
  18. Comparison Logistic Regression Classification Trees Random Forests 1 Interpretation Easy

    Easy Moderate 2 Average Prediction Accuracy* 86% 87% 88% 3 * For Spam dataset
  19. Other Classifiers •Naive Bayes •Perceptron •K Nearest Neighbor •Support Vector

    Machines •Neural Networks •Bayesian Networks •Hidden Markov Models •Learning Vector quantization http://en.wikipedia.org/wiki/Classification_in_machine_learning
  20. Support Vector Machines •Hyper plane - Line separating the classes

    (Decision Boundary) •Support vector - Maximize distance between closest points of classes •Lots of Mathematical theories behind •R library - e1071 http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf A B https://www.statsoft.com/textbook/support-vector-machines Kernels : Transforms the objects