Classification Methods
•Classification Trees
•Logistic Regression
•Model-Based Trees
•Random Forests
Slide 3
Slide 3 text
Logistic Regression
•Logistic regression is a type of probabilistic statistical classification
model.
• It models the relationship between a dependent and one or more
independent variables, and allows us to look at the fit of the model as well
as at the significance of the relationships
•Logistic regression estimates the probability of an event occurring
Slide 4
Slide 4 text
History
•In 1995 Tin Kam Ho of Bell Labs coined random
forests based on Breiman’s bagging tree.
•In 1999 Leo Breiman published a complete paper on
random forests
•Adele Cutler and Breiman developed algorithms for
random forests
http://www.stat.berkeley.edu/~breiman/RandomForests/
Slide 5
Slide 5 text
Random Forests
•Random forests are an ensemble learning method for classification (and
regression) that operate by constructing a multitude of decision trees at
training time and outputting the class that is the mode of the classes
output by individual trees.
•Each tree is grown at least partially at random
• Randomness is injected by growing each tree on a different
random subsample of the training data
•Randomness is injected into the split selection process so that the
splitter at any node is determined partly at random
Slide 6
Slide 6 text
Random Forests
Benefits:
•High levels of predictive accuracy
•Resistant to overtraining (overfitting) –generalizes well to new data
•Trains rapidly even with thousands of potential predictors
•No need for prior feature (variable) selection
Slide 7
Slide 7 text
Spam Classification
Classify an email as Spam or Not-Spam (Yes/No)
●
Response
●
Discrete/Categorical response
●
Predictors
●
Frequency of words (Continuous)
Solutions
●
Logistic Regression
●
Trees (CARTS, Random forests)
●
Support vector machines
Slide 8
Slide 8 text
Wrangling
●Visual inspection of data set
●Clean up messy data
●Find missing data (replacements)
●Transformations (Ex: Newyork, New York, NY)
●Facets and Filters
●Google OpenRefine and Stanford Data wrangler
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
Data Partitioning
Why?
●Need for In-sample(train) and Out-sample(test) data
●Over-fitting
How?
●Random Sub-sampling
●K-fold cross validation (test-train cycle)
●Leave one out cross validation
●Bootstrap
Spam data Split:
●¾ – Train and ¼ – Test
●Python Scikit-learn
●sklearn.cross_validation.train_test_split(*arrays, **options)
●In R, createDataParition{caret}
Slide 11
Slide 11 text
Variable Selection
Objective
●Find important predictors
●Find small number of predictors
Why?
●We don't want to use all variables (Occam's razor)
●Remove redundant and irrelevant features
How?
●Reasoning
●Cross validation (test-train cycle)
●Stepwise regression (Forward/Backward)
●Automatic searches
●Subset selection
●Ridge regression: Elastic net and Lasso
http://machinelearning.wustl.edu/mlpapers/paper_files/GuyonE03.pd
Slide 12
Slide 12 text
Some Idea
Spam: credit,money,free,remove
Not Spam: meeting,re:,edu,project
Slide 13
Slide 13 text
Corrgrams
http://www.sci.utah.edu/~kpotter/Library/Papers/friendly:2002:EDCM/
●No strong correlation, because
they are non-linear
Slide 14
Slide 14 text
Stepwise Regression
●Model with 23 parameters
AIC: 965.49
Accuracy: 0.8
●Complex model, less
information loss
train{caret} and stepAIC{MASS}
Slide 15
Slide 15 text
Best Subsets
- Branch and Bound Algorithm
- regsubsets{leaps}
http://cran.r-project.org/web/packages/leaps/leaps.pdf
Slide 16
Slide 16 text
Lasso Regularization
•lassoglm in matlab
•glmnet{glmnet} in R
Slide 17
Slide 17 text
http://scikit-learn.org/stable/auto_examples/plot_feature_selection.html#example-plot-
feature-selection-py
UniVariate Feature Selection
Python Sci-Learn Cross Validation Class
Slide 18
Slide 18 text
Random Forests
●Variables which splits
the data into large groups
Slide 19
Slide 19 text
Models
Model Count(x) x Information
Criteria
(AIC)
1 5 free + money + remove + re: +edu 1261
2 8 remove + internet + free + you + credit + money
+ business + edu
1051.5
3 23 address +all + remove + internet + order +
people + free + business + email + you + credit +
money + lab + data + parts + pm + meeting +
original + project + re. + edu + table + conference
995.49
Slide 20
Slide 20 text
Goodness of Fit
Why R2 wont work?
•Goal is to predict the binary outcome
•Model gives estimated probabilities
•R2 tries to fit a linear line
Solution: Compare observed and expected numbers
Other tests:
•Hosmer & Lemeshow
ohoslem.test{ResourceSelection}
•ROC / AUC / C-statistic
http://cran.r-project.org/web/packages/ResourceSelection/ResourceSelection.pdf
http://cran.r-project.org/web/packages/hmeasure/vignettes/hmeasure.pdf
http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/7_Binary/text.pdf
Analysis
Model1 Model2 Model
3
Actuals Actuals Actuals
Prediction Not Spam Spam Prediction Not Spam Spam Prediction Not Spam Spam
Not Spam 298 66 Not Spam 294 52 Not Spam 285 42
Spam 13 123 Spam 17 137 Spam 26 147
Mis
Classificatio
n Rate
Mis
Classificatio
n Rate
Mis
Classi
ficatio
n
Rate
6% 34% 5% 27% 8% 22%
Slide 24
Slide 24 text
Voters Targeting
•Campaigning, Mobilization and turnout.
•Identifying likelihood of support and turnout.
•Targeting for turnout based on the quadrants.
http://epub.wu.ac.at/3458/1/Report117.pdf
Slide 25
Slide 25 text
LORET
•Regression model with predictors and hierarchical partitioning.
•Combination of supervised and unsupervised learning.
•Fitting a local regression model to a segment of the data.
•Usage of Regressive partitioning methods to identify the segments of
data.
Slide 26
Slide 26 text
Classification Trees
•Classification(Discrete) vs Regression Tree(Numerical)
•Construct a binary tree which make decisions at each node
•Point estimation and Distribution estimation
•Maximum likelihood estimate of probability distribution
R Package:
•tree : S-plus routine Clark and Pregibo
•rpart: Implementation of Breiman, Friedman, Olshen and Stone
http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf
Slide 27
Slide 27 text
Original Tree Pruned Tree
●Pruning has removed 2 nodes from original tree
Q: Is the mail spam ?
Slide 28
Slide 28 text
Confusion Matrix
Original Tree Pruned Tree
Actuals Actuals
Prediction Not Spam Spam Pre
dicti
on
Not Spam Spam
Not Spam 277 35 Not
Spa
m
283 39
Spam 34 154 Spa
m
28 150
Slide 29
Slide 29 text
Random Forests- Packages
randomForest [pkg: randomForest]
•CART implementation
•Biased over continuous variables and variables with many
categories
cforest [pkg: cparty]
•Unbiased conditional inference trees
http://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf
Support Vector Machines
•Hyper plane - Line separating the classes (Decision Boundary)
•Support vector - Maximize distance between closest points of classes
•Lots of Mathematical theories behind
•R library - e1071
http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf
A B
https://www.statsoft.com/textbook/support-vector-machines
Kernels : Transforms the objects
Slide 42
Slide 42 text
Wow!!!
•98% accuracy for spam dataset
http://www.d.umn.edu/math/Technical%20Reports/Technical%20Reports%202007-/TR%202007-
2008/TR_2008_3.pdf