classification model. • It models the relationship between a dependent and one or more independent variables, and allows us to look at the fit of the model as well as at the significance of the relationships •Logistic regression estimates the probability of an event occurring
random forests based on Breiman’s bagging tree. •In 1999 Leo Breiman published a complete paper on random forests •Adele Cutler and Breiman developed algorithms for random forests http://www.stat.berkeley.edu/~breiman/RandomForests/
classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. •Each tree is grown at least partially at random • Randomness is injected by growing each tree on a different random subsample of the training data •Randomness is injected into the split selection process so that the splitter at any node is determined partly at random
overtraining (overfitting) –generalizes well to new data •Trains rapidly even with thousands of potential predictors •No need for prior feature (variable) selection
• Response • Discrete/Categorical response • Predictors • Frequency of words (Continuous) Solutions • Logistic Regression • Trees (CARTS, Random forests) • Support vector machines
predict the binary outcome •Model gives estimated probabilities •R2 tries to fit a linear line Solution: Compare observed and expected numbers Other tests: •Hosmer & Lemeshow ohoslem.test{ResourceSelection} •ROC / AUC / C-statistic http://cran.r-project.org/web/packages/ResourceSelection/ResourceSelection.pdf http://cran.r-project.org/web/packages/hmeasure/vignettes/hmeasure.pdf http://personalpages.manchester.ac.uk/staff/mark.lunt/stats/7_Binary/text.pdf
Spam Spam Prediction Not Spam Spam Prediction Not Spam Spam Not Spam 298 66 Not Spam 294 52 Not Spam 285 42 Spam 13 123 Spam 17 137 Spam 26 147 Mis Classificatio n Rate Mis Classificatio n Rate Mis Classi ficatio n Rate 6% 34% 5% 27% 8% 22%
supervised and unsupervised learning. •Fitting a local regression model to a segment of the data. •Usage of Regressive partitioning methods to identify the segments of data.
which make decisions at each node •Point estimation and Distribution estimation •Maximum likelihood estimate of probability distribution R Package: •tree : S-plus routine Clark and Pregibo •rpart: Implementation of Breiman, Friedman, Olshen and Stone http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf
continuous variables and variables with many categories cforest [pkg: cparty] •Unbiased conditional inference trees http://www.statistik.uni-dortmund.de/useR-2008/slides/Strobl+Zeileis.pdf
(Decision Boundary) •Support vector - Maximize distance between closest points of classes •Lots of Mathematical theories behind •R library - e1071 http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf A B https://www.statsoft.com/textbook/support-vector-machines Kernels : Transforms the objects