has shot through the roof in recent years, totaling over 13000 packages. Many of these implement novel statistical / machine learning models. It is easy to port / link to other applications. R doesn't try to do everything itself. If you prefer models implemented in C, C++, tensorflow , keras , python , stan , or Weka , you can access these applications without leaving R.1 R packages are built by people who do data analysis. This is in contrast to computer science. R provides the tooling to allow user experience to be a big focus of a package. The S language is very mature. S is the precursor to R, and R inherits some of the statistical routines found in S. Why use R for modeling? [1] In fact, this is actually one of the core principles for R. It is an interface to other languages. 4 / 42
If a high performance deployment is required, R can be treated like a prototyping language. Memory R is mostly memory-bound. There are plenty of exceptions to this though. Consistency among interfaces For example: There are two methods for specifying what terms are in a model1. Not all models have both. 99% of model functions automatically generate dummy variables. But damn the 1% that don't! Why not to use R for modeling? [1] There are now three but the last one is brand new and won't be discussed. But see recipes for info. 5 / 42
glm stats predict(obj, type = "response") gbm gbm predict(obj, type = "response", n.trees) mda mda predict(obj, type = "posterior") rpart rpart predict(obj, type = "prob") Weka RWeka predict(obj, type = "probability") logitboost LogitBoost predict(obj, type = "raw", nIter) Sigh. How has this been tackled in the past? 6 / 42
syntax and is perhaps the most untidy code that you will encounter. caret is the primary package for untidy predictive modeling. It wraps all of those models on the previous page. 1. More traditional R coding style. 2. High-level "I'll do that for you" syntax. 3. More comprehensive (for now) and less modular. caret can do everything from feature engineering to hyperparameter optimization. 4. Contains many optimizations and is easily parallelized. A tale of two philosophies 7 / 42
syntax and is perhaps the most untidy code that you will encounter. caret is the primary package for untidy predictive modeling. It wraps all of those models on the previous page. 1. More traditional R coding style. 2. High-level "I'll do that for you" syntax. 3. More comprehensive (for now) and less modular. caret can do everything from feature engineering to hyperparameter optimization. 4. Contains many optimizations and is easily parallelized. The modular approach The tidy modeling approach espouses the tenets of the tidyverse: 1. Reuse existing data structures. 2. Compose simple functions with the pipe. 3. Embrace functional programming. 4. Design for humans. This approach is exemplified by a new set of tidy modeling packages... A tale of two philosophies 7 / 42
2,930 properties in the data. The Sale Price was recorded along with 81 predictors, including: Location (e.g. neighborhood) and lot information. House components (garage, fireplace, pool, porch, etc.). General assessments such as overall quality and condition. Number of bedrooms, baths And so on... More details can be found in De Cock (2011, Journal of Statistics Education). The raw data are at http: bit.ly/2whgsQM but we will use a processed version found in the AmesHousing package. Data set - House prices The goal will be to explore new tools that allow us to model sale price as a function of these features. 9 / 42
performance on the data used to fit the model. Find the best model and carry that into production. Why not? Overestimate model performance You create a model highly specific to X and then test on X and get good performance. Is this realistic? No checks against overfitting Linear regression with p variables, and p data points. . Doesn't generalize! Baby steps R2 = 1 13 / 42
data sets: Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model. Test Set: these data can be used to get an independent assessment of model efficacy. They should not be used during model training. 14 / 42
data sets: Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model. Test Set: these data can be used to get an independent assessment of model efficacy. They should not be used during model training. The more data we spend, the better estimates we'll get (provided the data is accurate). Given a fixed amount of data: too much spent in training won't allow us to get a good assessment of predictive performance. We may find a model that fits the training data very well, but is not generalizable (overfitting) too much spent in testing won't allow us to get a good assessment of model parameters 14 / 42
do the split: simple random sampling, stratified sampling based on the outcome, by date, or methods that focus on the distribution of the predictors. For stratification: Classification: this would mean sampling within the classes to preserve the distribution of the outcome in the training and test sets Regression: determine the quartiles of the data set and samples within those artificial groups 15 / 42
to the housing data, the model terms must be specified. Historically, there are two main interfaces for doing this. The formula interface using R formula rules to specify a symbolic representation of the terms and variables. For example: Specifying variables and interactions: model_function(Sale_Price ~ Neighborhood + Year_Sold + Neighborhood:Year_Sold, data = ames_train) Specifying all variables on the right hand side: model_function(Sale_Price ~ ., data = ames_train) Using special functions directly on your data: model_function(log10(Sale_Price) ~ ns(Longitude, df = 3) + ns(Latitude, df = 3), data = ames_train) This is very convenient but it has some disadvantages. 20 / 42
~ pca(scale(x1), scale(x2), scale(x3)), data = dat) . For very wide data sets, the formula method can be extremely inefficient. Specifying multivariate outcomes is awkward. Not all model functions have a formula method (that consistency issue again). 21 / 42
interface. This usually has arguments for the predictors and the outcome(s): predictors c("Year_Sold", "Longitude", "Latitude") modeling_function( x = ames_train[, predictors], y = ames_train$Sale_Price ) This is inconvenient if you have transformations, factor variables, interactions, or any other operations to apply to the data prior to modeling. 22 / 42
a package has one or both of these interfaces. For example, lm() only has formulas. There is a third interface, using recipes that we will quickly discuss, but know that it attempts to solve most of these issues and more. 23 / 42
Add the data and formula / xy terms and fit. spec_lin_reg linear_reg() # or glmnet, or Spark, or Stan, or Keras spec_lm set_engine(spec_lin_reg, "lm") spec_lm Linear Regression Model Specification (regression) Computational engine: lm parsnip model fitting 27 / 42
Add the data and formula / xy terms and fit. spec_lin_reg linear_reg() # or glmnet, or Spark, or Stan, or Keras spec_lm set_engine(spec_lin_reg, "lm") spec_lm Linear Regression Model Specification (regression) Computational engine: lm ames_train_log ames_train %>% mutate(Sale_Price_Log = log10(Sale_Price)) fit_lm fit( object = spec_lm, # No "special" terms in the formula. Use recipes! formula = Sale_Price_Log ~ Latitude + Longitude, data = ames_train_log ) fit_lm parsnip model object Call: stats lm(formula = formula, data = data) Coefficients: (Intercept) Latitude Longitude -316.153 3.014 -2.079 parsnip model fitting 27 / 42
use. Remember, lm() only allows the formula interface, but parsnip 's interface let's you use either one. fit_xy( spec_lm, # same model specification y = ames_train_log$Sale_Price_Log, x = ames_train_log[, c("Latitude", "Longitude")] ) parsnip model object Call: stats lm(formula = formula, data = data) Coefficients: (Intercept) Latitude Longitude -316.153 3.014 -2.079 28 / 42
fit_lm$fit . If you use the summary() method on the underlying lm object, the bottom shows some statistics: summary(fit_lm$fit) < > Residual standard error: 0.1614 on 2196 degrees of freedom Multiple R-squared: 0.1808, Adjusted R-squared: 0.1801 F-statistic: 242.3 on 2 and 2196 DF, p value: < 2.2e-16 These statistics are generated from predicting on the training data used to fit the model. This is problematic because it can lead to optimistic results, especially for models that are extremely flexible (overfitting). 31 / 42
fit_lm$fit . If you use the summary() method on the underlying lm object, the bottom shows some statistics: summary(fit_lm$fit) < > Residual standard error: 0.1614 on 2196 degrees of freedom Multiple R-squared: 0.1808, Adjusted R-squared: 0.1801 F-statistic: 242.3 on 2 and 2196 DF, p value: < 2.2e-16 These statistics are generated from predicting on the training data used to fit the model. This is problematic because it can lead to optimistic results, especially for models that are extremely flexible (overfitting). Idea! The test set is used for assessing performance. Should we predict on the test set and use those results to estimate these statistics? 31 / 42
when you have one or two models that are your favorite. So... 1) For model A, fit on training set, predict on training set 2) For model B, fit on training set, predict on training set 3) Compare performance For some models, it is possible to get very "good" performance by predicting the training set (it was so flexible you overfit it). That's an issue since we will need to make "honest" comparisons between models before we finalize them and run our final choices on the test set. 33 / 42
when you have one or two models that are your favorite. So... 1) For model A, fit on training set, predict on training set 2) For model B, fit on training set, predict on training set 3) Compare performance For some models, it is possible to get very "good" performance by predicting the training set (it was so flexible you overfit it). That's an issue since we will need to make "honest" comparisons between models before we finalize them and run our final choices on the test set. If only... If only we had a method for getting honest performance estimates from the training set... 33 / 42
the training set. They attempt to simulate slightly different versions of the training set. These versions of the original are split into two model subsets: The analysis set is used to fit the model (analogous to the training set). The assessment set is used for performance evaluation. This process is repeated many times. There are different flavors of resampling but we will focus on just one. All Data (2000) Testing (200) Training (1800) Resample 1 (600) Resample 2 (600) Resample 3 (600) Analysis (500) Assessment (100) Analysis (500) Assessment (100) Analysis (500) Assessment (100) Back to rsample! 34 / 42
blocks of roughly equal size. We leave out the first block of analysis data and fit a model. This model is used to predict the held-out block of assessment data. We continue this process until we've predicted all V assessment blocks The final performance is based on the hold-out predictions by averaging the statistics from the V blocks. V is usually taken to be 5 or 10 and leave one out cross- validation has each sample as a block. V-Fold Cross-Validation [1] Image from: https://www.kaggle.com/dansbecker/cross- validation 35 / 42
fit( object = spec_lm, formula = Sale_Price_Log ~ Latitude + Longitude, data = analysis(split) # pull out training set ) } # For each resample, call fit_model() cv_splits cv_splits %>% mutate(models = map(splits, fit_model)) cv_splits # 10-fold cross validation using stratification # A tibble: 10 x 3 splits id models * <list> <chr> <list> 1 <split [2K/222]> Fold01 <fit[+]> 2 <split [2K/222]> Fold02 <fit[+]> 3 <split [2K/222]> Fold03 <fit[+]> 4 <split [2K/222]> Fold04 <fit[+]> 5 <split [2K/222]> Fold05 <fit[+]> 6 <split [2K/219]> Fold06 <fit[+]> 7 <split [2K/219]> Fold07 <fit[+]> 8 <split [2K/217]> Fold08 <fit[+]> 9 <split [2K/217]> Fold09 <fit[+]> 10 <split [2K/217]> Fold10 <fit[+]> Resampling the linear model Working with resample tibbles generally involves two things: 1) Small functions that perform an action on a single split. 2) The purrr package for map() ping over splits. Result: A nice self-contained modeling object! 37 / 42
cv_splits %>% mutate( assess = map(splits, assessment), preds = map2(assess, models, predict_split) ) select(cv_splits, id, assess, preds) # 10-fold cross validation using stratification # A tibble: 10 x 3 id assess preds * <chr> <list> <list> 1 Fold01 <tibble [222 × 82]> <tibble [222 × 1]> 2 Fold02 <tibble [222 × 82]> <tibble [222 × 1]> 3 Fold03 <tibble [222 × 82]> <tibble [222 × 1]> 4 Fold04 <tibble [222 × 82]> <tibble [222 × 1]> 5 Fold05 <tibble [222 × 82]> <tibble [222 × 1]> 6 Fold06 <tibble [219 × 82]> <tibble [219 × 1]> 7 Fold07 <tibble [219 × 82]> <tibble [219 × 1]> 8 Fold08 <tibble [217 × 82]> <tibble [217 × 1]> 9 Fold09 <tibble [217 × 82]> <tibble [217 × 1]> 10 Fold10 <tibble [217 × 82]> <tibble [217 × 1]> Resampled predictions At this point, we have fit models to each analysis set, and we can now make predictions for each resample using the corresponding assessment set. The same map() approach applies. We also use map2() to map over 2 things at once: the assessment set and the model. Both of these are used together to create the predictions. 38 / 42
Sale_Price_Log) print(cv_preds, n = 5) # A tibble: 2,199 x 3 # Groups: id [10] id .pred Sale_Price_Log <chr> <dbl> <dbl> 1 Fold01 5.28 5.37 2 Fold01 5.26 5.33 3 Fold01 5.24 4.98 4 Fold01 5.30 5.30 5 Fold01 5.30 5.55 # with 2,194 more rows cv_preds %>% rmse(Sale_Price_Log, .pred) # A tibble: 10 x 4 id .metric .estimator .estimate <chr> <chr> <chr> <dbl> 1 Fold01 rmse standard 0.149 2 Fold02 rmse standard 0.163 3 Fold03 rmse standard 0.160 4 Fold04 rmse standard 0.158 5 Fold05 rmse standard 0.161 6 Fold06 rmse standard 0.156 7 Fold07 rmse standard 0.179 8 Fold08 rmse standard 0.159 9 Fold09 rmse standard 0.168 10 Fold10 rmse standard 0.158 Resampled performance yardstick is a package that contains functions for calculating a large number of performance metrics. All functions are of the form metric_name(data, truth, estimate) . 39 / 42
we answer this question? Compare to other models. Maybe knn could be used? "Neighbors" makes sense when considering house prices! Compare resample 1 of lm to resample 1 of knn , then compare the overall rmse . 41 / 42
we answer this question? Compare to other models. Maybe knn could be used? "Neighbors" makes sense when considering house prices! Compare resample 1 of lm to resample 1 of knn , then compare the overall rmse . Testing set We still haven't used this! Once you've chosen your final model: Refit the model on the entire training set (no resamples). Predict on the testing set. Evaluate, this should simply confirm your existing beliefs by now. 41 / 42