Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modeling in the tidyverse

Davis Vaughan
November 27, 2018
270

Modeling in the tidyverse

Davis Vaughan

November 27, 2018
Tweet

Transcript

  1. Davis Vaughan Software engineer @ RStudio Modeling with Max Kuhn

    tidymodels Quantitative finance Master's @ UNC Charlotte Obsessed with creating tools that make your life easier Who am I? 2 / 42
  2. Part 1 - Modeling parsnip - Building model specifications rsample

    - Resampling techniques yardstick - Performance evaluation Agenda: 3 / 42
  3. Part 1 - Modeling parsnip - Building model specifications rsample

    - Resampling techniques yardstick - Performance evaluation Part 2 - Something new Arrays and matrices in R Agenda: 3 / 42
  4. R has cutting edge models. The number of R packages

    has shot through the roof in recent years, totaling over 13000 packages. Many of these implement novel statistical / machine learning models. It is easy to port / link to other applications. R doesn't try to do everything itself. If you prefer models implemented in C, C++, tensorflow , keras , python , stan , or Weka , you can access these applications without leaving R.1 R packages are built by people who do data analysis. This is in contrast to computer science. R provides the tooling to allow user experience to be a big focus of a package. The S language is very mature. S is the precursor to R, and R inherits some of the statistical routines found in S. Why use R for modeling? [1] In fact, this is actually one of the core principles for R. It is an interface to other languages. 4 / 42
  5. Performance R is a data analysis language, it's not C++.

    If a high performance deployment is required, R can be treated like a prototyping language. Memory R is mostly memory-bound. There are plenty of exceptions to this though. Consistency among interfaces For example: There are two methods for specifying what terms are in a model1. Not all models have both. 99% of model functions automatically generate dummy variables. But damn the 1% that don't! Why not to use R for modeling? [1] There are now three but the last one is brand new and won't be discussed. But see recipes for info. 5 / 42
  6. The problem of consistency Function Package Code lda MASS predict(obj)

    glm stats predict(obj, type = "response") gbm gbm predict(obj, type = "response", n.trees) mda mda predict(obj, type = "posterior") rpart rpart predict(obj, type = "prob") Weka RWeka predict(obj, type = "probability") logitboost LogitBoost predict(obj, type = "raw", nIter) 6 / 42
  7. The problem of consistency Function Package Code lda MASS predict(obj)

    glm stats predict(obj, type = "response") gbm gbm predict(obj, type = "response", n.trees) mda mda predict(obj, type = "posterior") rpart rpart predict(obj, type = "prob") Weka RWeka predict(obj, type = "probability") logitboost LogitBoost predict(obj, type = "raw", nIter) Sigh. 6 / 42
  8. The problem of consistency Function Package Code lda MASS predict(obj)

    glm stats predict(obj, type = "response") gbm gbm predict(obj, type = "response", n.trees) mda mda predict(obj, type = "posterior") rpart rpart predict(obj, type = "prob") Weka RWeka predict(obj, type = "probability") logitboost LogitBoost predict(obj, type = "raw", nIter) Sigh. How has this been tackled in the past? 6 / 42
  9. The one stop shop The more traditional approach uses high-level

    syntax and is perhaps the most untidy code that you will encounter. caret is the primary package for untidy predictive modeling. It wraps all of those models on the previous page. 1. More traditional R coding style. 2. High-level "I'll do that for you" syntax. 3. More comprehensive (for now) and less modular. caret can do everything from feature engineering to hyperparameter optimization. 4. Contains many optimizations and is easily parallelized. A tale of two philosophies 7 / 42
  10. The one stop shop The more traditional approach uses high-level

    syntax and is perhaps the most untidy code that you will encounter. caret is the primary package for untidy predictive modeling. It wraps all of those models on the previous page. 1. More traditional R coding style. 2. High-level "I'll do that for you" syntax. 3. More comprehensive (for now) and less modular. caret can do everything from feature engineering to hyperparameter optimization. 4. Contains many optimizations and is easily parallelized. The modular approach The tidy modeling approach espouses the tenets of the tidyverse: 1. Reuse existing data structures. 2. Compose simple functions with the pipe. 3. Embrace functional programming. 4. Design for humans. This approach is exemplified by a new set of tidy modeling packages... A tale of two philosophies 7 / 42
  11. tidymodels ecosystem library(tidymodels) ✔ broom 0.5.0.9001 ✔ purrr 0.2.5.9000 ✔

    dials 0.0.1.9000 ✔ recipes 0.1.4.9000 ✔ dplyr 0.7.99.9000 ✔ rsample 0.0.3 ✔ ggplot2 3.1.0 ✔ tibble 1.4.2 ✔ infer 0.4.0 ✔ yardstick 0.0.2 ✔ parsnip 0.0.1 Plus tidypredict , tidyposterior , tidytext , and more in development. We will explore 3 of these packages today, but use many others along the way. ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidymo ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidymodels_c ✖ purrr discard() masks scales discard() ✖ rsample fill() masks tidyr fill() ✖ dplyr filter() masks stats filter() ✖ dplyr lag() masks stats lag() ✖ recipes step() masks stats step() 8 / 42
  12. We will use the Ames IA housing data. There are

    2,930 properties in the data. The Sale Price was recorded along with 81 predictors, including: Location (e.g. neighborhood) and lot information. House components (garage, fireplace, pool, porch, etc.). General assessments such as overall quality and condition. Number of bedrooms, baths And so on... More details can be found in De Cock (2011, Journal of Statistics Education). The raw data are at http: bit.ly/2whgsQM but we will use a processed version found in the AmesHousing package. Data set - House prices The goal will be to explore new tools that allow us to model sale price as a function of these features. 9 / 42
  13. Visualize - Colored by neighborhood 10 / 42 + −

    Leaflet | Map tiles by Stamen Design, CC BY 3.0 — Map data © OpenStreetMap
  14. Hmm.. Repeatedly fit models on the whole data set. Estimate

    performance on the data used to fit the model. Find the best model and carry that into production. Baby steps 13 / 42
  15. Hmm.. Repeatedly fit models on the whole data set. Estimate

    performance on the data used to fit the model. Find the best model and carry that into production. Baby steps 13 / 42
  16. Hmm.. Repeatedly fit models on the whole data set. Estimate

    performance on the data used to fit the model. Find the best model and carry that into production. Why not? Overestimate model performance You create a model highly specific to X and then test on X and get good performance. Is this realistic? No checks against overfitting Linear regression with p variables, and p data points. . Doesn't generalize! Baby steps R2 = 1 13 / 42
  17. Data splitting We typically split data into training and test

    data sets: Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model. Test Set: these data can be used to get an independent assessment of model efficacy. They should not be used during model training. 14 / 42
  18. Data splitting We typically split data into training and test

    data sets: Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model. Test Set: these data can be used to get an independent assessment of model efficacy. They should not be used during model training. The more data we spend, the better estimates we'll get (provided the data is accurate). Given a fixed amount of data: too much spent in training won't allow us to get a good assessment of predictive performance. We may find a model that fits the training data very well, but is not generalizable (overfitting) too much spent in testing won't allow us to get a good assessment of model parameters 14 / 42
  19. Mechanics of splitting There are a few different ways to

    do the split: simple random sampling, stratified sampling based on the outcome, by date, or methods that focus on the distribution of the predictors. For stratification: Classification: this would mean sampling within the classes to preserve the distribution of the outcome in the training and test sets Regression: determine the quartiles of the data set and samples within those artificial groups 15 / 42
  20. Ames Housing Data Let's load the example data set and

    split it. We'll put 75% into training and 25% into testing. library(AmesHousing) library(rsample) ames make_ames() nrow(ames) [1] 2930 set.seed(4595) data_split initial_split(ames, strata = "Sale_Price", prop = .75) ames_train training(data_split) ames_test testing(data_split) nrow(ames_train)/nrow(ames) [1] 0.7505119 16 / 42
  21. Detour What are these split objects? # result of initial_split()

    # <training / testing / total> data_split <2199/731/2930> training(data_split) # A tibble: 2,199 x 81 MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape <fct> <fct> <dbl> <int> <fct> <fct> <fct> 1 One_Story_… Resident… 141 31770 Pave No_A… Slightly… 2 Two_Story_… Resident… 74 13830 Pave No_A… Slightly… 3 Two_Story_… Resident… 78 9978 Pave No_A… Slightly… 4 One_Story_… Resident… 43 5005 Pave No_A… Slightly… 5 One_Story_… Resident… 39 5389 Pave No_A… Slightly… 6 Two_Story_… Resident… 60 7500 Pave No_A… Regular 7 Two_Story_… Resident… 75 10000 Pave No_A… Slightly… 8 One_Story_… Resident… 85 10176 Pave No_A… Regular 9 One_Story_… Resident… 0 6820 Pave No_A… Slightly… 17 / 42
  22. ggplot(ames_train, aes(x = Sale_Price)) + geom_line(stat = "density", trim =

    TRUE) + geom_line(data = ames_test, stat = "density", trim = TRUE, col = "red") Distribution check 18 / 42
  23. Specifying models in R using formulas To fit a model

    to the housing data, the model terms must be specified. Historically, there are two main interfaces for doing this. The formula interface using R formula rules to specify a symbolic representation of the terms and variables. For example: Specifying variables and interactions: model_function(Sale_Price ~ Neighborhood + Year_Sold + Neighborhood:Year_Sold, data = ames_train) Specifying all variables on the right hand side: model_function(Sale_Price ~ ., data = ames_train) Using special functions directly on your data: model_function(log10(Sale_Price) ~ ns(Longitude, df = 3) + ns(Latitude, df = 3), data = ames_train) This is very convenient but it has some disadvantages. 20 / 42
  24. Downsides to formulas You can't nest functions such as foo(y

    ~ pca(scale(x1), scale(x2), scale(x3)), data = dat) . For very wide data sets, the formula method can be extremely inefficient. Specifying multivariate outcomes is awkward. Not all model functions have a formula method (that consistency issue again). 21 / 42
  25. Specifying models without formulas Some modeling function have an "xy"

    interface. This usually has arguments for the predictors and the outcome(s): predictors c("Year_Sold", "Longitude", "Latitude") modeling_function( x = ames_train[, predictors], y = ames_train$Sale_Price ) This is inconvenient if you have transformations, factor variables, interactions, or any other operations to apply to the data prior to modeling. 22 / 42
  26. Model specification woes Overall, it is difficult to predict if

    a package has one or both of these interfaces. For example, lm() only has formulas. There is a third interface, using recipes that we will quickly discuss, but know that it attempts to solve most of these issues and more. 23 / 42
  27. # A tibble: 3 x 4 Sale_Price Alley Central_Air Year_Sold

    <int> <fct> <fct> <int> 1 215000 No_Alley_Access Y 2010 2 105000 No_Alley_Access Y 2010 3 172000 No_Alley_Access Y 2010 unique(ames$Alley) [1] No_Alley_Access Paved Gravel Levels: Gravel No_Alley_Access Paved rec recipe( Sale_Price ~ Alley + Central_Air + Year_Sold, data = ames_train ) %>% step_log(Sale_Price) %>% step_dummy(Alley, Central_Air) %>% step_interact(~ starts_with("Alley_") Year_Sold) Data Recipe Inputs: role #variables outcome 1 predictor 3 Operations: Log transformation on Sale_Price Dummy variables from Alley, Central_Air Interactions with starts_with("Alley_") Year_Sold recipes ames %>% select(Sale_Price, Alley, Central_Air, Year_Sold) %>% head(3) 24 / 42
  28. Prepare your recipe! prepped prep(rec, training = ames_train, retain =

    TRUE) juice(prepped) # A tibble: 2,199 x 7 Year_Sold Sale_Price Alley_No_Alley_Access Alley_Paved Central_Air_Y <int> <dbl> <dbl> <dbl> <dbl> 1 2010 12.3 1 0 1 2 2010 12.2 1 0 1 Alley_No_Alley_Access_x_Year_Sold Alley_Paved_x_Year_Sold <dbl> <dbl> 1 2010 0 2 2010 0 # with 2,197 more rows 25 / 42
  29. Bake it on new data! ames_test_baked bake(prepped, new_data = ames_test)

    ames_test_baked # A tibble: 731 x 7 Year_Sold Sale_Price Alley_No_Alley_Access Alley_Paved Central_Air_Y <int> <dbl> <dbl> <dbl> <dbl> 1 2010 11.6 1 0 1 2 2010 12.1 1 0 1 Alley_No_Alley_Access_x_Year_Sold Alley_Paved_x_Year_Sold <dbl> <dbl> 1 2010 0 2 2010 0 # with 729 more rows 26 / 42
  30. Three steps: Create a lightweight model specification. Set an "engine".

    Add the data and formula / xy terms and fit. spec_lin_reg linear_reg() # or glmnet, or Spark, or Stan, or Keras spec_lm set_engine(spec_lin_reg, "lm") spec_lm Linear Regression Model Specification (regression) Computational engine: lm parsnip model fitting 27 / 42
  31. Three steps: Create a lightweight model specification. Set an "engine".

    Add the data and formula / xy terms and fit. spec_lin_reg linear_reg() # or glmnet, or Spark, or Stan, or Keras spec_lm set_engine(spec_lin_reg, "lm") spec_lm Linear Regression Model Specification (regression) Computational engine: lm ames_train_log ames_train %>% mutate(Sale_Price_Log = log10(Sale_Price)) fit_lm fit( object = spec_lm, # No "special" terms in the formula. Use recipes! formula = Sale_Price_Log ~ Latitude + Longitude, data = ames_train_log ) fit_lm parsnip model object Call: stats lm(formula = formula, data = data) Coefficients: (Intercept) Latitude Longitude -316.153 3.014 -2.079 parsnip model fitting 27 / 42
  32. XY interface parsnip isn't picky about which interface you should

    use. Remember, lm() only allows the formula interface, but parsnip 's interface let's you use either one. fit_xy( spec_lm, # same model specification y = ames_train_log$Sale_Price_Log, x = ames_train_log[, c("Latitude", "Longitude")] ) parsnip model object Call: stats lm(formula = formula, data = data) Coefficients: (Intercept) Latitude Longitude -316.153 3.014 -2.079 28 / 42
  33. Alternative engine spec_stan spec_lin_reg %>% # Engine specific arguments are

    passed through here set_engine("stan", chains = 4, iter = 1000) fit_stan fit( object = spec_stan, formula = Sale_Price_Log ~ Latitude + Longitude, data = ames_train_log ) coef(fit_stan$fit) (Intercept) Latitude Longitude -316.311134 3.008426 -2.089167 coef(fit_lm$fit) (Intercept) Latitude Longitude -316.152846 3.013530 -2.079171 29 / 42
  34. Overall model statistics parsnip holds the actual fit object in

    fit_lm$fit . If you use the summary() method on the underlying lm object, the bottom shows some statistics: summary(fit_lm$fit) < > Residual standard error: 0.1614 on 2196 degrees of freedom Multiple R-squared: 0.1808, Adjusted R-squared: 0.1801 F-statistic: 242.3 on 2 and 2196 DF, p value: < 2.2e-16 These statistics are generated from predicting on the training data used to fit the model. This is problematic because it can lead to optimistic results, especially for models that are extremely flexible (overfitting). 31 / 42
  35. Overall model statistics parsnip holds the actual fit object in

    fit_lm$fit . If you use the summary() method on the underlying lm object, the bottom shows some statistics: summary(fit_lm$fit) < > Residual standard error: 0.1614 on 2196 degrees of freedom Multiple R-squared: 0.1808, Adjusted R-squared: 0.1801 F-statistic: 242.3 on 2 and 2196 DF, p value: < 2.2e-16 These statistics are generated from predicting on the training data used to fit the model. This is problematic because it can lead to optimistic results, especially for models that are extremely flexible (overfitting). Idea! The test set is used for assessing performance. Should we predict on the test set and use those results to estimate these statistics? 31 / 42
  36. Assessing Models Save the test set until the very end

    when you have one or two models that are your favorite. 33 / 42
  37. Assessing Models Save the test set until the very end

    when you have one or two models that are your favorite. So... 1) For model A, fit on training set, predict on training set 2) For model B, fit on training set, predict on training set 3) Compare performance For some models, it is possible to get very "good" performance by predicting the training set (it was so flexible you overfit it). That's an issue since we will need to make "honest" comparisons between models before we finalize them and run our final choices on the test set. 33 / 42
  38. Assessing Models Save the test set until the very end

    when you have one or two models that are your favorite. So... 1) For model A, fit on training set, predict on training set 2) For model B, fit on training set, predict on training set 3) Compare performance For some models, it is possible to get very "good" performance by predicting the training set (it was so flexible you overfit it). That's an issue since we will need to make "honest" comparisons between models before we finalize them and run our final choices on the test set. If only... If only we had a method for getting honest performance estimates from the training set... 33 / 42
  39. These are additional data splitting schemes that are applied to

    the training set. They attempt to simulate slightly different versions of the training set. These versions of the original are split into two model subsets: The analysis set is used to fit the model (analogous to the training set). The assessment set is used for performance evaluation. This process is repeated many times. There are different flavors of resampling but we will focus on just one. All Data (2000) Testing (200) Training (1800) Resample 1 (600) Resample 2 (600) Resample 3 (600) Analysis (500) Assessment (100) Analysis (500) Assessment (100) Analysis (500) Assessment (100) Back to rsample! 34 / 42
  40. Here, we randomly split the training data into V distinct

    blocks of roughly equal size. We leave out the first block of analysis data and fit a model. This model is used to predict the held-out block of assessment data. We continue this process until we've predicted all V assessment blocks The final performance is based on the hold-out predictions by averaging the statistics from the V blocks. V is usually taken to be 5 or 10 and leave one out cross- validation has each sample as a block. V-Fold Cross-Validation [1] Image from: https://www.kaggle.com/dansbecker/cross- validation 35 / 42
  41. set.seed(2453) cv_splits vfold_cv( data = ames_train_log, v = 10, strata

    = "Sale_Price" ) cv_splits # 10-fold cross validation using stratification # A tibble: 10 x 2 splits id <list> <chr> 1 <split [2K/222]> Fold01 2 <split [2K/222]> Fold02 3 <split [2K/222]> Fold03 4 <split [2K/222]> Fold04 5 <split [2K/222]> Fold05 6 <split [2K/219]> Fold06 7 <split [2K/219]> Fold07 8 <split [2K/217]> Fold08 9 <split [2K/217]> Fold09 10 <split [2K/217]> Fold10 Each individual split object is similar to the initial_split() example. cv_splits$splits[[1]] <1977/222/2199> cv_splits$splits[[1]] %>% analysis() %>% dim() [1] 1977 82 cv_splits$splits[[1]] %>% assessment() %>% dim() [1] 222 82 Cross-validation using rsample 36 / 42
  42. # Fit on a single analysis resample fit_model function(split) {

    fit( object = spec_lm, formula = Sale_Price_Log ~ Latitude + Longitude, data = analysis(split) # pull out training set ) } # For each resample, call fit_model() cv_splits cv_splits %>% mutate(models = map(splits, fit_model)) cv_splits # 10-fold cross validation using stratification # A tibble: 10 x 3 splits id models * <list> <chr> <list> 1 <split [2K/222]> Fold01 <fit[+]> 2 <split [2K/222]> Fold02 <fit[+]> 3 <split [2K/222]> Fold03 <fit[+]> 4 <split [2K/222]> Fold04 <fit[+]> 5 <split [2K/222]> Fold05 <fit[+]> 6 <split [2K/219]> Fold06 <fit[+]> 7 <split [2K/219]> Fold07 <fit[+]> 8 <split [2K/217]> Fold08 <fit[+]> 9 <split [2K/217]> Fold09 <fit[+]> 10 <split [2K/217]> Fold10 <fit[+]> Resampling the linear model Working with resample tibbles generally involves two things: 1) Small functions that perform an action on a single split. 2) The purrr package for map() ping over splits. Result: A nice self-contained modeling object! 37 / 42
  43. predict_split function(assess_data, model) { predict(model, new_data = assess_data) } cv_splits

    cv_splits %>% mutate( assess = map(splits, assessment), preds = map2(assess, models, predict_split) ) select(cv_splits, id, assess, preds) # 10-fold cross validation using stratification # A tibble: 10 x 3 id assess preds * <chr> <list> <list> 1 Fold01 <tibble [222 × 82]> <tibble [222 × 1]> 2 Fold02 <tibble [222 × 82]> <tibble [222 × 1]> 3 Fold03 <tibble [222 × 82]> <tibble [222 × 1]> 4 Fold04 <tibble [222 × 82]> <tibble [222 × 1]> 5 Fold05 <tibble [222 × 82]> <tibble [222 × 1]> 6 Fold06 <tibble [219 × 82]> <tibble [219 × 1]> 7 Fold07 <tibble [219 × 82]> <tibble [219 × 1]> 8 Fold08 <tibble [217 × 82]> <tibble [217 × 1]> 9 Fold09 <tibble [217 × 82]> <tibble [217 × 1]> 10 Fold10 <tibble [217 × 82]> <tibble [217 × 1]> Resampled predictions At this point, we have fit models to each analysis set, and we can now make predictions for each resample using the corresponding assessment set. The same map() approach applies. We also use map2() to map over 2 things at once: the assessment set and the model. Both of these are used together to create the predictions. 38 / 42
  44. cv_preds cv_splits %>% unnest(preds, assess) %>% group_by(id) %>% select(id, .pred,

    Sale_Price_Log) print(cv_preds, n = 5) # A tibble: 2,199 x 3 # Groups: id [10] id .pred Sale_Price_Log <chr> <dbl> <dbl> 1 Fold01 5.28 5.37 2 Fold01 5.26 5.33 3 Fold01 5.24 4.98 4 Fold01 5.30 5.30 5 Fold01 5.30 5.55 # with 2,194 more rows cv_preds %>% rmse(Sale_Price_Log, .pred) # A tibble: 10 x 4 id .metric .estimator .estimate <chr> <chr> <chr> <dbl> 1 Fold01 rmse standard 0.149 2 Fold02 rmse standard 0.163 3 Fold03 rmse standard 0.160 4 Fold04 rmse standard 0.158 5 Fold05 rmse standard 0.161 6 Fold06 rmse standard 0.156 7 Fold07 rmse standard 0.179 8 Fold08 rmse standard 0.159 9 Fold09 rmse standard 0.168 10 Fold10 rmse standard 0.158 Resampled performance yardstick is a package that contains functions for calculating a large number of performance metrics. All functions are of the form metric_name(data, truth, estimate) . 39 / 42
  45. Resampled performance cv_preds %>% rmse(Sale_Price_Log, .pred) %>% summarise( .metric =

    .metric[1], resampled_estimate = mean(.estimate), estimate_sd = sd(.estimate) ) # A tibble: 1 x 3 .metric resampled_estimate estimate_sd <chr> <dbl> <dbl> 1 rmse 0.161 0.00795 40 / 42
  46. Now what? Model comparison Is that rmse good? How do

    we answer this question? Compare to other models. Maybe knn could be used? "Neighbors" makes sense when considering house prices! Compare resample 1 of lm to resample 1 of knn , then compare the overall rmse . 41 / 42
  47. Now what? Model comparison Is that rmse good? How do

    we answer this question? Compare to other models. Maybe knn could be used? "Neighbors" makes sense when considering house prices! Compare resample 1 of lm to resample 1 of knn , then compare the overall rmse . Testing set We still haven't used this! Once you've chosen your final model: Refit the model on the entire training set (no resamples). Predict on the testing set. Evaluate, this should simply confirm your existing beliefs by now. 41 / 42