Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modeling in the tidyverse

Davis Vaughan
November 27, 2018
220

Modeling in the tidyverse

Davis Vaughan

November 27, 2018
Tweet

Transcript

  1. Modeling in the tidyverse
    Davis Vaughan
    https://speakerdeck.com/davisvaughan

    View Slide

  2. Davis Vaughan
    Software engineer @ RStudio
    Modeling with Max Kuhn
    tidymodels
    Quantitative finance
    Master's @ UNC Charlotte
    Obsessed with creating tools that make your life easier
    Who am I?
    2 / 42

    View Slide

  3. Part 1 - Modeling
    parsnip - Building model specifications
    rsample - Resampling techniques
    yardstick - Performance evaluation
    Agenda:
    3 / 42

    View Slide

  4. Part 1 - Modeling
    parsnip - Building model specifications
    rsample - Resampling techniques
    yardstick - Performance evaluation
    Part 2 - Something new
    Arrays and matrices in R
    Agenda:
    3 / 42

    View Slide

  5. R has cutting edge models.
    The number of R packages has shot through the roof in
    recent years, totaling over 13000 packages. Many of
    these implement novel statistical / machine learning
    models.
    It is easy to port / link to other applications.
    R doesn't try to do everything itself. If you prefer models
    implemented in C, C++, tensorflow , keras , python ,
    stan , or Weka , you can access these applications
    without leaving R.1
    R packages are built by people who do data analysis.
    This is in contrast to computer science. R provides the
    tooling to allow user experience to be a big focus of a
    package.
    The S language is very mature.
    S is the precursor to R, and R inherits some of the
    statistical routines found in S.
    Why use R for modeling?
    [1] In fact, this is actually one of the core principles for R. It is an interface to other languages.
    4 / 42

    View Slide

  6. Performance
    R is a data analysis language, it's not C++. If a high
    performance deployment is required, R can be treated
    like a prototyping language.
    Memory
    R is mostly memory-bound. There are plenty of
    exceptions to this though.
    Consistency among interfaces
    For example:
    There are two methods for specifying what terms are
    in a model1. Not all models have both.
    99% of model functions automatically generate
    dummy variables. But damn the 1% that don't!
    Why not to use R for modeling?
    [1] There are now three but the last one is brand new and won't be discussed. But see recipes for info.
    5 / 42

    View Slide

  7. The problem of consistency
    Function Package Code
    lda MASS predict(obj)
    glm stats predict(obj, type = "response")
    gbm gbm predict(obj, type = "response", n.trees)
    mda mda predict(obj, type = "posterior")
    rpart rpart predict(obj, type = "prob")
    Weka RWeka predict(obj, type = "probability")
    logitboost LogitBoost predict(obj, type = "raw", nIter)
    6 / 42

    View Slide

  8. The problem of consistency
    Function Package Code
    lda MASS predict(obj)
    glm stats predict(obj, type = "response")
    gbm gbm predict(obj, type = "response", n.trees)
    mda mda predict(obj, type = "posterior")
    rpart rpart predict(obj, type = "prob")
    Weka RWeka predict(obj, type = "probability")
    logitboost LogitBoost predict(obj, type = "raw", nIter)
    Sigh.
    6 / 42

    View Slide

  9. The problem of consistency
    Function Package Code
    lda MASS predict(obj)
    glm stats predict(obj, type = "response")
    gbm gbm predict(obj, type = "response", n.trees)
    mda mda predict(obj, type = "posterior")
    rpart rpart predict(obj, type = "prob")
    Weka RWeka predict(obj, type = "probability")
    logitboost LogitBoost predict(obj, type = "raw", nIter)
    Sigh.
    How has this been tackled in the past?
    6 / 42

    View Slide

  10. The one stop shop
    The more traditional approach uses high-level syntax and is
    perhaps the most untidy code that you will encounter.
    caret is the primary package for untidy predictive modeling.
    It wraps all of those models on the previous page.
    1. More traditional R coding style.
    2. High-level "I'll do that for you" syntax.
    3. More comprehensive (for now) and less modular.
    caret can do everything from feature engineering to
    hyperparameter optimization.
    4. Contains many optimizations and is easily parallelized.
    A tale of two philosophies
    7 / 42

    View Slide

  11. The one stop shop
    The more traditional approach uses high-level syntax and is
    perhaps the most untidy code that you will encounter.
    caret is the primary package for untidy predictive modeling.
    It wraps all of those models on the previous page.
    1. More traditional R coding style.
    2. High-level "I'll do that for you" syntax.
    3. More comprehensive (for now) and less modular.
    caret can do everything from feature engineering to
    hyperparameter optimization.
    4. Contains many optimizations and is easily parallelized.
    The modular approach
    The tidy modeling approach espouses the tenets of the
    tidyverse:
    1. Reuse existing data structures.
    2. Compose simple functions with the pipe.
    3. Embrace functional programming.
    4. Design for humans.
    This approach is exemplified by a new set of tidy modeling
    packages...
    A tale of two philosophies
    7 / 42

    View Slide

  12. tidymodels ecosystem
    library(tidymodels)
    ✔ broom 0.5.0.9001 ✔ purrr 0.2.5.9000
    ✔ dials 0.0.1.9000 ✔ recipes 0.1.4.9000
    ✔ dplyr 0.7.99.9000 ✔ rsample 0.0.3
    ✔ ggplot2 3.1.0 ✔ tibble 1.4.2
    ✔ infer 0.4.0 ✔ yardstick 0.0.2
    ✔ parsnip 0.0.1
    Plus tidypredict , tidyposterior , tidytext , and more in development.
    We will explore 3 of these packages today, but use many others along the way.
    ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidymo
    ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidymodels_c
    ✖ purrr discard() masks scales discard()
    ✖ rsample fill() masks tidyr fill()
    ✖ dplyr filter() masks stats filter()
    ✖ dplyr lag() masks stats lag()
    ✖ recipes step() masks stats step()
    8 / 42

    View Slide

  13. We will use the Ames IA housing data. There are 2,930
    properties in the data.
    The Sale Price was recorded along with 81 predictors,
    including:
    Location (e.g. neighborhood) and lot information.
    House components (garage, fireplace, pool, porch, etc.).
    General assessments such as overall quality and
    condition.
    Number of bedrooms, baths
    And so on...
    More details can be found in De Cock (2011, Journal of
    Statistics Education).
    The raw data are at http: bit.ly/2whgsQM but we will use
    a processed version found in the AmesHousing package.
    Data set - House prices
    The goal will be to explore new tools that allow us to model sale price as a function of these features.
    9 / 42

    View Slide

  14. Visualize - Colored by neighborhood
    10 / 42
    +

    Leaflet | Map tiles by Stamen Design, CC BY 3.0 — Map data © OpenStreetMap

    View Slide

  15. Managing data
    11 / 42

    View Slide

  16. How do we "spend"
    the data to find an optimal model?
    12 / 42

    View Slide

  17. Hmm..
    Repeatedly fit models on the whole data set.
    Estimate performance on the data used to fit the model.
    Find the best model and carry that into production.
    Baby steps
    13 / 42

    View Slide

  18. Hmm..
    Repeatedly fit models on the whole data set.
    Estimate performance on the data used to fit the model.
    Find the best model and carry that into production.
    Baby steps
    13 / 42

    View Slide

  19. Hmm..
    Repeatedly fit models on the whole data set.
    Estimate performance on the data used to fit the model.
    Find the best model and carry that into production.
    Why not?
    Overestimate model performance
    You create a model highly specific to X and then test on
    X and get good performance. Is this realistic?
    No checks against overfitting
    Linear regression with p variables, and p data points.
    . Doesn't generalize!
    Baby steps
    R2
    = 1
    13 / 42

    View Slide

  20. Data splitting
    We typically split data into training and test data sets:
    Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the
    model.
    Test Set: these data can be used to get an independent assessment of model efficacy. They should not be used during model
    training.
    14 / 42

    View Slide

  21. Data splitting
    We typically split data into training and test data sets:
    Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the
    model.
    Test Set: these data can be used to get an independent assessment of model efficacy. They should not be used during model
    training.
    The more data we spend, the better estimates we'll get (provided the data is accurate).
    Given a fixed amount of data:
    too much spent in training won't allow us to get a good assessment of predictive performance. We may find a model that fits
    the training data very well, but is not generalizable (overfitting)
    too much spent in testing won't allow us to get a good assessment of model parameters
    14 / 42

    View Slide

  22. Mechanics of splitting
    There are a few different ways to do the split: simple random sampling, stratified sampling based on the outcome, by date, or
    methods that focus on the distribution of the predictors.
    For stratification:
    Classification: this would mean sampling within the classes to preserve the distribution of the outcome in the training and test
    sets
    Regression: determine the quartiles of the data set and samples within those artificial groups
    15 / 42

    View Slide

  23. Ames Housing Data
    Let's load the example data set and split it. We'll put 75% into training and 25% into testing.
    library(AmesHousing)
    library(rsample)
    ames make_ames()
    nrow(ames)
    [1] 2930
    set.seed(4595)
    data_split initial_split(ames, strata = "Sale_Price", prop = .75)
    ames_train training(data_split)
    ames_test testing(data_split)
    nrow(ames_train)/nrow(ames)
    [1] 0.7505119
    16 / 42

    View Slide

  24. Detour
    What are these split objects?
    # result of initial_split()
    #
    data_split
    <2199/731/2930>
    training(data_split)
    # A tibble: 2,199 x 81
    MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape

    1 One_Story_… Resident… 141 31770 Pave No_A… Slightly…
    2 Two_Story_… Resident… 74 13830 Pave No_A… Slightly…
    3 Two_Story_… Resident… 78 9978 Pave No_A… Slightly…
    4 One_Story_… Resident… 43 5005 Pave No_A… Slightly…
    5 One_Story_… Resident… 39 5389 Pave No_A… Slightly…
    6 Two_Story_… Resident… 60 7500 Pave No_A… Regular
    7 Two_Story_… Resident… 75 10000 Pave No_A… Slightly…
    8 One_Story_… Resident… 85 10176 Pave No_A… Regular
    9 One_Story_… Resident… 0 6820 Pave No_A… Slightly… 17 / 42

    View Slide

  25. ggplot(ames_train, aes(x = Sale_Price)) +
    geom_line(stat = "density",
    trim = TRUE) +
    geom_line(data = ames_test,
    stat = "density",
    trim = TRUE, col = "red")
    Distribution check
    18 / 42

    View Slide

  26. Creating models with parsnip
    19 / 42

    View Slide

  27. Specifying models in R using formulas
    To fit a model to the housing data, the model terms must be specified. Historically, there are two main interfaces for doing this.
    The formula interface using R formula rules to specify a symbolic representation of the terms and variables. For example:
    Specifying variables and interactions:
    model_function(Sale_Price ~ Neighborhood + Year_Sold + Neighborhood:Year_Sold, data = ames_train)
    Specifying all variables on the right hand side:
    model_function(Sale_Price ~ ., data = ames_train)
    Using special functions directly on your data:
    model_function(log10(Sale_Price) ~ ns(Longitude, df = 3) + ns(Latitude, df = 3), data = ames_train)
    This is very convenient but it has some disadvantages.
    20 / 42

    View Slide

  28. Downsides to formulas
    You can't nest functions such as foo(y ~ pca(scale(x1), scale(x2), scale(x3)), data = dat) .
    For very wide data sets, the formula method can be extremely inefficient.
    Specifying multivariate outcomes is awkward.
    Not all model functions have a formula method (that consistency issue again).
    21 / 42

    View Slide

  29. Specifying models without formulas
    Some modeling function have an "xy" interface. This usually has arguments for the predictors and the outcome(s):
    predictors c("Year_Sold", "Longitude", "Latitude")
    modeling_function(
    x = ames_train[, predictors],
    y = ames_train$Sale_Price
    )
    This is inconvenient if you have transformations, factor variables, interactions, or any other operations to apply to the data prior to
    modeling.
    22 / 42

    View Slide

  30. Model specification woes
    Overall, it is difficult to predict if a package has one or both of these interfaces. For example, lm() only has formulas.
    There is a third interface, using recipes that we will quickly discuss, but know that it attempts to solve most of these issues and
    more.
    23 / 42

    View Slide

  31. # A tibble: 3 x 4
    Sale_Price Alley Central_Air Year_Sold

    1 215000 No_Alley_Access Y 2010
    2 105000 No_Alley_Access Y 2010
    3 172000 No_Alley_Access Y 2010
    unique(ames$Alley)
    [1] No_Alley_Access Paved Gravel
    Levels: Gravel No_Alley_Access Paved
    rec recipe(
    Sale_Price ~ Alley + Central_Air + Year_Sold,
    data = ames_train
    ) %>%
    step_log(Sale_Price) %>%
    step_dummy(Alley, Central_Air) %>%
    step_interact(~ starts_with("Alley_") Year_Sold)
    Data Recipe
    Inputs:
    role #variables
    outcome 1
    predictor 3
    Operations:
    Log transformation on Sale_Price
    Dummy variables from Alley, Central_Air
    Interactions with starts_with("Alley_") Year_Sold
    recipes
    ames %>%
    select(Sale_Price, Alley, Central_Air, Year_Sold) %>%
    head(3)
    24 / 42

    View Slide

  32. Prepare your recipe!
    prepped prep(rec, training = ames_train, retain = TRUE)
    juice(prepped)
    # A tibble: 2,199 x 7
    Year_Sold Sale_Price Alley_No_Alley_Access Alley_Paved Central_Air_Y

    1 2010 12.3 1 0 1
    2 2010 12.2 1 0 1
    Alley_No_Alley_Access_x_Year_Sold Alley_Paved_x_Year_Sold

    1 2010 0
    2 2010 0
    # with 2,197 more rows
    25 / 42

    View Slide

  33. Bake it on new data!
    ames_test_baked bake(prepped, new_data = ames_test)
    ames_test_baked
    # A tibble: 731 x 7
    Year_Sold Sale_Price Alley_No_Alley_Access Alley_Paved Central_Air_Y

    1 2010 11.6 1 0 1
    2 2010 12.1 1 0 1
    Alley_No_Alley_Access_x_Year_Sold Alley_Paved_x_Year_Sold

    1 2010 0
    2 2010 0
    # with 729 more rows
    26 / 42

    View Slide

  34. Three steps:
    Create a lightweight model specification.
    Set an "engine".
    Add the data and formula / xy terms and fit.
    spec_lin_reg linear_reg()
    # or glmnet, or Spark, or Stan, or Keras
    spec_lm set_engine(spec_lin_reg, "lm")
    spec_lm
    Linear Regression Model Specification (regression)
    Computational engine: lm
    parsnip model fitting
    27 / 42

    View Slide

  35. Three steps:
    Create a lightweight model specification.
    Set an "engine".
    Add the data and formula / xy terms and fit.
    spec_lin_reg linear_reg()
    # or glmnet, or Spark, or Stan, or Keras
    spec_lm set_engine(spec_lin_reg, "lm")
    spec_lm
    Linear Regression Model Specification (regression)
    Computational engine: lm
    ames_train_log ames_train %>%
    mutate(Sale_Price_Log = log10(Sale_Price))
    fit_lm fit(
    object = spec_lm,
    # No "special" terms in the formula. Use recipes!
    formula = Sale_Price_Log ~ Latitude + Longitude,
    data = ames_train_log
    )
    fit_lm
    parsnip model object
    Call:
    stats lm(formula = formula, data = data)
    Coefficients:
    (Intercept) Latitude Longitude
    -316.153 3.014 -2.079
    parsnip model fitting
    27 / 42

    View Slide

  36. XY interface
    parsnip isn't picky about which interface you should use.
    Remember, lm() only allows the formula interface, but parsnip 's interface let's you use either one.
    fit_xy(
    spec_lm, # same model specification
    y = ames_train_log$Sale_Price_Log,
    x = ames_train_log[, c("Latitude", "Longitude")]
    )
    parsnip model object
    Call:
    stats lm(formula = formula, data = data)
    Coefficients:
    (Intercept) Latitude Longitude
    -316.153 3.014 -2.079
    28 / 42

    View Slide

  37. Alternative engine
    spec_stan spec_lin_reg %>%
    # Engine specific arguments are passed through here
    set_engine("stan", chains = 4, iter = 1000)
    fit_stan fit(
    object = spec_stan,
    formula = Sale_Price_Log ~ Latitude + Longitude,
    data = ames_train_log
    )
    coef(fit_stan$fit)
    (Intercept) Latitude Longitude
    -316.311134 3.008426 -2.089167
    coef(fit_lm$fit)
    (Intercept) Latitude Longitude
    -316.152846 3.013530 -2.079171
    29 / 42

    View Slide

  38. Model Evaluation
    30 / 42

    View Slide

  39. Overall model statistics
    parsnip holds the actual fit object in fit_lm$fit .
    If you use the summary() method on the underlying lm object, the bottom shows some statistics:
    summary(fit_lm$fit)
    < >
    Residual standard error: 0.1614 on 2196 degrees of freedom
    Multiple R-squared: 0.1808, Adjusted R-squared: 0.1801
    F-statistic: 242.3 on 2 and 2196 DF, p value: < 2.2e-16
    These statistics are generated from predicting on the training data used to fit the model.
    This is problematic because it can lead to optimistic results, especially for models that are extremely flexible (overfitting).
    31 / 42

    View Slide

  40. Overall model statistics
    parsnip holds the actual fit object in fit_lm$fit .
    If you use the summary() method on the underlying lm object, the bottom shows some statistics:
    summary(fit_lm$fit)
    < >
    Residual standard error: 0.1614 on 2196 degrees of freedom
    Multiple R-squared: 0.1808, Adjusted R-squared: 0.1801
    F-statistic: 242.3 on 2 and 2196 DF, p value: < 2.2e-16
    These statistics are generated from predicting on the training data used to fit the model.
    This is problematic because it can lead to optimistic results, especially for models that are extremely flexible (overfitting).
    Idea!
    The test set is used for assessing performance. Should we predict on the test set and use those results to estimate these statistics?
    31 / 42

    View Slide

  41. (Matthew Inman/Exploding Kittens)
    32 / 42

    View Slide

  42. Assessing Models
    Save the test set until the very end when you have one or two models that are your favorite.
    33 / 42

    View Slide

  43. Assessing Models
    Save the test set until the very end when you have one or two models that are your favorite.
    So...
    1) For model A, fit on training set, predict on training set
    2) For model B, fit on training set, predict on training set
    3) Compare performance
    For some models, it is possible to get very "good" performance by predicting the training set (it was so flexible you overfit it). That's
    an issue since we will need to make "honest" comparisons between models before we finalize them and run our final choices on
    the test set.
    33 / 42

    View Slide

  44. Assessing Models
    Save the test set until the very end when you have one or two models that are your favorite.
    So...
    1) For model A, fit on training set, predict on training set
    2) For model B, fit on training set, predict on training set
    3) Compare performance
    For some models, it is possible to get very "good" performance by predicting the training set (it was so flexible you overfit it). That's
    an issue since we will need to make "honest" comparisons between models before we finalize them and run our final choices on
    the test set.
    If only...
    If only we had a method for getting honest performance estimates from the training set...
    33 / 42

    View Slide

  45. These are additional data splitting schemes that are applied to
    the training set.
    They attempt to simulate slightly different versions of the
    training set. These versions of the original are split into two
    model subsets:
    The analysis set is used to fit the model (analogous to the
    training set).
    The assessment set is used for performance evaluation.
    This process is repeated many times.
    There are different flavors of resampling but we will focus on
    just one.
    All
    Data
    (2000)
    Testing
    (200)
    Training
    (1800)
    Resample 1
    (600)
    Resample 2
    (600)
    Resample 3
    (600)
    Analysis
    (500)
    Assessment
    (100)
    Analysis
    (500)
    Assessment
    (100)
    Analysis
    (500)
    Assessment
    (100)
    Back to rsample!
    34 / 42

    View Slide

  46. Here, we randomly split the training data into V distinct
    blocks of roughly equal size.
    We leave out the first block of analysis data and fit a
    model.
    This model is used to predict the held-out block of
    assessment data.
    We continue this process until we've predicted all V
    assessment blocks
    The final performance is based on the hold-out predictions by
    averaging the statistics from the V blocks.
    V is usually taken to be 5 or 10 and leave one out cross-
    validation has each sample as a block.
    V-Fold Cross-Validation
    [1] Image from: https://www.kaggle.com/dansbecker/cross-
    validation
    35 / 42

    View Slide

  47. set.seed(2453)
    cv_splits vfold_cv(
    data = ames_train_log,
    v = 10,
    strata = "Sale_Price"
    )
    cv_splits
    # 10-fold cross validation using stratification
    # A tibble: 10 x 2
    splits id

    1 Fold01
    2 Fold02
    3 Fold03
    4 Fold04
    5 Fold05
    6 Fold06
    7 Fold07
    8 Fold08
    9 Fold09
    10 Fold10
    Each individual split object is similar to the initial_split()
    example.
    cv_splits$splits[[1]]
    <1977/222/2199>
    cv_splits$splits[[1]] %>% analysis() %>% dim()
    [1] 1977 82
    cv_splits$splits[[1]] %>% assessment() %>% dim()
    [1] 222 82
    Cross-validation using rsample
    36 / 42

    View Slide

  48. # Fit on a single analysis resample
    fit_model function(split) {
    fit(
    object = spec_lm,
    formula = Sale_Price_Log ~ Latitude + Longitude,
    data = analysis(split) # pull out training set
    )
    }
    # For each resample, call fit_model()
    cv_splits cv_splits %>%
    mutate(models = map(splits, fit_model))
    cv_splits
    # 10-fold cross validation using stratification
    # A tibble: 10 x 3
    splits id models
    *
    1 Fold01
    2 Fold02
    3 Fold03
    4 Fold04
    5 Fold05
    6 Fold06
    7 Fold07
    8 Fold08
    9 Fold09
    10 Fold10
    Resampling the linear model
    Working with resample tibbles generally involves two things:
    1) Small functions that perform an action on a single split.
    2) The purrr package for map() ping over splits.
    Result: A nice self-contained modeling object! 37 / 42

    View Slide

  49. predict_split function(assess_data, model) {
    predict(model, new_data = assess_data)
    }
    cv_splits cv_splits %>%
    mutate(
    assess = map(splits, assessment),
    preds = map2(assess, models, predict_split)
    )
    select(cv_splits, id, assess, preds)
    # 10-fold cross validation using stratification
    # A tibble: 10 x 3
    id assess preds
    *
    1 Fold01
    2 Fold02
    3 Fold03
    4 Fold04
    5 Fold05
    6 Fold06
    7 Fold07
    8 Fold08
    9 Fold09
    10 Fold10
    Resampled predictions
    At this point, we have fit models to each analysis set, and we can now make predictions for each resample using the corresponding
    assessment set.
    The same map() approach applies. We also use map2() to map over 2 things at once: the assessment set and the model. Both of
    these are used together to create the predictions.
    38 / 42

    View Slide

  50. cv_preds cv_splits %>%
    unnest(preds, assess) %>%
    group_by(id) %>%
    select(id, .pred, Sale_Price_Log)
    print(cv_preds, n = 5)
    # A tibble: 2,199 x 3
    # Groups: id [10]
    id .pred Sale_Price_Log

    1 Fold01 5.28 5.37
    2 Fold01 5.26 5.33
    3 Fold01 5.24 4.98
    4 Fold01 5.30 5.30
    5 Fold01 5.30 5.55
    # with 2,194 more rows
    cv_preds %>%
    rmse(Sale_Price_Log, .pred)
    # A tibble: 10 x 4
    id .metric .estimator .estimate

    1 Fold01 rmse standard 0.149
    2 Fold02 rmse standard 0.163
    3 Fold03 rmse standard 0.160
    4 Fold04 rmse standard 0.158
    5 Fold05 rmse standard 0.161
    6 Fold06 rmse standard 0.156
    7 Fold07 rmse standard 0.179
    8 Fold08 rmse standard 0.159
    9 Fold09 rmse standard 0.168
    10 Fold10 rmse standard 0.158
    Resampled performance
    yardstick is a package that contains functions for calculating a large number of performance metrics. All functions are of the form
    metric_name(data, truth, estimate) .
    39 / 42

    View Slide

  51. Resampled performance
    cv_preds %>%
    rmse(Sale_Price_Log, .pred) %>%
    summarise(
    .metric = .metric[1],
    resampled_estimate = mean(.estimate),
    estimate_sd = sd(.estimate)
    )
    # A tibble: 1 x 3
    .metric resampled_estimate estimate_sd

    1 rmse 0.161 0.00795
    40 / 42

    View Slide

  52. Now what?
    Model comparison
    Is that rmse good? How do we answer this question?
    Compare to other models. Maybe knn could be used? "Neighbors" makes sense when considering house prices!
    Compare resample 1 of lm to resample 1 of knn , then compare the overall rmse .
    41 / 42

    View Slide

  53. Now what?
    Model comparison
    Is that rmse good? How do we answer this question?
    Compare to other models. Maybe knn could be used? "Neighbors" makes sense when considering house prices!
    Compare resample 1 of lm to resample 1 of knn , then compare the overall rmse .
    Testing set
    We still haven't used this!
    Once you've chosen your final model:
    Refit the model on the entire training set (no resamples).
    Predict on the testing set.
    Evaluate, this should simply confirm your existing beliefs by now. 41 / 42

    View Slide

  54. Switch gears
    42 / 42

    View Slide