rsample-sampler

R-Ladies San Francisco 2018-08-15 Fanny Chow @frannystats Rsample Sampler Resampling
Methods with R

Setup! " github.com/fbchow/rsample-sampler Download R (https://cran.r-project.org) RStudio (https://www.rstudio.com/products/rstudio/download/#download) Within RStudio:
Click Tools > Install Packages… tidyverse tidymodels

File > New File > RScript library(tidyverse) library(tidymodels) Then load
the packages by typing into your RScript: If you need help, please raise your hand"! With your cursor on the text: hit ‘command enter’ on Mac ‘cntrl enter’ on Windows. Let’s run this code!

>Whoami Introduced to STEM “This is cool but I’m terriﬁed.”
“You can do it!” “You should major in statistics.” startups differential privacy web security build & deploy models Intern

Introducing Consistent model interfaces High and low-level APIs with practical
defaults Suite of modular tools functional programming (purrr package) Chaining %>% Tidy data (2nd, 3rd Normal Form) Max Kuhn @topepo

https://github.com/tidymodels

Functions & data structures speciﬁcally for resampling data Infrastructure to
assess and validate model performance Easy to keep track of how your data is split https://tidymodels.github.io/rsample

Recall: what is sampling? We can’t measure every person, place,
or thing. So we randomly collect a sample that’s reasonably representative of the population we want to make inferences or predictions about. ✤ U.S. Census (American Community Survey) ✤ A/B test measuring click-through-rate on a sample of site visitors ✤ Risk of heart disease by race & ethnicity

What’s resampling? Resampling is taking repeated samples from our data
set.   “You randomly sample your random sample!”

Sampling Resampling Estimate a statistic of interest mean, median, regression
coefﬁcients Gauge the variability or uncertainty in our statistic of interest standard error, standard deviation, conﬁdence intervals Focus

What does resampling look like?

www.appliedpredictivemodeling.com 3 Iterations of Resampling

You can use resampling to gauge the performance & variability
in your model. Iteration 1

www.appliedpredictivemodeling.com Iteration 2

www.appliedpredictivemodeling.com Iteration 3

Initial Split into Test & Train We’ve only have so
much data so budget wisely! We take a random portion of our data to train (ﬁt our model) on. And we set aside the rest to test or (validate) our models on. Luckily the initial_split function in rsample makes it easy!

# Make a dataset of meetup attendees favorite numbers names_df
<- tribble( ~name, ~fav_num, "Dana", 4, "Irene", 8, "Mara", 2, "Ming", 5, "Chaita", 10, "Jen", 7, "Becky", 4, "Jenny", 9, "Mine", 2, "Emily", 6 ) # Explore data boxplot(names_df$fav_num) View(names_df) nrow(names_df) str(names_df) summary(names_df) # Initial Split first_split <- initial_split(names_df) train_names <- training(first_split test_names <- testing(first_split)

> initial_split(names_df) <8/2/10> Training: 8 data points Test: 2 data
points Original: 10 observations

> training(names_df) # A tibble: 8 x 2 name fav_num
<chr> <dbl> 1 Dana 4 2 Irene 8 3 Mara 2 4 Chaita 10 5 Jen 7 6 Jenny 9 7 Mine 2 8 Emily 6

> testing(names_df) # A tibble: 2 x 2 name fav_num
<chr> <dbl> 1 Ming 5 2 Becky 4

Start a new script and try initial_split on a different
data set. This time let’s use the attrition data set within the rsample package. Let’s load the data by typing: data(“attrition”)

Today’s Resampling Sampler 1. Cross-validation 2. Bootstrapping

1. Cross Validation

V-Fold Cross Validation https://github.com/topepo/rstudio-conf-2018 Used to test how the results
of a statistical analysis will generalize to a new situation

Let’s try vfold_cv() vfold_cv(data, v = 10, repeats = 1,
strata = NULL, ...) dataframe or tibble # of folds Optional: make several copies of each fold. Optional: stratify on a speciﬁc variable. (If you want to make sure the proportion of the Species variable in the analysis data set is equal to the original iris data set)

Let’s try vfold_cv() ✤ We will get a rset object
✤ Which contains many resamples ✤ Each resample is an rsplit object It’s easier to understand when we see it in action. Let’s try.

my_folds <- vfold_cv(train_names, v=3) my_folds # The `split` objects contain
the information about the sample sizes my_folds$splits[[1]] # Use the `analysis` and `assessment` functions to get the data analysis(my_folds$splits[[1]]) %>% dim() analysis(my_folds$splits[[1]]) assessment(my_folds$splits[[1]]) %>% dim() assessment(my_folds$splits[[1]])

✤ Great! We have many tidy resamples! ✤ We can
use functional programming functions from the purr package to ﬁt a model on each resample. ✤ For the sake of information overload, let’s save that for next time .

2. Bootstrapping

Bootstrapping ✤ resampling with replacement ✤ (all values in the
sample have an equal probability of being included, including multiple times, so a value could have a duplicate) ✤ Can help you calculate statistics with less strict mathematical assumptions Ex: throw 10 paper slips in a hat, pick name from a hat, write down name, throw paper back in, repeat 10x

✤ Bootstrapping is usually used to help us make sense
of the distribution of a statistic of interest. ✤ But we can also use bootstrapping to perform inference too.

Bootstraps function bootstraps(data, times = 25, strata = NULL, apparent
= FALSE, ...) bootstraps(data, times = 25, strata = NULL, apparent = FALSE, ...) bootstraps(data, times=25, strata=NULL, apparent=FALSE,…) dataframe or tibble # of bootstraps (Recommend starting with 1000.) Optional: stratify on a speciﬁc variable (If you want to make sure the proportion of the Species variable in the bootstrap sample is equal to the original iris data set) Optional: keep a copy of original data (Some estimators require.)

Take a look at one of the bootstraps many_boots<- bootstraps(attrition)
one_boot <- many_boots[[1]] analysis(one_boots) assessment(one_boots)

Let’s say we wanted to see if there’s a marked
difference between median income for men and women. An example from Max’s rsample vignette

95% Confidence Interval (Percentile Method) quantile(bt_resamples$wage_diff, probs = c(0.025, 0.500,
0.975)) #> 2.5% 50% 97.5% #> -207 190 618

Is there a difference between median monthly income between groups?

Summer Project ✤ Bootstrap conﬁdence intervals ✤ Percentile ✤ Student-t
✤ Bias-Corrected Accelerated (BCA)

Live demo (still in dev mode, sit back and relax
) www.github.com/fbchow/rsample

Thanks! Questions?

rsample-sampler

rsample-sampler

More Decks by FC

Featured

Transcript