rsample-sampler - Speaker Deck

rsample-sampler

by FC

Slide 1

Slide 1 text

R-Ladies San Francisco 2018-08-15 Fanny Chow @frannystats Rsample Sampler Resampling Methods with R

Slide 2

Slide 2 text

Setup! " github.com/fbchow/rsample-sampler Download R (https://cran.r-project.org) RStudio (https://www.rstudio.com/products/rstudio/download/#download) Within RStudio: Click Tools > Install Packages… tidyverse tidymodels

Slide 3

Slide 3 text

File > New File > RScript library(tidyverse) library(tidymodels) Then load the packages by typing into your RScript: If you need help, please raise your hand"! With your cursor on the text: hit ‘command enter’ on Mac ‘cntrl enter’ on Windows. Let’s run this code!

Slide 4

Slide 4 text

>Whoami Introduced to STEM “This is cool but I’m terriﬁed.” “You can do it!” “You should major in statistics.” startups differential privacy web security build & deploy models Intern

Slide 5

Slide 5 text

Introducing Consistent model interfaces High and low-level APIs with practical defaults Suite of modular tools functional programming (purrr package) Chaining %>% Tidy data (2nd, 3rd Normal Form) Max Kuhn @topepo

Slide 6

Slide 6 text

https://github.com/tidymodels

Slide 7

Slide 7 text

Functions & data structures speciﬁcally for resampling data Infrastructure to assess and validate model performance Easy to keep track of how your data is split https://tidymodels.github.io/rsample

Slide 8

Slide 8 text

Recall: what is sampling? We can’t measure every person, place, or thing. So we randomly collect a sample that’s reasonably representative of the population we want to make inferences or predictions about. ✤ U.S. Census (American Community Survey) ✤ A/B test measuring click-through-rate on a sample of site visitors ✤ Risk of heart disease by race & ethnicity

Slide 9

Slide 9 text

What’s resampling? Resampling is taking repeated samples from our data set.   “You randomly sample your random sample!”

Slide 10

Slide 10 text

Sampling Resampling Estimate a statistic of interest mean, median, regression coefﬁcients Gauge the variability or uncertainty in our statistic of interest standard error, standard deviation, conﬁdence intervals Focus

Slide 11

Slide 11 text

What does resampling look like?

Slide 12

Slide 12 text

www.appliedpredictivemodeling.com 3 Iterations of Resampling

Slide 13

Slide 13 text

You can use resampling to gauge the performance & variability in your model. Iteration 1

Slide 14

Slide 14 text

www.appliedpredictivemodeling.com Iteration 2

Slide 15

Slide 15 text

www.appliedpredictivemodeling.com Iteration 3

Slide 16

Slide 16 text

Initial Split into Test & Train We’ve only have so much data so budget wisely! We take a random portion of our data to train (ﬁt our model) on. And we set aside the rest to test or (validate) our models on. Luckily the initial_split function in rsample makes it easy!

Slide 17

Slide 17 text

# Make a dataset of meetup attendees favorite numbers names_df <- tribble( ~name, ~fav_num, "Dana", 4, "Irene", 8, "Mara", 2, "Ming", 5, "Chaita", 10, "Jen", 7, "Becky", 4, "Jenny", 9, "Mine", 2, "Emily", 6 ) # Explore data boxplot(names_df$fav_num) View(names_df) nrow(names_df) str(names_df) summary(names_df) # Initial Split first_split <- initial_split(names_df) train_names <- training(first_split test_names <- testing(first_split)

Slide 18

Slide 18 text

> initial_split(names_df) <8/2/10> Training: 8 data points Test: 2 data points Original: 10 observations

Slide 19

Slide 19 text

> training(names_df) # A tibble: 8 x 2 name fav_num 1 Dana 4 2 Irene 8 3 Mara 2 4 Chaita 10 5 Jen 7 6 Jenny 9 7 Mine 2 8 Emily 6

Slide 20

Slide 20 text

> testing(names_df) # A tibble: 2 x 2 name fav_num 1 Ming 5 2 Becky 4

Slide 21

Slide 21 text

Start a new script and try initial_split on a different data set. This time let’s use the attrition data set within the rsample package. Let’s load the data by typing: data(“attrition”)

Slide 22

Slide 22 text

Today’s Resampling Sampler 1. Cross-validation 2. Bootstrapping

Slide 23

Slide 23 text

1. Cross Validation

Slide 24

Slide 24 text

V-Fold Cross Validation https://github.com/topepo/rstudio-conf-2018 Used to test how the results of a statistical analysis will generalize to a new situation

Slide 25

Slide 25 text

Let’s try vfold_cv() vfold_cv(data, v = 10, repeats = 1, strata = NULL, ...) dataframe or tibble # of folds Optional: make several copies of each fold. Optional: stratify on a speciﬁc variable. (If you want to make sure the proportion of the Species variable in the analysis data set is equal to the original iris data set)

Slide 26

Slide 26 text

Let’s try vfold_cv() ✤ We will get a rset object ✤ Which contains many resamples ✤ Each resample is an rsplit object It’s easier to understand when we see it in action. Let’s try.

Slide 27

Slide 27 text

my_folds <- vfold_cv(train_names, v=3) my_folds # The `split` objects contain the information about the sample sizes my_folds$splits[[1]] # Use the `analysis` and `assessment` functions to get the data analysis(my_folds$splits[[1]]) %>% dim() analysis(my_folds$splits[[1]]) assessment(my_folds$splits[[1]]) %>% dim() assessment(my_folds$splits[[1]])

Slide 28

Slide 28 text

✤ Great! We have many tidy resamples! ✤ We can use functional programming functions from the purr package to ﬁt a model on each resample. ✤ For the sake of information overload, let’s save that for next time .

Slide 29

Slide 29 text

2. Bootstrapping

Slide 30

Slide 30 text

Bootstrapping ✤ resampling with replacement ✤ (all values in the sample have an equal probability of being included, including multiple times, so a value could have a duplicate) ✤ Can help you calculate statistics with less strict mathematical assumptions Ex: throw 10 paper slips in a hat, pick name from a hat, write down name, throw paper back in, repeat 10x

Slide 31

Slide 31 text

✤ Bootstrapping is usually used to help us make sense of the distribution of a statistic of interest. ✤ But we can also use bootstrapping to perform inference too.

Slide 32

Slide 32 text

Bootstraps function bootstraps(data, times = 25, strata = NULL, apparent = FALSE, ...) bootstraps(data, times = 25, strata = NULL, apparent = FALSE, ...) bootstraps(data, times=25, strata=NULL, apparent=FALSE,…) dataframe or tibble # of bootstraps (Recommend starting with 1000.) Optional: stratify on a speciﬁc variable (If you want to make sure the proportion of the Species variable in the bootstrap sample is equal to the original iris data set) Optional: keep a copy of original data (Some estimators require.)

Slide 33

Slide 33 text

Take a look at one of the bootstraps many_boots<- bootstraps(attrition) one_boot <- many_boots[[1]] analysis(one_boots) assessment(one_boots)

Slide 34

Slide 34 text

Let’s say we wanted to see if there’s a marked difference between median income for men and women. An example from Max’s rsample vignette

Slide 35

Slide 35 text

95% Confidence Interval (Percentile Method) quantile(bt_resamples$wage_diff, probs = c(0.025, 0.500, 0.975)) #> 2.5% 50% 97.5% #> -207 190 618

Slide 36

Slide 36 text

Is there a difference between median monthly income between groups?

Slide 37

Slide 37 text

Summer Project ✤ Bootstrap conﬁdence intervals ✤ Percentile ✤ Student-t ✤ Bias-Corrected Accelerated (BCA)

Slide 38

Slide 38 text

Live demo (still in dev mode, sit back and relax ) www.github.com/fbchow/rsample

Slide 39

Slide 39 text

Thanks! Questions?