Julia Silge Max Kuhn @juliasilge @topepos GOOD PRACTICES FOR APPLIED MACHINE LEARNING Model Development to Model Deployment

Polls at (event code = RKEYNOTE) GOOD PRACTICES FOR APPLIED MACHINE LEARNING Model Development to Model Deployment

What is tidymodels?

Yet another modeling framework

Spotify data popularity artist date duration genres 20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese…

What you think your script looks like Modeling function call Call to predict

The reality of your script Use effect encoding to convert the artist into features Transformations, etc. etc. Convert date to year, month, etc. Parse multiple choice data and convert to indicators Too many features! Do some analysis to fi lter some out Actually fi t your model!

The reality of your script Almost all of these involve estimation and are part of the modeling process Most of these steps also need to be applied during prediction

Cognitive load and machine learning

Do you have experience with these “roles”? Post at (event code = RKEYNOTE) A. Outcome B. Predictor C. Case weight D. Strati fi cation variable E. Censoring indicator F. Offsets

Reduce your cognitive load library(tidymodels) music_split <- initial_split(spotify) recipe(popularity ~ ., data = training(music_split)) #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 7

Reduce your cognitive load library(tidymodels) music_split <- initial_split(spotify) recipe(popularity ~ ., data = training(music_split)) %>% step_normalize(all_numeric_predictors()) #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 7 #> #> Operations: #> #> Centering and scaling for all_numeric_predictors()

Recall your script Use effect encoding to convert the artist into features Transformations, etc. etc. Convert date to year, month, etc. Parse multiple choice data and convert to indicators Too many features! Do some analysis to fi lter some out Actually fi t your model!

The reality of your tidymodels script predict(cubist_fit, new_samples) Recipes and work fl ows encapsulate all of your preprocessing and modeling operations into a single interface spotify_rec <- recipe(popularity ~ ., data = training(music_split)) %>% step_lencode_bayes( artist, outcome = vars(popularity) ) %>% step_best_normalize(duration) %>% step_date(date, keep_original_cols = FALSE) %>% step_dummy_extract( genres, pattern = "(?<=')[^',]+(?=')" ) %>% step_corr(all_numeric_predictors()) cubist_fit <- workflow(spotify_rec, cubist_rules()) %>% fit(data = spotify_train)

Get your model off your machine

Model deployment and YOU Post at (event code = RKEYNOTE) A. I have deployed a model to production. B. I have never deployed a model to production. C. What does “production” mean?

Collect data caret Understand 
 and clean data Train and evaluate model Version model Deploy model Monitor model

Get your model off your machine library(vetiver) v <- vetiver_model(cubist_fit, "spotify_rules") v #> #> !! spotify_rules ! model for deployment #> A Cubist regression modeling workflow using 4 features

Get your model off your machine library(plumber) pr() %>% vetiver_api(v) #> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router. #> # Use `pr_run()` on this object to start the API. #> "!![queryString] #> "!![body] #> "!![cookieParser] #> "!![sharedSecret] #> "!!/logo #> "!!/ping (GET) #> #!!/predict (POST)

Have you ever tuned a model? Post at (event code = RKEYNOTE) A. No B. Yes, but it was painful C. Yes, but I’m not sure if it was effective D. Yes, and it was easy! E. Yes, and I used an advanced method like racing

Some advanced techniques in tidymodels • Racing methods • Bayesian optimization RESAMPLE ANALYZE FILTER Racing is a method that eliminates model con fi gurations as they are resampled MODEL TUNING FEATURE EMBEDDING METHODS • UMAP • isoMap • Effect encodings

Racing methods for effective model selection •Fit only 7.2% of total models •Speed-ups • With parallel processing, 2.8-fold • Without, 9-fold

What else is tidymodels? YOU

The reality of your script Almost all of these involve estimation and are part of the modeling process Most also require code for the script with the prediction steps

If you use feature selection with 10-fold cross- validation, you will select features: Post at (event code = RKEYNOTE) A. 1 time B. 10 times C. 11 times D. 3.14159 times

Model Work fl ow

Model Work fl ow

Leakage and the Reproducibility Crisis in ML-based Science Kapoor and Narayanan (2022) We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Speci fi cally, through a survey of literature in research communities that adopted ML methods, we fi nd 17 fi elds where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions.

Proper data usage and validation library(recipeselectors) selection_rec <- spotify_rec %>% step_select_vip( artist, outcome = vars(popularity), top_p = tune() ) lm_res <- workflow(selection_rec, linear_reg()) %>% tune_grid(resamples = folds, grid = 25) The work fl ow makes sure that the appropriate computations are used with the right data at the right time

When was the last time you locked your keys in the car?

I last locked my keys in my car: Post at (event code = RKEYNOTE) A. ~1 week ago B. ~1 month ago C. ~1 year ago D. ~10 years ago E. I have never locked my keys in my car.

It’s hard to lock my keys in my current car Protect against common failure modes

Immediately painful problem

Problem that is painful later

David Robinson: Some of the resistance I’ve seen to tidymodels comes from a place of “This makes it too easy—you’re not thinking carefully about what the code is doing!” But I think this is getting it backwards. By removing the burden of writing procedural logic, I get to focus on scienti fi c and statistical questions about my data and model.

tidymodels is ready for your real ML tasks

Thank You

Also thanks to Simon Couch Hannah Frick Emil Hvitfeldt Davis Vaughan Jenny Bryan Mine Cetinkaya-Rundel Matt Dancho Alison Hill Allison Horst Edgar Ruiz Hadley Wickham Michael Chow Isabel Zimmerman TIDYMODELS TEAM VETIVER TEAM AND

Special thanks to tidymodels contributors…

