Slide 1

Slide 1 text

Julia Silge Max Kuhn @juliasilge @topepos GOOD PRACTICES FOR APPLIED MACHINE LEARNING Model Development to Model Deployment

Slide 2

Slide 2 text

Polls at slido.com (event code = RKEYNOTE) GOOD PRACTICES FOR APPLIED MACHINE LEARNING Model Development to Model Deployment

Slide 3

Slide 3 text

What is tidymodels?

Slide 4

Slide 4 text

Yet another modeling framework

Slide 5

Slide 5 text

Spotify data popularity artist date duration genres 20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese…

Slide 6

Slide 6 text

popularity artist date duration genres 20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese… Spotify data Outcome for model

Slide 7

Slide 7 text

popularity artist date duration genres 20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese… Spotify data Qualitative predictor with a very large number of values

Slide 8

Slide 8 text

Spotify data popularity artist date duration genres 20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese… Column that we want to convert to multiple features

Slide 9

Slide 9 text

Spotify data popularity artist date duration genres 20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese… Multiple choice fi eld to parse and convert to many indicator variables

Slide 10

Slide 10 text

What you think your script looks like Modeling function call Call to predict

Slide 11

Slide 11 text

The reality of your script Use effect encoding to convert the artist into features Transformations, etc. etc. Convert date to year, month, etc. Parse multiple choice data and convert to indicators Too many features! Do some analysis to fi lter some out Actually fi t your model!

Slide 12

Slide 12 text

The reality of your script Almost all of these involve estimation and are part of the modeling process Most of these steps also need to be applied during prediction

Slide 13

Slide 13 text

•ERGONOMIC •EFFECTIVE •SAFE

Slide 14

Slide 14 text

ERGONOMIC

Slide 15

Slide 15 text

Cognitive load and machine learning

Slide 16

Slide 16 text

Do you have experience with these “roles”? Post at slido.com (event code = RKEYNOTE) A. Outcome B. Predictor C. Case weight D. Strati fi cation variable E. Censoring indicator F. Offsets

Slide 17

Slide 17 text

Reduce your cognitive load library(tidymodels) music_split <- initial_split(spotify) recipe(popularity ~ ., data = training(music_split)) #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 7

Slide 18

Slide 18 text

Reduce your cognitive load library(tidymodels) music_split <- initial_split(spotify) recipe(popularity ~ ., data = training(music_split)) %>% step_normalize(all_numeric_predictors()) #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 7 #> #> Operations: #> #> Centering and scaling for all_numeric_predictors()

Slide 19

Slide 19 text

Recall your script Use effect encoding to convert the artist into features Transformations, etc. etc. Convert date to year, month, etc. Parse multiple choice data and convert to indicators Too many features! Do some analysis to fi lter some out Actually fi t your model!

Slide 20

Slide 20 text

The reality of your tidymodels script predict(cubist_fit, new_samples) Recipes and work fl ows encapsulate all of your preprocessing and modeling operations into a single interface spotify_rec <- recipe(popularity ~ ., data = training(music_split)) %>% step_lencode_bayes( artist, outcome = vars(popularity) ) %>% step_best_normalize(duration) %>% step_date(date, keep_original_cols = FALSE) %>% step_dummy_extract( genres, pattern = "(?<=')[^',]+(?=')" ) %>% step_corr(all_numeric_predictors()) cubist_fit <- workflow(spotify_rec, cubist_rules()) %>% fit(data = spotify_train)

Slide 21

Slide 21 text

Get your model off your machine

Slide 22

Slide 22 text

Model deployment and YOU Post at slido.com (event code = RKEYNOTE) A. I have deployed a model to production. B. I have never deployed a model to production. C. What does “production” mean?

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Collect data caret Understand 
 and clean data Train and evaluate model Version model Deploy model Monitor model

Slide 25

Slide 25 text

Get your model off your machine vetiver.rstudio.com library(vetiver) v <- vetiver_model(cubist_fit, "spotify_rules") v #> #> !! spotify_rules ! model for deployment #> A Cubist regression modeling workflow using 4 features

Slide 26

Slide 26 text

Get your model off your machine vetiver.rstudio.com library(plumber) pr() %>% vetiver_api(v) #> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router. #> # Use `pr_run()` on this object to start the API. #> "!![queryString] #> "!![body] #> "!![cookieParser] #> "!![sharedSecret] #> "!!/logo #> "!!/ping (GET) #> #!!/predict (POST)

Slide 27

Slide 27 text

EFFECTIVE

Slide 28

Slide 28 text

Have you ever tuned a model? Post at slido.com (event code = RKEYNOTE) A. No B. Yes, but it was painful C. Yes, but I’m not sure if it was effective D. Yes, and it was easy! E. Yes, and I used an advanced method like racing

Slide 29

Slide 29 text

Some advanced techniques in tidymodels • Racing methods • Bayesian optimization RESAMPLE ANALYZE FILTER Racing is a method that eliminates model con fi gurations as they are resampled MODEL TUNING FEATURE EMBEDDING METHODS • UMAP • isoMap • Effect encodings

Slide 30

Slide 30 text

Racing methods for effective model selection •Fit only 7.2% of total models •Speed-ups • With parallel processing, 2.8-fold • Without, 9-fold

Slide 31

Slide 31 text

Extensible

Slide 32

Slide 32 text

What else is tidymodels? YOU

Slide 33

Slide 33 text

Modular

Slide 34

Slide 34 text

SAFE

Slide 35

Slide 35 text

The reality of your script Almost all of these involve estimation and are part of the modeling process Most also require code for the script with the prediction steps

Slide 36

Slide 36 text

If you use feature selection with 10-fold cross- validation, you will select features: Post at slido.com (event code = RKEYNOTE) A. 1 time B. 10 times C. 11 times D. 3.14159 times

Slide 37

Slide 37 text

Model Work fl ow

Slide 38

Slide 38 text

Model Work fl ow

Slide 39

Slide 39 text

Leakage and the Reproducibility Crisis in ML-based Science Kapoor and Narayanan (2022) reproducible.cs.princeton.edu We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Speci fi cally, through a survey of literature in research communities that adopted ML methods, we fi nd 17 fi elds where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions.

Slide 40

Slide 40 text

Proper data usage and validation library(recipeselectors) selection_rec <- spotify_rec %>% step_select_vip( artist, outcome = vars(popularity), top_p = tune() ) lm_res <- workflow(selection_rec, linear_reg()) %>% tune_grid(resamples = folds, grid = 25) The work fl ow makes sure that the appropriate computations are used with the right data at the right time

Slide 41

Slide 41 text

When was the last time you locked your keys in the car?

Slide 42

Slide 42 text

I last locked my keys in my car: Post at slido.com (event code = RKEYNOTE) A. ~1 week ago B. ~1 month ago C. ~1 year ago D. ~10 years ago E. I have never locked my keys in my car.

Slide 43

Slide 43 text

It’s hard to lock my keys in my current car Protect against common failure modes

Slide 44

Slide 44 text

Immediately painful problem

Slide 45

Slide 45 text

Problem that is painful later

Slide 46

Slide 46 text

David Robinson: Some of the resistance I’ve seen to tidymodels comes from a place of “This makes it too easy—you’re not thinking carefully about what the code is doing!” But I think this is getting it backwards. By removing the burden of writing procedural logic, I get to focus on scienti fi c and statistical questions about my data and model. varianceexplained.org/r/sliced-ml/

Slide 47

Slide 47 text

tidymodels is ready for your real ML tasks

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

tmwr.org

Slide 50

Slide 50 text

Thank You

Slide 51

Slide 51 text

Also thanks to Simon Couch Hannah Frick Emil Hvitfeldt Davis Vaughan Jenny Bryan Mine Cetinkaya-Rundel Matt Dancho Alison Hill Allison Horst Edgar Ruiz Hadley Wickham Michael Chow Isabel Zimmerman TIDYMODELS TEAM VETIVER TEAM AND

Slide 52

Slide 52 text

Special thanks to tidymodels contributors…

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

Image sources •Traf fi c guards: https://commons.wikimedia.org/wiki/File:FitchBarrels2008.jpg •Ergonomic chair: https://pixabay.com/illustrations/chair-of fi ce-chair-work- chair-5020284/ •B&W Legos: https://unsplash.com/photos/2Ip_wpgLoyw •Rainbow Legos: https://unsplash.com/photos/gNMVpAPe3PE •Old car keys: https://unsplash.com/photos/7mgR-BZ5Dm4 •Antique gas pump: https://unsplash.com/photos/QBUVgT32mOo

Slide 55

Slide 55 text

Image sources continued •Classic car: https://unsplash.com/photos/RjMgc6GXpHg •Driver in car: https://unsplash.com/photos/IAc1x02D9K0 •Seatbelt light: https://www. fl ickr.com/photos/piratejohnny/2507761266 •Touch ID: https://www. fl ickr.com/photos/hawaii/9844738564 •Rumble strip: https://www. fl ickr.com/photos/wsdot/3972234532