Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Good practices for applied machine learning

Good practices for applied machine learning

rstudio::conf() 2022 keynote with Max Kuhn

Julia Silge

July 27, 2022
Tweet

More Decks by Julia Silge

Other Decks in Programming

Transcript

  1. Julia Silge Max Kuhn @juliasilge @topepos GOOD PRACTICES FOR APPLIED

    MACHINE LEARNING Model Development to Model Deployment
  2. Polls at slido.com (event code = RKEYNOTE) GOOD PRACTICES FOR

    APPLIED MACHINE LEARNING Model Development to Model Deployment
  3. Spotify data popularity artist date duration genres <dbl> <chr> <date>

    <dbl> <chr> 20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese…
  4. popularity artist date duration genres <dbl> <chr> <date> <dbl> <chr>

    20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese… Spotify data Outcome for model
  5. popularity artist date duration genres <dbl> <chr> <date> <dbl> <chr>

    20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese… Spotify data Qualitative predictor with a very large number of values
  6. Spotify data popularity artist date duration genres <dbl> <chr> <date>

    <dbl> <chr> 20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese… Column that we want to convert to multiple features
  7. Spotify data popularity artist date duration genres <dbl> <chr> <date>

    <dbl> <chr> 20 5BcZ22X 1969-11-21 33520 ['album rock', 'art rock', 'british invasion', 'classi… 46 4KXp3xt 1999-12-06 273949 ['classic swedish pop', 'swedish alternative rock', … 17 5MnhtFX 1979-07-26 202720 ['cante flamenco', 'flamenco', 'rumba'] 19 6zpuH5T 1989-01-01 228444 ['classic indo pop', 'dangdut', 'lagu sunda'] 44 4q2SZId 2014-02-04 226531 ['turkish rock'] 18 6EdXBTj 1984-03-14 263160 ['greek pop', 'laiko'] 21 3LOaK3K 1977-11-12 188240 ['chanson', 'french pop', 'ye ye'] 42 1t17z3v 2017-12-07 138136 ['j-pop', 'j-rock', 'japanese alternative rock', … 23 3f626JS 1974-01-01 216200 [] 18 2bjzKs2 1987-08-25 257893 ['glam metal', 'hard rock', 'j-metal', 'japanese… Multiple choice fi eld to parse and convert to many indicator variables
  8. The reality of your script Use effect encoding to convert

    the artist into features Transformations, etc. etc. Convert date to year, month, etc. Parse multiple choice data and convert to indicators Too many features! Do some analysis to fi lter some out Actually fi t your model!
  9. The reality of your script Almost all of these involve

    estimation and are part of the modeling process Most of these steps also need to be applied during prediction
  10. Do you have experience with these “roles”? Post at slido.com

    (event code = RKEYNOTE) A. Outcome B. Predictor C. Case weight D. Strati fi cation variable E. Censoring indicator F. Offsets
  11. Reduce your cognitive load library(tidymodels) music_split <- initial_split(spotify) recipe(popularity ~

    ., data = training(music_split)) #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 7
  12. Reduce your cognitive load library(tidymodels) music_split <- initial_split(spotify) recipe(popularity ~

    ., data = training(music_split)) %>% step_normalize(all_numeric_predictors()) #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 7 #> #> Operations: #> #> Centering and scaling for all_numeric_predictors()
  13. Recall your script Use effect encoding to convert the artist

    into features Transformations, etc. etc. Convert date to year, month, etc. Parse multiple choice data and convert to indicators Too many features! Do some analysis to fi lter some out Actually fi t your model!
  14. The reality of your tidymodels script predict(cubist_fit, new_samples) Recipes and

    work fl ows encapsulate all of your preprocessing and modeling operations into a single interface spotify_rec <- recipe(popularity ~ ., data = training(music_split)) %>% step_lencode_bayes( artist, outcome = vars(popularity) ) %>% step_best_normalize(duration) %>% step_date(date, keep_original_cols = FALSE) %>% step_dummy_extract( genres, pattern = "(?<=')[^',]+(?=')" ) %>% step_corr(all_numeric_predictors()) cubist_fit <- workflow(spotify_rec, cubist_rules()) %>% fit(data = spotify_train)
  15. Model deployment and YOU Post at slido.com (event code =

    RKEYNOTE) A. I have deployed a model to production. B. I have never deployed a model to production. C. What does “production” mean?
  16. Collect data caret Understand 
 and clean data Train and

    evaluate model Version model Deploy model Monitor model
  17. Get your model off your machine vetiver.rstudio.com library(vetiver) v <-

    vetiver_model(cubist_fit, "spotify_rules") v #> #> !! spotify_rules ! <butchered_workflow> model for deployment #> A Cubist regression modeling workflow using 4 features
  18. Get your model off your machine vetiver.rstudio.com library(plumber) pr() %>%

    vetiver_api(v) #> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router. #> # Use `pr_run()` on this object to start the API. #> "!![queryString] #> "!![body] #> "!![cookieParser] #> "!![sharedSecret] #> "!!/logo #> "!!/ping (GET) #> #!!/predict (POST)
  19. Have you ever tuned a model? Post at slido.com (event

    code = RKEYNOTE) A. No B. Yes, but it was painful C. Yes, but I’m not sure if it was effective D. Yes, and it was easy! E. Yes, and I used an advanced method like racing
  20. Some advanced techniques in tidymodels • Racing methods • Bayesian

    optimization RESAMPLE ANALYZE FILTER Racing is a method that eliminates model con fi gurations as they are resampled MODEL TUNING FEATURE EMBEDDING METHODS • UMAP • isoMap • Effect encodings
  21. Racing methods for effective model selection •Fit only 7.2% of

    total models •Speed-ups • With parallel processing, 2.8-fold • Without, 9-fold
  22. The reality of your script Almost all of these involve

    estimation and are part of the modeling process Most also require code for the script with the prediction steps
  23. If you use feature selection with 10-fold cross- validation, you

    will select features: Post at slido.com (event code = RKEYNOTE) A. 1 time B. 10 times C. 11 times D. 3.14159 times
  24. Leakage and the Reproducibility Crisis in ML-based Science Kapoor and

    Narayanan (2022) reproducible.cs.princeton.edu We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Speci fi cally, through a survey of literature in research communities that adopted ML methods, we fi nd 17 fi elds where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions.
  25. Proper data usage and validation library(recipeselectors) selection_rec <- spotify_rec %>%

    step_select_vip( artist, outcome = vars(popularity), top_p = tune() ) lm_res <- workflow(selection_rec, linear_reg()) %>% tune_grid(resamples = folds, grid = 25) The work fl ow makes sure that the appropriate computations are used with the right data at the right time
  26. I last locked my keys in my car: Post at

    slido.com (event code = RKEYNOTE) A. ~1 week ago B. ~1 month ago C. ~1 year ago D. ~10 years ago E. I have never locked my keys in my car.
  27. It’s hard to lock my keys in my current car

    Protect against common failure modes
  28. David Robinson: Some of the resistance I’ve seen to tidymodels

    comes from a place of “This makes it too easy—you’re not thinking carefully about what the code is doing!” But I think this is getting it backwards. By removing the burden of writing procedural logic, I get to focus on scienti fi c and statistical questions about my data and model. varianceexplained.org/r/sliced-ml/
  29. Also thanks to Simon Couch Hannah Frick Emil Hvitfeldt Davis

    Vaughan Jenny Bryan Mine Cetinkaya-Rundel Matt Dancho Alison Hill Allison Horst Edgar Ruiz Hadley Wickham Michael Chow Isabel Zimmerman TIDYMODELS TEAM VETIVER TEAM AND
  30. Image sources •Traf fi c guards: https://commons.wikimedia.org/wiki/File:FitchBarrels2008.jpg •Ergonomic chair: https://pixabay.com/illustrations/chair-of

    fi ce-chair-work- chair-5020284/ •B&W Legos: https://unsplash.com/photos/2Ip_wpgLoyw •Rainbow Legos: https://unsplash.com/photos/gNMVpAPe3PE •Old car keys: https://unsplash.com/photos/7mgR-BZ5Dm4 •Antique gas pump: https://unsplash.com/photos/QBUVgT32mOo
  31. Image sources continued •Classic car: https://unsplash.com/photos/RjMgc6GXpHg •Driver in car: https://unsplash.com/photos/IAc1x02D9K0

    •Seatbelt light: https://www. fl ickr.com/photos/piratejohnny/2507761266 •Touch ID: https://www. fl ickr.com/photos/hawaii/9844738564 •Rumble strip: https://www. fl ickr.com/photos/wsdot/3972234532