Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applied machine learning with tidymodels

Applied machine learning with tidymodels

useR! 2022 keynote

Julia Silge

June 22, 2022
Tweet

More Decks by Julia Silge

Other Decks in Technology

Transcript

  1. A pl ed ac in L ar in w th

    t dy od ls J li S lg @j l
  2. W at's he ar es p rt bo t ac

    in l ar in i p ac ic ? @j l
  3. library(tidymodels) #> ── Attaching packages ────────────────────────────────────────────── tidymodels 0.2.0 ── #>

    ✔ broom 0.8.0 ✔ rsample 0.1.1 #> ✔ dials 1.0.0 ✔ tibble 3.1.7 #> ✔ dplyr 1.0.9 ✔ tidyr 1.2.0 #> ✔ infer 1.0.2 ✔ tune 0.2.0 #> ✔ modeldata 0.1.1 ✔ workflows 0.2.6 #> ✔ parsnip 1.0.0 ✔ workflowsets 0.2.1 #> ✔ purrr 0.3.4 ✔ yardstick 1.0.0 #> ✔ recipes 0.2.0 #> ── Conflicts ───────────────────────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Dig deeper into tidy modeling with R at https://www.tmwr.org @j l
  4. T re t pi s or od y 4 S

    u t b 4 W u m s n e 4 G u m o u l @j l
  5. initial_split() S t r y t a t n t

    g s penguins_split <- initial_split(penguins, prop = 0.75) penguins_split #> <Training/Testing/Total> #> <249/84/333> @j l
  6. training() a d testing() C t g n t t

    o rsplit penguins_train <- training(penguins_split) penguins_train #> # A tibble: 249 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex #> <fct> <fct> <dbl> <dbl> <int> <int> <fct> #> 1 Chinst… Dream 47.6 18.3 195 3850 fema… #> 2 Adelie Torge… 35.7 17 189 3350 fema… #> 3 Gentoo Biscoe 45.5 15 220 5000 male #> 4 Gentoo Biscoe 48.7 15.7 208 5350 male #> 5 Gentoo Biscoe 46.5 13.5 210 4550 fema… #> # … with 244 more rows, and 1 more variable: year <int> @j l
  7. training() a d testing() C t g n t t

    o rsplit penguins_test <- testing(penguins_split) penguins_test #> # A tibble: 84 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex #> <fct> <fct> <dbl> <dbl> <int> <int> <fct> #> 1 Adelie Torge… 40.3 18 195 3250 fema… #> 2 Adelie Torge… 36.7 19.3 193 3450 fema… #> 3 Adelie Torge… 36.6 17.8 185 3700 fema… #> 4 Adelie Torge… 34.4 18.4 184 3325 fema… #> 5 Adelie Torge… 46 21.5 194 4200 male #> # … with 79 more rows, and 1 more variable: year <int> @j l
  8. H w an e se he ra ni g et

    o c mp re, e al at , a d un m de s? @j l
  9. C os -v li at on 14 18 28 17

    21 25 22 8 6 30 1 23 27 3 2 19 11 7 26 24 16 9 4 29 20 12 13 15 5 10 14 18 28 17 21 25 22 8 6 30 1 23 27 3 2 19 11 7 26 24 16 9 4 29 20 12 13 15 5 10 @j l
  10. C os -v li at on Model Fit Using Estimate

    Performance Using Fold 1 Iteration Fold 2 Iteration Fold 3 Iteration 14 29 17 20 21 8 24 28 3 1 13 26 16 9 5 30 19 15 6 12 27 22 23 25 2 18 7 4 11 10 11 28 18 22 23 7 25 27 4 2 10 26 16 8 5 29 20 13 6 9 30 19 21 24 1 17 12 3 15 14 14 27 18 21 22 7 23 25 2 1 12 24 17 10 3 30 19 15 4 11 29 20 26 28 5 16 8 6 13 9 @j l
  11. C os -v li at on set.seed(123) vfold_cv(penguins_train, strata =

    species) #> # 10-fold cross-validation using stratification #> # A tibble: 10 × 2 #> splits id #> <list> <chr> #> 1 <split [223/26]> Fold01 #> 2 <split [223/26]> Fold02 #> 3 <split [223/26]> Fold03 #> 4 <split [224/25]> Fold04 #> 5 <split [224/25]> Fold05 #> 6 <split [224/25]> Fold06 #> 7 <split [225/24]> Fold07 #> 8 <split [225/24]> Fold08 #> 9 <split [225/24]> Fold09 #> 10 <split [225/24]> Fold10 @j l
  12. B ot tr pp ng Model Fit Using Estimate Performance

    Using Bootstrap Iteration 1 16 19 27 19 23 25 23 13 8 29 1 24 25 4 1 21 14 10 25 23 17 13 7 28 22 15 16 16 8 13 18 28 26 30 3 9 2 24 5 11 12 20 6 12 15 27 14 18 23 21 4 4 30 2 22 28 3 2 17 7 4 23 22 14 6 3 28 17 10 11 12 3 6 20 29 5 13 1 26 8 16 19 24 9 15 19 22 18 20 21 20 5 5 30 2 21 22 3 2 19 10 5 21 21 18 6 3 29 20 11 12 16 4 7 24 28 27 8 14 1 26 9 17 23 25 13 Bootstrap Iteration 2 Bootstrap Iteration 3 @j l
  13. B ot tr pp ng set.seed(123) bootstraps(penguins_train, strata = species)

    #> # Bootstrap sampling using stratification #> # A tibble: 25 × 2 #> splits id #> <list> <chr> #> 1 <split [249/91]> Bootstrap01 #> 2 <split [249/93]> Bootstrap02 #> 3 <split [249/96]> Bootstrap03 #> 4 <split [249/88]> Bootstrap04 #> 5 <split [249/89]> Bootstrap05 #> 6 <split [249/82]> Bootstrap06 #> 7 <split [249/87]> Bootstrap07 #> 8 <split [249/87]> Bootstrap08 #> 9 <split [249/85]> Bootstrap09 #> 10 <split [249/95]> Bootstrap10 #> # … with 15 more rows @j l
  14. R sa pl ng et od S u t w

    t c ea e im la ed al da io s t(s) vfold_cv() loo_cv() mc_cv() bootstraps() validation_split() @j l
  15. w rk o s h tp ://w rf lo s.t

    dy od ls.o g/ @j l
  16. W er d es ou m de s ar a

    d nd? rf_spec <- rand_forest(mode = "classification") penguin_formula <- species ~ bill_length_mm + bill_depth_mm + sex @j l
  17. W er d es ou m de s ar a

    d nd? workflow(penguin_formula, rf_spec) #> ══ Workflow ════════════════════════════════════════════════════════════════════════════ #> Preprocessor: Formula #> Model: rand_forest() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────────────── #> species ~ bill_length_mm + bill_depth_mm + sex #> #> ── Model ─────────────────────────────────────────────────────────────────────────────── #> Random Forest Model Specification (classification) #> #> Computational engine: ranger @j l
  18. W er d es ou m de s ar a

    d nd? workflow(penguin_formula, rf_spec) %>% fit(data = penguins_train) #> ══ Workflow [trained] ══════════════════════════════════════════════════════════════════ #> Preprocessor: Formula #> Model: rand_forest() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────────────── #> species ~ bill_length_mm + bill_depth_mm + sex #> #> ── Model ─────────────────────────────────────────────────────────────────────────────── #> Ranger result #> #> Call: #> ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, #> verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) #> #> Type: Probability estimation #> Number of trees: 500 #> Sample size: 249 #> Number of independent variables: 3 #> Mtry: 1 #> Target node size: 10 #> Variable importance mode: none #> Splitrule: gini #> OOB prediction error (Brier s.): 0.05585744 @j l
  19. W er d es ou m de s ar a

    d nd? penguin_rec <- recipe(species ~ bill_length_mm + bill_depth_mm + sex, data = penguins_train) %>% step_dummy(sex) %>% step_normalize(all_numeric_predictors()) penguin_rec #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 3 #> #> Operations: #> #> Dummy variables from sex #> Centering and scaling for all_numeric_predictors() @j l
  20. W er d es ou m de s ar a

    d nd? svm_spec <- svm_linear(mode = "classification") workflow(penguin_rec, svm_spec) #> ══ Workflow ════════════════════════════════════════════════════════════════════════════ #> Preprocessor: Recipe #> Model: svm_linear() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────────────── #> 2 Recipe Steps #> #> • step_dummy() #> • step_normalize() #> #> ── Model ─────────────────────────────────────────────────────────────────────────────── #> Linear Support Vector Machine Specification (classification) #> #> Computational engine: LiblineaR @j l
  21. W er d es ou m de s ar a

    d nd? penguin_fit <- workflow(penguin_rec, svm_spec) %>% fit(data = penguins_train) @j l
  22. G t ou m de o y ur ap op

    library(vetiver) v <- vetiver_model(penguin_fit, "svm_penguins") v #> #> ── svm_penguins ─ <butchered_workflow> model for deployment #> A LiblineaR classification modeling workflow using 3 features @j l
  23. G t ou m de o y ur ap op

    library(plumber) pr() %>% vetiver_api(v) #> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router. #> # Use `pr_run()` on this object to start the API. #> ├──[queryString] #> ├──[body] #> ├──[cookieParser] #> ├──[sharedSecret] #> ├──/logo #> ├──/ping (GET) #> └──/predict (POST) @j l
  24. G t ou m de o y ur ap op

    4 P -b d R C 4 G e D l o o d e t @j l
  25. G t ou m de o y ur ap op

    # Generated by the vetiver package; edit with care FROM rocker/r-ver:4.2.0 ENV RENV_CONFIG_REPOS_OVERRIDE https://packagemanager.rstudio.com/cran/latest RUN apt-get update -qq && apt-get install -y --no-install-recommends \ libcurl4-openssl-dev \ libicu-dev \ libsodium-dev \ libssl-dev \ make COPY vetiver_renv.lock renv.lock RUN Rscript -e "install.packages('renv')" RUN Rscript -e "renv::restore()" COPY plumber.R /opt/ml/plumber.R EXPOSE 8000 ENTRYPOINT ["R", "-e", "pr <- plumber::plumb('/opt/ml/plumber.R'); pr$run(host = '0.0.0.0', port = 8000)"] @j l
  26. T an y u! h ://y .c /j l /

    h ://j l .c / h ://t e .o / h ://t .o / P a M U h