Applied machine learning with tidymodels

Slide 1

Slide 1 text

A pl ed ac in L ar in w th t dy od ls J li S lg @j l

Slide 2

Slide 2 text

H ll @j l

Slide 3

Slide 3 text

h ://x .c /1 /

Slide 4

Slide 4 text

I a c : h ://v .c /b /m _l g/

Slide 5

Slide 5 text

I a c : h ://v .c /b /m _l g/

Slide 6

Slide 6 text

W at's he ar es p rt bo t ac in l ar in i p ac ic ? @j l

Slide 7

Slide 7 text

@j l

Slide 8

Slide 8 text

library(tidymodels) #> ── Attaching packages ────────────────────────────────────────────── tidymodels 0.2.0 ── #> ✔ broom 0.8.0 ✔ rsample 0.1.1 #> ✔ dials 1.0.0 ✔ tibble 3.1.7 #> ✔ dplyr 1.0.9 ✔ tidyr 1.2.0 #> ✔ infer 1.0.2 ✔ tune 0.2.0 #> ✔ modeldata 0.1.1 ✔ workflows 0.2.6 #> ✔ parsnip 1.0.0 ✔ workflowsets 0.2.1 #> ✔ purrr 0.3.4 ✔ yardstick 1.0.0 #> ✔ recipes 0.2.0 #> ── Conflicts ───────────────────────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Dig deeper into tidy modeling with R at https://www.tmwr.org @j l

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

t wr.o g

Slide 11

Slide 11 text

T re t pi s or od y 4 S u t b 4 W u m s n e 4 G u m o u l @j l

Slide 12

Slide 12 text

S en in y ur at b dg t @j l

Slide 13

Slide 13 text

r am le h tp ://r am le.t dy od ls.o g @j l

Slide 14

Slide 14 text

D ta pl tt ng @j l

Slide 15

Slide 15 text

initial_split() S t r y t a t n t g s penguins_split <- initial_split(penguins, prop = 0.75) penguins_split #> #> <249/84/333> @j l

Slide 16

Slide 16 text

training() a d testing() C t g n t t o rsplit penguins_train <- training(penguins_split) penguins_train #> # A tibble: 249 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex #> #> 1 Chinst… Dream 47.6 18.3 195 3850 fema… #> 2 Adelie Torge… 35.7 17 189 3350 fema… #> 3 Gentoo Biscoe 45.5 15 220 5000 male #> 4 Gentoo Biscoe 48.7 15.7 208 5350 male #> 5 Gentoo Biscoe 46.5 13.5 210 4550 fema… #> # … with 244 more rows, and 1 more variable: year @j l

Slide 17

Slide 17 text

training() a d testing() C t g n t t o rsplit penguins_test <- testing(penguins_split) penguins_test #> # A tibble: 84 × 8 #> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex #> #> 1 Adelie Torge… 40.3 18 195 3250 fema… #> 2 Adelie Torge… 36.7 19.3 193 3450 fema… #> 3 Adelie Torge… 36.6 17.8 185 3700 fema… #> 4 Adelie Torge… 34.4 18.4 184 3325 fema… #> 5 Adelie Torge… 46 21.5 194 4200 male #> # … with 79 more rows, and 1 more variable: year @j l

Slide 18

Slide 18 text

T e es in d ta s re io s ! @j l

Slide 19

Slide 19 text

H w an e se he ra ni g et o c mp re, e al at , a d un m de s? @j l

Slide 20

Slide 20 text

@j l

Slide 21

Slide 21 text

C os -v li at on 14 18 28 17 21 25 22 8 6 30 1 23 27 3 2 19 11 7 26 24 16 9 4 29 20 12 13 15 5 10 14 18 28 17 21 25 22 8 6 30 1 23 27 3 2 19 11 7 26 24 16 9 4 29 20 12 13 15 5 10 @j l

Slide 22

Slide 22 text

C os -v li at on Model Fit Using Estimate Performance Using Fold 1 Iteration Fold 2 Iteration Fold 3 Iteration 14 29 17 20 21 8 24 28 3 1 13 26 16 9 5 30 19 15 6 12 27 22 23 25 2 18 7 4 11 10 11 28 18 22 23 7 25 27 4 2 10 26 16 8 5 29 20 13 6 9 30 19 21 24 1 17 12 3 15 14 14 27 18 21 22 7 23 25 2 1 12 24 17 10 3 30 19 15 4 11 29 20 26 28 5 16 8 6 13 9 @j l

Slide 23

Slide 23 text

C os -v li at on set.seed(123) vfold_cv(penguins_train, strata = species) #> # 10-fold cross-validation using stratification #> # A tibble: 10 × 2 #> splits id #> #> 1 Fold01 #> 2 Fold02 #> 3 Fold03 #> 4 Fold04 #> 5 Fold05 #> 6 Fold06 #> 7 Fold07 #> 8 Fold08 #> 9 Fold09 #> 10 Fold10 @j l

Slide 24

Slide 24 text

B ot tr pp ng Model Fit Using Estimate Performance Using Bootstrap Iteration 1 16 19 27 19 23 25 23 13 8 29 1 24 25 4 1 21 14 10 25 23 17 13 7 28 22 15 16 16 8 13 18 28 26 30 3 9 2 24 5 11 12 20 6 12 15 27 14 18 23 21 4 4 30 2 22 28 3 2 17 7 4 23 22 14 6 3 28 17 10 11 12 3 6 20 29 5 13 1 26 8 16 19 24 9 15 19 22 18 20 21 20 5 5 30 2 21 22 3 2 19 10 5 21 21 18 6 3 29 20 11 12 16 4 7 24 28 27 8 14 1 26 9 17 23 25 13 Bootstrap Iteration 2 Bootstrap Iteration 3 @j l

Slide 25

Slide 25 text

B ot tr pp ng set.seed(123) bootstraps(penguins_train, strata = species) #> # Bootstrap sampling using stratification #> # A tibble: 25 × 2 #> splits id #> #> 1 Bootstrap01 #> 2 Bootstrap02 #> 3 Bootstrap03 #> 4 Bootstrap04 #> 5 Bootstrap05 #> 6 Bootstrap06 #> 7 Bootstrap07 #> 8 Bootstrap08 #> 9 Bootstrap09 #> 10 Bootstrap10 #> # … with 15 more rows @j l

Slide 26

Slide 26 text

R sa pl ng et od S u t w t c ea e im la ed al da io s t(s) vfold_cv() loo_cv() mc_cv() bootstraps() validation_split() @j l

Slide 27

Slide 27 text

W er d es ou m de s ar a d nd? @j l

Slide 28

Slide 28 text

@j l

Slide 29

Slide 29 text

@j l

Slide 30

Slide 30 text

w rk o s h tp ://w rf lo s.t dy od ls.o g/ @j l

Slide 31

Slide 31 text

@j l

Slide 32

Slide 32 text

W er d es ou m de s ar a d nd? rf_spec <- rand_forest(mode = "classification") penguin_formula <- species ~ bill_length_mm + bill_depth_mm + sex @j l

Slide 33

Slide 33 text

W er d es ou m de s ar a d nd? workflow(penguin_formula, rf_spec) #> ══ Workflow ════════════════════════════════════════════════════════════════════════════ #> Preprocessor: Formula #> Model: rand_forest() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────────────── #> species ~ bill_length_mm + bill_depth_mm + sex #> #> ── Model ─────────────────────────────────────────────────────────────────────────────── #> Random Forest Model Specification (classification) #> #> Computational engine: ranger @j l

Slide 34

Slide 34 text

W er d es ou m de s ar a d nd? workflow(penguin_formula, rf_spec) %>% fit(data = penguins_train) #> ══ Workflow [trained] ══════════════════════════════════════════════════════════════════ #> Preprocessor: Formula #> Model: rand_forest() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────────────── #> species ~ bill_length_mm + bill_depth_mm + sex #> #> ── Model ─────────────────────────────────────────────────────────────────────────────── #> Ranger result #> #> Call: #> ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, #> verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) #> #> Type: Probability estimation #> Number of trees: 500 #> Sample size: 249 #> Number of independent variables: 3 #> Mtry: 1 #> Target node size: 10 #> Variable importance mode: none #> Splitrule: gini #> OOB prediction error (Brier s.): 0.05585744 @j l

Slide 35

Slide 35 text

I a A H

Slide 36

Slide 36 text

W er d es ou m de s ar a d nd? penguin_rec <- recipe(species ~ bill_length_mm + bill_depth_mm + sex, data = penguins_train) %>% step_dummy(sex) %>% step_normalize(all_numeric_predictors()) penguin_rec #> Recipe #> #> Inputs: #> #> role #variables #> outcome 1 #> predictor 3 #> #> Operations: #> #> Dummy variables from sex #> Centering and scaling for all_numeric_predictors() @j l

Slide 37

Slide 37 text

W er d es ou m de s ar a d nd? svm_spec <- svm_linear(mode = "classification") workflow(penguin_rec, svm_spec) #> ══ Workflow ════════════════════════════════════════════════════════════════════════════ #> Preprocessor: Recipe #> Model: svm_linear() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────────────── #> 2 Recipe Steps #> #> • step_dummy() #> • step_normalize() #> #> ── Model ─────────────────────────────────────────────────────────────────────────────── #> Linear Support Vector Machine Specification (classification) #> #> Computational engine: LiblineaR @j l

Slide 38

Slide 38 text

W er d es ou m de s ar a d nd? penguin_fit <- workflow(penguin_rec, svm_spec) %>% fit(data = penguins_train) @j l

Slide 39

Slide 39 text

G t ou m de o y ur l pt p @j l

Slide 40

Slide 40 text

v ti er h tp ://v ti er.r tu io.c m @j l

Slide 41

Slide 41 text

@j l

Slide 42

Slide 42 text

@j l

Slide 43

Slide 43 text

G t ou m de o y ur ap op library(vetiver) v <- vetiver_model(penguin_fit, "svm_penguins") v #> #> ── svm_penguins ─ model for deployment #> A LiblineaR classification modeling workflow using 3 features @j l

Slide 44

Slide 44 text

G t ou m de o y ur ap op library(plumber) pr() %>% vetiver_api(v) #> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router. #> # Use `pr_run()` on this object to start the API. #> ├──[queryString] #> ├──[body] #> ├──[cookieParser] #> ├──[sharedSecret] #> ├──/logo #> ├──/ping (GET) #> └──/predict (POST) @j l

Slide 45

Slide 45 text

G t ou m de o y ur ap op 4 P -b d R C 4 G e D l o o d e t @j l

Slide 46

Slide 46 text

G t ou m de o y ur ap op # Generated by the vetiver package; edit with care FROM rocker/r-ver:4.2.0 ENV RENV_CONFIG_REPOS_OVERRIDE https://packagemanager.rstudio.com/cran/latest RUN apt-get update -qq && apt-get install -y --no-install-recommends \ libcurl4-openssl-dev \ libicu-dev \ libsodium-dev \ libssl-dev \ make COPY vetiver_renv.lock renv.lock RUN Rscript -e "install.packages('renv')" RUN Rscript -e "renv::restore()" COPY plumber.R /opt/ml/plumber.R EXPOSE 8000 ENTRYPOINT ["R", "-e", "pr <- plumber::plumb('/opt/ml/plumber.R'); pr$run(host = '0.0.0.0', port = 8000)"] @j l