Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rの機械学習フレームワークの紹介〜tidymodelsを中心に〜 / machine_learning_with_r2024

Rの機械学習フレームワークの紹介〜tidymodelsを中心に〜 / machine_learning_with_r2024

2024年3月9日に開催された[日本統計学会第18回春季集会](https://jss2024spring.ywstat.jp/)(成城大学)における企画セッション AM-B「Rユーザのための機械学習チュートリアル」で用いたスライド資料です。

Rのソースコードを含んだGitHubリポジトリ: https://github.com/uribo/240309_jss18tutorial

第一部「Rの機械学習フレームワークの紹介〜tidymodelsを中心に〜
第二部「地理空間データの機械学習への適用

Uryu Shinya

March 11, 2024
Tweet

More Decks by Uryu Shinya

Other Decks in Research

Transcript

  1. 3Ͱͷ.-ϑϨʔϜϫʔΫᶃUJEZNPEFMT \SFDJQFT^ \STBNQMF^ σʔλલॲཧɺ ಛ௃ྔΤϯδχΞϦϯά σʔλ෼ׂɺ ϦαϯϓϦϯά \QBSTOJQ^ Ϟσϧߏஙɾద༻ \ZBSETUJDL^

    \XPSL fl PXT^ ϞσϧͷੑೳධՁ ֤޻ఔͷϫʔΫϑϩʔԽɺ ޮ཰తͳॲཧ \UVOF^ ύϥϝʔλ୳ࡧ ओͳύοέʔδͱͦͷ༻్
  2. 3Ͱͷ.-ϑϨʔϜϫʔΫᶄNMS https://mlr-org.com/ NMSύοέʔδΛத৺ͱͯ͠ɺػցֶशͷ޻ఔʹඞཁͳύοέʔδ܈Λ NMSWFSTFύοέʔδͱͯ͠·ͱΊͯఏڙ UBTL MFBSOFS SFTBNQMFS UVOFS QFSGPSNBODF υΠπΛத৺ʹ׆ಈ͢ΔNMSPSH͕։ൃ

    3ΫϥεͰ࣮૷ ಛ௃ σʔλΛEBUBUBCMFͱͯ͠ѻ͏ NMSͷ֦ுɻ஍ཧۭؒσʔλΛѻ͏ͨΊͷύοέʔδ΋ؚ·ΕΔ ஍ཧۭؒσʔλؔ܎ͷ ॲཧʹରͯ͠ɺUJEZNPEFMTΑΓ΋ ஫ྗ͍ͯ͠Δʁʢӝੜࡶײʣ
  3. Ψεڙڅঢ়گͷೋ஋෼ྨϞσϧͷߏங ஍ཧۭؒଐੑΛআ͍͍ͯ·͢ɻ͜ΕΒ͸ୈೋ෦Ͱѻ͍·͢ dplyr::glimpse(lp_supply) #> Rows: 1,812 #> Columns: 8 #>

    $ price <int> 115000, 61400, 57200, 51600, 177000, 208000, 57600, 68600… #> $ water <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ gas <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TR… #> $ sewer <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ above_floor <int> 8, 0, 3, 3, 5, 12, 3, 3, 3, 2, 6, 2, 7, 2, 2, 3, 2, 3, 3,… #> $ under_floor <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, … #> $ dist_from_st <int> 1300, 900, 270, 2100, 150, 0, 150, 850, 1200, 400, 640, 1… #> $ fire <chr> "๷Ր", "४๷", "४๷", "४๷", "๷Ր", "๷Ր", "४๷", "…
  4. Ψεڙڅঢ়گͷೋ஋෼ྨϞσϧͷߏங dplyr::glimpse(lp_supply) #> Rows: 1,812 #> Columns: 8 #> $

    price <int> #> $ water <lgl> #> $ gas <fct> #> $ sewer <lgl> #> $ above_floor <int> #> $ under_floor <int> #> $ dist_from_st <int> #> $ fire <chr> 水水 火
  5. library(tidymodels) #> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ── #>

    ✔ broom 1.0.5 ✔ recipes 1.0.10 #> ✔ dials 1.2.1 ✔ rsample 1.2.0 #> ✔ dplyr 1.1.4 ✔ tibble 3.2.1 #> ✔ ggplot2 3.5.0 ✔ tidyr 1.3.1 #> ✔ infer 1.0.6 ✔ tune 1.1.2 #> ✔ modeldata 1.3.0 ✔ workflows 1.1.4 #> ✔ parsnip 1.2.0 ✔ workflowsets 1.0.1 #> ✔ purrr 1.0.2 ✔ yardstick 1.3.0 #> #> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Dig deeper into tidy modeling with R at https://www.tmwr.org UJEZNPEFMTύοέʔδͷಡΈࠐΈ
  6. library(tidymodels) #> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ── #>

    ✔ broom 1.0.5 ✔ recipes 1.0.10 #> ✔ dials 1.2.1 ✔ rsample 1.2.0 #> ✔ dplyr 1.1.4 ✔ tibble 3.2.1 #> ✔ ggplot2 3.5.0 ✔ tidyr 1.3.1 #> ✔ infer 1.0.6 ✔ tune 1.1.2 #> ✔ modeldata 1.3.0 ✔ workflows 1.1.4 #> ✔ parsnip 1.2.0 ✔ workflowsets 1.0.1 #> ✔ purrr 1.0.2 ✔ yardstick 1.3.0 #> #> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Dig deeper into tidy modeling with R at https://www.tmwr.org tidymodels package::function UJEZNPEFMTύοέʔδͷಡΈࠐΈ
  7. ؔ਺໊ͷিಥ ύοέʔδؒͰಉؔ͡਺໊͕࢖ΘΕ͍ͯΔͱɺʮিಥʯ͕ൃੜ͢Δ ͋ͱ͔ΒಡΈࠐ·Εͨؔ਺໊͕༏ઌ͞ΕΔ ྫʣEQMZSMBH ͱTUBUTMBH ಉؔ͡਺໊Ͱ͋ͬͯ΋ػೳ͸ύοέʔδؒͰҟͳΔͨΊɺ ҙਤ͠ͳ͍݁ՌΛ΋ͨΒ͢ɺૢ࡞͕ߦΘΕͳ͍͓ͦΕ͕͋Δ lag(ldeaths, 1) #>

    Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec #> 1973 3035 #> 1974 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512 2933 #> 1975 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837 2787 #> 1976 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823 3102 #> 1977 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293 2815 #> 1978 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491 3084 #> 1979 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915 library(dplyr) lag(ldeaths, 1) #> Error in `lag()`: #> ! `x` must be a vector, not a <ts>, do you want `stats::lag()`?  
  8.   library(tidymodels) tidymodels_prefer() library(conflicted) # ؔ਺ͷিಥ͕ൃੜ͍ͯ͠Δͱɺ ɹɹɹɹɹɹɹɹɹɹɹɹ# ༏ઌॱҐΛࢦఆ͢ΔΑ͏ʹଅ͢ϝοηʔδΛग़ྗͯ͘͠ΕΔ conflict_scout()

    # ݱࡏͷিಥঢ়گΛ֬ೝʢtidymodels಺Ͱͷڝ߹ʣ #> 4 conflicts #> • `discard()`: purrr and scales #> • `filter()`: dplyr #> • `lag()`: dplyr and stats #> • `step()`: recipes UJEZNPEFMT@QSFGFS Λ࣮ߦ͢ΔͱɺUJEZNPEFMT͕ґଘ͢Δ ύοέʔδͷؔ਺͕༏ઌతʹར༻͞ΕΔΑ͏ʹ͢Δ ؔ਺໊ͷিಥΛճආ͢Δʹ͸ ໊લۭؒΛࢦఆͯؔ͠਺Λར༻͢ΔˠQBDLBHFGVODJUPO ͷܗࣜ DPO fl JDUFEύοέʔδͰ໌ࣔతʹ༏ઌ͢Δؔ਺Λࢦఆ͢Δ 行
  9. ֶशσʔλ ςετσʔλ set.seed(123) lp_split <- initial_split(lp_supply, prop = 0.8, strata

    = gas) lp_split #> <Training/Testing/Total> #> <1449/363/1812> lp_train <- training(lp_split) lp_test <- testing(lp_split) \STBNQMF^σʔλ෼ׂ
  10. ֶशσʔλ ςετσʔλ set.seed(123) lp_split <- initial_split(lp_supply, prop = 0.8, strata

    = gas) lp_split #> <Training/Testing/Total> #> <1449/363/1812> lp_train <- training(lp_split) lp_test <- testing(lp_split) 比 \STBNQMF^σʔλ෼ׂ
  11. \QBSTOJQ^ֶशثͷߏங  Ϟσϧͷબ୒ ˠܾఆ໦  Τϯδϯͷࢦఆ  Ϟʔυͷࢦఆ logistic_reg() #>

    Logistic Regression Model Specification (classification) #> #> Computational engine: glm 自 classi fi cation linear_reg() #> Linear Regression Model Specification (regression) decision_tree() #> Decision Tree Model Specification (unknown mode) #> #> Computational engine: rpart 示
  12. \QBSTOJQ^ֶशثͷߏங  Ϟσϧͷબ୒ ˠܾఆ໦  Τϯδϯͷࢦఆ ˠSQBSUύοέʔδ  Ϟʔυͷࢦఆ 1

    文 用 show_engines("decision_tree") #> # A tibble: 5 × 2 #> engine mode #> <chr> <chr> #> 1 rpart classification #> 2 rpart regression #> 3 C5.0 classification #> 4 spark classification #> 5 spark regression set_engine("rpart") #> Decision Tree Model Specification (unknown mode) #> #> Computational engine: rpart decision_tree() |>
  13. tree_spec <- set_mode("classification") tree_spec #> Decision Tree Model Specification (classification)

    #> #> Main Arguments: #> cost_complexity = 0.002 #> #> Computational engine: rpart \QBSTOJQ^ֶशثͷߏங  Ϟσϧͷબ୒ ˠܾఆ໦  Τϯδϯͷࢦఆ ˠSQBSUύοέʔδ  Ϟʔυͷࢦఆ ˠ෼ྨ໰୊ decision_tree(cost_complexity = 0.002) |> set_engine("rpart") |> classi fi cation regression
  14. \QBSTOJQ^ֶशثͷߏங ranger::ranger(x, y, mtry = min_cols(~10, x), num.trees = 2000,

    ...) ྫʣϥϯμϜϑΥϨετͰͷܾఆ໦࡞੒࣌ͷಛ௃ྔ਺ NUSZ ɺܾఆ໦ͷ਺ OUSFF Λ ɹɹἧ͑ɺҟͳΔύοέʔδͰֶशثΛ࣮૷͢Δ randomForest::randomForest(x, y, mtry = min_cols(~10, x), ntree = 2000) sparklyr::ml_random_forest(x, formula, type = "regression", feature_subset_strategy = "10", num_trees = 2000, ...) ύοέʔδؒͰύϥϝʔλͷࢦఆํ๏͕ҟͳΔʜ
  15. rand_forest(mtry = 10, trees = 2000) |> set_engine("ranger") |> set_mode("regression")

    |> translate() #> Random Forest Model Specification (regression) #> #> Main Arguments: #> mtry = 10 #> trees = 2000 #> #> Computational engine: ranger #> #> Model fit template: #> ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), #> mtry = min_cols(~10, x), num.trees = 2000, num.threads = 1, #> verbose = FALSE, seed = sample.int(10^5, 1)) \QBSTOJQ^ֶशثͷߏங ΤϯδϯʹԠͯ͡ҟͳΔύϥϝʔλͷࢦఆํ๏ΛϞσϧؔ਺಺Ͱ౷Ұ ΤϯδϯΛม͑Δ͚ͩͰύϥϝʔλͷࢦఆํ๏ʹؾΛݣΘͳ͍Ͱ͢Ή
  16. \XPSL fl PXT^ϫʔΫϑϩʔԽ ͜͜·Ͱͷ಺༰Λ·ͱΊΔͱʜ tree_wflow <- workflow(gas ~ ., spec

    = tree_spec) tree_wflow <- workflow() |> add_formula(gas ~ .) |> add_model(tree_spec) ͋Δ͍͸
  17. tree_wflow <- workflow(gas ~ ., spec = tree_spec) tree_wflow <-

    workflow() |> add_formula(gas ~ .) |> add_model(tree_spec) ͋Δ͍͸ \XPSL fl PXT^ϫʔΫϑϩʔԽ 目 tree_fit <- tree_wflow |> fit(data = lp_train) predict(tree_fit, new_data = lp_test) ςετσʔλ Λ༻ֶ͍ͨश ֶशσʔλ ʹର͢Δ༧ଌ ϫʔΫϑϩʔԽʹΑΓֶशثͷมߋ͕༰қͱͳΔ lp_tree_augment <- augment(tree_fit, new_data = lp_train) ΁ͷ༧ଌ݁Ռͷద༻ ֶशσʔλ
  18. tree_fit #> ══ Workflow [trained] ══════════════════════════════════════════════════════════ #> Preprocessor: Formula #>

    Model: decision_tree() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────── #> gas ~ . #> #> ── Model ─────────────────────────────────────────────────────────────────────── #> n= 1449 #> #> node), split, n, loss, yval, (yprob) #> * denotes terminal node #> #> 1) root 1449 122 TRUE (0.08419600 0.91580400) #> 2) price< 124500 181 78 TRUE (0.43093923 0.56906077) #> 4) dist_from_st>=5550 10 0 FALSE (1.00000000 0.00000000) * #> 5) dist_from_st< 5550 171 68 TRUE (0.39766082 0.60233918) #> … ʢলུʣ #> 3) price>=124500 1268 44 TRUE (0.03470032 0.96529968) * \XPSL fl PXT^ϫʔΫϑϩʔԽ Λ༻ֶ͍ͨश ֶशσʔλ
  19. \XPSL fl PXT^ϫʔΫϑϩʔԽ lp_tree_augment |> # ֶशσʔλʹϞσϧ༧ଌ݁Ռ͕ྻͱͯ͠༩͑ΒΕ͍ͯΔ # ֬ೝͷͨΊʹҰ෦ͷྻ͚ͩΛදࣔ select(starts_with(".pred_"),

    gas) #> # A tibble: 1,449 × 4 #> .pred_class .pred_FALSE .pred_TRUE gas #> <fct> <dbl> <dbl> <fct> #> 1 TRUE 0 1 TRUE #> 2 TRUE 0.0833 0.917 TRUE #> 3 TRUE 0.25 0.75 TRUE #> 4 TRUE 0.0347 0.965 TRUE #> 5 TRUE 0.0347 0.965 TRUE #> 6 TRUE 0.0833 0.917 TRUE #> 7 TRUE 0.0833 0.917 TRUE #> 8 FALSE 0.889 0.111 TRUE #> 9 TRUE 0 1 TRUE #> 10 TRUE 0.0833 0.917 TRUE #> # ℹ 1,439 more rows ΁ͷ༧ଌ݁Ռͷద༻ ֶशσʔλ
  20. #> # A tibble: 1,449 × 4 #> .pred_class .pred_FALSE

    .pred_TRUE gas #> <fct> <dbl> <dbl> <fct> #> 1 TRUE 0 1 TRUE #> 2 TRUE 0.0833 0.917 TRUE #> 3 TRUE 0.25 0.75 TRUE #> 4 TRUE 0.0347 0.965 TRUE #> 5 TRUE 0.0347 0.965 TRUE #> 6 TRUE 0.0833 0.917 TRUE #> 7 TRUE 0.0833 0.917 TRUE #> 8 FALSE 0.889 0.111 TRUE #> 9 TRUE 0 1 TRUE #> 10 TRUE 0.0833 0.917 TRUE #> # ℹ 1,439 more rows \XPSL fl PXT^ϫʔΫϑϩʔԽ lp_tree_augment |> # ֶशσʔλʹϞσϧ༧ଌ݁Ռ͕ྻͱͯ͠༩͑ΒΕ͍ͯΔ # ֬ೝͷͨΊʹҰ෦ͷྻ͚ͩΛදࣔ select(starts_with(".pred_"), gas) ΁ͷ༧ଌ݁Ռͷద༻ ֶशσʔλ
  21. \WJQ^ \%"-&9^ߏஙͨ͠Ϟσϧͷղऍ tree_exp <- DALEXtra::explain_tidymodels(tree_fit, data = lp_test, y =

    as.numeric(lp_test$gas)-1, label = "lp_tree") ·ͨ͸ DALEX::explain(tree_fit, data = lp_test, y = as.numeric(lp_test$gas)-1, label = "lp_tree") \%"-&9^Ͱ͸UJEZNPEFMTͰ࡞੒ͨ͠ϞσϧΛѻ͑Δ
  22. ߏஙͨ͠Ϟσϧͷղऍʙ(MPCBMʙ tree_fit |> extract_fit_engine() |> vip::vip() tree_effect <- model_parts(tree_exp, type

    = "variable_importance") plot(tree_effect, show_boxplots = TRUE) ม਺ॏཁ౓
  23. ߏஙͨ͠Ϟσϧͷղऍʙ-PDBMʙ ୯Ұͷ؍ଌʹରͯ͠Ϟσϧ͕ͲͷΑ͏ʹಇ͔͘Λ֬ೝ predict(tree_fit, new_data = lp_test[1, ]) #> # A

    tibble: 1 × 1 #> .pred_class #> <fct> #> 1 FALSE predict(tree_exp, lp_test[1, ]) #> TRUE #>0.4736842 predict_parts( tree_exp, new_observation = lp_test[1, ]) |> plot()
  24. \ZBSETUJDL^ϞσϧͷੑೳධՁ autoplot("heatmap") lp_tree_augment |> conf_mat(truth = gas, ɹɹɹɹɹɹ estimate =

    .pred_class) |> lp_tree_augment |> conf_mat(truth = gas, estimate = .pred_class) #> Truth #> Prediction FALSE TRUE #> FALSE 69 32 #> TRUE 53 1295 ֶशσʔλ ͱϞσϧͷ༧ଌ஋ʹର͢Δࠞಉߦྻ
  25. \ZBSETUJDL^ϞσϧͷੑೳධՁ lp_tree_augment |> accuracy(truth = gas, estimate = .pred_class) #>

    # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.941 lp_tree_augment |> sensitivity(truth = gas, estimate = .pred_class) #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 sensitivity binary 0.566 lp_tree_augment |> specificity(truth = gas, estimate = .pred_class) #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 specificity binary 0.976 ֶशσʔλ ͱϞσϧͷ༧ଌ஋ʹର͢Δ͞·͟·ͳධՁࢦඪͷࢉग़
  26. lp_tree_augment |> accuracy(truth = gas, estimate = .pred_class) #> #

    A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.941 lp_tree_augment |> sensitivity(truth = gas, estimate = .pred_class) #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 sensitivity binary 0.566 lp_tree_augment |> specificity(truth = gas, estimate = .pred_class) #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 specificity binary 0.976 \ZBSETUJDL^ϞσϧͷੑೳධՁ 用 accuracy sensitivity specificity
  27. lp_tree_augment |> accuracy(truth = gas, estimate = .pred_class) lp_tree_augment |>

    sensitivity(truth = gas, estimate = .pred_class) lp_tree_augment |> specificity(truth = gas, estimate = .pred_class) \ZBSETUJDL^ϞσϧͷੑೳධՁ 用 lp_metrics <- metric_set(accuracy, sensitivity, specificity) lp_tree_augment |> lp_metrics(truth = gas, estimate = .pred_class) #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.941 #> 2 sensitivity binary 0.566 #> 3 specificity binary 0.976
  28. \ZBSETUJDL^ϞσϧͷੑೳධՁ ͱϞσϧͷ༧ଌ஋ʹର͢Δ͞·͟·ͳධՁࢦඪͷࢉग़ ςετσʔλ ͓͞Β͍ʜϞσϧͷ༧ଌ݁Ռʢϥϕϧɺϥϕϧʹର͢Δ֬཰ʣΛ σʔλϑϨʔϜͷσʔλͷྻͱͯ͠௥Ճ͢Δؔ਺͸ʁ tree_fit |> augment(new_data = lp_test)

    |> lp_metrics(truth = gas, estimate = .pred_class) #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.923 #> 2 sensitivity binary 0.417 #> 3 specificity binary 0.959
  29. \ZBSETUJDL^ϞσϧͷੑೳධՁ ςετσʔλ tree_fit |> augment(new_data = lp_test) |> lp_metrics(truth =

    gas, estimate = .pred_class) #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.923 #> 2 sensitivity binary 0.417 #> 3 specificity binary 0.959 lp_tree_augment |> lp_metrics(truth = gas, estimate = .pred_class) #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.941 #> 2 sensitivity binary 0.566 #> 3 specificity binary 0.976 ֶशσʔλ
  30. \%"-&9^ʹΑΔϞσϧͷੑೳධՁ ςετσʔλ ʹର͢ΔධՁࢦඪͷࢉग़ DALEX::model_performance(tree_exp) #> Measures for: classification #> recall

    : 0.9587021 #> precision : 0.9587021 #> f1 : 0.9587021 #> accuracy : 0.922865 #> auc : 0.8038963 #> #> Residuals: #> 0% 10% 20% 30% 40% 50% #> -0.96529968 0.03470032 0.03470032 0.03470032 0.03470032 0.03470032 #> 60% 70% 80% 90% 100% #> 0.03470032 0.03470032 0.03470032 0.03470032 0.88888889
  31. \STBNQMF^ަࠩݕূ๏ͷಋೖ # 2ճͷ܁Γฦ͠ɺ10෼ׂަࠩݕূ lp_folds <- vfold_cv(lp_train, v = 10, repeats

    = 2, strata = gas) lp_folds #> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 3 #> splits id id2 #> <list> <chr> <chr> #> 1 <split [1304/145]> Repeat1 Fold01 #> 2 <split [1304/145]> Repeat1 Fold02 #> 3 <split [1304/145]> Repeat1 Fold03 #> 4 <split [1304/145]> Repeat1 Fold04 #> 5 <split [1304/145]> Repeat1 Fold05 #> 6 <split [1304/145]> Repeat1 Fold06 #> 7 <split [1304/145]> Repeat1 Fold07 #> 8 <split [1304/145]> Repeat1 Fold08 #> 9 <split [1304/145]> Repeat1 Fold09 #> 10 <split [1305/144]> Repeat1 Fold10 #> 11 <split [1304/145]> Repeat2 Fold01 #> 12 <split [1304/145]> Repeat2 Fold02 #> 13 <split [1304/145]> Repeat2 Fold03 #> 14 <split [1304/145]> Repeat2 Fold04 #> 15 <split [1304/145]> Repeat2 Fold05 #> 16 <split [1304/145]> Repeat2 Fold06 #> 17 <split [1304/145]> Repeat2 Fold07 #> 18 <split [1304/145]> Repeat2 Fold08 #> 19 <split [1304/145]> Repeat2 Fold09 #> 20 <split [1305/144]> Repeat2 Fold10
  32. \STBNQMF^ަࠩݕূ๏ͷಋೖ # 2ճͷ܁Γฦ͠ɺ10෼ׂަࠩݕূ lp_folds <- vfold_cv(lp_train, v = 10, repeats

    = 2, strata = gas) lp_folds #> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 3 #> splits id id2 #> <list> <chr> <chr> #> 1 <split [1304/145]> Repeat1 Fold01 #> 2 <split [1304/145]> Repeat1 Fold02 #> 3 <split [1304/145]> Repeat1 Fold03 #> 4 <split [1304/145]> Repeat1 Fold04 #> 5 <split [1304/145]> Repeat1 Fold05 #> 6 <split [1304/145]> Repeat1 Fold06 #> 7 <split [1304/145]> Repeat1 Fold07 #> 8 <split [1304/145]> Repeat1 Fold08 #> 9 <split [1304/145]> Repeat1 Fold09 #> 10 <split [1305/144]> Repeat1 Fold10 splits <list> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1305/144]>
  33. # 2ճͷ܁Γฦ͠ɺ10෼ׂަࠩݕূ lp_folds <- vfold_cv(lp_train, v = 10, repeats =

    2, strata = gas) lp_folds #> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 3 #> splits id id2 #> <list> <chr> <chr> #> 1 <split [1304/145]> Repeat1 Fold01 #> 2 <split [1304/145]> Repeat1 Fold02 #> 3 <split [1304/145]> Repeat1 Fold03 #> 4 <split [1304/145]> Repeat1 Fold04 #> 5 <split [1304/145]> Repeat1 Fold05 #> 6 <split [1304/145]> Repeat1 Fold06 #> 7 <split [1304/145]> Repeat1 Fold07 #> 8 <split [1304/145]> Repeat1 Fold08 #> 9 <split [1304/145]> Repeat1 Fold09 #> 10 <split [1305/144]> Repeat1 Fold10 \STBNQMF^ަࠩݕূ๏ͷಋೖ <split [1304/145]> <split [1304/145]> <split [1304/145]> lp_folds$splits[1:3] #> [[1]] #> <Analysis/Assess/Total> #> <1304/145/1449> #> #> [[2]] #> <Analysis/Assess/Total> #> <1304/145/1449> #> #> [[3]] #> <Analysis/Assess/Total> #> <1304/145/1449>
  34. \UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_fit_rs <- fit_resamples(tree_wflow, lp_folds) lp_fit_rs #> # Resampling results

    #> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 5 #> splits id id2 .metrics .notes #> <list> <chr> <chr> <list> <list> #> 1 <split [1304/145]> Repeat1 Fold01 <tibble [2 × 4]> <tibble [0 × 3]> #> 2 <split [1304/145]> Repeat1 Fold02 <tibble [2 × 4]> <tibble [0 × 3]> #> 3 <split [1304/145]> Repeat1 Fold03 <tibble [2 × 4]> <tibble [0 × 3]> #> … (লུ) #> 18 <split [1304/145]> Repeat2 Fold08 <tibble [2 × 4]> <tibble [0 × 3]> #> 19 <split [1304/145]> Repeat2 Fold09 <tibble [2 × 4]> <tibble [0 × 3]> #> 20 <split [1305/144]> Repeat2 Fold10 <tibble [2 × 4]> <tibble [0 × 3]> Ϧαϯϓϧσʔλ͝ͱʹֶशΛߦ͏
  35. \UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_fit_rs <- fit_resamples(tree_wflow, lp_folds) lp_fit_rs #> # Resampling results

    #> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 5 #> splits id id2 .metrics .notes #> <list> <chr> <chr> <list> <list> #> 1 <split [1304/145]> Repeat1 Fold01 <tibble [2 × 4]> <tibble [0 × 3]> #> 2 <split [1304/145]> Repeat1 Fold02 <tibble [2 × 4]> <tibble [0 × 3]> #> 3 <split [1304/145]> Repeat1 Fold03 <tibble [2 × 4]> <tibble [0 × 3]> #> … (লུ) #> 18 <split [1304/145]> Repeat2 Fold08 <tibble [2 × 4]> <tibble [0 × 3]> #> 19 <split [1304/145]> Repeat2 Fold09 <tibble [2 × 4]> <tibble [0 × 3]> #> 20 <split [1305/144]> Repeat2 Fold10 <tibble [2 × 4]> <tibble [0 × 3]> .metrics <list> <tibble [2 × 4]> <tibble [2 × 4]> <tibble [2 × 4]> <tibble [2 × 4]> <tibble [2 × 4]> <tibble [2 × 4]> 行
  36. \UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_fit_rs$.metrics[[1]] #> # A tibble: 2 × 4 #>

    .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 accuracy binary 0.917 Preprocessor1_Model1 #> 2 roc_auc binary 0.800 Preprocessor1_Model1 ͢΂ͯͷϦαϯϓϧσʔλ͔Βੑೳͷฏۉ஋ΛٻΊΔ lp_fit_rs |> collect_metrics() #> # A tibble: 2 × 6 #> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 accuracy binary 0.909 20 0.00566 Preprocessor1_Model1 #> 2 roc_auc binary 0.776 20 0.0195 Preprocessor1_Model1
  37. \UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_ctrl <- # Ϧαϯϓϧσʔλʹର͢Δ༧ଌ஋Λྻʹอଘ͢ΔͨΊ control_resamples(save_pred = TRUE) lp_preds <-

    fit_resamples(tree_wflow, lp_folds, control = lp_ctrl) |> # ϦαϯϓϧσʔλΛల։ͯ͠ҰͭͷσʔλϑϨʔϜʹ collect_predictions() lp_preds #> # A tibble: 2,898 × 8 #> id id2 .pred_FALSE .pred_TRUE .row .pred_class gas .config #> <chr> <chr> <dbl> <dbl> <int> <fct> <fct> <chr> #> 1 Repeat1 Fold01 0.0359 0.964 4 TRUE TRUE Preprocessor1_… #> 2 Repeat1 Fold01 0 1 9 TRUE TRUE Preprocessor1_… #> 3 Repeat1 Fold01 0.889 0.111 22 FALSE FALSE Preprocessor1_… #> …(লུ)
  38. \UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_preds |> group_by(id) |> lp_metrics(truth = gas, estimate =

    .pred_class) #> # A tibble: 6 × 4 #> id .metric .estimator .estimate #> <chr> <chr> <chr> <dbl> #> 1 Repeat1 accuracy binary 0.906 #> 2 Repeat2 accuracy binary 0.912 #> 3 Repeat1 sensitivity binary 0.270 #> 4 Repeat2 sensitivity binary 0.270 #> 5 Repeat1 specificity binary 0.965 #> 6 Repeat2 specificity binary 0.971
  39. \NMSWFSTF^Ͱ͜͜·Ͱͷ޻ఔΛ࣮ߦ library(mlr3verse) lp_supply_chr2fct <- lp_supply |> mutate(across(where(is.character), as.factor)) # λεΫͷఆٛ

    lp_task <- as_task_classif(lp_supply_chr2fct, target = "gas") # σʔλ෼ׂʢ૚Խநग़ɺ80%Λֶशσʔλʹʣ set.seed(123) lp_split_mlr <- partition(lp_task, stratify = TRUE, ratio = 0.8) # ֶशثͷߏங tree_learner <- lrn("classif.rpart", cp = 0.002)
  40. \NMSWFSTF^Ͱ͜͜·Ͱͷ޻ఔΛ࣮ߦ # ϦαϯϓϦϯάσʔλͷ༻ҙʢ2ճͷ܁Γฦ͠5෼ׂަࠩݕূ๏ʣ lp_folds_mlr <- rsmp("repeated_cv", repeats = 2, folds

    = 5) rr <- resample(lp_task, tree_learner, lp_folds_mlr) rr$aggregate(msr("classif.acc")) #> classif.acc #> 0.9213476 rr$aggregate(msr("classif.sensitivity")) #> classif.sensitivity #> 0.3157344 rr$aggregate(msr("classif.specificity")) #> classif.specificity #> 0.9747444
  41. \SFDJQFT^ಛ௃ྔΤϯδχΞϦϯά recipe(gas ~ ., data = lp_train) |> step_zv(all_predictors()) lp_supply$water

    |> unique() #> [1] TRUE step_*() 行 step_*() 目 TUFQ@[W ؔ਺͸ɺ୯Ұͷ஋͔ΒͳΔม਺ΛϞσϧ͔ࣜΒআ֎͢Δʢ;FSPWBSJBODF fi MUFSʣ ಛ௃ྔ΁ͷૢ࡞ΛʮϨγϐʯͱͯ͠ఆٛ͢Δ
  42. \SFDJQFT^ಛ௃ྔΤϯδχΞϦϯά lp_rec <- recipe(gas ~ ., data = lp_train) |>

    step_zv(all_predictors()) |> step_log(price, dist_from_st, offset = 0.01) |> step_normalize(all_numeric_predictors()) |> step_dummy(all_nominal_predictors()) lp_rec #> #> ── Recipe ────────────────────────────────────────────────────────────────────── #> #> ── Inputs #> Number of variables by role #> outcome: 1 #> predictor: 7 #> #> ── Operations #> • Zero variance filter on: all_predictors() #> • Log transformation on: price and dist_from_st #> • Centering and scaling for: all_numeric_predictors() #> • Dummy variables from: all_nominal_predictors() TUFQ@ ΛඞཁʹԠͯ͡௥Ճ͢Δ 用
  43. \SFDJQFT^ಛ௃ྔΤϯδχΞϦϯά prep(lp_rec) |> bake(new_data = lp_train) |> glimpse() #> Rows:

    1,449 #> Columns: 7 #> $ price <dbl> -1.0975606, -1.6677181, -1.8257103, -0.7057626, -0.559127… #> $ sewer <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ above_floor <dbl> 1.4569564, -1.3153743, -0.2757503, 0.4173324, 2.8431217, … #> $ under_floor <dbl> 1.9139251, -0.3518245, -0.3518245, 1.9139251, -0.3518245,… #> $ dist_from_st <dbl> 0.63568930, 0.46296422, 0.86095169, -0.37863009, -4.89537… #> $ gas <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ fire_๷Ր <dbl> 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, … 用 用
  44. \XPSL fl PXT^ϫʔΫϑϩʔͷमਖ਼ tree_wflow |> remove_formula() |> add_recipe(lp_rec) #> ══

    Workflow ════════════════════════════════════════════════════════════════════ #> Preprocessor: Recipe #> Model: decision_tree() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────── #> 4 Recipe Steps #> #> • step_zv() #> • step_log() #> • step_normalize() #> • step_dummy() #> #> ── Model ─────────────────────────────────────────────────────────────────────── #> Decision Tree Model Specification (classification) #> #> Main Arguments: #> cost_complexity = 0.002 #> #> Computational engine: rpart
  45. ϥϯμϜϑΥϨετͷ࣮૷ͱϞσϧͷൺֱ rf_spec <- rand_forest(trees = 1000, mode = "classification") |>

    set_engine("randomForest") rf_spec #> Random Forest Model Specification (classification) #> #> Main Arguments: #> trees = 1000 #> #> Computational engine: randomForest
  46. \XPSL fl PXTFUT^ෳ਺ϞσϧɺϨγϐΛѻ͏ wf_set_fit <- workflow_set(preproc = list(none = gas

    ~ ., prep = lp_rec), models = list(tree_spec, rf_spec), cross = TRUE) |> workflow_map("fit_resamples", resamples = lp_folds) wf_set_fit |> rank_results() #> # A tibble: 8 × 9 #> wflow_id .config .metric mean std_err n preprocessor model rank #> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int> #> 1 prep_rand_forest Prepro… accura… 0.924 0.00475 20 recipe rand… 1 #> 2 prep_rand_forest Prepro… roc_auc 0.873 0.0155 20 recipe rand… 1 #> 3 none_rand_forest Prepro… accura… 0.927 0.00460 20 formula rand… 2 #> 4 none_rand_forest Prepro… roc_auc 0.870 0.0165 20 formula rand… 2 #> 5 none_decision_tr… Prepro… accura… 0.909 0.00566 20 formula deci… 3 #> 6 none_decision_tr… Prepro… roc_auc 0.776 0.0195 20 formula deci… 3 #> 7 prep_decision_tr… Prepro… accura… 0.909 0.00566 20 recipe deci… 4 #> 8 prep_decision_tr… Prepro… roc_auc 0.776 0.0195 20 recipe deci… 4
  47. \XPSL fl PXTFUT^ෳ਺ϞσϧɺϨγϐΛѻ͏ wf_set_fit <- workflow_set(preproc = list(none = gas

    ~ ., prep = lp_rec), models = list(tree_spec, rf_spec), cross = TRUE) |> workflow_map("fit_resamples", resamples = lp_folds) wf_set_fit |> rank_results() #> # A tibble: 8 × 9 #> wflow_id .config .metric mean std_err n preprocessor model rank #> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int> #> 1 prep_rand_forest Prepro… accura… 0.924 0.00475 20 recipe rand… 1 #> 2 prep_rand_forest Prepro… roc_auc 0.873 0.0155 20 recipe rand… 1 #> 3 none_rand_forest Prepro… accura… 0.927 0.00460 20 formula rand… 2 #> 4 none_rand_forest Prepro… roc_auc 0.870 0.0165 20 formula rand… 2 #> 5 none_decision_tr… Prepro… accura… 0.909 0.00566 20 formula deci… 3 #> 6 none_decision_tr… Prepro… roc_auc 0.776 0.0195 20 formula deci… 3 #> 7 prep_decision_tr… Prepro… accura… 0.909 0.00566 20 recipe deci… 4 #> 8 prep_decision_tr… Prepro… roc_auc 0.776 0.0195 20 recipe deci… 4
  48. \UVOF^ϋΠύʔύϥϝʔλͷ୳ࡧ rf_tune_spec <- rand_forest(min_n = , trees = , mode

    = "classification") |> set_engine("randomForest") rf_tune_spec #> Random Forest Model Specification (classification) #> #> Main Arguments: #> trees = tune() #> min_n = tune() #> #> Computational engine: randomForest tune() tune() 行 tune()
  49. \UVOF^ϋΠύʔύϥϝʔλͷ୳ࡧ rf_tune_wflow <- workflow() |> add_model(rf_tune_spec) |> add_recipe(lp_rec) rf_fit_tune_res #>

    # A tibble: 20 × 5 #> splits id id2 .metrics .notes #> <list> <chr> <chr> <list> <list> #> 1 <vfld_spl> Repeat1 Fold01 <tibble [10 × 6]> <tibble [0 × 3]> #> 2 <vfld_spl> Repeat1 Fold02 <tibble [10 × 6]> <tibble [0 × 3]> #> 3 <vfld_spl> Repeat1 Fold03 <tibble [10 × 6]> <tibble [0 × 3]> #> …ʢলུʣ rf_fit_tune_res <- tune_grid(rf_tune_wflow, lp_folds, grid = 5)
  50. \UVOF^ϋΠύʔύϥϝʔλͷ୳ࡧ finalize_workflow(rf_tune_wflow, lp_tune_best_parameter) |> last_fit(lp_split) |> collect_metrics() #> # A

    tibble: 2 × 4 #> .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 accuracy binary 0.950 Preprocessor1_Model1 #> 2 roc_auc binary 0.909 Preprocessor1_Model1 lp_tune_best_parameter <- select_best(rf_fit_tune_res, metric = "roc_auc") lp_tune_best_parameter #> # A tibble: 1 × 3 #> trees min_n .config #> <int> <int> <chr> #> 1 1951 15 Preprocessor1_Model5
  51. ࢀߟࢿྉ .BY,VIO BOE,KFMM+PIOTPO  l'FBUVSF&OHJOFFSJOHBOE4FMFDUJPOz IUUQTCPPLEPXOPSHNBY'&4 .BY,VIOBOE+VMJB4JMHF  5JEZ.PEFMJOHXJUI3IUUQTXXXUNXSPSH দଜ༏࠸

    ӝੜਅ໵ ٢ଜ޿ࢤ  ʮ3ϢʔβͷͨΊͷUJEZNPEFMT<࣮ફ>ೖ໳ʙ Ϟμϯͳ౷ܭɾػցֶशϞσϦϯάͷੈքʯٕज़ධ࿦ࣾ 1S[FNZT“BX#JFDFLBOE5PNBT[#VS[ZLPXTLJ  &YQMBOBUPSZ.PEFM "OBMZTJTIUUQTFNBESXIZBJ .BY,VIOBOE,KFMM+PIOTPO  "QQMJFE.BDIJOF-FBSOJOHGPS5BCVMBS %BUBIUUQTBNMUEPSH #FSOE#JTDIM 3BQIBFM4POBCFOE -BST,PUUIP ff .JDIFM-BOH  "QQMJFE .BDIJOF-FBSOJOH6TJOHNMSJO3IUUQTNMSCPPLNMSPSHDPN