Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

tidymodelsで覚えるRでのモデル構築と運用 / tidymodels2020

Uryu Shinya
October 21, 2020

tidymodelsで覚えるRでのモデル構築と運用 / tidymodels2020

Uryu Shinya

October 21, 2020
Tweet

More Decks by Uryu Shinya

Other Decks in Programming

Transcript

  1. ෳ਺ͷ޻ఔΛ൓෮తʹߦ͏ ࡞ۀ߲໨ ܾఆࣄ߲ ධՁࢦඪ ಛ௃ྔΤϯδχΞϦϯά Ϟσϧͷܾఆɾ࣮ߦɾൺֱ ൚ԽੑೳͷධՁɾվળ λεΫઃఆ ύϥϝʔλ୳ࡧ ୳ࡧతσʔλ෼ੳ

    ճؼ ෼ྨ ܽଛ΁ͷରॲ ࡟আ ิ׬ 3.4& "6$ ༏ઌ౓ ద߹཰ ൚Խੑೳ ղऍੑ ܰྔ όϦσʔγϣϯ ελοΩϯά άϦουαʔν ϕΠζ࠷దԽ ϥϯμϜϑΥϨετ (#%5 χϡʔϥϧωοτ εέʔϦϯά ΤϯίʔσΟϯά 1$"
  2. UJEZNPEFMTW library(tidymodels) ✓ broom 0.7.1 ✓ recipes 0.1.13 ✓ dials

    0.0.9 ✓ rsample 0.0.8 ✓ dplyr 1.0.2 ✓ tibble 3.0.4 ✓ infer 0.5.3 ✓ tidyr 1.1.2 ✓ modeldata 0.0.2 ✓ tune 0.1.1 ✓ parsnip 0.1.3 ✓ workflows 0.2.1 ✓ purrr 0.3.4 ✓ yardstick 0.0.7 UJEZWFSTFύοέʔδͱಉ͡఩ֶࢥ૝Ͱ։ൃ͞ΕΔ ҰͭͷύοέʔδΛ ಡΈࠐΉͱ ෳ਺ͷύοέʔδ͕ ར༻ՄೳʹͳΔ ౷Ұ͞ΕͨΠϯλʔϑΣΠεΛఏڙ ύΠϓԋࢉࢠ   ϑϨϯυϦʔ ؔ਺ɺ Ҿ਺໊ͷ໌֬ੑ
  3. {tidymodels}ʹؚ·ΕΔύοέʔδ {parsnip} {recipes} {rsample} {yardstick} Ϟσϧߏஙɾద༻ ϞσϧͷੑೳධՁ σʔλલॲཧɺ ಛ௃ྔੜ੒ ෼ׂɺϦαϯϓϦϯά

    {dials} {tune} {workflows} Ϟσϧద༻·ͰͷॲཧΛ ϫʔΫϑϩʔԽ ύϥϝʔλ୳ࡧɾௐ੔ ͜ͷεϥΠυͰѻ͏΋ͷ
  4. ԋश ࠃ౔਺஋৘ใ஍Ձެࣔσʔλ ஍ՁՁ֨Λ༧ଌ͢ΔϞσϧΛߏங͢Δʢճؼ໰୊ʣ dplyr ::glimpse(df_lp) #> Rows: 8,476 #> Columns:

    8 #> $ log_lp <dbl> 3.618048, 4.591065, 4.754348… #> $ distance_from_station <int> 8700, 13000, 13000, 5500, 80… #> $ acreage <int> 317, 166, 226, 274, 357, 173, 661… #> $ current_use <fct> "ॅ୐,ͦͷଞ", "ॅ୐", "ళฮ"… #> $ building_coverage <dbl> 0, 60, 80, 70, 70, 60, 70, 70, 70… #> $ building_structure <fct> W, W, W, W, W, W, W, W, W, W, W… #> $ .longitude <dbl> 138.5383, 138.5921, 138.5933… #> $ .latitude <dbl> 36.46920, 36.61913, 36.62025… 出典: 国⼟交通省 国⼟数値情報 地価公⽰データ 第2.4版 L01 平成30年度 https://nlftp.mlit.go.jp/ksj/jpgis/datalist/KsjTmplt-L01-v1_1.html
  5. ม਺໊ આ໌ ܕ log_lp ஍ՁՁ֨Λৗ༻ର਺ʹͨ͠஋ ࣮਺ distance_from_station Ӻ͔Βͷڑ཭(m) ੔਺ acreage

    ஍ੵ(m2) ੔਺ current_use ར༻ݱگɻඪ४஍ͷݱࡏͷར༻ํ๏Λࣔ͢ΧςΰϦɻ ෳ਺ͷΧςΰϦʹͳΔ͜ͱ΋͋Δɻ Ҽࢠ building_coverage ݐ΃͍཰ɻݐங෺ͷԆ΂໘ੵͷෑ஍໘ੵʹର͢Δׂ߹ ࣮਺ building_structure ݐ෺ߏ଄ɻඪ४஍ͷݐ෺ͷߏ଄ʹΑΔ۠ผɻ SRCɿమࠎɾమےίϯΫϦʔτ, RCɿమےίϯΫϦʔτ, Sɿమࠎ଄, BɿϒϩοΫ଄, Wɿ໦଄ɻະهࡌͷ৔߹͸ UNKNOWN Ҽࢠ .longitude ܦ౓ɻ஍Ձެࣔඪ४஍ͷҐஔΛࣔ͢ ࣮਺ .latitude Ң౓ɻ஍Ձެࣔඪ४஍ͷҐஔΛࣔ͢ ࣮਺ ԋश ࠃ౔਺஋৘ใ஍Ձެࣔσʔλ
  6. σʔληοτશମΛ ֶशηοτ USBJO ɺධՁηοτ UFTU ʹ෼͚Δ ֶशηοτ ධՁηοτ σʔλ෼ׂ 3Ͱͷφ΢ͳσʔλ෼ׂͷ΍ΓํSTBNQMFύοέʔδʹΑΔަࠩݕূגࣜձࣾϗΫιΤϜͷϒϩά

    IUUQTCMPHIPYPNDPNFOUSZ σʔληοτ Ϟσϧͷֶशʹ༻͍Δ ϞσϧͷੑೳධՁΛଌఆ͢ΔͨΊɺ ະ஌ͷ৘ใͱͯ͠༩͑Δ
  7. ෼ׂ͸ϥϯμϜ σʔληοτͷׂΛ෼ੳηοτͱ͢Δ lp_split <- initial_split(df_lp, prop = 0.8, strata =

    log_lp) lp_split #> <Analysis/Assess/Total> #> <6358/2118/8476> lp_train <- training(lp_split) # ֶशηοτ lp_test <- testing(lp_split)ɹ# ධՁηοτ σʔλ෼ׂ ஍Ձͷ෼෍ʹԠͨ͡ ૚ผαϯϓϦϯάΛࢦఆ
  8. લॲཧɾಛ௃ྔΤϯδχΞϦϯά Ϟσϧʹ༻͍ΔσʔλՃ޻ͷखଓ͖ΛʮϨγϐʯԽ ϞσϧͰѻ͏σʔλͷલॲཧΛSFDJQFTͰߦ͏גࣜձࣾϗΫιΤϜͷϒϩά IUUQTCMPHIPYPNDPNFOUSZ 1 2 3 recipe() step_*() prep()

    bake() 4 ར༻͢Δม਺ͷؔ܎Λఆٛ ˠࡐྉΛࢦఆ͢Δ σʔλՃ޻ͷखଓ͖Λࢦఆ ˠௐཧ๏Λهड़͢Δ σʔληοτʹద༻ ˠௐཧΛߦ͏ TUFQ@ ͷॲཧΛ౷߹ ˠϨγϐΛ֬ೝ͢Δ
  9. init_lp_recipe <- lp_train %>% #> # log_lp Λ໨తม਺ɺଞͷม਺Λઆ໌ม਺ʹͨ͠Ϟσϧ recipe(formula =

    log_lp ~ .) %>% #> # εςοϓ1: acreageΛର৅ʹৗ༻ର਺ʹม׵ step_log(acreage, base = 10) Ϟσϧ΁ͷॲཧΛύΠϓԋࢉࢠͰ௥Ճ step_log( recipe(lp_train, log_lp ~ .), acreage, base = 10) ౰વɺؔ਺ΛೖΕࢠʹهड़ͯ͠΋0,
  10. ͲΜͳॲཧΛࢦఆͰ͖Δͷʁ step_*() ؔ਺͸ ͱͯ͠ఏڙ͞ΕΔ εέʔϦϯά ΤϯίʔσΟϯά ೔෇ɾ࣌ؒ ϑΟϧλॲཧ ࣍ݩ࡟ݮ ͳͲ

    ls("package:recipes", pattern = “^step_") #> # 77ݸͷstep_*ؔ਺ (version 0.1.14) ઐ໳ʹಛԽͨ͠ύοέʔδ΋ {textrecipes} จࣈྻ {embed} {themis} ෆۉߧ ΧςΰϦΧϧ
  11. step_*()Ͱͷม਺ͷࢦఆํ๏ จࣈྻͰͷࢦఆ tidyselectͷؔ਺ Ϟσϧ಺Ͱͷrole 1 2 3 ม਺ͷσʔλܕ 4 all_predictors()

    all_outcomes() starts_with() contains()ͳͲ all_nominal() all_numeric() "acreage" "building_structure" dͰ࢝·Δ dΛؚΉ આ໌ม਺ ໨తม਺ ΧςΰϦ ਺஋
  12. init_lp_recipe <- init_lp_recipe %>% step_mutate(distance_from_station = if_else(distance_from_station == 0, 0.1,

    as.double(distance_from_station))) %>% step_log(distance_from_station, base = 10) %>% step_other(current_use, threshold = 0.01) %>% step_dummy(all_nominal()) %>% step_normalize(all_predictors()) step_*()ͷ௥Ճ 1 2 3 4
  13. લॲཧɾಛ௃ྔΤϯδχΞϦϯά Ϟσϧʹ༻͍ΔσʔλՃ޻ͷखଓ͖ΛʮϨγϐʯԽ ϞσϧͰѻ͏σʔλͷલॲཧΛSFDJQFTͰߦ͏גࣜձࣾϗΫιΤϜͷϒϩά IUUQTCMPHIPYPNDPNFOUSZ 1 2 3 recipe() step_*() prep()

    bake() 4 ར༻͢Δม਺ͷؔ܎Λఆٛ ˠࡐྉΛࢦఆ͢Δ σʔλՃ޻ͷखଓ͖Λࢦఆ ˠௐཧ๏Λهड़͢Δ σʔληοτʹద༻ ˠௐཧΛߦ͏ TUFQ@ ͷॲཧΛ౷߹ ˠϨγϐΛ֬ೝ͢Δ
  14. lp_rec_prepped <- prep(init_lp_recipe) #> Data Recipe #> #> Inputs: #>

    role #variables #> outcome 1 #> predictor 7 #> #> Training data contained 6358 data points and no missing data. #> #> Operations: #> Log transformation on acreage [trained] #> Variable mutation for distance_from_station [trained] #> Log transformation on distance_from_station [trained] #> Collapsing factor levels for current_use [trained] #> Dummy variables from current_use, building_structure [trained] #> Centering and scaling for distance_from_station, acreage, ... [trained] recipeͷ׬੒
  15. σʔληοτʹϨγϐΛద༻ lp_test_prepped <- lp_rec_prepped %>% bake(new_data = lp_test) ෼ੳηοτ ධՁηοτ

    lp_train_prepped <- lp_rec_prepped %>% bake(new_data = NULL) glimpse(lp_train_prepped) #> Observations: 6,358 #> Variables: 22 #> $ distance_from_station <dbl> 1.48883723, 1.74347636, … #> $ acreage <dbl> 0.348700377, -0.368317326, -0.026333976, … #> … #> $ current_use_ॅ୐.ళฮ <dbl> -0.2244556, -0.2244556, -0.2244556, … #> … #> $ current_use_other <dbl> -0.2876676, -0.2876676, -0.2876676, …
  16. Ϟσϧߏங ࢓༷Λఆٛ ΤϯδϯʢύοέʔδʣΛࢦఆ Ϟσϧͷ౰ͯ͸Ί 1 set_engine() ՝୊ʹదͨ͠ϞσϧΛબͿ 2 3 fit()

    linear_reg() rand_forest() logistic_reg() ֶशηοτͷద༻ predict() ධՁηοτͰͷ༧ଌ
  17. ઢܕճؼϞσϧ lm_model <- linear_reg() %>% set_engine("lm") class(lm_model) #> [1] "linear_reg"

    "model_spec" lm_formula_fit <- lm_model %>% fit(log_lp ~ ., data = lp_train_prepped) lm(log_lp ~ ., data = lp_train_prepped) ☝ಉ݁͡Ռ
  18. gb_model <- boost_tree(trees = 1000, mtry = 3, tree_depth =

    4) %>% set_mode("regression") rf_model <- rand_forest(trees = 1000, mtry = 3) %>% set_mode("regression") Ϟσϧʹݻ༗ͷ ɹɹΦϓγϣϯΛࢦఆՄೳ rf_model %>% set_engine("ranger") rf_model %>% set_engine("randomForest") ϥϯμϜϑΥϨετ ޯ഑ϒʔεςΟϯά gb_model %>% set_engine("xgboost") 2 1 3 fit() predict() ద༻͢ΔϞσϧɺΤϯδϯΛมߋ
  19. ϞσϧͷੑೳධՁ λεΫʹԠͨ͡ධՁࢦඪΛར༻͢Δ ܾఆ܎਺(R2, RSQ: coefficient of determination) ೋ৐ฏۉฏํࠜޡࠩ (RMSE: Root

    Mean Square Error) ฏۉઈରޡࠩ(MAE: Mean absolute error) ࠞಉߦྻ ਖ਼ղ཰ ద߹཰ͱ࠶ݱ཰ ROCۂઢͱAUC ճؼ໰୊ ෼ྨ໰୊
  20. ϞσϧੑೳධՁ ࢓༷Λఆٛ ΤϯδϯʢύοέʔδʣΛࢦఆ Ϟσϧͷ౰ͯ͸Ί 1 set_engine() ՝୊ʹదͨ͠ϞσϧΛબͿ 2 3 fit()

    linear_reg() rand_forest() logistic_reg() ֶशηοτͷద༻ predict() ධՁηοτͰͷ༧ଌ
  21. ϞσϧੑೳධՁ ಛఆͷੑೳࢦඪ΍ੑೳࢦඪͷ૊Έ߹ΘͤΛࢦఆ͢Δ rmse(df_lm_model_predict, truth = log_lp, estimate = .pred) #>

    .metric .estimator .estimate #> 1 rmse standard 0.357 rsq(df_lm_model_predict, truth = log_lp, estimate = .pred) #> .metric .estimator .estimate #> 1 rsq standard 0.595 ઢܕճؼϞσϧͷ3.4&͸ lp_metrics <- metric_set(rmse, rsq, mae) lp_metrics(df_lm_model_predict, truth = log_lp, estimate = .pred) #> # A tibble: 3 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 0.357 #> 2 rsq standard 0.595 #> 3 mae standard 0.271
  22. ஍Ձެࣔσʔλճؼ໰୊ͷϫʔΫϑϩʔ lp_wflow <- workflow() %>% add_recipe(init_lp_recipe) %>% add_model(lm_model) # ઢܗճؼϞσϧ

    fit(lp_wflow, data = lp_train) Ϩγϐ΍Ϟσϧͷมߋɺద༻͢Δσʔλͷࢦఆ͕༰қ lp_wflow %>% update_model(rf_model) %>% fit(data = lp_train) lp_wflow %>% update_model(gb_model) %>% fit(data = lp_test) \QBSTOJQ^Ͱ࡞ͬͨ ϥϯμϜϑΥϨετ ධՁηοτ 1 2 \YHCPPTU^Ͱ࡞ͬͨ ޯ഑ϒʔεςΟϯά 2
  23. ϨγϐɺϫʔΫϑϩʔͷߋ৽ second_lp_recipe <- init_lp_recipe %>% step_interact( ~ acreage:starts_with("building_structure")) %>% #

    << ަޓ࡞༻߲ step_ns(.latitude, .longitude, deg_free = 20) # << ϊοτ਺͸ద౰ ࠷ॳͷϨγϐʹॲཧΛ௥Ճ lp_wflow <- lp_wflow %>% update_recipe(second_lp_recipe) lp_fit <- fit(lp_wflow, lp_train) 3.4&ʜ 1 2
  24. set.seed(55) val_set <- vfold_cv(lp_train, v = 10) #> # 10-fold

    cross-validation #> # A tibble: 10 x 2 #> splits id #> <list> <chr> #> 1 <split [5.7K/636]> Fold01 #> 2 <split [5.7K/636]> Fold02 #> 3 <split [5.7K/636]> Fold03 #> 4 <split [5.7K/636]> Fold04 #> 5 <split [5.7K/636]> Fold05 #> 6 <split [5.7K/636]> Fold06 #> 7 <split [5.7K/636]> Fold07 #> 8 <split [5.7K/636]> Fold08 #> 9 <split [5.7K/635]> Fold09 #> 10 <split [5.7K/635]> Fold10 cores <- parallel::detectCores() rf_wflow <- workflow() %>% add_model( rand_forest(mtry = tune(), trees = tune()) %>% set_engine("ranger", num.threads = cores) %>% set_mode("regression")) %>% add_recipe(second_lp_recipe) ϥϯμϜϑΥϨετͷύϥϝʔλ୳ࡧ ௐ੔͍ͨ͠ύϥϝʔλ ʹରͯ͠UVOF Λࢦఆ ,෼ׂަࠩݕূͰ ݸͷGPMEΛ༻ҙ 1 2
  25. άϦουαʔνͰͷ୳ࡧ set.seed(345) rf_res <- rf_wflow %>% tune_grid(val_set, grid = 25,

    control = control_grid(save_pred = TRUE), metrics = metric_set(rmse)) autoplot(rf_res) όϦσʔγϣϯηοτ ධՁࢦඪͷࢦఆ 3
  26. ϕετϞσϧͷύϥϝʔλ rf_best <- rf_res %>% show_best(metric = “rmse") #> #

    A tibble: 5 x 8 #> mtry trees .metric .estimator mean n std_err .config #> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 26 1283 rmse standard 0.135 1 NA Model17 #> 2 20 524 rmse standard 0.135 1 NA Model19 #> 3 23 1190 rmse standard 0.135 1 NA Model08 #> 4 29 239 rmse standard 0.135 1 NA Model01 #> 5 34 1118 rmse standard 0.136 1 NA Model25 4
  27. ೚ҙͷύϥϝʔλɾൣғΛ୳ࡧ #> # A tibble: 4 x 7 #> coords

    .metric .estimator mean n std_err .config #> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 5 rmse standard 0.160 10 0.00288 Recipe2 #> … tune_lp_recipe <- init_lp_recipe %>% step_interact( ~ acreage:starts_with("building_structure")) %>% step_ns(.latitude, .longitude, deg_free = tune("coords")) # << spline_res <- tune_grid(rf_model, tune_lp_recipe, resamples = lp_folds, grid = expand.grid(coords = c(2, 5, 20, 200))) spline_res %>% show_best(metric = "rmse")
  28. ࠷ऴతͳϞσϧ rf_model_tuned <- rand_forest(mtry = rf_best$mtry[1], trees = rf_best$trees[1]) %>%

    set_engine("ranger", num.threads = cores, importance = "impurity") %>% set_mode("regression") ϥϯμϜϑΥϨετͷύϥϝʔλʹ୳ࡧͨ݁͠ՌΛద༻ ϕετϞσϧͷύϥϝʔλ last_lp_recipe <- init_lp_recipe %>% step_interact( ~ acreage:starts_with("building_structure")) %>% step_ns(.latitude, .longitude, deg_free = 5)
  29. ֶशɾධՁηοτΛ༩͑ͯ࠷ऴ݁ՌΛಘΔ last_rf_fit <- rf_wflow_tuned %>% last_fit(lp_split) last_rf_fit %>% collect_metrics() #>

    # A tibble: 2 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 0.134 #> 2 rsq standard 0.943 3.4&ʜ ࠷ॳʹ෼ׂͨ͠ σʔληοτ ॳظϞσϧʜ rf_wflow_tuned <- rf_wflow %>% update_recipe(last_lp_recipe) %>% update_model(rf_model_tuned) ௐ੔ͨ͠Ϩγϐɺ ϞσϧΛࢦఆ
  30. ·ͱΊ {parsnip} {recipes} {rsample} {yardstick} Ϟσϧߏஙɾద༻ ϞσϧͷੑೳධՁ σʔλલॲཧɺಛ௃ྔੜ੒ ෼ׂɺϦαϯϓϦϯά {dials}

    {tune} {workflows} ϫʔΫϑϩʔԽ ύϥϝʔλ୳ࡧɾௐ੔ initial_split() vfold_cv() initial_time_split() nested_cv() step_*() prep() bake() recipe() set_engine() rand_forest() boost_tree() metrics() rmse() roc_auc() tune() grid_random() workflow() add_*() update_*()