Rの機械学習フレームワークの紹介〜tidymodelsを中心に〜 / machine_learning_with_r2024

ӝੜਅ໵ʢಙౡେֶσβΠϯܕ"*ڭҭݚڀηϯλʔʣ ɹϢʔβͷͨΊͷ ػցֶशνϡʔτϦΞϧ ୈҰ෦ɿ3ͷػցֶशϑϨʔϜϫʔΫͷ঺հʙUJEZNPEFMTΛத৺ʹʙ ೔ຊ౷ܭֶձୈճय़قूձاըηογϣϯ".#  https://github.com/uribo/240309_jss18tutorial  𝕏 u_ribo

ࣗݾ঺հ ಙౡେֶσβΠϯܕ"*ڭҭݚڀηϯλʔɾॿڭ ӝੜਅ໵ ͏ΓΎ͏͠Μ΍ ؔ৺ɾݚڀ σʔλՄࢹԽɺۭؒӸֶɺਓྲྀɺ໺ੜੜ෺ ʮ"*૯߹ԋशʯɺʮ৘ใՊֶೖ໳ʯɺ ʮσʔλαΠΤϯε΁ͷ༠͍ʯ ୲౰ߨٛ KQNFTI
[JQBOHV KNBTUBUT GHESͳͲ ύοέʔδ ։ൃ ࣥචɾ຋༁ ѪݘʮϥοΩʔʯͱࢲ

νϡʔτϦΞϧͷ಺༰ ୈҰ෦ʢ෼ʣ ୈೋ෦ʢ෼ʣ ػցֶशͷϫʔΫϑϩʔʹ͍ͭͯ ػցֶशϑϨʔϜϫʔΫͷಋೖ 3ʹΑΔػցֶशͷ࣮ԋ ٳܜʢ෼ʣ ۭؒґଘੑ΁ͷ഑ྀ Ϟσϧͷۭؒతͳద༻ൣғ ஍ཧۭؒσʔλͷಛ௃
஍ཧۭؒσʔλʷػցֶशϞσϧ ஍ཧۭؒσʔλΛػցֶशʹ ద༻͢Δࡍͷ՝୊

໨ඪ ɽ3ݴޠͰ͞·͟·ͳػցֶशख๏ͷద༻ɾϞσϧߏஙલޙͷૢ࡞Λߦ͏ ύοέʔδ͕ଘࡏ͢Δ͜ͱΛ஌Δ ͜ΕΒͷύοέʔδΛ౷ҰతͳΠϯλʔϑΣΠεͰ·ͱΊͨϑϨʔϜϫʔΫͱɺ ɹػցֶशϞσϧͷ޻ఔͷͭͳ͕ΓΛֶͿ ஍ཧۭؒσʔλΛػցֶशϞσϧͰѻ͏ࡍͷ஫ҙ఺Λཧղ͠ɺ ɹͲͷΑ͏ͳํ๏Ͱରࡦ͕Մೳ͔Λݕ౼Ͱ͖ΔΑ͏ʹͳΔ νϡʔτϦΞϧऴྃޙɺUJEZNPEFMTΛ༻͍ͯɺ ɹࣗ਎ͷσʔλ΍ద༻͢ΔϞσϧʹԠͨ͡ػցֶशϞσϧΛߏஙํ๏Λ਎ʹ͚ͭΔ

ѻΘͳ͍಺༰ ౷ܭతख๏ɺϞσϧͷ਺ཧతഎܠʹ͍ͭͯ͸লུˠࢀՃऀͷํʑͷ΄͏͕ৄ͍͠ 3ݴޠͷૢ࡞ˠύΠϓԋࢉࢠʢcʣΞϨϧΪʔ͕ͳ͍͜ͱ ඇߏ଄ԽσʔλΛ༻͍ͨػցֶशɺਂ૚ֶशˠදܗࣜͷσʔλΛ૝ఆ νϡʔτϦΞϧͰͷର৅ൣғ ਓ޻஌ೳɺػցֶशɺਂ૚ֶशͷҐஔ͚ͮ ਓ޻஌ೳ ػցֶश ΤΩεύʔτ γεςϜͳͲ
ਂ૚ֶश ݹ͖ྑ͖"*   

ࢿྉஔ͖৔ ຊνϡʔτϦΞϧͰ༻͍Διʔείʔυ΍εϥΠυΛԼه63-Ͱެ։͍ͯ͠·͢   νϡʔτϦΞϧʺϫʔΫγϣοϓ 3ίʔυͷ࣮ߦ͸֤ࣗͰߦ͍ͬͯͩ͘͞ ࠶ݱՄೳͳ؀ڥΛఏڙ͠·͕͢ɺ׬શͰ͸ͳ͍͔΋͠Ε·ͤΜ ޙ೔*TTVFT౳Ͱ࿈བྷ͍͚ͨͩͨΒରԠ͠·͢ https://github.com/uribo/240309_jss18tutorial

ࢿྉஔ͖৔  ᶅ@KTTUVUPSJBM3QSPK ɹɹɹɹɹɹɹΛ34UVEJPͰ։͘

ػցֶशͷ ϫʔΫϑϩʔ

σʔλ෼ੳશൠͷϫʔΫϑϩʔ (BSSFUUBOE)BEMFZ Λ΋ͱʹ࡞੒

σʔλ෼ੳͷϫʔΫϑϩʔ (BSSFUUBOE)BEMFZ Λ΋ͱʹ࡞੒ ͜ͷνϡʔτϦΞϧͰ͸࿮ઢ಺ʹয఺Λ౰ͯ·͢

ػցֶशϞσϦϯάͷϫʔΫϑϩʔ .BY,VIO BOE,KFMM+PIOTPO Λ΋ͱʹ࡞੒ ෳ਺ͷ޻ఔΛ൓෮తʹߦ͏ͷ͕ಛ௃ ֤޻ఔͰҙࢥܾఆ͕ٻΊΒΕΔ ྫʣͲͷΑ͏ʹಛ௃ྔΤϯδχΞϦϯάΛߦ͏ʁϞσϧ͸ԿΛར༻͢Δʁ

ػցֶश ϑϨʔϜϫʔΫ

ϑϨʔϜϫʔΫΛར༻͢Δ΍൱΍ ౷Ұ͞Ε͍ͯͳ͍ΠϯλʔϑΣΠεͷֶशίετ͕ߴ͍ʜ ར༻͠ͳ͍ɾͰ͖ͳ͍ ֤޻ఔͷ݁ՌΛγʔϜϨεʹҠߦͰ͖Δ͔͸Θ͔Βͳ͍ ར༻͢Δ ΠϯλʔϑΣΠεͷֶशίετ͕Լ͕Δ   ݪଇతʹɺҰͭͷ޻ఔͰͭͷύοέʔδ ಉҰͷ޻ఔ಺Ͱ͋ͬͯ΋ҟͳΔύοέʔδΛ༻͍Δ͜ͱʹͳΔ
  ॊೈͳॲཧ͕Մೳ  ޻ఔؒͰੜͨ͡ΦϒδΣΫτʢ݁Ռʣͷ΍ΓऔΓɺ࿈ܞ͕༰қ ύοέʔδʹΑΓఏڙ͞ΕΔϞσϧ͕ରԠ͍ͯ͠Δͱ͸ݶΒͳ͍    ʢࢀߟʣ1ZUIPOʹ͓͚Δ4DJLJU-FBSO

3Ͱͷ.-ϑϨʔϜϫʔΫᶃUJEZNPEFMT ಛ௃ 1PTJUϝϯόʔΛؚΊͨUJEZNPEFMTνʔϜ͕։ൃʢʙʣ https://www.tidymodels.org/ UJEZWFSTFͷ఩ֶɺઃܭࢥ૝Λ౿ऻˠύΠϓԋࢉࢠϑϨϯυϦʔ DBSFUύοέʔδͷ։ൃऀ.BY,VIO͕Ϧʔυ UJEZNPEFMTύοέʔδͱͯ͠·ͱΊͯఏڙ ύοέʔδΠϯετʔϧ࣌ʹґଘύοέʔδ΋Πϯετʔϧ ϦαϯϓϦϯάɺֶशثɺಛ௃ྔΤϯδχΞϦϯάͳͲͷػೳΛ֦ு͠΍͍͢

3Ͱͷ.-ϑϨʔϜϫʔΫᶃUJEZNPEFMT ಛ௃ 1PTJUϝϯόʔΛؚΊͨUJEZNPEFMTνʔϜ͕։ൃʢʙʣ https://www.tidymodels.org/ UJEZWFSTFͷ఩ֶɺઃܭࢥ૝Λ౿ऻˠύΠϓԋࢉࢠϑϨϯυϦʔ DBSFUύοέʔδͷ։ൃऀ.BY,VIO͕Ϧʔυ UJEZNPEFMTύοέʔδͱͯ͠·ͱΊͯఏڙ ύοέʔδΠϯετʔϧ࣌ʹґଘύοέʔδ΋Πϯετʔϧ ϦαϯϓϦϯάɺֶशثɺಛ௃ྔΤϯδχΞϦϯάͷͨΊͷػೳΛ֦ு͠΍͍͢ QBSTOJQͷ֦ு
\DFOTPSFE^ʜੜଘ࣌ؒղੳ \QPJTTPOSFH^ʜϙΞιϯճؼ \UJEZDMVTU^ΫϥελϦϯά \CPOTBJ^ʜ໦Ϟσϧ SFDJQFTͷ֦ு \FNCFE^ʜຒΊࠐΈදݱ \UFYUSFDJQFT^ʜจࣈྻૢ࡞ \UIFNJT^ʜෆۉߧσʔλ \VSJCPXBTIPLV^ʜ೔ຊޠจষ ϞσϧͷσϓϩΠ ˠ\CFUJWFS^

3Ͱͷ.-ϑϨʔϜϫʔΫᶃUJEZNPEFMT \SFDJQFT^ \STBNQMF^ σʔλલॲཧɺ ಛ௃ྔΤϯδχΞϦϯά σʔλ෼ׂɺ ϦαϯϓϦϯά \QBSTOJQ^ Ϟσϧߏஙɾద༻ \ZBSETUJDL^
\XPSL fl PXT^ ϞσϧͷੑೳධՁ ֤޻ఔͷϫʔΫϑϩʔԽɺ ޮ཰తͳॲཧ \UVOF^ ύϥϝʔλ୳ࡧ ओͳύοέʔδͱͦͷ༻్

3Ͱͷ.-ϑϨʔϜϫʔΫᶄNMS https://mlr-org.com/ NMSύοέʔδΛத৺ͱͯ͠ɺػցֶशͷ޻ఔʹඞཁͳύοέʔδ܈Λ NMSWFSTFύοέʔδͱͯ͠·ͱΊͯఏڙ UBTL MFBSOFS SFTBNQMFS UVOFS QFSGPSNBODF υΠπΛத৺ʹ׆ಈ͢ΔNMSPSH͕։ൃ
3ΫϥεͰ࣮૷ ಛ௃ σʔλΛEBUBUBCMFͱͯ͠ѻ͏ NMSͷ֦ுɻ஍ཧۭؒσʔλΛѻ͏ͨΊͷύοέʔδ΋ؚ·ΕΔ ஍ཧۭؒσʔλؔ܎ͷ ॲཧʹରͯ͠ɺUJEZNPEFMTΑΓ΋ ஫ྗ͍ͯ͠Δʁʢӝੜࡶײʣ

3Ͱͷ.-ϑϨʔϜϫʔΫᶅ%"-&9 આ໌ՄೳͳػցֶशϞσϧͷͨΊʹઃܭ͞Εͨ ύοέʔδʢ୯ମʣ͓Αͼύοέʔδ܈Λࢦ͢ ϫϧγϟϫ޻Պେֶʢϙʔϥϯυʣ಺.*"*ʹΑΔ։ൃ https://dalex.drwhy.ai/ UJEZNPEFMTɺNMSͰߏங͞Εͨ ϞσϧΛѻ͏֦ுͱͯ͠ %"-&9FYUSBύοέʔδ͕ఏڙ͞ΕΔ 1ZUIPO൛΋੔උ͞Ε͍ͯΔ ಛ௃


͜͜·Ͱͷཁ఺ σʔλ෼ੳɺػցֶशͷ޻ఔ͸ଟذʹΘͨΓɺͦͷଟ͘͸൓෮తʹߦΘΕΔ ˠҰ࿈ͷखଓ͖Λԁ׈ʹ͢ΔͨΊͷϫʔΫϑϩʔͷར༻͕๬·͍͠ 3ʹ͓͚ΔػցֶशϑϨʔϜϫʔΫͱͯ͠ɺUJEZWFSTF΍NMS͕͋Δɻ ·ͨϞσϧͷղऍΛॿ͚ΔͨΊͷ%"-&9΋ར༻Ͱ͖Δ

3ʹΑΔ ػցֶशͷ࣮ԋ

Ψεڙڅঢ়گͷೋ஋෼ྨϞσϧͷߏங σʔλʲࠃ౔਺஋৘ใʳ౎ಓ෎ݝ஍Ձௐࠪσʔλʢ೥ʣ https://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-L02-2020.html ࠃ౔ར༻ܭը๏ࢪߦྩʹج͖ͮɺ֤౎ಓ෎ݝ ஌ࣄʹΑͬͯߦΘΕ֤ͨ೥݄೔࣌఺ͷ஍Ձ ௐࠪ݁Ռʹ͍ͭͯɺͦͷج४஍ͷҐஔʢ఺ʣɺ ௐࠪՁ֨ɺར༻ݱگɺ༻్஍Ҭɺ஍ੵ౳Λ੔උ ͨ͠΋ͷ ˠؔ౦஍ํʢ౎ݝʣͷ σʔλΛՃ޻ͯ͠࢖͍·͢
σʔλߏ଄ MQ MBOEQSJDF TVQQMZ

Ψεڙڅঢ়گͷೋ஋෼ྨϞσϧͷߏங ஍ཧۭؒଐੑΛআ͍͍ͯ·͢ɻ͜ΕΒ͸ୈೋ෦Ͱѻ͍·͢ dplyr::glimpse(lp_supply) #> Rows: 1,812 #> Columns: 8 #>
$ price <int> 115000, 61400, 57200, 51600, 177000, 208000, 57600, 68600… #> $ water <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ gas <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TR… #> $ sewer <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ above_floor <int> 8, 0, 3, 3, 5, 12, 3, 3, 3, 2, 6, 2, 7, 2, 2, 3, 2, 3, 3,… #> $ under_floor <int> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, … #> $ dist_from_st <int> 1300, 900, 270, 2100, 150, 0, 150, 850, 1200, 400, 640, 1… #> $ fire <chr> "๷Ր", "४๷", "४๷", "४๷", "๷Ր", "๷Ր", "४๷", "…

Ψεڙڅঢ়گͷೋ஋෼ྨϞσϧͷߏங dplyr::glimpse(lp_supply) #> Rows: 1,812 #> Columns: 8 #> $
price <int> #> $ water <lgl> #> $ gas <fct> #> $ sewer <lgl> #> $ above_floor <int> #> $ under_floor <int> #> $ dist_from_st <int> #> $ fire <chr> 水水火

୳ࡧతσʔλ෼ੳ σʔλͷಛ௃ͷ೺ѲɺཧղΛਂΊΔ Ϟσϧͷߏ૝Λ࿅Δ

library(tidymodels) #> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ── #>
✔ broom 1.0.5 ✔ recipes 1.0.10 #> ✔ dials 1.2.1 ✔ rsample 1.2.0 #> ✔ dplyr 1.1.4 ✔ tibble 3.2.1 #> ✔ ggplot2 3.5.0 ✔ tidyr 1.3.1 #> ✔ infer 1.0.6 ✔ tune 1.1.2 #> ✔ modeldata 1.3.0 ✔ workflows 1.1.4 #> ✔ parsnip 1.2.0 ✔ workflowsets 1.0.1 #> ✔ purrr 1.0.2 ✔ yardstick 1.3.0 #> #> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Dig deeper into tidy modeling with R at https://www.tmwr.org UJEZNPEFMTύοέʔδͷಡΈࠐΈ

library(tidymodels) #> ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ── #>
✔ broom 1.0.5 ✔ recipes 1.0.10 #> ✔ dials 1.2.1 ✔ rsample 1.2.0 #> ✔ dplyr 1.1.4 ✔ tibble 3.2.1 #> ✔ ggplot2 3.5.0 ✔ tidyr 1.3.1 #> ✔ infer 1.0.6 ✔ tune 1.1.2 #> ✔ modeldata 1.3.0 ✔ workflows 1.1.4 #> ✔ parsnip 1.2.0 ✔ workflowsets 1.0.1 #> ✔ purrr 1.0.2 ✔ yardstick 1.3.0 #> #> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Dig deeper into tidy modeling with R at https://www.tmwr.org tidymodels package::function UJEZNPEFMTύοέʔδͷಡΈࠐΈ

ؔ਺໊ͷিಥ ύοέʔδؒͰಉؔ͡਺໊͕࢖ΘΕ͍ͯΔͱɺʮিಥʯ͕ൃੜ͢Δ ͋ͱ͔ΒಡΈࠐ·Εͨؔ਺໊͕༏ઌ͞ΕΔ ྫʣEQMZSMBH ͱTUBUTMBH ಉؔ͡਺໊Ͱ͋ͬͯ΋ػೳ͸ύοέʔδؒͰҟͳΔͨΊɺ ҙਤ͠ͳ͍݁ՌΛ΋ͨΒ͢ɺૢ࡞͕ߦΘΕͳ͍͓ͦΕ͕͋Δ lag(ldeaths, 1) #>
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec #> 1973 3035 #> 1974 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512 2933 #> 1975 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837 2787 #> 1976 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823 3102 #> 1977 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293 2815 #> 1978 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491 3084 #> 1979 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915 library(dplyr) lag(ldeaths, 1) #> Error in `lag()`: #> ! `x` must be a vector, not a <ts>, do you want `stats::lag()`?  

  library(tidymodels) tidymodels_prefer() library(conflicted) # ؔ਺ͷিಥ͕ൃੜ͍ͯ͠Δͱɺ ɹɹɹɹɹɹɹɹɹɹɹɹ# ༏ઌॱҐΛࢦఆ͢ΔΑ͏ʹଅ͢ϝοηʔδΛग़ྗͯ͘͠ΕΔ conflict_scout()
# ݱࡏͷিಥঢ়گΛ֬ೝʢtidymodels಺Ͱͷڝ߹ʣ #> 4 conflicts #> • `discard()`: purrr and scales #> • `filter()`: dplyr #> • `lag()`: dplyr and stats #> • `step()`: recipes UJEZNPEFMT@QSFGFS Λ࣮ߦ͢ΔͱɺUJEZNPEFMT͕ґଘ͢Δ ύοέʔδͷؔ਺͕༏ઌతʹར༻͞ΕΔΑ͏ʹ͢Δ ؔ਺໊ͷিಥΛճආ͢Δʹ͸ ໊લۭؒΛࢦఆͯؔ͠਺Λར༻͢ΔˠQBDLBHFGVODJUPO ͷܗࣜ DPO fl JDUFEύοέʔδͰ໌ࣔతʹ༏ઌ͢Δؔ਺Λࢦఆ͢Δ 行

ֶशσʔλ ςετσʔλ set.seed(123) lp_split <- initial_split(lp_supply, prop = 0.8, strata
= gas) lp_split #> <Training/Testing/Total> #> <1449/363/1812> lp_train <- training(lp_split) lp_test <- testing(lp_split) \STBNQMF^σʔλ෼ׂ

ֶशσʔλ ςετσʔλ set.seed(123) lp_split <- initial_split(lp_supply, prop = 0.8, strata
= gas) lp_split #> <Training/Testing/Total> #> <1449/363/1812> lp_train <- training(lp_split) lp_test <- testing(lp_split) 比 \STBNQMF^σʔλ෼ׂ

\QBSTOJQ^ֶशثͷߏங Ϟσϧͷબ୒ Τϯδϯʢ˺ύοέʔδʣͷࢦఆ ϞʔυʢλεΫʣͷࢦఆ

\QBSTOJQ^ֶशثͷߏங Ϟσϧͷબ୒ ˠܾఆ໦ Τϯδϯͷࢦఆ Ϟʔυͷࢦఆ logistic_reg() #>
Logistic Regression Model Specification (classification) #> #> Computational engine: glm 自 classi fi cation linear_reg() #> Linear Regression Model Specification (regression) decision_tree() #> Decision Tree Model Specification (unknown mode) #> #> Computational engine: rpart 示

\QBSTOJQ^ֶशثͷߏங Ϟσϧͷબ୒ ˠܾఆ໦ Τϯδϯͷࢦఆ ˠSQBSUύοέʔδ Ϟʔυͷࢦఆ 1
文用 show_engines("decision_tree") #> # A tibble: 5 × 2 #> engine mode #> <chr> <chr> #> 1 rpart classification #> 2 rpart regression #> 3 C5.0 classification #> 4 spark classification #> 5 spark regression set_engine("rpart") #> Decision Tree Model Specification (unknown mode) #> #> Computational engine: rpart decision_tree() |>

tree_spec <- set_mode("classification") tree_spec #> Decision Tree Model Specification (classification)
#> #> Main Arguments: #> cost_complexity = 0.002 #> #> Computational engine: rpart \QBSTOJQ^ֶशثͷߏங Ϟσϧͷબ୒ ˠܾఆ໦ Τϯδϯͷࢦఆ ˠSQBSUύοέʔδ Ϟʔυͷࢦఆ ˠ෼ྨ໰୊ decision_tree(cost_complexity = 0.002) |> set_engine("rpart") |> classi fi cation regression

\QBSTOJQ^ֶशثͷߏங ranger::ranger(x, y, mtry = min_cols(~10, x), num.trees = 2000,
...) ྫʣϥϯμϜϑΥϨετͰͷܾఆ໦࡞੒࣌ͷಛ௃ྔ਺ NUSZ ɺܾఆ໦ͷ਺ OUSFF Λ ɹɹἧ͑ɺҟͳΔύοέʔδͰֶशثΛ࣮૷͢Δ randomForest::randomForest(x, y, mtry = min_cols(~10, x), ntree = 2000) sparklyr::ml_random_forest(x, formula, type = "regression", feature_subset_strategy = "10", num_trees = 2000, ...) ύοέʔδؒͰύϥϝʔλͷࢦఆํ๏͕ҟͳΔʜ

rand_forest(mtry = 10, trees = 2000) |> set_engine("ranger") |> set_mode("regression")
|> translate() #> Random Forest Model Specification (regression) #> #> Main Arguments: #> mtry = 10 #> trees = 2000 #> #> Computational engine: ranger #> #> Model fit template: #> ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), #> mtry = min_cols(~10, x), num.trees = 2000, num.threads = 1, #> verbose = FALSE, seed = sample.int(10^5, 1)) \QBSTOJQ^ֶशثͷߏங ΤϯδϯʹԠͯ͡ҟͳΔύϥϝʔλͷࢦఆํ๏ΛϞσϧؔ਺಺Ͱ౷Ұ ΤϯδϯΛม͑Δ͚ͩͰύϥϝʔλͷࢦఆํ๏ʹؾΛݣΘͳ͍Ͱ͢Ή

͜͜·Ͱͷཁ఺ σʔλ෼ੳɺػցֶशͷ޻ఔ͸ଟذʹΘͨΓɺͦͷଟ͘͸൓෮తʹߦΘΕΔ ˠҰ࿈ͷखଓ͖Λԁ׈ʹ͢ΔͨΊͷϫʔΫϑϩʔͷར༻͕๬·͍͠ 3ʹ͓͚ΔػցֶशϑϨʔϜϫʔΫͱͯ͠ɺUJEZWFSTF΍NMS͕͋Δɻ ·ͨϞσϧͷղऍΛॿ͚ΔͨΊͷ%"-&9΋ར༻Ͱ͖Δ UJEZNPEFMTύοέʔδΛར༻͢Δ͜ͱͰɺػցֶशϞσϦϯάͰߦ͏޻ఔΛ ౷ҰతΠϯλʔϑΣΠεͷ΋ͱͰਐΊΒΕΔ σʔλ෼ׂˠ\STBNQMF^ ֶशثͷߏஙˠ\QBSTOJQ^

\XPSL fl PXT^ϫʔΫϑϩʔԽ ͜͜·Ͱͷ಺༰Λ·ͱΊΔͱʜ tree_wflow <- workflow(gas ~ ., spec
= tree_spec) tree_wflow <- workflow() |> add_formula(gas ~ .) |> add_model(tree_spec) ͋Δ͍͸

tree_wflow <- workflow(gas ~ ., spec = tree_spec) tree_wflow <-
workflow() |> add_formula(gas ~ .) |> add_model(tree_spec) ͋Δ͍͸ \XPSL fl PXT^ϫʔΫϑϩʔԽ 目 tree_fit <- tree_wflow |> fit(data = lp_train) predict(tree_fit, new_data = lp_test) ςετσʔλ Λ༻ֶ͍ͨश ֶशσʔλ ʹର͢Δ༧ଌ ϫʔΫϑϩʔԽʹΑΓֶशثͷมߋ͕༰қͱͳΔ lp_tree_augment <- augment(tree_fit, new_data = lp_train) ΁ͷ༧ଌ݁Ռͷద༻ ֶशσʔλ

tree_fit #> ══ Workflow [trained] ══════════════════════════════════════════════════════════ #> Preprocessor: Formula #>
Model: decision_tree() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────── #> gas ~ . #> #> ── Model ─────────────────────────────────────────────────────────────────────── #> n= 1449 #> #> node), split, n, loss, yval, (yprob) #> * denotes terminal node #> #> 1) root 1449 122 TRUE (0.08419600 0.91580400) #> 2) price< 124500 181 78 TRUE (0.43093923 0.56906077) #> 4) dist_from_st>=5550 10 0 FALSE (1.00000000 0.00000000) * #> 5) dist_from_st< 5550 171 68 TRUE (0.39766082 0.60233918) #> … ʢলུʣ #> 3) price>=124500 1268 44 TRUE (0.03470032 0.96529968) * \XPSL fl PXT^ϫʔΫϑϩʔԽ Λ༻ֶ͍ͨश ֶशσʔλ

\XPSL fl PXT^ϫʔΫϑϩʔԽ lp_tree_augment |> # ֶशσʔλʹϞσϧ༧ଌ݁Ռ͕ྻͱͯ͠༩͑ΒΕ͍ͯΔ # ֬ೝͷͨΊʹҰ෦ͷྻ͚ͩΛදࣔ select(starts_with(".pred_"),
gas) #> # A tibble: 1,449 × 4 #> .pred_class .pred_FALSE .pred_TRUE gas #> <fct> <dbl> <dbl> <fct> #> 1 TRUE 0 1 TRUE #> 2 TRUE 0.0833 0.917 TRUE #> 3 TRUE 0.25 0.75 TRUE #> 4 TRUE 0.0347 0.965 TRUE #> 5 TRUE 0.0347 0.965 TRUE #> 6 TRUE 0.0833 0.917 TRUE #> 7 TRUE 0.0833 0.917 TRUE #> 8 FALSE 0.889 0.111 TRUE #> 9 TRUE 0 1 TRUE #> 10 TRUE 0.0833 0.917 TRUE #> # ℹ 1,439 more rows ΁ͷ༧ଌ݁Ռͷద༻ ֶशσʔλ

#> # A tibble: 1,449 × 4 #> .pred_class .pred_FALSE
.pred_TRUE gas #> <fct> <dbl> <dbl> <fct> #> 1 TRUE 0 1 TRUE #> 2 TRUE 0.0833 0.917 TRUE #> 3 TRUE 0.25 0.75 TRUE #> 4 TRUE 0.0347 0.965 TRUE #> 5 TRUE 0.0347 0.965 TRUE #> 6 TRUE 0.0833 0.917 TRUE #> 7 TRUE 0.0833 0.917 TRUE #> 8 FALSE 0.889 0.111 TRUE #> 9 TRUE 0 1 TRUE #> 10 TRUE 0.0833 0.917 TRUE #> # ℹ 1,439 more rows \XPSL fl PXT^ϫʔΫϑϩʔԽ lp_tree_augment |> # ֶशσʔλʹϞσϧ༧ଌ݁Ռ͕ྻͱͯ͠༩͑ΒΕ͍ͯΔ # ֬ೝͷͨΊʹҰ෦ͷྻ͚ͩΛදࣔ select(starts_with(".pred_"), gas) ΁ͷ༧ଌ݁Ռͷద༻ ֶशσʔλ

ߏஙͨ͠Ϟσϧͷղऍʙܾఆ໦ͷදࣔʙ tree_fit |> extract_fit_engine() |> rpart.plot::rpart.plot(roundint = FALSE)

\WJQ^ \%"-&9^ߏஙͨ͠Ϟσϧͷղऍ tree_exp <- DALEXtra::explain_tidymodels(tree_fit, data = lp_test, y =
as.numeric(lp_test$gas)-1, label = "lp_tree") ·ͨ͸ DALEX::explain(tree_fit, data = lp_test, y = as.numeric(lp_test$gas)-1, label = "lp_tree") \%"-&9^Ͱ͸UJEZNPEFMTͰ࡞੒ͨ͠ϞσϧΛѻ͑Δ

ߏஙͨ͠Ϟσϧͷղऍʙ(MPCBMʙ tree_fit |> extract_fit_engine() |> vip::vip() tree_effect <- model_parts(tree_exp, type
= "variable_importance") plot(tree_effect, show_boxplots = TRUE) ม਺ॏཁ౓

ߏஙͨ͠Ϟσϧͷղऍʙ-PDBMʙ ୯Ұͷ؍ଌʹରͯ͠Ϟσϧ͕ͲͷΑ͏ʹಇ͔͘Λ֬ೝ predict(tree_fit, new_data = lp_test[1, ]) #> # A
tibble: 1 × 1 #> .pred_class #> <fct> #> 1 FALSE predict(tree_exp, lp_test[1, ]) #> TRUE #>0.4736842 predict_parts( tree_exp, new_observation = lp_test[1, ]) |> plot()

\ZBSETUJDL^ϞσϧͷੑೳධՁ autoplot("heatmap") lp_tree_augment |> conf_mat(truth = gas, ɹɹɹɹɹɹ estimate =
.pred_class) |> lp_tree_augment |> conf_mat(truth = gas, estimate = .pred_class) #> Truth #> Prediction FALSE TRUE #> FALSE 69 32 #> TRUE 53 1295 ֶशσʔλ ͱϞσϧͷ༧ଌ஋ʹର͢Δࠞಉߦྻ

\ZBSETUJDL^ϞσϧͷੑೳධՁ lp_tree_augment |> accuracy(truth = gas, estimate = .pred_class) #>
# A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.941 lp_tree_augment |> sensitivity(truth = gas, estimate = .pred_class) #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 sensitivity binary 0.566 lp_tree_augment |> specificity(truth = gas, estimate = .pred_class) #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 specificity binary 0.976 ֶशσʔλ ͱϞσϧͷ༧ଌ஋ʹର͢Δ͞·͟·ͳධՁࢦඪͷࢉग़

lp_tree_augment |> accuracy(truth = gas, estimate = .pred_class) #> #
A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.941 lp_tree_augment |> sensitivity(truth = gas, estimate = .pred_class) #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 sensitivity binary 0.566 lp_tree_augment |> specificity(truth = gas, estimate = .pred_class) #> # A tibble: 1 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 specificity binary 0.976 \ZBSETUJDL^ϞσϧͷੑೳධՁ 用 accuracy sensitivity specificity

lp_tree_augment |> accuracy(truth = gas, estimate = .pred_class) lp_tree_augment |>
sensitivity(truth = gas, estimate = .pred_class) lp_tree_augment |> specificity(truth = gas, estimate = .pred_class) \ZBSETUJDL^ϞσϧͷੑೳධՁ 用 lp_metrics <- metric_set(accuracy, sensitivity, specificity) lp_tree_augment |> lp_metrics(truth = gas, estimate = .pred_class) #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.941 #> 2 sensitivity binary 0.566 #> 3 specificity binary 0.976

\ZBSETUJDL^ϞσϧͷੑೳධՁ ͱϞσϧͷ༧ଌ஋ʹର͢Δ͞·͟·ͳධՁࢦඪͷࢉग़ ςετσʔλ ͓͞Β͍ʜϞσϧͷ༧ଌ݁Ռʢϥϕϧɺϥϕϧʹର͢Δ֬཰ʣΛ σʔλϑϨʔϜͷσʔλͷྻͱͯ͠௥Ճ͢Δؔ਺͸ʁ

\ZBSETUJDL^ϞσϧͷੑೳධՁ ͱϞσϧͷ༧ଌ஋ʹର͢Δ͞·͟·ͳධՁࢦඪͷࢉग़ ςετσʔλ ͓͞Β͍ʜϞσϧͷ༧ଌ݁Ռʢϥϕϧɺϥϕϧʹର͢Δ֬཰ʣΛ σʔλϑϨʔϜͷσʔλͷྻͱͯ͠௥Ճ͢Δؔ਺͸ʁ tree_fit |> augment(new_data = lp_test)
|> lp_metrics(truth = gas, estimate = .pred_class) #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.923 #> 2 sensitivity binary 0.417 #> 3 specificity binary 0.959

\ZBSETUJDL^ϞσϧͷੑೳධՁ ςετσʔλ tree_fit |> augment(new_data = lp_test) |> lp_metrics(truth =
gas, estimate = .pred_class) #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.923 #> 2 sensitivity binary 0.417 #> 3 specificity binary 0.959 lp_tree_augment |> lp_metrics(truth = gas, estimate = .pred_class) #> # A tibble: 3 × 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 accuracy binary 0.941 #> 2 sensitivity binary 0.566 #> 3 specificity binary 0.976 ֶशσʔλ

\%"-&9^ʹΑΔϞσϧͷੑೳධՁ ςετσʔλ ʹର͢ΔධՁࢦඪͷࢉग़ DALEX::model_performance(tree_exp) #> Measures for: classification #> recall
: 0.9587021 #> precision : 0.9587021 #> f1 : 0.9587021 #> accuracy : 0.922865 #> auc : 0.8038963 #> #> Residuals: #> 0% 10% 20% 30% 40% 50% #> -0.96529968 0.03470032 0.03470032 0.03470032 0.03470032 0.03470032 #> 60% 70% 80% 90% 100% #> 0.03470032 0.03470032 0.03470032 0.03470032 0.88888889

͜͜·Ͱͷཁ఺ σʔλ෼ੳɺػցֶशͷ޻ఔ͸ଟذʹΘͨΓɺͦͷଟ͘͸൓෮తʹߦΘΕΔ ˠҰ࿈ͷखଓ͖Λԁ׈ʹ͢ΔͨΊͷϫʔΫϑϩʔͷར༻͕๬·͍͠ 3ʹ͓͚ΔػցֶशϑϨʔϜϫʔΫͱͯ͠ɺUJEZWFSTF΍NMS͕͋Δɻ ·ͨϞσϧͷղऍΛॿ͚ΔͨΊͷ%"-&9΋ར༻Ͱ͖Δ UJEZNPEFMTύοέʔδΛར༻͢Δ͜ͱͰɺػցֶशϞσϦϯάͰߦ͏޻ఔΛ ౷ҰతΠϯλʔϑΣΠεͷ΋ͱͰਐΊΒΕΔ σʔλ෼ׂˠ\STBNQMF^ ֶशثͷߏஙˠ\QBSTOJQ^ ϫʔΫϑϩʔԽˠ\XPSL
fl PXT^ ੑೳධՁˠ\ZBSETUJDL^

ֶशσʔλ ςετσʔλ \STBNQMF^ަࠩݕূ๏ͷಋೖ ෼ੳσʔλ ධՁσʔλ ֶशσʔλΛ΋ͱʹɺ෼ੳσʔλͱධՁσʔλ͔ΒͳΔϦαϯϓϧσʔλΛಘΔ Ϧαϯϓϧ ʜ Ϧαϯϓϧ Ϧαϯϓϧ/
\STBNQMF^Ͱ͸͞·͟·ͳϦαϯϓϦϯάख๏Λఏڙ͢Δ

\STBNQMF^ަࠩݕূ๏ͷಋೖ # 2ճͷ܁Γฦ͠ɺ10෼ׂަࠩݕূ lp_folds <- vfold_cv(lp_train, v = 10, repeats
= 2, strata = gas) lp_folds #> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 3 #> splits id id2 #> <list> <chr> <chr> #> 1 <split [1304/145]> Repeat1 Fold01 #> 2 <split [1304/145]> Repeat1 Fold02 #> 3 <split [1304/145]> Repeat1 Fold03 #> 4 <split [1304/145]> Repeat1 Fold04 #> 5 <split [1304/145]> Repeat1 Fold05 #> 6 <split [1304/145]> Repeat1 Fold06 #> 7 <split [1304/145]> Repeat1 Fold07 #> 8 <split [1304/145]> Repeat1 Fold08 #> 9 <split [1304/145]> Repeat1 Fold09 #> 10 <split [1305/144]> Repeat1 Fold10 #> 11 <split [1304/145]> Repeat2 Fold01 #> 12 <split [1304/145]> Repeat2 Fold02 #> 13 <split [1304/145]> Repeat2 Fold03 #> 14 <split [1304/145]> Repeat2 Fold04 #> 15 <split [1304/145]> Repeat2 Fold05 #> 16 <split [1304/145]> Repeat2 Fold06 #> 17 <split [1304/145]> Repeat2 Fold07 #> 18 <split [1304/145]> Repeat2 Fold08 #> 19 <split [1304/145]> Repeat2 Fold09 #> 20 <split [1305/144]> Repeat2 Fold10

\STBNQMF^ަࠩݕূ๏ͷಋೖ # 2ճͷ܁Γฦ͠ɺ10෼ׂަࠩݕূ lp_folds <- vfold_cv(lp_train, v = 10, repeats
= 2, strata = gas) lp_folds #> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 3 #> splits id id2 #> <list> <chr> <chr> #> 1 <split [1304/145]> Repeat1 Fold01 #> 2 <split [1304/145]> Repeat1 Fold02 #> 3 <split [1304/145]> Repeat1 Fold03 #> 4 <split [1304/145]> Repeat1 Fold04 #> 5 <split [1304/145]> Repeat1 Fold05 #> 6 <split [1304/145]> Repeat1 Fold06 #> 7 <split [1304/145]> Repeat1 Fold07 #> 8 <split [1304/145]> Repeat1 Fold08 #> 9 <split [1304/145]> Repeat1 Fold09 #> 10 <split [1305/144]> Repeat1 Fold10 splits <list> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1304/145]> <split [1305/144]>

# 2ճͷ܁Γฦ͠ɺ10෼ׂަࠩݕূ lp_folds <- vfold_cv(lp_train, v = 10, repeats =
2, strata = gas) lp_folds #> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 3 #> splits id id2 #> <list> <chr> <chr> #> 1 <split [1304/145]> Repeat1 Fold01 #> 2 <split [1304/145]> Repeat1 Fold02 #> 3 <split [1304/145]> Repeat1 Fold03 #> 4 <split [1304/145]> Repeat1 Fold04 #> 5 <split [1304/145]> Repeat1 Fold05 #> 6 <split [1304/145]> Repeat1 Fold06 #> 7 <split [1304/145]> Repeat1 Fold07 #> 8 <split [1304/145]> Repeat1 Fold08 #> 9 <split [1304/145]> Repeat1 Fold09 #> 10 <split [1305/144]> Repeat1 Fold10 \STBNQMF^ަࠩݕূ๏ͷಋೖ <split [1304/145]> <split [1304/145]> <split [1304/145]> lp_folds$splits[1:3] #> [[1]] #> <Analysis/Assess/Total> #> <1304/145/1449> #> #> [[2]] #> <Analysis/Assess/Total> #> <1304/145/1449> #> #> [[3]] #> <Analysis/Assess/Total> #> <1304/145/1449>

\UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_fit_rs <- fit_resamples(tree_wflow, lp_folds) lp_fit_rs #> # Resampling results
#> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 5 #> splits id id2 .metrics .notes #> <list> <chr> <chr> <list> <list> #> 1 <split [1304/145]> Repeat1 Fold01 <tibble [2 × 4]> <tibble [0 × 3]> #> 2 <split [1304/145]> Repeat1 Fold02 <tibble [2 × 4]> <tibble [0 × 3]> #> 3 <split [1304/145]> Repeat1 Fold03 <tibble [2 × 4]> <tibble [0 × 3]> #> … (লུ) #> 18 <split [1304/145]> Repeat2 Fold08 <tibble [2 × 4]> <tibble [0 × 3]> #> 19 <split [1304/145]> Repeat2 Fold09 <tibble [2 × 4]> <tibble [0 × 3]> #> 20 <split [1305/144]> Repeat2 Fold10 <tibble [2 × 4]> <tibble [0 × 3]> Ϧαϯϓϧσʔλ͝ͱʹֶशΛߦ͏

\UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_fit_rs <- fit_resamples(tree_wflow, lp_folds) lp_fit_rs #> # Resampling results
#> # 10-fold cross-validation repeated 2 times using stratification #> # A tibble: 20 × 5 #> splits id id2 .metrics .notes #> <list> <chr> <chr> <list> <list> #> 1 <split [1304/145]> Repeat1 Fold01 <tibble [2 × 4]> <tibble [0 × 3]> #> 2 <split [1304/145]> Repeat1 Fold02 <tibble [2 × 4]> <tibble [0 × 3]> #> 3 <split [1304/145]> Repeat1 Fold03 <tibble [2 × 4]> <tibble [0 × 3]> #> … (লུ) #> 18 <split [1304/145]> Repeat2 Fold08 <tibble [2 × 4]> <tibble [0 × 3]> #> 19 <split [1304/145]> Repeat2 Fold09 <tibble [2 × 4]> <tibble [0 × 3]> #> 20 <split [1305/144]> Repeat2 Fold10 <tibble [2 × 4]> <tibble [0 × 3]> .metrics <list> <tibble [2 × 4]> <tibble [2 × 4]> <tibble [2 × 4]> <tibble [2 × 4]> <tibble [2 × 4]> <tibble [2 × 4]> 行

\UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_fit_rs$.metrics[[1]] #> # A tibble: 2 × 4 #>
.metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 accuracy binary 0.917 Preprocessor1_Model1 #> 2 roc_auc binary 0.800 Preprocessor1_Model1 ͢΂ͯͷϦαϯϓϧσʔλ͔Βੑೳͷฏۉ஋ΛٻΊΔ lp_fit_rs |> collect_metrics() #> # A tibble: 2 × 6 #> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 accuracy binary 0.909 20 0.00566 Preprocessor1_Model1 #> 2 roc_auc binary 0.776 20 0.0195 Preprocessor1_Model1

\UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_ctrl <- # Ϧαϯϓϧσʔλʹର͢Δ༧ଌ஋Λྻʹอଘ͢ΔͨΊ control_resamples(save_pred = TRUE) lp_preds <-
fit_resamples(tree_wflow, lp_folds, control = lp_ctrl) |> # ϦαϯϓϧσʔλΛల։ͯ͠ҰͭͷσʔλϑϨʔϜʹ collect_predictions() lp_preds #> # A tibble: 2,898 × 8 #> id id2 .pred_FALSE .pred_TRUE .row .pred_class gas .config #> <chr> <chr> <dbl> <dbl> <int> <fct> <fct> <chr> #> 1 Repeat1 Fold01 0.0359 0.964 4 TRUE TRUE Preprocessor1_… #> 2 Repeat1 Fold01 0 1 9 TRUE TRUE Preprocessor1_… #> 3 Repeat1 Fold01 0.889 0.111 22 FALSE FALSE Preprocessor1_… #> …(লུ)

\UVOF^Ϧαϯϓϧσʔλ΁ͷॲཧ lp_preds |> group_by(id) |> lp_metrics(truth = gas, estimate =
.pred_class) #> # A tibble: 6 × 4 #> id .metric .estimator .estimate #> <chr> <chr> <chr> <dbl> #> 1 Repeat1 accuracy binary 0.906 #> 2 Repeat2 accuracy binary 0.912 #> 3 Repeat1 sensitivity binary 0.270 #> 4 Repeat2 sensitivity binary 0.270 #> 5 Repeat1 specificity binary 0.965 #> 6 Repeat2 specificity binary 0.971

͜͜·Ͱͷཁ఺ σʔλ෼ੳɺػցֶशͷ޻ఔ͸ଟذʹΘͨΓɺͦͷଟ͘͸൓෮తʹߦΘΕΔ ˠҰ࿈ͷखଓ͖Λԁ׈ʹ͢ΔͨΊͷϫʔΫϑϩʔͷར༻͕๬·͍͠ 3ʹ͓͚ΔػցֶशϑϨʔϜϫʔΫͱͯ͠ɺUJEZWFSTF΍NMS͕͋Δɻ ·ͨϞσϧͷղऍΛॿ͚ΔͨΊͷ%"-&9΋ར༻Ͱ͖Δ UJEZNPEFMTύοέʔδΛར༻͢Δ͜ͱͰɺػցֶशϞσϦϯάͰߦ͏޻ఔΛ ౷ҰతΠϯλʔϑΣΠεͷ΋ͱͰਐΊΒΕΔ σʔλ෼ׂɺϦαϯϓϦϯάˠ\STBNQMF^ ֶशثͷߏஙˠ\QBSTOJQ^ ϫʔΫϑϩʔԽˠ\XPSL
fl PXT^ ੑೳධՁˠ\ZBSETUJDL^ ϦαϯϓϦϯάσʔλ΁ͷϞσϧͷ౰ͯ͸Ίˠ\UVOF^

\NMSWFSTF^Ͱ͜͜·Ͱͷ޻ఔΛ࣮ߦ library(mlr3verse) lp_supply_chr2fct <- lp_supply |> mutate(across(where(is.character), as.factor)) # λεΫͷఆٛ
lp_task <- as_task_classif(lp_supply_chr2fct, target = "gas") # σʔλ෼ׂʢ૚Խநग़ɺ80%Λֶशσʔλʹʣ set.seed(123) lp_split_mlr <- partition(lp_task, stratify = TRUE, ratio = 0.8) # ֶशثͷߏங tree_learner <- lrn("classif.rpart", cp = 0.002)

\NMSWFSTF^Ͱ͜͜·Ͱͷ޻ఔΛ࣮ߦ # ϦαϯϓϦϯάσʔλͷ༻ҙʢ2ճͷ܁Γฦ͠5෼ׂަࠩݕূ๏ʣ lp_folds_mlr <- rsmp("repeated_cv", repeats = 2, folds
= 5) rr <- resample(lp_task, tree_learner, lp_folds_mlr) rr$aggregate(msr("classif.acc")) #> classif.acc #> 0.9213476 rr$aggregate(msr("classif.sensitivity")) #> classif.sensitivity #> 0.3157344 rr$aggregate(msr("classif.specificity")) #> classif.specificity #> 0.9747444

\SFDJQFT^ಛ௃ྔΤϯδχΞϦϯά ϞσϧΛݟ௚͢ ˠϞσϧʹ࢖ΘΕΔม਺ʢಛ௃ྔʣ ˠϞσϧͰ࢖ΘΕΔΞϧΰϦζϜ ࠷ॳʹ࡞ͬͨϞσϧͰಘΒΕͨಎ࡯͔Βʜ lp_supply$water |> unique() #> [1]
TRUE

\SFDJQFT^ಛ௃ྔΤϯδχΞϦϯά recipe(gas ~ ., data = lp_train) |> step_zv(all_predictors()) lp_supply$water
|> unique() #> [1] TRUE step_*() 行 step_*() 目 TUFQ@[W ؔ਺͸ɺ୯Ұͷ஋͔ΒͳΔม਺ΛϞσϧ͔ࣜΒআ֎͢Δʢ;FSPWBSJBODF fi MUFSʣ ಛ௃ྔ΁ͷૢ࡞ΛʮϨγϐʯͱͯ͠ఆٛ͢Δ

\SFDJQFT^ಛ௃ྔΤϯδχΞϦϯά lp_rec <- recipe(gas ~ ., data = lp_train) |>
step_zv(all_predictors()) |> step_log(price, dist_from_st, offset = 0.01) |> step_normalize(all_numeric_predictors()) |> step_dummy(all_nominal_predictors()) lp_rec #> #> ── Recipe ────────────────────────────────────────────────────────────────────── #> #> ── Inputs #> Number of variables by role #> outcome: 1 #> predictor: 7 #> #> ── Operations #> • Zero variance filter on: all_predictors() #> • Log transformation on: price and dist_from_st #> • Centering and scaling for: all_numeric_predictors() #> • Dummy variables from: all_nominal_predictors() TUFQ@ ΛඞཁʹԠͯ͡௥Ճ͢Δ 用

\SFDJQFT^ಛ௃ྔΤϯδχΞϦϯά prep(lp_rec) |> # ӨڹΛ͏͚Δม਺ͱͦͷޮՌʹ͍ͭͯ֬ೝ # stepͷॱ൪Λࢦఆ tidy(number = 1)
#> # A tibble: 1 × 2 #> terms id #> <chr> <chr> #> 1 water zv_KsRlw

\SFDJQFT^ಛ௃ྔΤϯδχΞϦϯά prep(lp_rec) |> bake(new_data = lp_train) |> glimpse() #> Rows:
1,449 #> Columns: 7 #> $ price <dbl> -1.0975606, -1.6677181, -1.8257103, -0.7057626, -0.559127… #> $ sewer <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ above_floor <dbl> 1.4569564, -1.3153743, -0.2757503, 0.4173324, 2.8431217, … #> $ under_floor <dbl> 1.9139251, -0.3518245, -0.3518245, 1.9139251, -0.3518245,… #> $ dist_from_st <dbl> 0.63568930, 0.46296422, 0.86095169, -0.37863009, -4.89537… #> $ gas <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU… #> $ fire_๷Ր <dbl> 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, … 用用

\XPSL fl PXT^ϫʔΫϑϩʔͷमਖ਼ tree_wflow |> remove_formula() |> add_recipe(lp_rec) #> ══
Workflow ════════════════════════════════════════════════════════════════════ #> Preprocessor: Recipe #> Model: decision_tree() #> #> ── Preprocessor ──────────────────────────────────────────────────────────────── #> 4 Recipe Steps #> #> • step_zv() #> • step_log() #> • step_normalize() #> • step_dummy() #> #> ── Model ─────────────────────────────────────────────────────────────────────── #> Decision Tree Model Specification (classification) #> #> Main Arguments: #> cost_complexity = 0.002 #> #> Computational engine: rpart

fl PXT^ ੑೳධՁˠ\ZBSETUJDL^ ϦαϯϓϦϯάσʔλ΁ͷϞσϧͷ౰ͯ͸Ίˠ\UVOF^ ಛ௃ྔΤϯδχΞϦϯάˠ\SFDJQFT^

ϥϯμϜϑΥϨετͷ࣮૷ͱϞσϧͷൺֱ rf_spec <- rand_forest(trees = 1000, mode = "classification") |>
set_engine("randomForest") rf_spec #> Random Forest Model Specification (classification) #> #> Main Arguments: #> trees = 1000 #> #> Computational engine: randomForest

\XPSL fl PXTFUT^ෳ਺ϞσϧɺϨγϐΛѻ͏ wf_set_fit <- workflow_set(preproc = list(none = gas
~ ., prep = lp_rec), models = list(tree_spec, rf_spec), cross = TRUE) |> workflow_map("fit_resamples", resamples = lp_folds) wf_set_fit |> rank_results() #> # A tibble: 8 × 9 #> wflow_id .config .metric mean std_err n preprocessor model rank #> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int> #> 1 prep_rand_forest Prepro… accura… 0.924 0.00475 20 recipe rand… 1 #> 2 prep_rand_forest Prepro… roc_auc 0.873 0.0155 20 recipe rand… 1 #> 3 none_rand_forest Prepro… accura… 0.927 0.00460 20 formula rand… 2 #> 4 none_rand_forest Prepro… roc_auc 0.870 0.0165 20 formula rand… 2 #> 5 none_decision_tr… Prepro… accura… 0.909 0.00566 20 formula deci… 3 #> 6 none_decision_tr… Prepro… roc_auc 0.776 0.0195 20 formula deci… 3 #> 7 prep_decision_tr… Prepro… accura… 0.909 0.00566 20 recipe deci… 4 #> 8 prep_decision_tr… Prepro… roc_auc 0.776 0.0195 20 recipe deci… 4

\UVOF^ϋΠύʔύϥϝʔλͷ୳ࡧ rf_tune_spec <- rand_forest(min_n = , trees = , mode
= "classification") |> set_engine("randomForest") rf_tune_spec #> Random Forest Model Specification (classification) #> #> Main Arguments: #> trees = tune() #> min_n = tune() #> #> Computational engine: randomForest tune() tune() 行 tune()

\UVOF^ϋΠύʔύϥϝʔλͷ୳ࡧ rf_tune_wflow <- workflow() |> add_model(rf_tune_spec) |> add_recipe(lp_rec) rf_fit_tune_res #>
# A tibble: 20 × 5 #> splits id id2 .metrics .notes #> <list> <chr> <chr> <list> <list> #> 1 <vfld_spl> Repeat1 Fold01 <tibble [10 × 6]> <tibble [0 × 3]> #> 2 <vfld_spl> Repeat1 Fold02 <tibble [10 × 6]> <tibble [0 × 3]> #> 3 <vfld_spl> Repeat1 Fold03 <tibble [10 × 6]> <tibble [0 × 3]> #> …ʢলུʣ rf_fit_tune_res <- tune_grid(rf_tune_wflow, lp_folds, grid = 5)

\UVOF^ϋΠύʔύϥϝʔλͷ୳ࡧ finalize_workflow(rf_tune_wflow, lp_tune_best_parameter) |> last_fit(lp_split) |> collect_metrics() #> # A
tibble: 2 × 4 #> .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 accuracy binary 0.950 Preprocessor1_Model1 #> 2 roc_auc binary 0.909 Preprocessor1_Model1 lp_tune_best_parameter <- select_best(rf_fit_tune_res, metric = "roc_auc") lp_tune_best_parameter #> # A tibble: 1 × 3 #> trees min_n .config #> <int> <int> <chr> #> 1 1951 15 Preprocessor1_Model5

fl PXT^ ੑೳධՁˠ\ZBSETUJDL^ ϦαϯϓϦϯάσʔλ΁ͷϞσϧͷ౰ͯ͸ΊɺϋΠύʔύϥϝʔλ୳ࡧˠ\UVOF^ ಛ௃ྔΤϯδχΞϦϯάˠ\SFDJQFT^ ϫʔΫϑϩʔͷ؅ཧͱൺֱˠ\XPSL fl PXTFUT^

ୈҰ෦ͷ·ͱΊ 3ݴޠʹ͓͚ΔػցֶशϑϨʔϜϫʔΫͱͯ͠UJEZNPEFMTΛ঺հ NMSWFSTFͰ΋UJEZNPEFMTͰߦͬͨॲཧΛ࣮ݱͰ͖Δ ϞσϧͷղऍΛॿ͚ΔͨΊͷ%"-&9͸UJEZNPEFMT΍NMSʹରԠ ൓෮తૢ࡞ɺෳ਺ϞσϧɾϨγϐͷݕ౼͕ඞཁͳػցֶशϞσϦϯάʹ͓͍ͯ͸ ΦϒδΣΫτͷ؅ཧ͕ॏཁʜʢϫʔΫϑϩʔԽΛਪ঑ʣ

ࢀߟࢿྉ .BY,VIO BOE,KFMM+PIOTPO l'FBUVSF&OHJOFFSJOHBOE4FMFDUJPOz IUUQTCPPLEPXOPSHNBY'&4 .BY,VIOBOE+VMJB4JMHF 5JEZ.PEFMJOHXJUI3IUUQTXXXUNXSPSH দଜ༏࠸
ӝੜਅ໵ ٢ଜ޿ࢤ ʮ3ϢʔβͷͨΊͷUJEZNPEFMT<࣮ફ>ೖ໳ʙ Ϟμϯͳ౷ܭɾػցֶशϞσϦϯάͷੈքʯٕज़ධ࿦ࣾ 1S[FNZTBX#JFDFLBOE5PNBT[#VS[ZLPXTLJ &YQMBOBUPSZ.PEFM "OBMZTJTIUUQTFNBESXIZBJ .BY,VIOBOE,KFMM+PIOTPO "QQMJFE.BDIJOF-FBSOJOHGPS5BCVMBS %BUBIUUQTBNMUEPSH #FSOE#JTDIM 3BQIBFM4POBCFOE -BST,PUUIP ff .JDIFM-BOH "QQMJFE .BDIJOF-FBSOJOH6TJOHNMSJO3IUUQTNMSCPPLNMSPSHDPN

Rの機械学習フレームワークの紹介〜tidymodelsを中心に〜 / machine_lear...

Rの機械学習フレームワークの紹介〜tidymodelsを中心に〜 / machine_learning_with_r2024

More Decks by Uryu Shinya

Other Decks in Research

Featured

Transcript