Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Japan.R 2015: {purrr} による非テーブルデータの処理

Sinhrks
December 05, 2015

Japan.R 2015: {purrr} による非テーブルデータの処理

Sinhrks

December 05, 2015
Tweet

More Decks by Sinhrks

Other Decks in Technology

Transcript

  1. ࣗݾ঺հ • @sinhrks • ۀ຿: σʔλ෼ੳ (ϝʔΧʔ) • ར༻ݴޠ: Python,

    Rͱ͔ • झຯ: OSS ׆ಈ • pandas ※ ίϛολ (Python ύοέʔδ) ※ R ͷ data.frame + dplyr + tidyr + readr + haven + lubridate + stringr + ggplot2… (ҎԼུ ͷΑ͏ͳ΋ͷ
  2. ͳͥ Japan.R ʹ? • Rͷύοέʔδ࡞ͬͨ (JU)VC"XBSET ݱࡏ • ͜ͷR։ൃऀ͕͍͢͝ (2015)

    by @u_ribo બग़ • http://uribo.hatenablog.com/entry/2015/12/02/180004
  3. ૝ఆ͢ΔϨϕϧͱΰʔϧ • લఏ: dplyr, ggplot2 ͸গ͠࢖ͬͨ͜ͱ͕͋Δ • ΰʔϧ: {purrr} ͷجຊతͳ࢖͍ํΛ஌Δ

    • ͜ΜͳίʔυΛॻ͔ͳͯ͘΋Α͘ͳΔ glm.fit1 <- glm(y ~ x, …) glm.fit2 <- glm(y ~ x + z, …) … predict(glm.fit1, newdata = newdata) predict(glm.fit2, newdata = newdata) …
  4. RʹΑΔσʔλॲཧ EBUBGSBNF ϦετϕΫτϧ Ϟσϧ HHQMPU UJEZS EQMZS SFBES SWFTU SMJTU

    QVSSS ֤छ౷ܭػցֶश ύοέʔδ CSPPN DBSFU HHGPSUJGZ
  5. plyr • ૊ΈࠐΈ: library(plyr) plyr::ddply(iris, .(Species), plyr::colwise(mean)) ## Species Sepal.Length

    Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 iris_group <- split(iris, iris$Species) res <- sapply(iris_group, function(x) { sapply(x[, 1:4], mean) }) as.data.frame(t(res)) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## setosa 5.006 3.428 1.462 0.246 ## versicolor 5.936 2.770 4.260 1.326 ## virginica 6.588 2.974 5.552 2.026 • plyr: ෼ׂ - ద༻ - ݁߹ΛҰͭͷؔ਺Ͱ
  6. dplyr • plyr: ෼ׂ - ద༻ - ݁߹ΛҰͭͷؔ਺Ͱ library(plyr) plyr::ddply(iris,

    .(Species), plyr::colwise(mean)) ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 library(dplyr) dplyr::summarise_each(dplyr::group_by(iris, Species), dplyr::funs(mean)) ## Source: local data frame [3 x 5] ## ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## (fctr) (dbl) (dbl) (dbl) (dbl) ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 • dplyr: ॲཧ͝ͱʹؔ਺Λద༻
  7. ύΠϓԋࢉࢠ %>% • magrittr iris %>% dplyr::group_by(Species) %>% dplyr::summarise_each(dplyr::funs(mean)) ##

    Source: local data frame [3 x 5] ## ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## (fctr) (dbl) (dbl) (dbl) (dbl) ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 x %>% f(…) == f(x, …)
  8. ૊ΈࠐΈؔ਺Ͱ΍Δͱ… • Apply ؔ਺܈ • ߴ֊ؔ਺ ※ ܈ BQQMZ lapply(list(1,

    2, 3), function (x) { x + 1 }) Map(function (x) { x + 1 }, list(1, 2, 3)) MBQQMZ TBQQMZ UBQQMZ NBQQMZ 3FEVDF 'JMUFS 'JOE .BQ /FHBUF 1PTJUJPO ແ໊ؔ਺ͷఆ͕ٛΊΜͲ͏ ؔ਺ͷద༻Ҏ֎ͷॲཧ͸Ͱ͖ͳ͍ ؔ਺͕ୈҰҾ਺ͷͨΊύΠϓԋࢉࢠ͕࢖͍ʹ͍͘ ֮͑ΒΕͳ͍ ※ ؔ਺ΛҾ਺΍ฦΓ஋ʹ͢Δؔ਺
  9. {purrr} ͱ͸ • RStudio BlogΑΓ • ؔ਺ܕϓϩάϥϛϯάͷͨΊͷπʔϧ • ॲཧΛγϯϓϧͳؔ਺ͷ૊Έ߹ΘͤͰهड़ •

    ؔ਺͸ύΠϓԋࢉࢠͰ઀ଓ Purrr is a new package that fills in the missing pieces in R’s functional programming tools: it’s designed to make your pure functions purrr. Like many of my recent packages, it works with magrittr to allow you to express complex operations by combining simple pieces in a standard way.
  10. R ͸ؔ਺ܕݴޠ͔? • Yes (by Hadley Wickham in “Advanced R”)

    . • ৄ͘͠͸ “Rݴޠపఈղઆ” (๜༁͕12/23ൃച)
  11. {purrr} ͷಛ௃ • ϥϜμࣜ • ύΠϓԋࢉࢠͰ઀ଓͰ͖Δؔ਺܈ • ஫ҙ఺ • όʔδϣϯ͸0.1ɻࠓޙ

    ഁյతͳมߋ͕͋ΔՄೳੑ͕͋Δɻ • ຊ೔͸ओཁͳؔ਺ͷΈ͝঺հ (͜ΕͰ΄ͱΜͲͷ͜ͱ͸Ͱ͖Δ)ɻ
  12. ϥϜμࣜ • ໊લΛ΋ͨͳ͍ؔ਺ (ແ໊ؔ਺) Λγϯϓϧʹهड़ • R ඪ४ • {purrr}

    ͷϥϜμࣜ • {purrr} ͷؔ਺ͷதͰ࢖͏ (ߴ֊ؔ਺) function(x) { x + 1 } ~ . + 1 map(x, ~ . + 1)  υοτ ͕Ҿ਺ʹରԠ
  13. map • ؔ਺Λ֤ཁૉʹద༻ map(c(1, 2, 3), ~ . + 1)

    ## x = list 3 (216 bytes) ## . [[1]] = double 1= 2 ## . [[2]] = double 1= 3 ## . [[3]] = double 1= 4 map(list(a = 1, b = 2, c = 3), ~ . + 1) ## x = list 3 (544 bytes) ## . a = double 1= 2 ## . b = double 1= 3 ## . c = double 1= 4 map(c(1, 2, 3), ~ . + 1) == list(1 + 1, 2 + 1, 3 + 1)
  14. map2 • 2ͭͷҾ਺ʹରͯؔ͠਺ద༻ map2(c(1, 2, 3), c(4, 5, 6), ~

    .x * .y) ## x = list 3 (216 bytes) ## . [[1]] = double 1= 4 ## . [[2]] = double 1= 10 ## . [[3]] = double 1= 18 map2(c(1, 2, 3), c(4, 5, 6), ~ .x * .y) == list(1 * 4, 2 * 5, 3 * 6) ͻͱͭΊ͔ΒͷཁૉΛY ;ͨͭΊ͔ΒͷཁૉΛZͰࢀর
  15. map_xxx • ݁ՌΛϕΫτϧͰऔಘɺܕͷࢦఆ͕ඞཁ map_int(list(a = 1L, b = 2L, c

    = 3L), ~ . + 1L) ## a b c ## 2 3 4 NBQ@YYY ϕΫτϧͷܕ NBQ@MHM MPHJDBM NBQ@DIS DIBSBDUFS NBQ@JOU JOUFHFS NBQ@ECM OVNSJD
  16. keep • ؔ਺ͷ৚݅ʹ͋ͯ͸·ΔཁૉΛநग़ keep(c(1, 2, 3), ~ . >= 2)

    ## [1] 2 3 keep(list(a = 1, b = 2, c = 3), ~ . >= 2) ## x = list 2 (416 bytes) ## . b = double 1= 2 ## . c = double 1= 3
  17. reduce • ͨͨΈࠐΈ reduce(c(1, 2, 3), `+`) ## [1] 6

    reduce(list(a = 1, b = 2, c = 3), `+`) ## [1] 6 reduce(c(1, 2, 3), `+`) == ((1 + 2) + 3)
  18. split_by • base::split + ϥϜμࣜ split_by(c(1, 2, 3), ~ .

    %% 2) ## x = list 2 (424 bytes) ## . 0 = double 1= 2 ## . 1 = double 2= 1 3 split_by(list(a = 1, b = 2, c = 3), ~ . %% 2) ## x = list 2 (1040 bytes) ## . 0 = list 1 ## . . b = double 1= 2 ## . 1 = list 2 ## . . a = double 1= 1 ## . . c = double 1= 3
  19. sort_by • ιʔτ sort_by(c(2, -3, 1), ~ abs(.)) ## [1]

    1 2 -3 sort_by(list(a = 2, b = - 3, c = 1), ~ abs(.)) ## x = list 3 (544 bytes) ## . c = double 1= 1 ## . a = double 1= 2 ## . b = double 1= -3
  20. 1. μϛʔσʔλͷ࡞੒ ndata <- c('hoxo-m', 'hoxo-eros', 'hoxo-um', 'hoxo-uri') dummies <-

    map(1:5, ~ list(name = sample(ndata, size = 1), age = sample(25:35, size = 1), likes = sample(ndata, size = 2))) dummies ## x = list 5 (3632 bytes) ## . [[1]] = list 3 ## . . name = character 1= hoxo-eros ## . . age = integer 1= 29 ## . . likes = character 2= hoxo-um hoxo-uri ## . [[2]] = list 3 ## . . name = character 1= hoxo-m ## . . age = integer 1= 34 ## . . likes = character 2= hoxo-uri hoxo-eros ## . [[3]] = list 3 ## . . name = character 1= hoxo-um ## . . age = integer 1= 25 ## . . likes = character 2= hoxo-m hoxo-uri ## . [[4]] = list 3 ## . . name = character 1= hoxo-um ## . . age = integer 1= 29 ## . . likes = character 2= hoxo-uri hoxo-eros ## . [[5]] = list 3 ## . . name = character 1= hoxo-um ## . . age = integer 1= 35 ## . . likes = character 2= hoxo-eros hoxo-um ϦετΛฦ͢ϥϜμࣜ
  21. 1. μϛʔσʔλͷ࡞੒ dummies <- map(ndata, ~ list(name = ., age

    = sample(30:35, size = 1), likes = sample(ndata[ndata != .], size = sample(1:(length(ndata) - 1), size = 1)))) dummies ## x = list 4 (2896 bytes) ## . [[1]] = list 3 ## . . name = character 1= hoxo-m ## . . age = integer 1= 31 ## . . likes = character 2= hoxo-um hoxo-uri ## . [[2]] = list 3 ## . . name = character 1= hoxo-eros ## . . age = integer 1= 31 ## . . likes = character 3= hoxo-uri hoxo-um ... ## . [[3]] = list 3 ## . . name = character 1= hoxo-um ## . . age = integer 1= 30 ## . . likes = character 1= hoxo-m ## . [[4]] = list 3 ## . . name = character 1= hoxo-uri ## . . age = integer 1= 34 ## . . likes = character 2= hoxo-um hoxo-m
  22. 1. μϛʔσʔλͷ࡞੒ gen <- function(name) { age = sample(30:35, size

    = 1) size = sample(1:(length(ndata) - 1), size = 1) likes = sample(ndata[ndata != name], size = size) return (list(name = name, age = age, likes = likes)) } dummies <- map(ndata, ~ gen(.)) dummies ## x = list 4 (2896 bytes) ## . [[1]] = list 3 ## . . name = character 1= hoxo-m ## . . age = integer 1= 31 ## . . likes = character 2= hoxo-um hoxo-uri ## . [[2]] = list 3 ## . . name = character 1= hoxo-eros ## . . age = integer 1= 31 ## . . likes = character 3= hoxo-uri hoxo-um ... ## . [[3]] = list 3 ## . . name = character 1= hoxo-um ## . . age = integer 1= 30 ## . . likes = character 1= hoxo-m ## . [[4]] = list 3 ## . . name = character 1= hoxo-uri ## . . age = integer 1= 34 ## . . likes = character 2= hoxo-um hoxo-m ී௨ͷؔ਺Λ౉ͯ͠΋Α͍
  23. 1. μϛʔσʔλͷ࡞੒ ndata %>% map(~ list(name = ., age =

    sample(30:35, size = 1))) %>% dplyr::bind_rows() EBUBGSBNFʹ͚ͨ͠Ε͹ EQMZSCJOE@SPXT ## Source: local data frame [4 x 2] ## ## name age ## (chr) (int) ## 1 hoxo-m 34 ## 2 hoxo-eros 35 ## 3 hoxo-um 32 ## 4 hoxo-uri 34
  24. 2. Ϧετ͔Βͷσʔλબ୒ • ؆୯ͳૢ࡞͸ {purrr} Ͱ΋Ͱ͖Δ • Ϩίʔυͷબ୒ • ଐੑͷબ୒

    • ࢀߟ: {rlist} ͱͷൺֱ • {purrr} ͰϦετσʔλΛૢ࡞͢Δ <1> • {purrr} ͰϦετσʔλΛૢ࡞͢Δ <2>
  25. 2. Ϧετ͔Βͷσʔλબ୒ keep(dummies, ~ .$name == 'hoxo-m') ## x =

    list 1 (752 bytes) ## . [[1]] = list 3 ## . . name = character 1= hoxo-m ## . . age = integer 1= 31 ## . . likes = character 2= hoxo-um hoxo-uri OBNFଐੑ͕bIPYPN`ͷ ϨίʔυΛબ୒
  26. 2. Ϧετ͔Βͷσʔλબ୒ keep(dummies, ~ .$age > 30) ## x =

    list 3 (2256 bytes) ## . [[1]] = list 3 ## . . name = character 1= hoxo-m ## . . age = integer 1= 31 ## . . likes = character 2= hoxo-um hoxo-uri ## . [[2]] = list 3 ## . . name = character 1= hoxo-eros ## . . age = integer 1= 31 ## . . likes = character 3= hoxo-uri hoxo-um ... ## . [[3]] = list 3 ## . . name = character 1= hoxo-uri ## . . age = integer 1= 34 ## . . likes = character 2= hoxo-um hoxo-m BHFଐੑ͕ΑΓେ͖͍ ͓͡͞ΜΛબ୒ ˞࣮ࡍͷ೥ྸͱ͸ҟͳΔՄೳੑ͕͋Γ·͢
  27. 2. Ϧετ͔Βͷσʔλબ୒ keep(dummies, ~ 'hoxo-m' %in% .$likes) ## x =

    list 3 (2192 bytes) ## . [[1]] = list 3 ## . . name = character 1= hoxo-eros ## . . age = integer 1= 31 ## . . likes = character 3= hoxo-uri hoxo-um ... ## . [[2]] = list 3 ## . . name = character 1= hoxo-um ## . . age = integer 1= 30 ## . . likes = character 1= hoxo-m ## . [[3]] = list 3 ## . . name = character 1= hoxo-uri ## . . age = integer 1= 34 ## . . likes = character 2= hoxo-um hoxo-m bIPYPN`͞Μͷ͜ͱ͕ ޷͖ͳਓΛநग़ ˞ࣄ࣮ͱ͸ҟͳΔՄೳੑ͕͋Γ·͢
  28. 2. Ϧετ͔Βͷσʔλબ୒ keep(dummies, ~ 'hoxo-m' %in% .$likes) %>% map(~ .$name)

    ## x = list 3 (376 bytes) ## . [[1]] = character 1= hoxo-eros ## . [[2]] = character 1= hoxo-um ## . [[3]] = character 1= hoxo-uri ϑΟϧλ݁Ռ͔Β OBNFଐੑͷΈΛநग़ keep(dummies, ~ 'hoxo-m' %in% .$likes) %>% map('name') ## x = list 3 (376 bytes) ## . [[1]] = character 1= hoxo-eros ## . [[2]] = character 1= hoxo-um ## . [[3]] = character 1= hoxo-uri NBQʹจࣈྻΛ ౉ͯ͠΋Α͍
  29. • σʔλղੳͷͨΊͷ౷ܭϞσϦϯάೖ໳ • ୈ3ষΑΓ • y: छࢠ਺ • x: ২෺ͷαΠζ

    • f: ࢪං͋Γ(T) / ͳ͠ 3. ෳ਺Ϟσϧͷ࡞੒ͱൺֱ df <- read.csv('data3a.csv') head(df) ## y x f ## 1 6 8.31 C ## 2 6 9.44 C ## 3 6 9.50 C ## 4 12 9.07 C ## 5 10 10.16 C ## 6 4 8.32 C
  30. 3. ෳ਺Ϟσϧͷ࡞੒ͱൺֱ formulas <- list(mod1 = y ~ x, mod2

    = y ~ f, mod3 = y ~ x + f) results <- purrr::map(formulas, ~ glm(formula = ., family = poisson, data = df)) results[[1]] Call: glm(formula = .x, family = .y, data = df) Coefficients: (Intercept) x 1.29172 0.07566 Degrees of Freedom: 99 Total (i.e. Null); 98 Residual Null Deviance: 89.51 Residual Deviance: 84.99 AIC: 474.8 GPSNVMBͷϦετΛ࡞੒ ֤GPSNVMBͰ ϞσϧΛ࡞੒
  31. 3. ෳ਺Ϟσϧͷ࡞੒ͱൺֱ map(results, logLik) ## x = list 3 (2128

    bytes) ## . mod1 = double 1( logLik )= -235.39 ## . A nobs = integer 1= 100 ## . A df = integer 1= 2 ## . mod2 = double 1( logLik )= -237.63 ## . A nobs = integer 1= 100 ## . A df = integer 1= 2 ## . mod3 = double 1( logLik )= -235.29 ## . A nobs = integer 1= 100 ## . A df = integer 1= 3 ݁Ռ͢΂ͯʹ MPH-JLؔ਺Λద༻
  32. 3. ෳ਺Ϟσϧͷ࡞੒ͱൺֱ map(results, AIC) ## x = list 3 (544

    bytes) ## . mod1 = double 1= 474.77 ## . mod2 = double 1= 479.25 ## . mod3 = double 1= 476.59 sort_by(results, AIC) %>% names() ## [1] "mod1" "mod3" "mod2" ݁Ռ͢΂ͯʹ "*$ؔ਺Λద༻ "*$ͷॱʹฒ΂ସ͑ɺ Ϟσϧ໊Λදࣔ
  33. 3. ෳ਺Ϟσϧͷ࡞੒ͱൺֱ formulas <- c(y ~ x, y ~ x,

    y ~ f) families <- c(poisson, gaussian, poisson) results2 <- purrr::map2(formulas, families, ~ glm(formula = .x, family = .y, data = df)) results2[[1]] Call: glm(formula = .x, family = .y, data = df) Coefficients: (Intercept) x 1.29172 0.07566 Degrees of Freedom: 99 Total (i.e. Null); 98 Residual Null Deviance: 89.51 Residual Deviance: 84.99 AIC: 474.8 ෳ਺ύϥϝʔλͷ৔߹͸ NBQ
  34. 4. ϓϩοτ plot(results[[1]], which = 2) map(results, ~ plot(., which

    = 2)) ݁Ռͷ͏ͪͻͱͭΛ 22ϓϩοτ ͢΂ͯΛ22ϓϩοτ ॱ൪ʹදࣔ͞ΕΔ αϒϓϩοτ͚ͨ͠Ε͹QBS
  35. {caret} • ػցֶशશ෦ೖΓ library(mlbench) library(caret) data(PimaIndiansDiabetes) control <- trainControl(method =

    "repeatedcv", number = 10, repeats = 3) methods <- c('gbm', 'rpart', 'svmRadial') trained <- methods %>% purrr::map(~ train(diabetes ~ ., data = PimaIndiansDiabetes, method = ., trControl = control)) resampled <- resamples(trained) summary(resampled) ## Call: ## summary.resamples(object = resampled) ## ## Models: Model1, Model2, Model3 ## Number of resamples: 30 ## ## Accuracy ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## Model1 0.6883 0.7435 0.7662 0.7682 0.7915 0.8442 0 ## Model2 0.6711 0.7273 0.7451 0.7499 0.7785 0.8312 0 ## Model3 0.6883 0.7403 0.7662 0.7678 0.7922 0.8312 0 ## ## Kappa ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## Model1 0.3037 0.4297 0.4763 0.4711 0.5268 0.6578 0 ## Model2 0.2339 0.3657 0.4078 0.4245 0.4733 0.6128 0 ## Model3 0.2518 0.4043 0.4610 0.4607 0.5138 0.6165 0 ൺֱ͍ͨ͠ϞσϧΛϦετʹ͠ɺ ֤ϞσϧΛUSBJO
  36. {broom} • ϞσϧΛ data.frame ʹม׵͢Δ library(broom) glance(results[[1]]) ## null.deviance df.null

    logLik AIC BIC deviance df.residual ## 1 89.50694 99 -235.3863 474.7725 479.9828 84.993 98 ؔ਺ ֓ཁ UJEZ Ϟσϧͷཁ໿ BVHNFOU Ϟσϧͷ৘ใΛݩͷEBUBGSBNFʹ෇༩͢Δ HMBODF Ұߦͷཁ໿ • 3. ෳ਺Ϟσϧͷ࡞੒ͱൺֱͷ݁ՌΛྲྀ༻
  37. {broom} • ϞσϧΛ data.frame ʹม׵͢Δ map(results, broom::glance) %>% dplyr::bind_rows() ##

    Source: local data frame [3 x 7] ## ## null.deviance df.null logLik AIC BIC deviance df.residual ## (dbl) (int) (dbl) (dbl) (dbl) (dbl) (int) ## 1 89.50694 99 -235.3863 474.7725 479.9828 84.99300 98 ## 2 89.50694 99 -237.6273 479.2545 484.4649 89.47501 98 ## 3 89.50694 99 -235.2937 476.5874 484.4029 84.80793 97 ෳ਺ͷϞσϧͷཁ໿Λ EBUBGSBNFʹ