Slide 1

Slide 1 text

σʔλ෼ੳݴޠ (R, Python, Julia) 1೥ͷৼΓฦΓ Masaaki Horikoshi @ ARISE analytics

Slide 2

Slide 2 text

ࣗݾ঺հ • R • ύοέʔδ։ൃͳͲ • Git Awards ࠃ಺1Ґ • Python • http://git-awards.com/users/search?login=sinhrks

Slide 3

Slide 3 text

2017೥͸ ͲΜͳ1೥Ͱ͔ͨ͠ʁ

Slide 4

Slide 4 text

2017೥ͷৼΓฦΓ • ࠓ೥ͷ׆ಈ (ࣗ࡞pkgͷϓϧϦΫΑΓ)

Slide 5

Slide 5 text

2017೥ͷৼΓฦΓ • R 3.4.xϦϦʔε • RStudio 1.1ϦϦʔε • IEEE The 2017 Top Programming Languages 6Ґ • CRAN 10,000ύοέʔδಥഁ • υΩϡϝϯςʔγϣϯܥύοέʔδͷॆ࣮ (blogdown, xaringan) • FFI (reticulate) • prophet, tensorflow • ֤छॻ੶

Slide 6

Slide 6 text

໨࣍ • σʔλॲཧύοέʔδͷચ࿅ • ϢʔςΟϦςΟͷॆ࣮

Slide 7

Slide 7 text

໨࣍ • σʔλॲཧύοέʔδͷચ࿅ • ϢʔςΟϦςΟͷॆ࣮

Slide 8

Slide 8 text

tidy dataͱ͸?

Slide 9

Slide 9 text

1. Put each dataset in a tibble. 2. Put each variable in a column. tidy dataͱ͸ #> # A tibble: 6 × 4 #> country year cases population #> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 • R for Data Sciense http://r4ds.had.co.nz/tidy-data.html

Slide 10

Slide 10 text

ΧϥϜΛத৺ʹߟ͑Δ tidy dataͱ͸ (ࢲݟ) 9 " # # " = filter map 9 5 ' ' 5 9 5 5 indexing = summarize_all 9 9 " # aggregation indexing

Slide 11

Slide 11 text

tidyverseͱ͸?

Slide 12

Slide 12 text

tidyverseͱ͸ • Opinionated collection of R packages designed for data science. • All packages share an underlying philosophy and common APIs.

Slide 13

Slide 13 text

tidyverseͱ͸(ࢲݟ) • ύΠϓϥΠϯͰߟ͑Δ Import Transform Explore Modeling Share %>%

Slide 14

Slide 14 text

tidyverse • dplyr, tidyr 0.7.0 • purrr 0.2.3 • forcats, stringr • reprex • glue (*) * tidyverse 1.2.1ʹ͸ؚ·Εͳ͍

Slide 15

Slide 15 text

dplyr, tidyr 0.7.0 • dplyr 0.7.0 • Colwise functions • tidyeval • Databases (dbplyr) • UTF-8 • tidyr 0.7.0 • tidyeval • tidyselect

Slide 16

Slide 16 text

• mutate_xxx, summarise_xxx ͷҰൠԽ →ྻʹର͢Δ indexing Colwise functions ؔ਺໊ ॲཧର৅ YYY@BMM શͯͷྻ YYY@BU ྻ໊ɺ΋͘͠͸ΠϯσοΫεͰࢦఆͨ͠ྻ YYY@JG ৚݅Λຬͨ͢ྻ • dplyr࠶ೖ໳ʢColwiseฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-colwisebian

Slide 17

Slide 17 text

Colwise functions • dplyr࠶ೖ໳ʢColwiseฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-colwisebian df <- data_frame(a1 = 1:5, a2 = 5:1, b1 = 11:15, b2 = 15:11) df %>% rename_all(toupper) df %>% rename_at(c(1, 2), toupper) df %>% rename_if(summarise_all(., mean) > 10, toupper) " " # # rename_all " " C C rename_at B B # # rename_if

Slide 18

Slide 18 text

tidyeval • ద౰ͳॲཧΛ܁Γฦ͍ͨ͠৔߹… • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian 9 : ; NZ@GVOD 9 NZ@GVOD : NZ@TVNNBSZ 9 NZ@TVNNBSZ : %>% %>% %>% %>% % >%

Slide 19

Slide 19 text

tidyeval df <- data_frame(g1 = c(1, 2, 1, 2, 1), g2 = c(1, 1, 1, 2, 2), aa = 1:5, bb = 5:1) group_sum_ng <- function(df, by) { df %>% group_by(by) %>% summarise_all(sum) } group_sum_ng(df, g1) grouped_df_impl(data, unname(vars), drop) ͰΤϥʔ: Column `by` is unknown • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian CZͰࢦఆͨ͠ྻͰ άϧʔϓԽ͍ͨ͠

Slide 20

Slide 20 text

tidyeval • NSE group_sum_enquo <- function(df, by) { qby <- enquo(by) df %>% group_by(!! qby) %>% summarise_all(sum) } group_sum_enquo(df, g1) # OK • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian

Slide 21

Slide 21 text

tidyeval • SE group_sum_sym <- function(df, by) { qby <- rlang::sym(by) df %>% group_by(!! qby) %>% summarise_all(sum) } group_sum_sym(df, “g1") # OK • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian

Slide 22

Slide 22 text

tidyeval • όοΫΤϯυ͸ rlang ύοέʔδ͕ఏڙ • ύΠϓϥΠϯԽ͚ͩΛߟ͑Δͱ • ֎͔ΒNSEͰ౉ͨ͠ม਺໊͸ enquo -> !! • ֎͔ΒSEͰ౉ͨ͠ม਺໊͸ sym -> !! • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian

Slide 23

Slide 23 text

Databases (dbplyr) library(dplyr) con <- DBI::dbConnect(RSQLite::SQLite(), “:memory:") DBI::dbWriteTable(con, "iris", iris) df <- dplyr::tbl(con, “iris") head(df) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 7.0 3.2 4.7 1.4 versicolor 2 6.4 3.2 4.5 1.5 versicolor 3 6.9 3.1 4.9 1.5 versicolor αϯϓϧॻ͖ࠐΈ ࢦఆͨ͠ςʔϒϧΛಡΈࠐΈ

Slide 24

Slide 24 text

purrr 0.2.3 • pluck • map helpers • map functions • modify functions

Slide 25

Slide 25 text

purrr 0.2.3 • pluck • map functions a <- list(a = 1, b = list(x = 1, y = 2), c = 3) pluck(a, "b", "x") [1] 1 imap(a, ~toupper(.y)) $a [1] "A" $b [1] "B" $c [1] "C" BCY ΩʔΛZͰऔಘ

Slide 26

Slide 26 text

forcats, stringr library(forcats) x <- factor(c("a", "b", "a", "c", "d")) x %>% forcats::fct_other(keep = c("a", "b")) [1] a b a Other Other Levels: a b Other library(stringr) vals <- c("a1", "a2", "b1", "b2") stringr::str_which(vals, "b") [1] 3 4

Slide 27

Slide 27 text

reprex library(reprex) reprex(1 + 3) reprex(1 + 3, venue = "so") ``` r 1 + 3 #> [1] 4 ``` ΫϦοϓϘʔυʹอଘ
1 + 3 #> [1] 4 ΫϦοϓϘʔυʹอଘ

Slide 28

Slide 28 text

glue • ϑΥʔϚοτจࣈྻϦςϥϧ library(glue) num <- 100 glue("x = {num}") x = 100 stringr::str_interp("x = ${num}") x = 100

Slide 29

Slide 29 text

glue • glue_sql con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbWriteTable(con, "iris", iris) var <- "Sepal.Length" tbl <- "iris" num <- 5 q <- glue_sql("SELECT * FROM {`tbl`} WHERE {`tbl`}.{var} > {num} ", .con = con) q SELECT * FROM `iris` WHERE `iris`.'Sepal.Length' > 5

Slide 30

Slide 30 text

glue • glue_sql df <- as_data_frame(DBI::dbGetQuery(con, q)) df # A tibble: 61 x 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 7.0 3.2 4.7 1.4 versicolor 2 6.4 3.2 4.5 1.5 versicolor 3 6.9 3.1 4.9 1.5 versicolor 4 6.5 2.8 4.6 1.5 versicolor 5 6.3 3.3 4.7 1.6 versicolor # ... with 56 more rows

Slide 31

Slide 31 text

ར༻ྫ 1 • RESTͰऔಘͨ݁͠ՌΛ tibble ʹ͍ͨ͠ • https://api.github.com/orgs/tidyverse/repos … …

Slide 32

Slide 32 text

ར༻ྫ 1 library(httr) library(glue) library(purrr) library(dplyr) library(lubridate) org <- "tidyverse" url <- glue::glue(‘https://api.github.com/orgs/{org}/repos') p <- httr::GET(url, query = list(per_page = 100)) %>% httr::content(“parsed") p[[2]] $id [1] 148017 $name [1] "lubridate" $full_name [1] “tidyverse/lubridate" … ϑΥʔϚοτจࣈྻϦςϥϧ

Slide 33

Slide 33 text

ར༻ྫ 1 cols <- c("name", "stargazers_count", "created_at", "updated_at") dt <- dplyr::vars(dplyr::ends_with(“_at")) pkgs <- p %>% purrr::map(~ .[cols]) %>% dplyr::bind_rows() %>% dplyr::mutate_at(dt, lubridate::ymd_hms) %>% dplyr::rename_at(dt, dplyr::funs(sub("_at", "_time", .))) # A tibble: 28 x 4 name stargazers_count created_time updated_time 1 ggplot2 2780 2008-05-25 01:21:32 2017-12-01 14:48:13 2 lubridate 333 2009-03-11 01:18:52 2017-11-26 21:49:53 3 stringr 226 2009-11-08 22:20:08 2017-11-30 09:00:39 4 dplyr 2087 2012-10-28 13:39:17 2017-12-01 14:02:30 # ... with 24 more rows Ϧετ͔Βಛఆͷ஋Λબ୒ UJCCMFʹม׵ ೔࣌จࣈྻΛύʔε ྻ໊Λมߋ $PMXJTFGVODUJPO

Slide 34

Slide 34 text

ར༻ྫ 2 • Ϟσϧ࡞੒&ධՁΛ୳ࡧతʹ΍Γ͍ͨ

Slide 35

Slide 35 text

• αϯϓϦϯά, CV, Ϟσϧૢ࡞ͳͲͷػೳΛఏڙ ิ଍: modelr head(trees, n = 3) Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 m <- glm(trees, Volume ~ Girth + Height) trees %>% modelr::add_predictions(m) %>% modelr::add_residuals(m) Girth Height Volume pred resid 1 8.3 70 10.3 4.837660 5.46234035 2 8.6 65 10.3 4.553852 5.74614837 3 8.8 63 10.2 4.816981 5.38301873 ༧ଌ஋Λྻͱͯ͠௥Ճ ࢒ࠩΛྻͱͯ͠௥Ճ

Slide 36

Slide 36 text

ར༻ྫ 2 my_model <- function(df, tgt, var) { qtgt <- rlang::enexpr(tgt) qvar <- rlang::enexpr(var) glm(rlang::new_formula(qtgt, qvar), data = df) } m <- my_model(trees, Volume, Girth + Height) m Call: glm(formula = rlang::new_formula(qtgt, qvar), data = df) Coefficients: (Intercept) Girth Height -57.9877 4.7082 0.3393 Degrees of Freedom: 30 Total (i.e. Null); 28 Residual Null Deviance: 8106 Residual Deviance: 421.9 AIC: 176.9 /4&ͰGPSNVMBΛ࡞੒

Slide 37

Slide 37 text

ར༻ྫ 2 get_besides <- function(df, model, tgt) { qtgt <- enquo(tgt) df %>% modelr::add_predictions(model) %>% modelr::add_residuals(model) %>% dplyr::filter(abs(resid) > (!! qtgt) * 0.5) } get_besides(trees, m, Volume) Girth Height Volume pred resid 1 8.3 70 10.3 4.837660 5.462340 2 8.6 65 10.3 4.553852 5.746148 3 8.8 63 10.2 4.816981 5.383019 ৚݅Λຬͨ͢ߦΛϑΟϧλ UJEZFWBM ༧ଌ஋Λྻͱͯ͠௥Ճ ࢒ࠩΛྻͱͯ͠௥Ճ

Slide 38

Slide 38 text

tidyverseͷચ࿅ • ೖྗʙϞσϧ࡞੒ʙ݁Ռͷڞ༗·ͰɺύΠϓϥΠϯ ͷॲཧ͕΍Γ΍͘͢ • Database (dbplyr), glue • Colwise function, forcats, stringr • tidyeval • reprex

Slide 39

Slide 39 text

໨࣍ • σʔλॲཧύοέʔδͷચ࿅ • ϢʔςΟϦςΟͷॆ࣮

Slide 40

Slide 40 text

r-lib • R infrastructure organization • https://github.com/r-lib • (͓ͦΒ͘) 17೥த͝Ζʙ (fka. r-pkgs) • 17೥ʹ࡞ΒΕ࢝Ίͨύοέʔδ΋݁ߏ͋Δ

Slide 41

Slide 41 text

r-lib • ϢʔςΟϦςΟ(httr, xml2…) • ύοέʔδ։ൃ·ΘΓ(testthat, pkgdown, covr, usethis…) • Πϯλʔφϧ(R6, memoise…) • ίϯιʔϧ·ΘΓ(cli, progress, crayon…)

Slide 42

Slide 42 text

ίϯιʔϧ·ΘΓ ύοέʔδ ֓ཁ DSBZPO ίϯιʔϧग़ྗͷελΠϧࢦఆ DMJ $-*༻ͷจࣈྻϑΥʔϚοτ QSPHSFTT ϓϩάϨεόʔදࣔ QJMMBS ΧϥϜͷϑΥʔϚοτ

Slide 43

Slide 43 text

ར༻ྫ library(cli) library(crayon) library(progress) rule(center = "ॲཧ։࢝", line_col = "red") cat(red(symbol$tick, "check1 \n")) cat(blue(symbol$tick, "check2 \n")) cat(green(symbol$tick, "check3 \n")) pb <- progress_bar$new(total = 100) for (i in 1:100) { pb$tick() Sys.sleep(1 / 50) } rule(center = "ॲཧऴྃ", line_col = "red") DMJ DSBZPO QSPHSFTT

Slide 44

Slide 44 text

·ͱΊ • σʔλॲཧύοέʔδͷચ࿅ • ϢʔςΟϦςΟͷॆ࣮

Slide 45

Slide 45 text

2018೥΋ Enjoy!