$30 off During Our Annual Pro Sale. View Details »

データ分析言語R 1年の振り返り

Sinhrks
December 02, 2017
2.2k

データ分析言語R 1年の振り返り

@ Japan.R 2017

Sinhrks

December 02, 2017
Tweet

Transcript

  1. σʔλ෼ੳݴޠ
    (R, Python, Julia)
    1೥ͷৼΓฦΓ
    Masaaki Horikoshi @ ARISE analytics

    View Slide

  2. ࣗݾ঺հ
    • R
    • ύοέʔδ։ൃͳͲ
    • Git Awards ࠃ಺1Ґ
    • Python
    • http://git-awards.com/users/search?login=sinhrks

    View Slide

  3. 2017೥͸
    ͲΜͳ1೥Ͱ͔ͨ͠ʁ

    View Slide

  4. 2017೥ͷৼΓฦΓ
    • ࠓ೥ͷ׆ಈ (ࣗ࡞pkgͷϓϧϦΫΑΓ)

    View Slide

  5. 2017೥ͷৼΓฦΓ
    • R 3.4.xϦϦʔε
    • RStudio 1.1ϦϦʔε
    • IEEE The 2017 Top Programming Languages 6Ґ
    • CRAN 10,000ύοέʔδಥഁ
    • υΩϡϝϯςʔγϣϯܥύοέʔδͷॆ࣮ (blogdown, xaringan)
    • FFI (reticulate)
    • prophet, tensorflow
    • ֤छॻ੶

    View Slide

  6. ໨࣍
    • σʔλॲཧύοέʔδͷચ࿅
    • ϢʔςΟϦςΟͷॆ࣮

    View Slide

  7. ໨࣍
    • σʔλॲཧύοέʔδͷચ࿅
    • ϢʔςΟϦςΟͷॆ࣮

    View Slide

  8. tidy dataͱ͸?

    View Slide

  9. 1. Put each dataset in a tibble.
    2. Put each variable in a column.
    tidy dataͱ͸
    #> # A tibble: 6 × 4
    #> country year cases population
    #>
    #> 1 Afghanistan 1999 745 19987071
    #> 2 Afghanistan 2000 2666 20595360
    #> 3 Brazil 1999 37737 172006362
    #> 4 Brazil 2000 80488 174504898
    #> 5 China 1999 212258 1272915272
    #> 6 China 2000 213766 1280428583
    • R for Data Sciense http://r4ds.had.co.nz/tidy-data.html

    View Slide

  10. ΧϥϜΛத৺ʹߟ͑Δ
    tidy dataͱ͸ (ࢲݟ)
    9
    "
    #
    #
    "
    = filter
    map
    9
    5
    '
    '
    5
    9
    5
    5
    indexing
    = summarize_all
    9




    9
    "
    #
    aggregation
    indexing

    View Slide

  11. tidyverseͱ͸?

    View Slide

  12. tidyverseͱ͸
    • Opinionated collection of R packages
    designed for data science.
    • All packages share an underlying philosophy
    and common APIs.

    View Slide

  13. tidyverseͱ͸(ࢲݟ)
    • ύΠϓϥΠϯͰߟ͑Δ
    Import Transform Explore Modeling Share
    %>%

    View Slide

  14. tidyverse
    • dplyr, tidyr 0.7.0
    • purrr 0.2.3
    • forcats, stringr
    • reprex
    • glue (*)
    * tidyverse 1.2.1ʹ͸ؚ·Εͳ͍

    View Slide

  15. dplyr, tidyr 0.7.0
    • dplyr 0.7.0
    • Colwise functions
    • tidyeval
    • Databases (dbplyr)
    • UTF-8
    • tidyr 0.7.0
    • tidyeval
    • tidyselect

    View Slide

  16. • mutate_xxx, summarise_xxx ͷҰൠԽ
    →ྻʹର͢Δ indexing
    Colwise functions
    ؔ਺໊ ॲཧର৅
    YYY@BMM શͯͷྻ
    YYY@BU ྻ໊ɺ΋͘͠͸ΠϯσοΫεͰࢦఆͨ͠ྻ
    YYY@JG ৚݅Λຬͨ͢ྻ
    • dplyr࠶ೖ໳ʢColwiseฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-colwisebian

    View Slide

  17. Colwise functions
    • dplyr࠶ೖ໳ʢColwiseฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-colwisebian
    df <- data_frame(a1 = 1:5,
    a2 = 5:1,
    b1 = 11:15,
    b2 = 15:11)
    df %>% rename_all(toupper)
    df %>% rename_at(c(1, 2), toupper)
    df %>% rename_if(summarise_all(., mean) > 10, toupper)
    " " # #
    rename_all
    " " C C
    rename_at
    B B # #
    rename_if

    View Slide

  18. tidyeval
    • ద౰ͳॲཧΛ܁Γฦ͍ͨ͠৔߹…
    • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian
    9 : ;
    NZ@GVOD 9

    NZ@GVOD :

    NZ@TVNNBSZ 9

    NZ@TVNNBSZ :

    %>%
    %>%
    %>%
    %>%
    %
    >%

    View Slide

  19. tidyeval
    df <- data_frame(g1 = c(1, 2, 1, 2, 1),
    g2 = c(1, 1, 1, 2, 2),
    aa = 1:5,
    bb = 5:1)
    group_sum_ng <- function(df, by) {
    df %>%
    group_by(by) %>%
    summarise_all(sum)
    }
    group_sum_ng(df, g1)
    grouped_df_impl(data, unname(vars), drop) ͰΤϥʔ:
    Column `by` is unknown
    • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian
    CZͰࢦఆͨ͠ྻͰ
    άϧʔϓԽ͍ͨ͠

    View Slide

  20. tidyeval
    • NSE
    group_sum_enquo <- function(df, by) {
    qby <- enquo(by)
    df %>%
    group_by(!! qby) %>%
    summarise_all(sum)
    }
    group_sum_enquo(df, g1)
    # OK
    • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian

    View Slide

  21. tidyeval
    • SE
    group_sum_sym <- function(df, by) {
    qby <- rlang::sym(by)
    df %>%
    group_by(!! qby) %>%
    summarise_all(sum)
    }
    group_sum_sym(df, “g1")
    # OK
    • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian

    View Slide

  22. tidyeval
    • όοΫΤϯυ͸ rlang ύοέʔδ͕ఏڙ
    • ύΠϓϥΠϯԽ͚ͩΛߟ͑Δͱ
    • ֎͔ΒNSEͰ౉ͨ͠ม਺໊͸ enquo -> !!
    • ֎͔ΒSEͰ౉ͨ͠ม਺໊͸ sym -> !!
    • dplyr࠶ೖ໳ʢTidyevalฤʣhttps://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian

    View Slide

  23. Databases (dbplyr)
    library(dplyr)
    con <- DBI::dbConnect(RSQLite::SQLite(), “:memory:")
    DBI::dbWriteTable(con, "iris", iris)
    df <- dplyr::tbl(con, “iris")
    head(df)
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1 7.0 3.2 4.7 1.4 versicolor
    2 6.4 3.2 4.5 1.5 versicolor
    3 6.9 3.1 4.9 1.5 versicolor
    αϯϓϧॻ͖ࠐΈ
    ࢦఆͨ͠ςʔϒϧΛಡΈࠐΈ

    View Slide

  24. purrr 0.2.3
    • pluck
    • map helpers
    • map functions
    • modify functions

    View Slide

  25. purrr 0.2.3
    • pluck
    • map functions
    a <- list(a = 1, b = list(x = 1, y = 2), c = 3)
    pluck(a, "b", "x")
    [1] 1
    imap(a, ~toupper(.y))
    $a
    [1] "A"
    $b
    [1] "B"
    $c
    [1] "C"
    BCY
    ΩʔΛZͰऔಘ

    View Slide

  26. forcats, stringr
    library(forcats)
    x <- factor(c("a", "b", "a", "c", "d"))
    x %>%
    forcats::fct_other(keep = c("a", "b"))
    [1] a b a Other Other
    Levels: a b Other
    library(stringr)
    vals <- c("a1", "a2", "b1", "b2")
    stringr::str_which(vals, "b")
    [1] 3 4

    View Slide

  27. reprex
    library(reprex)
    reprex(1 + 3)
    reprex(1 + 3, venue = "so")
    ``` r
    1 + 3
    #> [1] 4
    ```
    ΫϦοϓϘʔυʹอଘ



    1 + 3
    #> [1] 4
    ΫϦοϓϘʔυʹอଘ

    View Slide

  28. glue
    • ϑΥʔϚοτจࣈྻϦςϥϧ
    library(glue)
    num <- 100
    glue("x = {num}")
    x = 100
    stringr::str_interp("x = ${num}")
    x = 100

    View Slide

  29. glue
    • glue_sql
    con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
    DBI::dbWriteTable(con, "iris", iris)
    var <- "Sepal.Length"
    tbl <- "iris"
    num <- 5
    q <- glue_sql("SELECT *
    FROM {`tbl`}
    WHERE {`tbl`}.{var} > {num}
    ", .con = con)
    q
    SELECT *
    FROM `iris`
    WHERE `iris`.'Sepal.Length' > 5

    View Slide

  30. glue
    • glue_sql
    df <- as_data_frame(DBI::dbGetQuery(con, q))
    df
    # A tibble: 61 x 5
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species

    1 7.0 3.2 4.7 1.4 versicolor
    2 6.4 3.2 4.5 1.5 versicolor
    3 6.9 3.1 4.9 1.5 versicolor
    4 6.5 2.8 4.6 1.5 versicolor
    5 6.3 3.3 4.7 1.6 versicolor
    # ... with 56 more rows

    View Slide

  31. ར༻ྫ 1
    • RESTͰऔಘͨ݁͠ՌΛ tibble ʹ͍ͨ͠
    • https://api.github.com/orgs/tidyverse/repos
    … …

    View Slide

  32. ར༻ྫ 1
    library(httr)
    library(glue)
    library(purrr)
    library(dplyr)
    library(lubridate)
    org <- "tidyverse"
    url <- glue::glue(‘https://api.github.com/orgs/{org}/repos')
    p <- httr::GET(url, query = list(per_page = 100)) %>%
    httr::content(“parsed")
    p[[2]]
    $id
    [1] 148017
    $name
    [1] "lubridate"
    $full_name
    [1] “tidyverse/lubridate"

    ϑΥʔϚοτจࣈྻϦςϥϧ

    View Slide

  33. ར༻ྫ 1
    cols <- c("name", "stargazers_count", "created_at", "updated_at")
    dt <- dplyr::vars(dplyr::ends_with(“_at"))
    pkgs <- p %>%
    purrr::map(~ .[cols]) %>%
    dplyr::bind_rows() %>%
    dplyr::mutate_at(dt, lubridate::ymd_hms) %>%
    dplyr::rename_at(dt, dplyr::funs(sub("_at", "_time", .)))
    # A tibble: 28 x 4
    name stargazers_count created_time updated_time

    1 ggplot2 2780 2008-05-25 01:21:32 2017-12-01 14:48:13
    2 lubridate 333 2009-03-11 01:18:52 2017-11-26 21:49:53
    3 stringr 226 2009-11-08 22:20:08 2017-11-30 09:00:39
    4 dplyr 2087 2012-10-28 13:39:17 2017-12-01 14:02:30
    # ... with 24 more rows
    Ϧετ͔Βಛఆͷ஋Λબ୒
    UJCCMFʹม׵
    ೔࣌จࣈྻΛύʔε
    ྻ໊Λมߋ
    $PMXJTFGVODUJPO

    View Slide

  34. ར༻ྫ 2
    • Ϟσϧ࡞੒&ධՁΛ୳ࡧతʹ΍Γ͍ͨ

    View Slide

  35. • αϯϓϦϯά, CV, Ϟσϧૢ࡞ͳͲͷػೳΛఏڙ
    ิ଍: modelr
    head(trees, n = 3)
    Girth Height Volume
    1 8.3 70 10.3
    2 8.6 65 10.3
    3 8.8 63 10.2
    m <- glm(trees, Volume ~ Girth + Height)
    trees %>%
    modelr::add_predictions(m) %>%
    modelr::add_residuals(m)
    Girth Height Volume pred resid
    1 8.3 70 10.3 4.837660 5.46234035
    2 8.6 65 10.3 4.553852 5.74614837
    3 8.8 63 10.2 4.816981 5.38301873
    ༧ଌ஋Λྻͱͯ͠௥Ճ
    ࢒ࠩΛྻͱͯ͠௥Ճ

    View Slide

  36. ར༻ྫ 2
    my_model <- function(df, tgt, var) {
    qtgt <- rlang::enexpr(tgt)
    qvar <- rlang::enexpr(var)
    glm(rlang::new_formula(qtgt, qvar), data = df)
    }
    m <- my_model(trees, Volume, Girth + Height)
    m
    Call: glm(formula = rlang::new_formula(qtgt, qvar), data = df)
    Coefficients:
    (Intercept) Girth Height
    -57.9877 4.7082 0.3393
    Degrees of Freedom: 30 Total (i.e. Null); 28 Residual
    Null Deviance: 8106
    Residual Deviance: 421.9 AIC: 176.9
    /4&ͰGPSNVMBΛ࡞੒

    View Slide

  37. ར༻ྫ 2
    get_besides <- function(df, model, tgt) {
    qtgt <- enquo(tgt)
    df %>%
    modelr::add_predictions(model) %>%
    modelr::add_residuals(model) %>%
    dplyr::filter(abs(resid) > (!! qtgt) * 0.5)
    }
    get_besides(trees, m, Volume)
    Girth Height Volume pred resid
    1 8.3 70 10.3 4.837660 5.462340
    2 8.6 65 10.3 4.553852 5.746148
    3 8.8 63 10.2 4.816981 5.383019
    ৚݅Λຬͨ͢ߦΛϑΟϧλ
    UJEZFWBM

    ༧ଌ஋Λྻͱͯ͠௥Ճ
    ࢒ࠩΛྻͱͯ͠௥Ճ

    View Slide

  38. tidyverseͷચ࿅
    • ೖྗʙϞσϧ࡞੒ʙ݁Ռͷڞ༗·ͰɺύΠϓϥΠϯ
    ͷॲཧ͕΍Γ΍͘͢
    • Database (dbplyr), glue
    • Colwise function, forcats, stringr
    • tidyeval
    • reprex

    View Slide

  39. ໨࣍
    • σʔλॲཧύοέʔδͷચ࿅
    • ϢʔςΟϦςΟͷॆ࣮

    View Slide

  40. r-lib
    • R infrastructure organization
    • https://github.com/r-lib
    • (͓ͦΒ͘) 17೥த͝Ζʙ (fka. r-pkgs)
    • 17೥ʹ࡞ΒΕ࢝Ίͨύοέʔδ΋݁ߏ͋Δ

    View Slide

  41. r-lib
    • ϢʔςΟϦςΟ(httr, xml2…)
    • ύοέʔδ։ൃ·ΘΓ(testthat, pkgdown,
    covr, usethis…)
    • Πϯλʔφϧ(R6, memoise…)
    • ίϯιʔϧ·ΘΓ(cli, progress, crayon…)

    View Slide

  42. ίϯιʔϧ·ΘΓ
    ύοέʔδ ֓ཁ
    DSBZPO ίϯιʔϧग़ྗͷελΠϧࢦఆ
    DMJ $-*༻ͷจࣈྻϑΥʔϚοτ
    QSPHSFTT ϓϩάϨεόʔදࣔ
    QJMMBS ΧϥϜͷϑΥʔϚοτ

    View Slide

  43. ར༻ྫ
    library(cli)
    library(crayon)
    library(progress)
    rule(center = "ॲཧ։࢝", line_col = "red")
    cat(red(symbol$tick, "check1 \n"))
    cat(blue(symbol$tick, "check2 \n"))
    cat(green(symbol$tick, "check3 \n"))
    pb <- progress_bar$new(total = 100)
    for (i in 1:100) {
    pb$tick()
    Sys.sleep(1 / 50)
    }
    rule(center = "ॲཧऴྃ", line_col = "red")
    DMJ
    DSBZPO
    QSPHSFTT

    View Slide

  44. ·ͱΊ
    • σʔλॲཧύοέʔδͷચ࿅
    • ϢʔςΟϦςΟͷॆ࣮

    View Slide

  45. 2018೥΋
    Enjoy!

    View Slide