Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tokyo.R#75 BeginneRSession-data pipeline

kilometer
January 19, 2019

Tokyo.R#75 BeginneRSession-data pipeline

第75回 Tokyo.Rの初心者セッション用のスライドです。

kilometer

January 19, 2019
Tweet

More Decks by kilometer

Other Decks in Science

Transcript

  1. BeginneR Advanced Hoxo_m If I have seen further it is

    by standing on the shoulders of Giants. -- Sir Isaac Newton, 1676
  2. The tidyverse style guide https://style.tidyverse.org/ "Good coding style is like

    correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." Google's R Style Guide https://style.tidyverse.org/ "The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify." R coding style guides
  3. > data function(..., list = character(), package = NULL, lib.loc

    = NULL, verbose = getOption("verbose"), envir = .GlobalEnv) { fileExt <- function(x) { db <- grepl("\\.[^.]+\\.(gz|bz2|xz)$", x) ans <- sub(".*\\.", "", x) ... "Where possible, avoid re-using names of common functions and variables. This will cause confusion for the readers of your code." # Good df <- read.csv("hoge.csv") dat <- read.csv("hoge.csv") # Bad data <- read.csv("hoge.csv")
  4. # Bad for(i in 1:10){ print(i) } # Good for(i

    in 1:10){ print(i) } copy (cut) & paste Auto-indentation (in RStudio) Details: RStudio > Preference > Code > Editing
  5. The tidyverse style guide https://style.tidyverse.org/ "Good coding style is like

    correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." Google's R Style Guide https://style.tidyverse.org/ "The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify." R coding style guides
  6. ブール演算⼦ Boolean Algebra A == B A != B George

    Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia
  7. "a" != "b" # is A in B? ブール演算⼦ Boolean

    Algebra [1] TRUE 1 %in% 10:100 # is A in B? [1] FALSE
  8. George Boole 1815 - 1864 A Class-Room Introduction to Logic

    https://niyamaklogic.wordpress.co m/category/laws-of-thoughts/ Mathematician Philosopher &
  9. ブール演算⼦ Boolean Algebra A == B A != B George

    Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia
  10. vector in R in Excel pre <- c(1, 2, 3,

    4, 5) post <- pre * 5 > pre [1] 1 2 3 4 5 > post [1] 5 10 15 20 25
  11. vector vec1 <- c(1, 2, 3, 4, 5) vec2 <-

    1:5 vec3 <- seq(from = 1, to = 5, by = 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5 > vec3 [1] 1 2 3 4 5
  12. vector vec1 <- seq(from = 1, to = 5, by

    = 1) vec2 <- seq(1, 5, 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5
  13. > ?seq vector seq{base} Sequence Generation Description Generate regular sequences.

    seq is a standard generic with a default method. … Usage seq(...) ## Default S3 method: seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)
  14. vector vec1 <- rep(1:3, times = 2) vec2 <- rep(1:3,

    each = 2) vec3 <- rep(1:3, times = 2, each = 2) > vec1 [1] 1 2 3 1 2 3 > vec2 [1] 1 1 2 2 3 3 > vec3 [1] 1 1 2 2 3 3 1 1 2 2 3 3
  15. vector vec1 <- 11:15 > vec1 [1] 11 12 13

    14 15 > vec1[1] [1] 11 > vec1[3:5] [1] 13 14 15 > vec1[c(1:2, 5)] [1] 11 12 15
  16. list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1

    [[1]] [1] 1 2 3 4 5 6 [[2]] [1] 11 12 13 14 15 [[3]] [1] "a" "b" "c"
  17. list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1[[1]]

    [1] 1 2 3 4 5 6 > list1[[3]][2:3] [1] "b" "c" > list1[[2]] * 3 [1] 33 36 39 42 45
  18. named list list2 <- list(A = 1:6, B = 11:15,

    C = c("a", "b", "c")) > list2 $A [1] 1 2 3 4 5 6 $B [1] 11 12 13 14 15 $C [1] "a" "b" "c"
  19. > list2$A [1] 1 2 3 4 5 6 >

    list2$C[2:3] [1] "b" "c" > list2$B * 3 [1] 33 36 39 42 45 list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c")) named list
  20. list1 <- list(1:6, 11:15, c("a", "b", "c")) > class(list1) [1]

    "list" > names(list1) NULL list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c")) > class(list2) [1] "list" > names(list2) [1] "A" "B" "C" named list list
  21. list3 <- list(A = 1:3, B = 11:13) > class(list3)

    [1] "list" > names(list3) [1] "A" "B" df1 <- data.frame(A = 1:3, B = 11:13) > class(df1) [1] "data.frame" > names(df1) [1] "A" "B" named list & data.frame
  22. > str(list3) List of 2 $ A: int [1:3] 1

    2 3 $ B: int [1:3] 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 variables: $ A: int 1 2 3 $ B: int 11 12 13 list3 <- list(A = 1:3, B = 11:13) df1 <- data.frame(A = 1:3, B = 11:13) named list & data.frame
  23. > list3 $A [1] 1 2 3 $B [1] 11

    12 13 > df1 A B 1 1 11 2 2 12 3 3 13 named list & data.frame
  24. > list3 $A [1] 1 2 3 $B [1] 11

    12 13 > df1 A B 1 1 11 2 2 12 3 3 13 named list & data.frame observation variable
  25. data.frame v.s. matrix A B 1 1 11 2 2

    12 3 3 13 [,1] [,2] [1,] 1 11 [2,] 2 12 [3,] 3 13 df1 <- data.frame(A = 1:3, B = 11:13) > str(mat1) int [1:3, 1:2] 1 2 3 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 vars.: $ A: int 1 2 3 $ B: int 11 12 13 mat1 <- matrix(c(1:3, 11:13), 3, 2)
  26. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell.
  27. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell. > df1 A B 1 1 11 2 2 12 3 3 13 observation variable df1 <- data.frame(A = 1:3, B = 11:13)
  28. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell.
  29. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell.
  30. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell. Different observation data Value Label
  31. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell. data tidying
  32. "Horizontal" data "Nested" data "Vertical" data nest unnest gather spread

    input output visualization Data style manipulation in {tidyr} Loops, Summarization, Feature extractions et al., ...
  33. 1JQF X %>% f X %>% f(y) X %>% f

    %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magrittr} 「dplyr再⼊⾨(基本編)」yutanihilation https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-ji-ben-bian
  34. ① lift Bring milk from the kitchen! lift(Robot, glass, table)

    -> Robot' take ② take(Robot', fridge, milk) -> Robot''
  35. Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)

    Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④
  36. Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)

    Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④
  37. Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk)

    Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④ Thinking Reading Bring milk from the kitchen!
  38. ① ② ③ ④ lift take pour put Bring milk

    from the kitchen! result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)
  39. please_bring <- function(someone, milk, glass, table = dining_table, fridge= kitchen_fridge){

    someone %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) } RobotA %>% please_bring(milk, my_glass) Define an original function Usage RobotB %>% please_bring(cold_tea, her_glass)
  40. • nouns for variables • verbs for functions General naming

    guidance to naming things https://www.grinchcentral.com/function-names-to-verb-or-not-to-verb
  41. • nouns for variables • verbs for functions General naming

    guidance to naming things • variables are nouns • functions are verbs Conversely,
  42. filter(df, x == "a", y == 1) df[df$x == "a"

    & df$y == 1, ] # verb (動詞的) # noun (名詞的) {dplyr} df %>% filter(x == "a", y == 1) # verb with pipe
  43. mutate select filter arrange summaries # add column # select

    column # select row # arrange row # summary of vars {dplyr} WFSCT WFSCGVODUJPOT
  44. It (dplyr) provides simple “verbs” to help you translate your

    thoughts into code. functions that correspond to the most common data manipulation tasks Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html WFSCT {dplyr}
  45. library(dplyr) iris %>% mutate(a = 1:nrow(.)) %>% str 'data.frame': 150

    obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels "setosa", ... $ a : int 1 2 3 4 5 6 7 8 9 10 ... WFSCT {dplyr}
  46. library(dplyr) iris %>% mutate(a = 1:nrow(.), a = a *

    5/3 %>% round) 'data.frame': 150 obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels “setosa”, ... $ a : num 1.67 3.33 5 6.67 8.33 ... ... WFSCT {dplyr} over write
  47. library(dplyr) iris %>% select(Sepal.Length, Sepal.Width) 'data.frame': 150 obs. of 6

    variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... WFSCT {dplyr}
  48. library(dplyr) iris %>% select(contains(“Width”)) 'data.frame': 150 obs. of 6 variables:

    $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... WFSCT {dplyr} Select help functions
  49. WFSCT {dplyr} # Select help functions starts_with("s") ends_with("s") contains("se") matches("^.e")

    one_of(c("Sepal.Length", "Species")) everything() https://kazutan.github.io/blog/2017/04/dplyr-select-memo/ 「dplyr::selectの活⽤例メモ」kazutan
  50. mutate select filter arrange summaries # カラムの追加 # カラムの選択 #

    ⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈
  51. library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} 'data.frame': 50

    obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
  52. library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} NSE (Non-Standard

    Evaluation) 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
  53. filter(df, x == "a", y == 1) /4&ͷ࿩ NSE (Non-Standard

    Evaluation) df[df$x == "a" & df$y == 1, ] SE (Standard Evaluation) http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr
  54. filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 ‧dfの名前を何回も書かなくていいよ

    ‧SQLっぽく書けるよ http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr df[df$x == "a" & df$y == 1, ]
  55. filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 df[df$x

    == "a" & df$y == 1, ] http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr ⾊々あるけどスッキリしているのは正義 (私⾒) 書きやすく、読みやすく。 思考と実装の距離を近く。 # verb (動詞的) # noun (名詞的)
  56. df <- data.frame(x = 1:3, y = 1:3) filter(df, x

    == 1) /4&ͷ࿩ Because of NSE.. http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr my_var <- "x" filter(df, my_var == 1) This do NOT work There is No “my_var” column in df
  57. /4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、

    何故こうなるかは、 「dplyr再⼊⾨(Tidyval編)」を参照。 https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian 「dplyr再⼊⾨(Tidyval編)」yutanihilation
  58. /4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、

    何故こうなるかは、 「dplyr再⼊⾨(Tidyval編)」を参照。 https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian 可読性が上がる?下がる? それは、あなたと読み⼿次第。 「dplyr再⼊⾨(Tidyval編)」yutanihilation
  59. mutate select filter arrange summaries # カラムの追加 # カラムの選択 #

    ⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈
  60. (SBNNBSPGEBUBNBOJQVMBUJPO By constraining your options, it helps you think about

    your data manipulation challenges. Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
  61. より多くの制約を課す事で、 魂の⾜枷から、より⾃由になる。 Igor Stravinsky И ́ горь Ф Страви́нский The

    more constraints one imposes, the more one frees one's self of the chains that shackle the spirit. 1882 - 1971 ※ 割と意訳
  62. > list3 $A [1] 1 2 3 $B [1] 11

    12 13 > df1 A B 1 1 11 2 2 12 3 3 13 observation variable
  63. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell. data tidying
  64. "Horizontal" data "Nested" data "Vertical" data nest unnest gather spread

    input output visualization Data style manipulation in {tidyr} Loops, Summarization, Feature extractions et al., ...
  65. 1JQF X %>% f X %>% f(y) X %>% f

    %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magrittr} 「dplyr再⼊⾨(基本編)」yutanihilation https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-ji-ben-bian
  66. ① ② ③ ④ lift take pour put Functions are

    verbs result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)
  67. filter(df, x == "a", y == 1) df[df$x == "a"

    & df$y == 1, ] # verb (動詞的) # noun (名詞的) {dplyr} df %>% filter(x == "a", y == 1)
  68. mutate select filter arrange summaries # add column # select

    column # select row # arrange row # summary of vars {dplyr} WFSCT WFSCGVODUJPOT