Tokyo.R#76 BeginneRSession-data pipeline

Tokyo.R#76 BeginneRSession-data pipeline

第76回R勉強会@東京 Tokyo.Rの初心者セッションで発表した際の資料です。

8284465a94bbdf1ea82cf1a67d55f447?s=128

kilometer

March 02, 2019
Tweet

Transcript

  1. BeginneR Session - Data Pipeline - #76 Tokyo.R 2019.03.02 @kilometer00

  2. Who!?

  3. Who!? 名前: 三村 @kilometer 職業: ポスドク (こうがくはくし) 専⾨: ⾏動神経科学(霊⻑類) 脳イメージング

    医療システム⼯学 R歴: ~ 10年ぐらい 流⾏: ⻘岛啤酒
  4. BeginneR Session

  5. 2018.03.03 Tokyo.R #68 BeginneR Session – Data import / Export

    2018.04.21 Tokyo.R #69 BeginneR Session – Data import / Export 2018.06.09 Tokyo.R #70 BeginneR Session – Bayesian Modeling 2018.07.15 Tokyo.R #71 Landscape with R – the Japanese R community 2018.10.20 Tokyo.R #73 BeginneR Session – Visualization & Plot 2019.01.19 Tokyo.R #75 BeginneR Session – Data pipeline 2019.03.02 Tokyo.R #76 BeginneR Session – Data pipeline Today
  6. BeginneR

  7. BeginneR Advanced Hoxo_m If I have seen further it is

    by standing on the shoulders of Giants. -- Sir Isaac Newton, 1676
  8. Before After BeginneR Session BeginneR BeginneR

  9. BeginneR Session - Data Pipeline -

  10. Input Output Data Pipeline

  11. packages you

  12. Input Output packages Data Pipeline

  13. None
  14. None
  15. Output Input Input Data Pipeline

  16. Output Input Input Data Pipeline

  17. Output Input Input Data Pipeline

  18. Data Pipeline

  19. Data Pipeline readable coding

  20. Programing Write Run Read Think

  21. Run!!! https://www.amazon.co.jp/dp/B00Y0UI990/

  22. Programing Write Run Read Think

  23. Programing Write Run Read Think coding style

  24. The tidyverse style guide https://style.tidyverse.org/ "Good coding style is like

    correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." Google's R Style Guide https://style.tidyverse.org/ "The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify." R coding style guides
  25. The tidyverse style guides https://style.tidyverse.org/syntax.html#object-names

  26. The tidyverse style guides https://style.tidyverse.org/syntax.html#object-names

  27. > data function(..., list = character(), package = NULL, lib.loc

    = NULL, verbose = getOption("verbose"), envir = .GlobalEnv) { fileExt <- function(x) { db <- grepl("\\.[^.]+\\.(gz|bz2|xz)$", x) ans <- sub(".*\\.", "", x) ... "Where possible, avoid re-using names of common functions and variables. This will cause confusion for the readers of your code." # Good dat <- read.csv("hoge.csv") # Bad data <- read.csv("hoge.csv")
  28. # Bad for(i in 1:10){ print(i) } # Good for(i

    in 1:10){ print(i) } copy (cut) & paste Auto-indentation (in RStudio) Details: RStudio > Preference > Code > Editing
  29. The tidyverse style guide https://style.tidyverse.org/ "Good coding style is like

    correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." Google's R Style Guide https://style.tidyverse.org/ "The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify." R coding style guides
  30. Programing Write Run Read Think Write Run Read Think Share

  31. Text Figure Information Intention Data decode encode feedback Programing

  32. ブール演算⼦ Boolean Algebra A == B A != B George

    Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia
  33. "a" != "b" # is A in B? ブール演算⼦ Boolean

    Algebra [1] TRUE 1 %in% 10:100 # is A in B? [1] FALSE
  34. George Boole 1815 - 1864 A Class-Room Introduction to Logic

    https://niyamaklogic.wordpress.co m/category/laws-of-thoughts/ Mathematician Philosopher &
  35. ブール演算⼦ Boolean Algebra A == B A != B George

    Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia
  36. Programing

  37. Programing

  38. Programing Write Run Read Think Write Run Read Think Communicate

    Share
  39. 0. Introduction 1. data.frame 2. Pipe 4. Tidy data 3.

    Verbs Agenda 済
  40. vector in Excel

  41. vector in R in Excel pre <- c(1, 2, 3,

    4, 5) post <- pre * 5 > pre [1] 1 2 3 4 5 > post [1] 5 10 15 20 25
  42. None
  43. None
  44. None
  45. None
  46. None
  47. > str(df1) 'data.frame': 3 obs. of 2 variables: $ A:

    int 1 2 3 $ B: int 11 12 13 df1 <- data.frame(A = 1:3, B = 11:13) data.frame > df1 A B 1 1 11 2 2 12 3 3 13 observation variable
  48. 0. Introduction 1. data.frame 2. Tidy data 4. Verbs 3.

    Pipe Agenda 済 済
  49. http://vita.had.co.nz/papers/tidy-data.html

  50. https://r4ds.had.co.nz/

  51. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell. > df1 A B 1 1 11 2 2 12 3 3 13 observation variable df1 <- data.frame(A = 1:3, B = 11:13)
  52. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell.
  53. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell.
  54. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell. Different observation data Value Label
  55. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell. data tidying
  56. "Wide" data "Nested" data "Long" data nest unnest gather spread

    input output visualization Data layout Loops, Summarization, Feature extractions et al., ...
  57. long_dat <- gather(wide_dat, key, val, -tag) wide_dat <- spread(long_dat, key,

    val) Wide layout Long layout gather spread Data layout
  58. "Wide" data "Long" data gather(df, key, value, -c(obsid, group)) {tidyr}

    variables variables
  59. data Nested data list [[1]] [[2]] [[3]] nest unnest n_dat

    <- nest(group_by(dat, tag)) dat <- unnest(n_dat) Data layout
  60. data Nested data list [[1]] [[2]] [[3]] group_nest unnest n_dat

    <- group_nest(dat, tag) # dplyr 0.8↑ dat <- unnest(n_dat) Data layout
  61. "Wide" data "Nested" data group_nest(df, group)

  62. "Nested" data df2 <- group_nest(df, group)

  63. "Wide" data "Nested" data "Long" data nest unnest gather spread

    input output visualization Data layout in {tidyr} Loops, Summarization, Feature extractions et al., ...
  64. 0. Introduction 1. data.frame 2. Tidy data 4. Verbs 3.

    Pipe Agenda 済 済
  65. 1JQF X %>% f X %>% f(y) X %>% f

    %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magrittr} 「dplyr再⼊⾨(基本編)」yutanihilation https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-ji-ben-bian
  66. {magrittr} 「最近パイプしか打ってないです」 「パイプ、あれはいいよなって 他の⾔語の⼈も皆んな思ってますよ」 「1年ぐらいかけてゆっくりこっち (パイプ)にシフトしましたね」 【中毒 愛⽤者たちの声】 「Rコミュニティ四⽅⼭話」https://rlangradio.org/ 1JQF

    %>%
  67. ① ② ③ ④ lift take pour put Bring milk

    from the kitchen!
  68. ① lift Bring milk from the kitchen! lift(Robot, glass, table)

    -> Robot' take ② take(Robot', fridge, milk) -> Robot''
  69. Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)

    Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④
  70. The tidyverse style guides https://style.tidyverse.org/syntax.html#object-names "There are only two hard

    things in Computer Science: cache invalidation and naming things"
  71. Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)

    Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④
  72. Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk)

    Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④ Thinking Reading Bring milk from the kitchen!
  73. Programing Write Run Read Think Write Run Read Think Communicate

    Share
  74. 0. Introduction 1. data.frame 2. Tidy data 4. Verbs 3.

    Pipe Agenda 済 済
  75. ① ② ③ ④ lift take pour put Bring milk

    from the kitchen!
  76. ① ② ③ ④ lift take pour put Bring milk

    from the kitchen! result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)
  77. please_bring <- function(someone, milk, glass, table = dining_table, fridge= kitchen_fridge){

    someone %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) } RobotA %>% please_bring(milk, my_glass) Define an original function Usage RobotB %>% please_bring(cold_tea, her_glass)
  78. • nouns for variables • verbs for functions General naming

    guidance to naming things
  79. • nouns for variables • verbs for functions General naming

    guidance to naming things https://www.grinchcentral.com/function-names-to-verb-or-not-to-verb
  80. • nouns for variables • verbs for functions General naming

    guidance to naming things • variables are nouns • functions are verbs Conversely,
  81. Functions are verbs.

  82. filter(df, x == "a", y == 1) df[df$x == "a"

    & df$y == 1, ] # verb (動詞的) # noun (名詞的) {dplyr} df %>% filter(x == "a", y == 1) # verb with pipe
  83. mutate select filter arrange summaries # add column # select

    column # select row # arrange row # summary of vars {dplyr} WFSCT WFSCGVODUJPOT
  84. It (dplyr) provides simple “verbs” to help you translate your

    thoughts into code. functions that correspond to the most common data manipulation tasks Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html WFSCT {dplyr}
  85. dplyrは、あなたの考えをコードに翻訳 するための【動詞】を提供する。 データ操作における基本のキを、 シンプルに実⾏できる関数 (群) Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html WFSCT

    {dplyr} ※ かなり意訳
  86. WFSCT {dplyr} mutate # カラムの追加 + mutate(dat, C = fun(A,

    B))
  87. WFSCT {dplyr} mutate # カラムの追加 + dat %>% mutate(C =

    fun(A, B))
  88. > df1 A B 1 1 11 2 2 12

    3 3 13 df1 <- data.frame(A = 1:3, B = 11:13) WFSCT {dplyr} mutate # カラムの追加 > df2 A B C 1 1 11 11 2 2 12 24 3 3 13 39 df2 <- df1 %>% mutate(C = A * B)
  89. library(dplyr) iris %>% mutate(a = 1:nrow(.)) %>% str 'data.frame': 150

    obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels "setosa", ... $ a : int 1 2 3 4 5 6 7 8 9 10 ... WFSCT {dplyr}
  90. library(dplyr) iris %>% mutate(a = 1:nrow(.), a = a *

    5/3 %>% round) 'data.frame': 150 obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels “setosa”, ... $ a : num 1.67 3.33 5 6.67 8.33 ... ... WFSCT {dplyr} over write
  91. WFSCT {dplyr} select # カラムの選択 dat %>% select(tag, B)

  92. library(dplyr) iris %>% select(Sepal.Length, Sepal.Width) 'data.frame': 150 obs. of 6

    variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... WFSCT {dplyr}
  93. library(dplyr) iris %>% select(contains(“Width”)) 'data.frame': 150 obs. of 6 variables:

    $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... WFSCT {dplyr} Select help functions
  94. WFSCT {dplyr} # Select help functions starts_with("s") ends_with("s") contains("se") matches("^.e")

    one_of(c("Sepal.Length", "Species")) everything() https://kazutan.github.io/blog/2017/04/dplyr-select-memo/ 「dplyr::selectの活⽤例メモ」kazutan
  95. mutate select filter arrange summaries # カラムの追加 # カラムの選択 #

    ⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈
  96. WFSCT {dplyr} filter # ⾏の絞り込み dat %>% filter(tag %in% c(1,

    3, 5))
  97. library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} 'data.frame': 50

    obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
  98. library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} NSE (Non-Standard

    Evaluation) 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
  99. filter(df, x == "a", y == 1) /4&ͷ࿩ NSE (Non-Standard

    Evaluation) df[df$x == "a" & df$y == 1, ] SE (Standard Evaluation) http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr
  100. filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 ‧dfの名前を何回も書かなくていいよ

    ‧SQLっぽく書けるよ http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr df[df$x == "a" & df$y == 1, ]
  101. filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 df[df$x

    == "a" & df$y == 1, ] http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr ⾊々あるけどスッキリしているのは正義 (私⾒) 書きやすく、読みやすく。 思考と実装の距離を近く。 # verb (動詞的) # noun (名詞的)
  102. df <- data.frame(x = 1:3, y = 1:3) filter(df, x

    == 1) /4&ͷ࿩ Because of NSE.. http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr my_var <- "x" filter(df, my_var == 1) This do NOT work There is No “my_var” column in df
  103. /4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、

    何故こうなるかは、 「dplyr再⼊⾨(Tidyval編)」を参照。 https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian 「dplyr再⼊⾨(Tidyval編)」yutanihilation
  104. /4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、

    何故こうなるかは、 「dplyr再⼊⾨(Tidyval編)」を参照。 https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian 可読性が上がる?下がる? それは、あなたと読み⼿次第。 「dplyr再⼊⾨(Tidyval編)」yutanihilation
  105. mutate select filter arrange summaries # カラムの追加 # カラムの選択 #

    ⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈
  106. (SBNNBSPGEBUBNBOJQVMBUJPO By constraining your options, it helps you think about

    your data manipulation challenges. Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
  107. 選択肢を制限することで、 データ解析のステップを シンプルに考えられますヨ。 (めっちゃ意訳) Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html ※ まさに意訳

    (SBNNBSPGEBUBNBOJQVMBUJPO
  108. より多くの制約を課す事で、 魂の⾜枷から、より⾃由になる。 Igor Stravinsky И ́ горь Ф Страви́нский The

    more constraints one imposes, the more one frees one's self of the chains that shackle the spirit. 1882 - 1971 ※ 割と意訳
  109. 0. Introduction 1. data.frame 2. Tidy data 4. Verbs 3.

    Pipe Agenda 済 済
  110. Summary

  111. > list3 $A [1] 1 2 3 $B [1] 11

    12 13 > df1 A B 1 1 11 2 2 12 3 3 13 observation variable
  112. In tidy data: 1. Each variable forms a column. 2.

    Each observation forms a row. 3. Each value must have its own cell. data tidying
  113. "Wide" data "Nested" data "Long" data nest unnest gather spread

    input output visualization Data style manipulation in {tidyr} Loops, Summarization, Feature extractions et al., ...
  114. 1JQF X %>% f X %>% f(y) X %>% f

    %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magrittr} 「dplyr再⼊⾨(基本編)」yutanihilation https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-ji-ben-bian
  115. ① ② ③ ④ lift take pour put Functions are

    verbs result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)
  116. filter(df, x == "a", y == 1) df[df$x == "a"

    & df$y == 1, ] # verb (動詞的) # noun (名詞的) {dplyr} df %>% filter(x == "a", y == 1)
  117. mutate select filter arrange summaries # add column # select

    column # select row # arrange row # summary of vars {dplyr} WFSCT WFSCGVODUJPOT
  118. Data Pipeline readable coding

  119. https://www.tidyverse.org/

  120. report share visualize tools manipulate import export https://www.tidyverse.org/

  121. Programing languages are language Write Run Read Think Write Run

    Read Think Communicate Share
  122. “Life shrinks or expands to one’s courage.” -- Anaïs Nin,

    2000 http://theamericanreader.com
  123. Before After BeginneR Session BeginneR BeginneR ?

  124. None
  125. bar dradra