Slide 1

Slide 1 text

BeginneR Session - Data Pipeline - #76 Tokyo.R 2019.03.02 @kilometer00

Slide 2

Slide 2 text

Who!?

Slide 3

Slide 3 text

Who!? 名前: 三村 @kilometer 職業: ポスドク (こうがくはくし) 専⾨: ⾏動神経科学(霊⻑類) 脳イメージング 医療システム⼯学 R歴: ~ 10年ぐらい 流⾏: ⻘岛啤酒

Slide 4

Slide 4 text

BeginneR Session

Slide 5

Slide 5 text

2018.03.03 Tokyo.R #68 BeginneR Session – Data import / Export 2018.04.21 Tokyo.R #69 BeginneR Session – Data import / Export 2018.06.09 Tokyo.R #70 BeginneR Session – Bayesian Modeling 2018.07.15 Tokyo.R #71 Landscape with R – the Japanese R community 2018.10.20 Tokyo.R #73 BeginneR Session – Visualization & Plot 2019.01.19 Tokyo.R #75 BeginneR Session – Data pipeline 2019.03.02 Tokyo.R #76 BeginneR Session – Data pipeline Today

Slide 6

Slide 6 text

BeginneR

Slide 7

Slide 7 text

BeginneR Advanced Hoxo_m If I have seen further it is by standing on the shoulders of Giants. -- Sir Isaac Newton, 1676

Slide 8

Slide 8 text

Before After BeginneR Session BeginneR BeginneR

Slide 9

Slide 9 text

BeginneR Session - Data Pipeline -

Slide 10

Slide 10 text

Input Output Data Pipeline

Slide 11

Slide 11 text

packages you

Slide 12

Slide 12 text

Input Output packages Data Pipeline

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Output Input Input Data Pipeline

Slide 16

Slide 16 text

Output Input Input Data Pipeline

Slide 17

Slide 17 text

Output Input Input Data Pipeline

Slide 18

Slide 18 text

Data Pipeline

Slide 19

Slide 19 text

Data Pipeline readable coding

Slide 20

Slide 20 text

Programing Write Run Read Think

Slide 21

Slide 21 text

Run!!! https://www.amazon.co.jp/dp/B00Y0UI990/

Slide 22

Slide 22 text

Programing Write Run Read Think

Slide 23

Slide 23 text

Programing Write Run Read Think coding style

Slide 24

Slide 24 text

The tidyverse style guide https://style.tidyverse.org/ "Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." Google's R Style Guide https://style.tidyverse.org/ "The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify." R coding style guides

Slide 25

Slide 25 text

The tidyverse style guides https://style.tidyverse.org/syntax.html#object-names

Slide 26

Slide 26 text

The tidyverse style guides https://style.tidyverse.org/syntax.html#object-names

Slide 27

Slide 27 text

> data function(..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), envir = .GlobalEnv) { fileExt <- function(x) { db <- grepl("\\.[^.]+\\.(gz|bz2|xz)$", x) ans <- sub(".*\\.", "", x) ... "Where possible, avoid re-using names of common functions and variables. This will cause confusion for the readers of your code." # Good dat <- read.csv("hoge.csv") # Bad data <- read.csv("hoge.csv")

Slide 28

Slide 28 text

# Bad for(i in 1:10){ print(i) } # Good for(i in 1:10){ print(i) } copy (cut) & paste Auto-indentation (in RStudio) Details: RStudio > Preference > Code > Editing

Slide 29

Slide 29 text

The tidyverse style guide https://style.tidyverse.org/ "Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." Google's R Style Guide https://style.tidyverse.org/ "The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify." R coding style guides

Slide 30

Slide 30 text

Programing Write Run Read Think Write Run Read Think Share

Slide 31

Slide 31 text

Text Figure Information Intention Data decode encode feedback Programing

Slide 32

Slide 32 text

ブール演算⼦ Boolean Algebra A == B A != B George Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia

Slide 33

Slide 33 text

"a" != "b" # is A in B? ブール演算⼦ Boolean Algebra [1] TRUE 1 %in% 10:100 # is A in B? [1] FALSE

Slide 34

Slide 34 text

George Boole 1815 - 1864 A Class-Room Introduction to Logic https://niyamaklogic.wordpress.co m/category/laws-of-thoughts/ Mathematician Philosopher &

Slide 35

Slide 35 text

ブール演算⼦ Boolean Algebra A == B A != B George Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia

Slide 36

Slide 36 text

Programing

Slide 37

Slide 37 text

Programing

Slide 38

Slide 38 text

Programing Write Run Read Think Write Run Read Think Communicate Share

Slide 39

Slide 39 text

0. Introduction 1. data.frame 2. Pipe 4. Tidy data 3. Verbs Agenda 済

Slide 40

Slide 40 text

vector in Excel

Slide 41

Slide 41 text

vector in R in Excel pre <- c(1, 2, 3, 4, 5) post <- pre * 5 > pre [1] 1 2 3 4 5 > post [1] 5 10 15 20 25

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

> str(df1) 'data.frame': 3 obs. of 2 variables: $ A: int 1 2 3 $ B: int 11 12 13 df1 <- data.frame(A = 1:3, B = 11:13) data.frame > df1 A B 1 1 11 2 2 12 3 3 13 observation variable

Slide 48

Slide 48 text

0. Introduction 1. data.frame 2. Tidy data 4. Verbs 3. Pipe Agenda 済 済

Slide 49

Slide 49 text

http://vita.had.co.nz/papers/tidy-data.html

Slide 50

Slide 50 text

https://r4ds.had.co.nz/

Slide 51

Slide 51 text

In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each value must have its own cell. > df1 A B 1 1 11 2 2 12 3 3 13 observation variable df1 <- data.frame(A = 1:3, B = 11:13)

Slide 52

Slide 52 text

In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each value must have its own cell.

Slide 53

Slide 53 text

In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each value must have its own cell.

Slide 54

Slide 54 text

In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each value must have its own cell. Different observation data Value Label

Slide 55

Slide 55 text

In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each value must have its own cell. data tidying

Slide 56

Slide 56 text

"Wide" data "Nested" data "Long" data nest unnest gather spread input output visualization Data layout Loops, Summarization, Feature extractions et al., ...

Slide 57

Slide 57 text

long_dat <- gather(wide_dat, key, val, -tag) wide_dat <- spread(long_dat, key, val) Wide layout Long layout gather spread Data layout

Slide 58

Slide 58 text

"Wide" data "Long" data gather(df, key, value, -c(obsid, group)) {tidyr} variables variables

Slide 59

Slide 59 text

data Nested data list [[1]] [[2]] [[3]] nest unnest n_dat <- nest(group_by(dat, tag)) dat <- unnest(n_dat) Data layout

Slide 60

Slide 60 text

data Nested data list [[1]] [[2]] [[3]] group_nest unnest n_dat <- group_nest(dat, tag) # dplyr 0.8↑ dat <- unnest(n_dat) Data layout

Slide 61

Slide 61 text

"Wide" data "Nested" data group_nest(df, group)

Slide 62

Slide 62 text

"Nested" data df2 <- group_nest(df, group)

Slide 63

Slide 63 text

"Wide" data "Nested" data "Long" data nest unnest gather spread input output visualization Data layout in {tidyr} Loops, Summarization, Feature extractions et al., ...

Slide 64

Slide 64 text

0. Introduction 1. data.frame 2. Tidy data 4. Verbs 3. Pipe Agenda 済 済

Slide 65

Slide 65 text

1JQF X %>% f X %>% f(y) X %>% f %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magrittr} 「dplyr再⼊⾨(基本編)」yutanihilation https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-ji-ben-bian

Slide 66

Slide 66 text

{magrittr} 「最近パイプしか打ってないです」 「パイプ、あれはいいよなって 他の⾔語の⼈も皆んな思ってますよ」 「1年ぐらいかけてゆっくりこっち (パイプ)にシフトしましたね」 【中毒 愛⽤者たちの声】 「Rコミュニティ四⽅⼭話」https://rlangradio.org/ 1JQF %>%

Slide 67

Slide 67 text

① ② ③ ④ lift take pour put Bring milk from the kitchen!

Slide 68

Slide 68 text

① lift Bring milk from the kitchen! lift(Robot, glass, table) -> Robot' take ② take(Robot', fridge, milk) -> Robot''

Slide 69

Slide 69 text

Bring milk from the kitchen! Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④

Slide 70

Slide 70 text

The tidyverse style guides https://style.tidyverse.org/syntax.html#object-names "There are only two hard things in Computer Science: cache invalidation and naming things"

Slide 71

Slide 71 text

Bring milk from the kitchen! Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④

Slide 72

Slide 72 text

Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④ Thinking Reading Bring milk from the kitchen!

Slide 73

Slide 73 text

Programing Write Run Read Think Write Run Read Think Communicate Share

Slide 74

Slide 74 text

0. Introduction 1. data.frame 2. Tidy data 4. Verbs 3. Pipe Agenda 済 済

Slide 75

Slide 75 text

① ② ③ ④ lift take pour put Bring milk from the kitchen!

Slide 76

Slide 76 text

① ② ③ ④ lift take pour put Bring milk from the kitchen! result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)

Slide 77

Slide 77 text

please_bring <- function(someone, milk, glass, table = dining_table, fridge= kitchen_fridge){ someone %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) } RobotA %>% please_bring(milk, my_glass) Define an original function Usage RobotB %>% please_bring(cold_tea, her_glass)

Slide 78

Slide 78 text

• nouns for variables • verbs for functions General naming guidance to naming things

Slide 79

Slide 79 text

• nouns for variables • verbs for functions General naming guidance to naming things https://www.grinchcentral.com/function-names-to-verb-or-not-to-verb

Slide 80

Slide 80 text

• nouns for variables • verbs for functions General naming guidance to naming things • variables are nouns • functions are verbs Conversely,

Slide 81

Slide 81 text

Functions are verbs.

Slide 82

Slide 82 text

filter(df, x == "a", y == 1) df[df$x == "a" & df$y == 1, ] # verb (動詞的) # noun (名詞的) {dplyr} df %>% filter(x == "a", y == 1) # verb with pipe

Slide 83

Slide 83 text

mutate select filter arrange summaries # add column # select column # select row # arrange row # summary of vars {dplyr} WFSCT WFSCGVODUJPOT

Slide 84

Slide 84 text

It (dplyr) provides simple “verbs” to help you translate your thoughts into code. functions that correspond to the most common data manipulation tasks Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html WFSCT {dplyr}

Slide 85

Slide 85 text

dplyrは、あなたの考えをコードに翻訳 するための【動詞】を提供する。 データ操作における基本のキを、 シンプルに実⾏できる関数 (群) Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html WFSCT {dplyr} ※ かなり意訳

Slide 86

Slide 86 text

WFSCT {dplyr} mutate # カラムの追加 + mutate(dat, C = fun(A, B))

Slide 87

Slide 87 text

WFSCT {dplyr} mutate # カラムの追加 + dat %>% mutate(C = fun(A, B))

Slide 88

Slide 88 text

> df1 A B 1 1 11 2 2 12 3 3 13 df1 <- data.frame(A = 1:3, B = 11:13) WFSCT {dplyr} mutate # カラムの追加 > df2 A B C 1 1 11 11 2 2 12 24 3 3 13 39 df2 <- df1 %>% mutate(C = A * B)

Slide 89

Slide 89 text

library(dplyr) iris %>% mutate(a = 1:nrow(.)) %>% str 'data.frame': 150 obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels "setosa", ... $ a : int 1 2 3 4 5 6 7 8 9 10 ... WFSCT {dplyr}

Slide 90

Slide 90 text

library(dplyr) iris %>% mutate(a = 1:nrow(.), a = a * 5/3 %>% round) 'data.frame': 150 obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels “setosa”, ... $ a : num 1.67 3.33 5 6.67 8.33 ... ... WFSCT {dplyr} over write

Slide 91

Slide 91 text

WFSCT {dplyr} select # カラムの選択 dat %>% select(tag, B)

Slide 92

Slide 92 text

library(dplyr) iris %>% select(Sepal.Length, Sepal.Width) 'data.frame': 150 obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... WFSCT {dplyr}

Slide 93

Slide 93 text

library(dplyr) iris %>% select(contains(“Width”)) 'data.frame': 150 obs. of 6 variables: $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... WFSCT {dplyr} Select help functions

Slide 94

Slide 94 text

WFSCT {dplyr} # Select help functions starts_with("s") ends_with("s") contains("se") matches("^.e") one_of(c("Sepal.Length", "Species")) everything() https://kazutan.github.io/blog/2017/04/dplyr-select-memo/ 「dplyr::selectの活⽤例メモ」kazutan

Slide 95

Slide 95 text

mutate select filter arrange summaries # カラムの追加 # カラムの選択 # ⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈

Slide 96

Slide 96 text

WFSCT {dplyr} filter # ⾏の絞り込み dat %>% filter(tag %in% c(1, 3, 5))

Slide 97

Slide 97 text

library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...

Slide 98

Slide 98 text

library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} NSE (Non-Standard Evaluation) 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...

Slide 99

Slide 99 text

filter(df, x == "a", y == 1) /4&ͷ࿩ NSE (Non-Standard Evaluation) df[df$x == "a" & df$y == 1, ] SE (Standard Evaluation) http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr

Slide 100

Slide 100 text

filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 ‧dfの名前を何回も書かなくていいよ ‧SQLっぽく書けるよ http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr df[df$x == "a" & df$y == 1, ]

Slide 101

Slide 101 text

filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 df[df$x == "a" & df$y == 1, ] http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr ⾊々あるけどスッキリしているのは正義 (私⾒) 書きやすく、読みやすく。 思考と実装の距離を近く。 # verb (動詞的) # noun (名詞的)

Slide 102

Slide 102 text

df <- data.frame(x = 1:3, y = 1:3) filter(df, x == 1) /4&ͷ࿩ Because of NSE.. http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr my_var <- "x" filter(df, my_var == 1) This do NOT work There is No “my_var” column in df

Slide 103

Slide 103 text

/4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、 何故こうなるかは、 「dplyr再⼊⾨(Tidyval編)」を参照。 https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian 「dplyr再⼊⾨(Tidyval編)」yutanihilation

Slide 104

Slide 104 text

/4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、 何故こうなるかは、 「dplyr再⼊⾨(Tidyval編)」を参照。 https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian 可読性が上がる?下がる? それは、あなたと読み⼿次第。 「dplyr再⼊⾨(Tidyval編)」yutanihilation

Slide 105

Slide 105 text

mutate select filter arrange summaries # カラムの追加 # カラムの選択 # ⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈

Slide 106

Slide 106 text

(SBNNBSPGEBUBNBOJQVMBUJPO By constraining your options, it helps you think about your data manipulation challenges. Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

Slide 107

Slide 107 text

選択肢を制限することで、 データ解析のステップを シンプルに考えられますヨ。 (めっちゃ意訳) Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html ※ まさに意訳 (SBNNBSPGEBUBNBOJQVMBUJPO

Slide 108

Slide 108 text

より多くの制約を課す事で、 魂の⾜枷から、より⾃由になる。 Igor Stravinsky И ́ горь Ф Страви́нский The more constraints one imposes, the more one frees one's self of the chains that shackle the spirit. 1882 - 1971 ※ 割と意訳

Slide 109

Slide 109 text

0. Introduction 1. data.frame 2. Tidy data 4. Verbs 3. Pipe Agenda 済 済

Slide 110

Slide 110 text

Summary

Slide 111

Slide 111 text

> list3 $A [1] 1 2 3 $B [1] 11 12 13 > df1 A B 1 1 11 2 2 12 3 3 13 observation variable

Slide 112

Slide 112 text

In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each value must have its own cell. data tidying

Slide 113

Slide 113 text

"Wide" data "Nested" data "Long" data nest unnest gather spread input output visualization Data style manipulation in {tidyr} Loops, Summarization, Feature extractions et al., ...

Slide 114

Slide 114 text

1JQF X %>% f X %>% f(y) X %>% f %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magrittr} 「dplyr再⼊⾨(基本編)」yutanihilation https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-ji-ben-bian

Slide 115

Slide 115 text

① ② ③ ④ lift take pour put Functions are verbs result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)

Slide 116

Slide 116 text

filter(df, x == "a", y == 1) df[df$x == "a" & df$y == 1, ] # verb (動詞的) # noun (名詞的) {dplyr} df %>% filter(x == "a", y == 1)

Slide 117

Slide 117 text

mutate select filter arrange summaries # add column # select column # select row # arrange row # summary of vars {dplyr} WFSCT WFSCGVODUJPOT

Slide 118

Slide 118 text

Data Pipeline readable coding

Slide 119

Slide 119 text

https://www.tidyverse.org/

Slide 120

Slide 120 text

report share visualize tools manipulate import export https://www.tidyverse.org/

Slide 121

Slide 121 text

Programing languages are language Write Run Read Think Write Run Read Think Communicate Share

Slide 122

Slide 122 text

“Life shrinks or expands to one’s courage.” -- Anaïs Nin, 2000 http://theamericanreader.com

Slide 123

Slide 123 text

Before After BeginneR Session BeginneR BeginneR ?

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

bar dradra