BeginneR Session - Data Pipeline - #75 Tokyo.R 2019.01.19 @kilometer00

Who!? 名前: 三村 @kilometer 職業: ポスドク (こうがくはくし) 専⾨: ⾏動神経科学(霊⻑類) 脳イメージング 医療システム⼯学 R歴: ~ 10年ぐらい 流⾏: グリル付きコンロ

BeginneR Session

BeginneR Advanced Hoxo_m If I have seen further it is by standing on the shoulders of Giants. -- Sir Isaac Newton, 1676

Before After BeginneR Session BeginneR BeginneR

BeginneR Session - Data Pipeline -

Input Output Data Pipeline

packages you

Input Output packages Data Pipeline

Output Input Input Data Pipeline

Output Input Input Data Pipeline

Output Input Input Data Pipeline

Data Pipeline

Data Pipeline readable coding

Programing Write Run Read Think

Programing Write Run Read Think

Programing Write Run Read Think coding style

The tidyverse style guide "Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." Google's R Style Guide "The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify." R coding style guides

The tidyverse style guides

The tidyverse style guides

> data function(..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), envir = .GlobalEnv) { fileExt <- function(x) { db <- grepl("\\.[^.]+\\.(gz|bz2|xz)$", x) ans <- sub(".*\\.", "", x) ... "Where possible, avoid re-using names of common functions and variables. This will cause confusion for the readers of your code." # Good df <- read.csv("hoge.csv") dat <- read.csv("hoge.csv") # Bad data <- read.csv("hoge.csv")

# Bad for(i in 1:10){ print(i) } # Good for(i in 1:10){ print(i) } copy (cut) & paste Auto-indentation (in RStudio) Details: RStudio > Preference > Code > Editing

Programing Write Run Read Think Write Run Read Think Share

Text Figure Information Intention Data decode encode feedback Programing

ブール演算⼦ Boolean Algebra A == B A != B George Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia

"a" != "b" # is A in B? ブール演算⼦ Boolean Algebra [1] TRUE 1 %in% 10:100 # is A in B? [1] FALSE

George Boole 1815 - 1864 A Class-Room Introduction to Logic m/category/laws-of-thoughts/ Mathematician Philosopher &

Programing Write Run Read Think Write Run Read Think Communicate Share

Input Output packages Data Pipeline

Integrated Development Environment RStudio

Integrated Development Environment RStudio

Projects RStudio

RStudio > Project ⼀説には2147483647個存在するとも⾔われるRStudioの 利点のなかでも、 ‧Rなどのソースファイルをタブで並べて表⽰できる ‧そのタブの順番を保持できる ‧タブの内容をファイルを保存せずにRStudioを終了して しまっても、編集途中の内容を保持してくれている 等が全国2147483647⼈のRユーザーのQOLを⼤いに向上 させたのは、おそらく異論がないことと思われます。 RStudioって...なんだ? Projectって...なんだ???? @wakuteka

File > New Project… > New Directory > New Project hogehoge

hogehoge ~/Documents/R hogehoge.Rproj .Rproj.user Project Root Directory Double click!! .RData .Rhistory Auto saved project information Open project New!!

~/Documents/R project1 project2 project3

vector in Excel

vector in R in Excel pre <- c(1, 2, 3, 4, 5) post <- pre * 5 > pre [1] 1 2 3 4 5 > post [1] 5 10 15 20 25

vector vec1 <- c(1, 2, 3, 4, 5) vec2 <- 1:5 vec3 <- seq(from = 1, to = 5, by = 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5 > vec3 [1] 1 2 3 4 5

vector vec1 <- seq(from = 1, to = 5, by = 1) vec2 <- seq(1, 5, 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5

> ?seq vector seq{base} Sequence Generation Description Generate regular sequences. seq is a standard generic with a default method. … Usage seq(...) ## Default S3 method: seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)

vector vec1 <- rep(1:3, times = 2) vec2 <- rep(1:3, each = 2) vec3 <- rep(1:3, times = 2, each = 2) > vec1 [1] 1 2 3 1 2 3 > vec2 [1] 1 1 2 2 3 3 > vec3 [1] 1 1 2 2 3 3 1 1 2 2 3 3

vector vec1 <- 11:15 > vec1 [1] 11 12 13 14 15 > vec1[1] [1] 11 > vec1[3:5] [1] 13 14 15 > vec1[c(1:2, 5)] [1] 11 12 15

list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1 [[1]] [1] 1 2 3 4 5 6 [[2]] [1] 11 12 13 14 15 [[3]] [1] "a" "b" "c"

list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1[[1]] [1] 1 2 3 4 5 6 > list1[[3]][2:3] [1] "b" "c" > list1[[2]] * 3 [1] 33 36 39 42 45

named list list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c")) > list2 $A [1] 1 2 3 4 5 6 $B [1] 11 12 13 14 15 $C [1] "a" "b" "c"

> list2$A [1] 1 2 3 4 5 6 > list2$C[2:3] [1] "b" "c" > list2$B * 3 [1] 33 36 39 42 45 list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c")) named list

list1 <- list(1:6, 11:15, c("a", "b", "c")) > class(list1) [1] "list" > names(list1) NULL list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c")) > class(list2) [1] "list" > names(list2) [1] "A" "B" "C" named list list

list3 <- list(A = 1:3, B = 11:13) > class(list3) [1] "list" > names(list3) [1] "A" "B" df1 <- data.frame(A = 1:3, B = 11:13) > class(df1) [1] "data.frame" > names(df1) [1] "A" "B" named list & data.frame

> str(list3) List of 2 $ A: int [1:3] 1 2 3 $ B: int [1:3] 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 variables: $ A: int 1 2 3 $ B: int 11 12 13 list3 <- list(A = 1:3, B = 11:13) df1 <- data.frame(A = 1:3, B = 11:13) named list & data.frame

> list3 $A [1] 1 2 3 $B [1] 11 12 13 > df1 A B 1 1 11 2 2 12 3 3 13 named list & data.frame

> list3 $A [1] 1 2 3 $B [1] 11 12 13 > df1 A B 1 1 11 2 2 12 3 3 13 named list & data.frame observation variable

data.frame v.s. matrix A B 1 1 11 2 2 12 3 3 13 [,1] [,2] [1,] 1 11 [2,] 2 12 [3,] 3 13 df1 <- data.frame(A = 1:3, B = 11:13) > str(mat1) int [1:3, 1:2] 1 2 3 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 vars.: $ A: int 1 2 3 $ B: int 11 12 13 mat1 <- matrix(c(1:3, 11:13), 3, 2)

data.frame v.s. matrix

data.frame v.s. matrix

…(省前) … そんなわけで、「data.frame」は、我々の⼼の中にしかありません。 あの四⾓い感じの、みんなが「data.frame」と呼んでいるものこそが 「data.frame」なのです。 ... So, "data.frame" is only in our mind. Something like square, everyone calls "data.frame", is the "data.frame".

In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each value must have its own cell. Different observation data Value Label

In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each value must have its own cell. data tidying

"Horizontal" tidy data variables

"Horizontal" tidy data "Vertical" tidy data gather(df, key, value, -c(obsid, group)) {tidyr} variables

"Horizontal" style "Vertical" style gather(df, key, value, -c(obsid, group)) {tidyr} variables variables

"Horizontal" style "Nested" style nest(group_by(df, group)) {tidyr}

"Nested" style df2 <- nest(group_by(df, group)) {tidyr}

"Horizontal" data "Nested" data "Vertical" data nest unnest gather spread input output visualization Data style manipulation in {tidyr} Loops, Summarization, Feature extractions et al., ...

1JQF X %>% f X %>% f(y) X %>% f %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magrittr} 「dplyr再⼊⾨(基本編)」yutanihilation

{magrittr} 「最近パイプしか打ってないです」 「パイプ、あれはいいよなって 他の⾔語の⼈も皆んな思ってますよ」 「1年ぐらいかけてゆっくりこっち (パイプ)にシフトしましたね」 【中毒 愛⽤者たちの声】 「Rコミュニティ四⽅⼭話」 1JQF %>%

① ② ③ ④ lift take pour put Bring milk from the kitchen!

① lift Bring milk from the kitchen! lift(Robot, glass, table) -> Robot' take ② take(Robot', fridge, milk) -> Robot''

Bring milk from the kitchen! Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④

The tidyverse style guides "There are only two hard things in Computer Science: cache invalidation and naming things"

Bring milk from the kitchen! Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④

Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④ Thinking Reading Bring milk from the kitchen!

Programing Write Run Read Think Write Run Read Think Communicate Share

① ② ③ ④ lift take pour put Bring milk from the kitchen!

① ② ③ ④ lift take pour put Bring milk from the kitchen! result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)

please_bring <- function(someone, milk, glass, table = dining_table, fridge= kitchen_fridge){ someone %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) } RobotA %>% please_bring(milk, my_glass) Define an original function Usage RobotB %>% please_bring(cold_tea, her_glass)

• nouns for variables • verbs for functions General naming guidance to naming things

• nouns for variables • verbs for functions General naming guidance to naming things

• nouns for variables • verbs for functions General naming guidance to naming things • variables are nouns • functions are verbs Conversely,

Functions are verbs.

filter(df, x == "a", y == 1) df[df$x == "a" & df$y == 1, ] # verb (動詞的) # noun (名詞的) {dplyr} df %>% filter(x == "a", y == 1) # verb with pipe

mutate select filter arrange summaries # add column # select column # select row # arrange row # summary of vars {dplyr} WFSCT WFSCGVODUJPOT

It (dplyr) provides simple “verbs” to help you translate your thoughts into code. functions that correspond to the most common data manipulation tasks Introduction to dplyr WFSCT {dplyr}

dplyrは、あなたの考えをコードに翻訳 するための【動詞】を提供する。 データ操作における基本のキを、 シンプルに実⾏できる関数 (群) Introduction to dplyr WFSCT {dplyr} ※ かなり意訳

WFSCT {dplyr} mutate # カラムの追加 + mutate

library(dplyr) iris %>% mutate(a = 1:nrow(.)) %>% str 'data.frame': 150 obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels "setosa", ... $ a : int 1 2 3 4 5 6 7 8 9 10 ... WFSCT {dplyr}

library(dplyr) iris %>% mutate(a = 1:nrow(.), a = a * 5/3 %>% round) 'data.frame': 150 obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels “setosa”, ... $ a : num 1.67 3.33 5 6.67 8.33 ... ... WFSCT {dplyr} over write

WFSCT {dplyr} select # カラムの選択 select

library(dplyr) iris %>% select(Sepal.Length, Sepal.Width) 'data.frame': 150 obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... WFSCT {dplyr}

library(dplyr) iris %>% select(contains(“Width”)) 'data.frame': 150 obs. of 6 variables: $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... WFSCT {dplyr} Select help functions

WFSCT {dplyr} # Select help functions starts_with("s") ends_with("s") contains("se") matches("^.e") one_of(c("Sepal.Length", "Species")) everything() 「dplyr::selectの活⽤例メモ」kazutan

mutate select filter arrange summaries # カラムの追加 # カラムの選択 # ⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈

WFSCT {dplyr} filter # ⾏の絞り込み filter

library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...

library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} NSE (Non-Standard Evaluation) 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...

filter(df, x == "a", y == 1) /4&ͷ࿩ NSE (Non-Standard Evaluation) df[df$x == "a" & df$y == 1, ] SE (Standard Evaluation) Programming with dplyr

filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 ‧dfの名前を何回も書かなくていいよ ‧SQLっぽく書けるよ Programming with dplyr df[df$x == "a" & df$y == 1, ]

filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 df[df$x == "a" & df$y == 1, ] Programming with dplyr ⾊々あるけどスッキリしているのは正義 (私⾒) 書きやすく、読みやすく。 思考と実装の距離を近く。 # verb (動詞的) # noun (名詞的)

df <- data.frame(x = 1:3, y = 1:3) filter(df, x == 1) /4&ͷ࿩ Because of NSE.. Programming with dplyr my_var <- "x" filter(df, my_var == 1) This do NOT work There is No “my_var” column in df

/4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、 何故こうなるかは、 「dplyr再⼊⾨(Tidyval編)」を参照。 「dplyr再⼊⾨(Tidyval編)」yutanihilation

/4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、 何故こうなるかは、 「dplyr再⼊⾨(Tidyval編)」を参照。 可読性が上がる?下がる? それは、あなたと読み⼿次第。 「dplyr再⼊⾨(Tidyval編)」yutanihilation

mutate select filter arrange summaries # カラムの追加 # カラムの選択 # ⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈

(SBNNBSPGEBUBNBOJQVMBUJPO By constraining your options, it helps you think about your data manipulation challenges. Introduction to dplyr

選択肢を制限することで、 データ解析のステップを シンプルに考えられますヨ。 (めっちゃ意訳) Introduction to dplyr ※ まさに意訳 (SBNNBSPGEBUBNBOJQVMBUJPO

より多くの制約を課す事で、 魂の⾜枷から、より⾃由になる。 Igor Stravinsky И ́ горь Ф Страви́нский The more constraints one imposes, the more one frees one's self of the chains that shackle the spirit. 1882 - 1971 ※ 割と意訳

> list3 $A [1] 1 2 3 $B [1] 11 12 13 > df1 A B 1 1 11 2 2 12 3 3 13 observation variable

"Horizontal" data "Nested" data "Vertical" data nest unnest gather spread input output visualization Data style manipulation in {tidyr} Loops, Summarization, Feature extractions et al., ...

① ② ③ ④ lift take pour put Functions are verbs result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)

Data Pipeline readable coding

Programing languages are language Write Run Read Think Write Run Read Think Communicate Share

“Life shrinks or expands to one’s courage.” -- Anaïs Nin, 2000

Before After BeginneR Session BeginneR BeginneR ?

