Tokyo.R#75 BeginneRSession-data pipeline

BeginneR Session - Data Pipeline - #75 Tokyo.R 2019.01.19 @kilometer00

Who！？

Who！？名前：三村 @kilometer 職業：ポスドク (こうがくはくし) 専⾨：⾏動神経科学(霊⻑類) 脳イメージング
医療システム⼯学 R歴： ~ 10年ぐらい流⾏: グリル付きコンロ

BeginneR Session

BeginneR

BeginneR Advanced Hoxo_m If I have seen further it is
by standing on the shoulders of Giants. -- Sir Isaac Newton, 1676

Before After BeginneR Session BeginneR BeginneR

BeginneR Session - Data Pipeline -

Input Output Data Pipeline

packages you

Input Output packages Data Pipeline

Output Input Input Data Pipeline

Data Pipeline

Data Pipeline readable coding

Programing Write Run Read Think

Run!!! https://www.amazon.co.jp/dp/B00Y0UI990/

Programing Write Run Read Think

Programing Write Run Read Think coding style

The tidyverse style guide https://style.tidyverse.org/ "Good coding style is like
correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." Google's R Style Guide https://style.tidyverse.org/ "The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify." R coding style guides

The tidyverse style guides https://style.tidyverse.org/syntax.html#object-names

> data function(..., list = character(), package = NULL, lib.loc
= NULL, verbose = getOption("verbose"), envir = .GlobalEnv) { fileExt <- function(x) { db <- grepl("\\.[^.]+\\.(gz|bz2|xz)$", x) ans <- sub(".*\\.", "", x) ... "Where possible, avoid re-using names of common functions and variables. This will cause confusion for the readers of your code." # Good df <- read.csv("hoge.csv") dat <- read.csv("hoge.csv") # Bad data <- read.csv("hoge.csv")

# Bad for(i in 1:10){ print(i) } # Good for(i
in 1:10){ print(i) } copy (cut) & paste Auto-indentation (in RStudio) Details: RStudio > Preference > Code > Editing

The tidyverse style guide https://style.tidyverse.org/ "Good coding style is like
correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." Google's R Style Guide https://style.tidyverse.org/ "The goal of the R Programming Style Guide is to make our R code easier to read, share, and verify." R coding style guides

Programing Write Run Read Think Write Run Read Think Share

Text Figure Information Intention Data decode encode feedback Programing

ブール演算⼦ Boolean Algebra A == B A != B George
Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia

"a" != "b" # is A in B? ブール演算⼦ Boolean
Algebra [1] TRUE 1 %in% 10:100 # is A in B? [1] FALSE

George Boole 1815 - 1864 A Class-Room Introduction to Logic
https://niyamaklogic.wordpress.co m/category/laws-of-thoughts/ Mathematician Philosopher &

ブール演算⼦ Boolean Algebra A == B A != B George
Boole 1815 - 1864 A | B A & B A %in% B # equal to # not equal to # or # and # is A in B? wikipedia

Programing

Programing Write Run Read Think Write Run Read Think Communicate
Share

Input Output packages Data Pipeline

Integrated Development Environment RStudio https://www.rstudio.com/

RStudio

Projects RStudio

RStudio > Project ⼀説には2147483647個存在するとも⾔われるRStudioの利点のなかでも、 ‧Rなどのソースファイルをタブで並べて表⽰できる ‧そのタブの順番を保持できる ‧タブの内容をファイルを保存せずにRStudioを終了してしまっても、編集途中の内容を保持してくれている等が全国2147483647⼈のRユーザーのQOLを⼤いに向上
させたのは、おそらく異論がないことと思われます。 RStudioって...なんだ？ Projectって...なんだ???? @wakuteka https://qiita.com/wakuteka/items/9599bb0a8985d98928d7

File > New Project… > New Directory > New Project
hogehoge

hogehoge ~/Documents/R hogehoge.Rproj .Rproj.user Project Root Directory Double click!! .RData
.Rhistory Auto saved project information Open project New!!

~/Documents/R project1 project2 project3

0. Introduction 1. data.frame 2. Pipe 4. Tidy data 3.
Verbs Agenda 済

vector in Excel

vector in R in Excel pre <- c(1, 2, 3,
4, 5) post <- pre * 5 > pre [1] 1 2 3 4 5 > post [1] 5 10 15 20 25

vector vec1 <- c(1, 2, 3, 4, 5) vec2 <-
1:5 vec3 <- seq(from = 1, to = 5, by = 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5 > vec3 [1] 1 2 3 4 5

vector vec1 <- seq(from = 1, to = 5, by
= 1) vec2 <- seq(1, 5, 1) > vec1 [1] 1 2 3 4 5 > vec2 [1] 1 2 3 4 5

> ?seq vector seq{base} Sequence Generation Description Generate regular sequences.
seq is a standard generic with a default method. … Usage seq(...) ## Default S3 method: seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)

vector vec1 <- rep(1:3, times = 2) vec2 <- rep(1:3,
each = 2) vec3 <- rep(1:3, times = 2, each = 2) > vec1 [1] 1 2 3 1 2 3 > vec2 [1] 1 1 2 2 3 3 > vec3 [1] 1 1 2 2 3 3 1 1 2 2 3 3

vector vec1 <- 11:15 > vec1 [1] 11 12 13
14 15 > vec1[1] [1] 11 > vec1[3:5] [1] 13 14 15 > vec1[c(1:2, 5)] [1] 11 12 15

list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1
[[1]] [1] 1 2 3 4 5 6 [[2]] [1] 11 12 13 14 15 [[3]] [1] "a" "b" "c"

list list1 <- list(1:6, 11:15, c("a", "b", "c")) > list1[[1]]
[1] 1 2 3 4 5 6 > list1[[3]][2:3] [1] "b" "c" > list1[[2]] * 3 [1] 33 36 39 42 45

named list list2 <- list(A = 1:6, B = 11:15,
C = c("a", "b", "c")) > list2 $A [1] 1 2 3 4 5 6 $B [1] 11 12 13 14 15 $C [1] "a" "b" "c"

> list2$A [1] 1 2 3 4 5 6 >
list2$C[2:3] [1] "b" "c" > list2$B * 3 [1] 33 36 39 42 45 list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c")) named list

list1 <- list(1:6, 11:15, c("a", "b", "c")) > class(list1) [1]
"list" > names(list1) NULL list2 <- list(A = 1:6, B = 11:15, C = c("a", "b", "c")) > class(list2) [1] "list" > names(list2) [1] "A" "B" "C" named list list

list3 <- list(A = 1:3, B = 11:13) > class(list3)
[1] "list" > names(list3) [1] "A" "B" df1 <- data.frame(A = 1:3, B = 11:13) > class(df1) [1] "data.frame" > names(df1) [1] "A" "B" named list & data.frame

> str(list3) List of 2 $ A: int [1:3] 1
2 3 $ B: int [1:3] 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 variables: $ A: int 1 2 3 $ B: int 11 12 13 list3 <- list(A = 1:3, B = 11:13) df1 <- data.frame(A = 1:3, B = 11:13) named list & data.frame

> list3 $A [1] 1 2 3 $B [1] 11
12 13 > df1 A B 1 1 11 2 2 12 3 3 13 named list & data.frame

> list3 $A [1] 1 2 3 $B [1] 11
12 13 > df1 A B 1 1 11 2 2 12 3 3 13 named list & data.frame observation variable

data.frame v.s. matrix A B 1 1 11 2 2
12 3 3 13 [,1] [,2] [1,] 1 11 [2,] 2 12 [3,] 3 13 df1 <- data.frame(A = 1:3, B = 11:13) > str(mat1) int [1:3, 1:2] 1 2 3 11 12 13 > str(df1) 'data.frame': 3 obs. of 2 vars.: $ A: int 1 2 3 $ B: int 11 12 13 mat1 <- matrix(c(1:3, 11:13), 3, 2)

data.frame v.s. matrix

…(省前) … そんなわけで、「data.frame」は、我々の⼼の中にしかありません。あの四⾓い感じの、みんなが「data.frame」と呼んでいるものこそが「data.frame」なのです。 ... So, "data.frame" is only
in our mind. Something like square, everyone calls "data.frame", is the "data.frame".

0. Introduction 1. data.frame 2. Tidy data 4. Verbs 3.
Pipe Agenda 済済

http://vita.had.co.nz/papers/tidy-data.html

https://r4ds.had.co.nz/

In tidy data: 1. Each variable forms a column. 2.
Each observation forms a row. 3. Each value must have its own cell.

Each observation forms a row. 3. Each value must have its own cell. > df1 A B 1 1 11 2 2 12 3 3 13 observation variable df1 <- data.frame(A = 1:3, B = 11:13)

Each observation forms a row. 3. Each value must have its own cell.

Each observation forms a row. 3. Each value must have its own cell. Different observation data Value Label

Each observation forms a row. 3. Each value must have its own cell. data tidying

"Horizontal" tidy data variables

"Horizontal" tidy data "Vertical" tidy data gather(df, key, value, -c(obsid,
group)) {tidyr} variables

"Horizontal" style "Vertical" style gather(df, key, value, -c(obsid, group)) {tidyr}
variables variables

"Horizontal" style "Nested" style nest(group_by(df, group)) {tidyr}

"Nested" style df2 <- nest(group_by(df, group)) {tidyr}

"Horizontal" data "Nested" data "Vertical" data nest unnest gather spread
input output visualization Data style manipulation in {tidyr} Loops, Summarization, Feature extractions et al., ...

Pipe Agenda 済済

1JQF X %>% f X %>% f(y) X %>% f
%>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magrittr} 「dplyr再⼊⾨（基本編）」yutanihilation https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-ji-ben-bian

{magrittr} 「最近パイプしか打ってないです」「パイプ、あれはいいよなって他の⾔語の⼈も皆んな思ってますよ」「1年ぐらいかけてゆっくりこっち（パイプ）にシフトしましたね」【中毒愛⽤者たちの声】「Rコミュニティ四⽅⼭話」https://rlangradio.org/ 1JQF
%>%

① ② ③ ④ lift take pour put Bring milk
from the kitchen!

① lift Bring milk from the kitchen! lift(Robot, glass, table)
-> Robot' take ② take(Robot', fridge, milk) -> Robot''

Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)
Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④

The tidyverse style guides https://style.tidyverse.org/syntax.html#object-names "There are only two hard
things in Computer Science: cache invalidation and naming things"

Bring milk from the kitchen! Robot' <- lift(Robot, glass, table)
Robot'' <- take(Robot', fridge, milk) Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④

Robot' <- lift(Robot, glass, table) Robot'' <- take(Robot', fridge, milk)
Robot''' <- pour(Robot'', milk, glass) result <- put(Robot''', glass, table) result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) by using pipe, # ① # ② # ③ # ④ # ① # ② # ③ # ④ Thinking Reading Bring milk from the kitchen!

Programing Write Run Read Think Write Run Read Think Communicate
Share

Pipe Agenda 済済

from the kitchen!

from the kitchen! result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)

please_bring <- function(someone, milk, glass, table = dining_table, fridge= kitchen_fridge){
someone %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table) } RobotA %>% please_bring(milk, my_glass) Define an original function Usage RobotB %>% please_bring(cold_tea, her_glass)

• nouns for variables • verbs for functions General naming
guidance to naming things

guidance to naming things https://www.grinchcentral.com/function-names-to-verb-or-not-to-verb

guidance to naming things • variables are nouns • functions are verbs Conversely,

Functions are verbs.

filter(df, x == "a", y == 1) df[df$x == "a"
& df$y == 1, ] # verb (動詞的) # noun (名詞的) {dplyr} df %>% filter(x == "a", y == 1) # verb with pipe

mutate select filter arrange summaries # add column # select
column # select row # arrange row # summary of vars {dplyr} WFSCT WFSCGVODUJPOT

It (dplyr) provides simple “verbs” to help you translate your
thoughts into code. functions that correspond to the most common data manipulation tasks Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html WFSCT {dplyr}

dplyrは、あなたの考えをコードに翻訳するための【動詞】を提供する。データ操作における基本のキを、シンプルに実⾏できる関数 (群) Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html WFSCT
{dplyr} ※ かなり意訳

WFSCT {dplyr} mutate # カラムの追加 + mutate

library(dplyr) iris %>% mutate(a = 1:nrow(.)) %>% str 'data.frame': 150
obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels "setosa", ... $ a : int 1 2 3 4 5 6 7 8 9 10 ... WFSCT {dplyr}

library(dplyr) iris %>% mutate(a = 1:nrow(.), a = a *
5/3 %>% round) 'data.frame': 150 obs. of 6 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... $ Species : Factor w/ 3 levels “setosa”, ... $ a : num 1.67 3.33 5 6.67 8.33 ... ... WFSCT {dplyr} over write

WFSCT {dplyr} select # カラムの選択 select

library(dplyr) iris %>% select(Sepal.Length, Sepal.Width) 'data.frame': 150 obs. of 6
variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... WFSCT {dplyr}

library(dplyr) iris %>% select(contains(“Width”)) 'data.frame': 150 obs. of 6 variables:
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 ... WFSCT {dplyr} Select help functions

WFSCT {dplyr} # Select help functions starts_with("s") ends_with("s") contains("se") matches("^.e")
one_of(c("Sepal.Length", "Species")) everything() https://kazutan.github.io/blog/2017/04/dplyr-select-memo/ 「dplyr::selectの活⽤例メモ」kazutan

mutate select filter arrange summaries # カラムの追加 # カラムの選択 #
⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈

WFSCT {dplyr} filter # ⾏の絞り込み filter

library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} 'data.frame': 50
obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...

library(dplyr) iris %>% filter(Species == "versicolor") WFSCT {dplyr} NSE (Non-Standard
Evaluation) 'data.frame': 50 obs. of 5 variables: $ Sepal.Length: num 7 6.4 6.9 5.5 6.5 5.7 6.3 ... $ Sepal.Width : num 3.2 3.2 3.1 2.3 2.8 2.8 ... $ Petal.Length: num 4.7 4.5 4.9 4 4.6 4.5 4.7 ... $ Petal.Width : num 1.4 1.5 1.5 1.3 1.5 1.3 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...

filter(df, x == "a", y == 1) /4&ͷ࿩ NSE (Non-Standard
Evaluation) df[df$x == "a" & df$y == 1, ] SE (Standard Evaluation) http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr

filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 ‧dfの名前を何回も書かなくていいよ
‧SQLっぽく書けるよ http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr df[df$x == "a" & df$y == 1, ]

filter(df, x == "a", y == 1) /4&ͷ࿩ NSEを使うと、 df[df$x
== "a" & df$y == 1, ] http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr ⾊々あるけどスッキリしているのは正義 (私⾒) 書きやすく、読みやすく。思考と実装の距離を近く。 # verb (動詞的) # noun (名詞的)

df <- data.frame(x = 1:3, y = 1:3) filter(df, x
== 1) /4&ͷ࿩ Because of NSE.. http://dplyr.tidyverse.org/articles/programming.html Programming with dplyr my_var <- "x" filter(df, my_var == 1) This do NOT work There is No “my_var” column in df

/4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、
何故こうなるかは、「dplyr再⼊⾨（Tidyval編）」を参照。 https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian 「dplyr再⼊⾨（Tidyval編）」yutanihilation

/4&ͷ࿩ my_var <- quo(x) filter(df, (!! my_var) == 1) ど〜〜〜してもやりたければ、
何故こうなるかは、「dplyr再⼊⾨（Tidyval編）」を参照。 https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-tidyevalbian 可読性が上がる？下がる？それは、あなたと読み⼿次第。「dplyr再⼊⾨（Tidyval編）」yutanihilation

mutate select filter arrange summaries # カラムの追加 # カラムの選択 #
⾏の絞り込み # ⾏の並び替え # 値の集約 {dplyr} WFSCT WFSCؔ਺܈

(SBNNBSPGEBUBNBOJQVMBUJPO By constraining your options, it helps you think about
your data manipulation challenges. Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

選択肢を制限することで、データ解析のステップをシンプルに考えられますヨ。（めっちゃ意訳） Introduction to dplyr https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html ※ まさに意訳
(SBNNBSPGEBUBNBOJQVMBUJPO

より多くの制約を課す事で、魂の⾜枷から、より⾃由になる。 Igor Stravinsky И ́ горь Ф Страви́нский The
more constraints one imposes, the more one frees one's self of the chains that shackle the spirit. 1882 - 1971 ※ 割と意訳

Pipe Agenda 済済

Summary

> list3 $A [1] 1 2 3 $B [1] 11
12 13 > df1 A B 1 1 11 2 2 12 3 3 13 observation variable

Each observation forms a row. 3. Each value must have its own cell. data tidying

"Horizontal" data "Nested" data "Vertical" data nest unnest gather spread
input output visualization Data style manipulation in {tidyr} Loops, Summarization, Feature extractions et al., ...

1JQF X %>% f X %>% f(y) X %>% f
%>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X) %>% {magrittr} 「dplyr再⼊⾨（基本編）」yutanihilation https://speakerdeck.com/yutannihilation/dplyrzai-ru-men-ji-ben-bian

① ② ③ ④ lift take pour put Functions are
verbs result <- Robot %>% lift(glass, table) %>% take(fridge, milk) %>% pour(milk, glass) %>% put(glass, table)

filter(df, x == "a", y == 1) df[df$x == "a"
& df$y == 1, ] # verb (動詞的) # noun (名詞的) {dplyr} df %>% filter(x == "a", y == 1)

mutate select filter arrange summaries # add column # select
column # select row # arrange row # summary of vars {dplyr} WFSCT WFSCGVODUJPOT

Data Pipeline readable coding

https://www.tidyverse.org/

Programing languages are language Write Run Read Think Write Run
Read Think Communicate Share

“Life shrinks or expands to one’s courage.” -- Anaïs Nin,
2000 http://theamericanreader.com

Before After BeginneR Session BeginneR BeginneR ？

Tokyo.R#75 BeginneRSession-data pipeline

Tokyo.R#75 BeginneRSession-data pipeline

More Decks by kilometer

Other Decks in Science

Featured

Transcript