450

# Tokyo.R#77 BeginneRSession-Data analysis

April 13, 2019

## Transcript

1. ### 77th Tokyo.R @kilometer BeginneR Session 5 -- Data analysis --

2019.04.13 at SONY Co.

3. ### Who！？ 名前： 三村 @kilometer 職業： ポスドク (こうがくはくし) 専⾨： ⾏動神経科学(霊⻑類) 脳イメージング

医療システム⼯学 R歴： ~ 10年ぐらい 流⾏: ベントー

8. ### BeginneR Advanced Hoxo_m If I have seen further it is

by standing on the sholders of Giants. -- Sir Isaac Newton, 1676

Modeling
13. ### “Strong” Hypothesis “Weaken” Hypothesis Data Data What is Data? Hypothesis

Driven Data Driven
14. ### What is Data? f X ℎℎ . f X ℎℎ

. ℎ ℎ Hypothesis Driven Data Driven
15. ### What is Data? Data is observed (partial) information about phenotype

of the world. We can hypothesize a part of principle via statistical modeling with data.

18. ### Regression Linear = + + ~(0, ) Input Output coefficient

intercept residual Normal (Gaussian) distribution mean standard deviation
19. ### Regression Linear = + + ~(0, ) Input Output coefficient

intercept residual Normal (Gaussian) distribution mean standard deviation parameters

22. ### Tool①: pipe %>% X %>% f X %>% f(y) X

%>% f %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X)

24. ### Regression Linear = + + ~(0, ) Input Output coefficient

intercept residual Normal (Gaussian) distribution mean standard deviation parameters

26. ### # set environment library(tidyverse) set.seed(123) # set parameters N <-

7 a <- 4 b <- 3 s <- 15 # make data sample dat <- data.frame(x = runif(N, 0, 10), mutate(y = a * x + b + e) # attach {package} # set random seed # data No. # coefficient # intercept # standard deviation Random number generator e = rnorm(N, 0, s)) %>%
27. ### # make data sample dat <- data.frame(x = runif(N, 0,

10), mutate(y = a * x + b + e) e = rnorm(N, 0, s)) %>%

predicted
31. ### [A , A ] = + E E E E

observed predicted
32. ### [A , A ] = + E E E observed

predicted F predicted F unobserved E
33. ### # liner model fitting fit_lm <- lm(formula = y ~

x, data = dat) fit_lm <- lm(y ~ x, dat) abbreviated form (same meaning)
34. ### # liner model fitting fit_lm <- lm(y ~ x, dat)

# view help ?lm
35. ### # liner model fitting fit_lm <- lm(y ~ x, dat)

# view help ?lm Usage lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, ...)

38. ### [A , A ] = + E E E E

observed predicted
39. ### [A , A ] = + E E E E

observed predicted
40. ### [A , A ] = + E E E observed

predicted F predicted F unobserved E unobserved F
41. ### dat <- data.frame(x = runif(N, 0, 10), e = rnorm(N,

0, s)) %>% mutate(y = a * x + b + e) dat_lm <- dat %>% mutate(predict = lm(y ~ x) %>% predict) E

43. ### # visualization ggplot(dat_lm)+ geom_line(aes(x, predict), linetype = 2) geom_point(aes(x, y))+

geom_point(aes(x, predict), color = "blue")
44. ### ggplot(dat_lm)+ geom_path(aes(x, predict), size = 3, color = "darkgrey")+ geom_point(aes(x,

y), size = 3)+ geom_segment(aes(x, y, xend = x, yend = predict), color = "blue", linetype = 1, size = 0.7)+ geom_point(aes(x, predict), size = 3, color = "blue")+ theme_classic()+ theme(text = element_text(size = 21))+ ylab("y") ggsave("plot.png", width = 5, height = 5)
45. ### ggplot(dat_lm)+ geom_path(aes(x, predict), size = 3, color = "darkgrey")+ geom_point(aes(x,

y), size = 3)+ geom_segment(aes(x, y, xend = x, yend = predict), color = "blue", linetype = 1, size = 0.7)+ geom_point(aes(x, predict), size = 3, color = "blue")+ theme_classic()+ theme(text = element_text(size = 21))+ ylab("y") ggsave("plot.png", width = 5, height = 5)

48. ### = + , H ≅ 0.84 = + , H

≅ 0.62 = + , H ≅ 0.37 summary(fit_lm)\$r.squared
49. ### = + E E E A − S S H

= 1 − ∑ A H A ∑ A − S H A A Coefficient of determination = S
50. ### = + , H ≅ 0.84 = + , H

≅ 0.62 = + , H ≅ 0.37
51. ### = + , H ≅ 0.84 = + , H

≅ 0.62 = + , H ≅ 0.37 ~(0, = 5) ~(0, = 15) ~(0, = 25)
52. ### Regression Linear = W + + Input Output coefficient intercept

residual Multiple Linear Regression = W + X X + H H + Input Output intercept coefficient residual coefficient Input
53. ### N <- 15; s <- 10; a0 <- 3; a1

<- 3; a2 <- 2 dat <- data.frame(x1 = runif(N, 0, 10), x2 = runif(N, 0, 10)) %>% mutate(y = a0 + a1 * x1 + a2 * x2 + rnorm(N, 0, s)) Multiple Linear Regression

56. ### N <- 15; s <- 10; a0 <- 3; a1

<- 3; a2 <- 2 dat <- data.frame(x1 = runif(N, 0, 10), x2 = runif(N, 0, 10)) %>% mutate(y = a0 + a1 * x1 + a2 * x2 + rnorm(N, 0, s)) Multiple Linear Regression fit_lm <- lm(y ~ x1 + x2, dat)
57. ### coeff <- fit_lm\$coefficients plot3d(dat) planes3d(coeff, coeff, -1, coeff, alpha =

0.5) Multiple Linear Regression

済 済

61. ### X = XX X + XH [ + ⋯ +

X] ] H = HX X + HH [ + ⋯ + H] ] [ = [X [ + [H [ + ⋯ + [] ] ⋮ X ⋮ ] = XX … X] ⋮ ⋱ ⋮ ]X … ]] X ⋮ ] Principal Component Analysis eigenvector matrix principal components

済 済 済

69. ### dat1 <- data.frame(...) dat_lm1 <- ... dat2 <- data.frame(...) dat_lm2

<- ... dat3 <- data.frame(...) dat_lm3 <- ... Nested data modeling

74. ### dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) Nested data modeling
75. ### dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) %>% mutate(data = map2(data, s, ~mutate(.x, y = a * x + b + rnorm(N, 0, .y)))) Nested data modeling
76. ### dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) %>% mutate(data = map2(data, s, ~mutate(.x, y = a * x + b + rnorm(N, 0, .y)))) %>% mutate(lm = map(data, ~lm(.\$y ~ .\$x)), predict = map(lm, predict)) Nested data modeling
77. ### dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) %>% mutate(data = map2(data, s, ~mutate(.x, y = a * x + b + rnorm(N, 0, .y)))) %>% mutate(lm = map(data, ~lm(.\$y ~ .\$x)), predict = map(lm, predict)) %>% mutate(r_sq = map_dbl(lm, ~summary(.)\$r.squared), a = map_dbl(lm, ~.\$coeff), b = map_dbl(lm, ~.\$coeff)) Nested data modeling
78. ### dat %>% select(id, data, predict) %>% unnest %>% ggplot(...)+ geom_...+

facet_wrap(~id) Nested data modeling

Modeling
83. ### Regression Linear = + + ~(0, ) Input Output coefficient

intercept residual Normal (Gaussian) distribution mean standard deviation parameters

85. ### fit_lm <- lm(y ~ x1 + x2, dat) coeff <-

fit_lm\$coefficients plot3d(dat) planes3d(coeff, coeff, -1, coeff, alpha = 0.5) Multiple Linear Regression
86. ### X ⋮ ] = XX … X] ⋮ ⋱ ⋮

]X … ]] X ⋮ ] Principal Component Analysis eigenvector matrix principal components fit_pca <- prcomp(dat, scale = T) biplot