Tokyo.R#77 BeginneRSession-Data analysis

Tokyo.R#77 BeginneRSession-Data analysis

第77回Tokyo.Rの初心者セッションでトークしたスライドです。

8284465a94bbdf1ea82cf1a67d55f447?s=128

kilometer

April 13, 2019
Tweet

Transcript

  1. 77th Tokyo.R @kilometer BeginneR Session 5 -- Data analysis --

    2019.04.13 at SONY Co.
  2. Who!?

  3. Who!? 名前: 三村 @kilometer 職業: ポスドク (こうがくはくし) 専⾨: ⾏動神経科学(霊⻑類) 脳イメージング

    医療システム⼯学 R歴: ~ 10年ぐらい 流⾏: ベントー
  4. BeginneR Session

  5. BeginneR

  6. BeginneR

  7. Before After BeginneR Session BeginneR BeginneR

  8. BeginneR Advanced Hoxo_m If I have seen further it is

    by standing on the sholders of Giants. -- Sir Isaac Newton, 1676
  9. BeginneR Session 5 -- Data analysis --

  10. What is Data?

  11. What is Data? ℎ f X ℎℎ Truth Knowledge

  12. What is Data? ℎ f X ℎℎ Truth Knowledge Modeling

    Modeling
  13. “Strong” Hypothesis “Weaken” Hypothesis Data Data What is Data? Hypothesis

    Driven Data Driven
  14. What is Data? f X ℎℎ . f X ℎℎ

    . ℎ ℎ Hypothesis Driven Data Driven
  15. What is Data? Data is observed (partial) information about phenotype

    of the world. We can hypothesize a part of principle via statistical modeling with data.
  16. What is the most frequently used standard popular simple modeling?

  17. Regression Linear = + + Input Output coefficient intercept residual

  18. Regression Linear = + + ~(0, ) Input Output coefficient

    intercept residual Normal (Gaussian) distribution mean standard deviation
  19. Regression Linear = + + ~(0, ) Input Output coefficient

    intercept residual Normal (Gaussian) distribution mean standard deviation parameters
  20. Before we begin...

  21. Tools ①: pipe %>% ②: mutate

  22. Tool①: pipe %>% X %>% f X %>% f(y) X

    %>% f %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X)
  23. Tool②: mutate

  24. Regression Linear = + + ~(0, ) Input Output coefficient

    intercept residual Normal (Gaussian) distribution mean standard deviation parameters
  25. [A , A ]

  26. # set environment library(tidyverse) set.seed(123) # set parameters N <-

    7 a <- 4 b <- 3 s <- 15 # make data sample dat <- data.frame(x = runif(N, 0, 10), mutate(y = a * x + b + e) # attach {package} # set random seed # data No. # coefficient # intercept # standard deviation Random number generator e = rnorm(N, 0, s)) %>%
  27. # make data sample dat <- data.frame(x = runif(N, 0,

    10), mutate(y = a * x + b + e) e = rnorm(N, 0, s)) %>%
  28. # visualization ggplot(dat, aes(x, y))+ geom_point()

  29. [A , A ] = +

  30. [A , A ] = + E E E observed

    predicted
  31. [A , A ] = + E E E E

    observed predicted
  32. [A , A ] = + E E E observed

    predicted F predicted F unobserved E
  33. # liner model fitting fit_lm <- lm(formula = y ~

    x, data = dat) fit_lm <- lm(y ~ x, dat) abbreviated form (same meaning)
  34. # liner model fitting fit_lm <- lm(y ~ x, dat)

    # view help ?lm
  35. # liner model fitting fit_lm <- lm(y ~ x, dat)

    # view help ?lm Usage lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, ...)
  36. # liner model fitting fit_lm <- lm(y ~ x, dat)

  37. # liner model fitting fit_lm <- lm(y ~ x, dat)

  38. [A , A ] = + E E E E

    observed predicted
  39. [A , A ] = + E E E E

    observed predicted
  40. [A , A ] = + E E E observed

    predicted F predicted F unobserved E unobserved F
  41. dat <- data.frame(x = runif(N, 0, 10), e = rnorm(N,

    0, s)) %>% mutate(y = a * x + b + e) dat_lm <- dat %>% mutate(predict = lm(y ~ x) %>% predict) E
  42. # visualization ggplot(dat_lm)+ geom_point(aes(x, y))+ geom_point(aes(x, predict), color = "blue")

  43. # visualization ggplot(dat_lm)+ geom_line(aes(x, predict), linetype = 2) geom_point(aes(x, y))+

    geom_point(aes(x, predict), color = "blue")
  44. ggplot(dat_lm)+ geom_path(aes(x, predict), size = 3, color = "darkgrey")+ geom_point(aes(x,

    y), size = 3)+ geom_segment(aes(x, y, xend = x, yend = predict), color = "blue", linetype = 1, size = 0.7)+ geom_point(aes(x, predict), size = 3, color = "blue")+ theme_classic()+ theme(text = element_text(size = 21))+ ylab("y") ggsave("plot.png", width = 5, height = 5)
  45. ggplot(dat_lm)+ geom_path(aes(x, predict), size = 3, color = "darkgrey")+ geom_point(aes(x,

    y), size = 3)+ geom_segment(aes(x, y, xend = x, yend = predict), color = "blue", linetype = 1, size = 0.7)+ geom_point(aes(x, predict), size = 3, color = "blue")+ theme_classic()+ theme(text = element_text(size = 21))+ ylab("y") ggsave("plot.png", width = 5, height = 5)
  46. [A , A ] = + , H ≅ 0.65

  47. [A , A ] = + , H ≅ 0.65

  48. = + , H ≅ 0.84 = + , H

    ≅ 0.62 = + , H ≅ 0.37 summary(fit_lm)$r.squared
  49. = + E E E A − S S H

    = 1 − ∑ A H A ∑ A − S H A A Coefficient of determination = S
  50. = + , H ≅ 0.84 = + , H

    ≅ 0.62 = + , H ≅ 0.37
  51. = + , H ≅ 0.84 = + , H

    ≅ 0.62 = + , H ≅ 0.37 ~(0, = 5) ~(0, = 15) ~(0, = 25)
  52. Regression Linear = W + + Input Output coefficient intercept

    residual Multiple Linear Regression = W + X X + H H + Input Output intercept coefficient residual coefficient Input
  53. N <- 15; s <- 10; a0 <- 3; a1

    <- 3; a2 <- 2 dat <- data.frame(x1 = runif(N, 0, 10), x2 = runif(N, 0, 10)) %>% mutate(y = a0 + a1 * x1 + a2 * x2 + rnorm(N, 0, s)) Multiple Linear Regression
  54. Multiple Linear Regression plot(dat)

  55. library(rgl) plot3d(dat) Multiple Linear Regression

  56. N <- 15; s <- 10; a0 <- 3; a1

    <- 3; a2 <- 2 dat <- data.frame(x1 = runif(N, 0, 10), x2 = runif(N, 0, 10)) %>% mutate(y = a0 + a1 * x1 + a2 * x2 + rnorm(N, 0, s)) Multiple Linear Regression fit_lm <- lm(y ~ x1 + x2, dat)
  57. coeff <- fit_lm$coefficients plot3d(dat) planes3d(coeff[2], coeff[3], -1, coeff[1], alpha =

    0.5) Multiple Linear Regression
  58. Multiple Linear Regression W , X , H H

  59. Agenda Introduction What is data? Liner Regression Multivariate Analysis 済

    済 済
  60. plot(dat)

  61. X = XX X + XH [ + ⋯ +

    X] ] H = HX X + HH [ + ⋯ + H] ] [ = [X [ + [H [ + ⋯ + [] ] ⋮ X ⋮ ] = XX … X] ⋮ ⋱ ⋮ ]X … ]] X ⋮ ] Principal Component Analysis eigenvector matrix principal components
  62. Principal Component Analysis fit_pca <- prcomp(dat, scale = T)

  63. Principal Component Analysis fit_pca <- prcomp(dat, scale = T)

  64. Principal Component Analysis compression

  65. References

  66. Agenda Introduction What is data? Liner Regression Multivariate Analysis 済

    済 済 済
  67. Appendix…

  68. Nested data modeling

  69. dat1 <- data.frame(...) dat_lm1 <- ... dat2 <- data.frame(...) dat_lm2

    <- ... dat3 <- data.frame(...) dat_lm3 <- ... Nested data modeling
  70. Nested data modeling

  71. Nested data modeling

  72. Nested data modeling

  73. Nested data modeling

  74. dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

    list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) Nested data modeling
  75. dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

    list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) %>% mutate(data = map2(data, s, ~mutate(.x, y = a * x + b + rnorm(N, 0, .y)))) Nested data modeling
  76. dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

    list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) %>% mutate(data = map2(data, s, ~mutate(.x, y = a * x + b + rnorm(N, 0, .y)))) %>% mutate(lm = map(data, ~lm(.$y ~ .$x)), predict = map(lm, predict)) Nested data modeling
  77. dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

    list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) %>% mutate(data = map2(data, s, ~mutate(.x, y = a * x + b + rnorm(N, 0, .y)))) %>% mutate(lm = map(data, ~lm(.$y ~ .$x)), predict = map(lm, predict)) %>% mutate(r_sq = map_dbl(lm, ~summary(.)$r.squared), a = map_dbl(lm, ~.$coeff[2]), b = map_dbl(lm, ~.$coeff[1])) Nested data modeling
  78. dat %>% select(id, data, predict) %>% unnest %>% ggplot(...)+ geom_...+

    facet_wrap(~id) Nested data modeling
  79. Nested data modeling

  80. Nested data modeling

  81. Summary

  82. What is Data? ℎ f X ℎℎ Truth Knowledge Modeling

    Modeling
  83. Regression Linear = + + ~(0, ) Input Output coefficient

    intercept residual Normal (Gaussian) distribution mean standard deviation parameters
  84. # liner model fitting fit_lm <- lm(y ~ x, dat)

  85. fit_lm <- lm(y ~ x1 + x2, dat) coeff <-

    fit_lm$coefficients plot3d(dat) planes3d(coeff[2], coeff[3], -1, coeff[1], alpha = 0.5) Multiple Linear Regression
  86. X ⋮ ] = XX … X] ⋮ ⋱ ⋮

    ]X … ]] X ⋮ ] Principal Component Analysis eigenvector matrix principal components fit_pca <- prcomp(dat, scale = T) biplot
  87. Nested data modeling

  88. Enjoy!!