Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tokyo.R#77 BeginneRSession-Data analysis

Tokyo.R#77 BeginneRSession-Data analysis

第77回Tokyo.Rの初心者セッションでトークしたスライドです。

kilometer

April 13, 2019
Tweet

More Decks by kilometer

Other Decks in Technology

Transcript

  1. BeginneR Advanced Hoxo_m If I have seen further it is

    by standing on the sholders of Giants. -- Sir Isaac Newton, 1676
  2. What is Data? f X ℎℎ . f X ℎℎ

    . ℎ ℎ Hypothesis Driven Data Driven
  3. What is Data? Data is observed (partial) information about phenotype

    of the world. We can hypothesize a part of principle via statistical modeling with data.
  4. Regression Linear = + + ~(0, ) Input Output coefficient

    intercept residual Normal (Gaussian) distribution mean standard deviation
  5. Regression Linear = + + ~(0, ) Input Output coefficient

    intercept residual Normal (Gaussian) distribution mean standard deviation parameters
  6. Tool①: pipe %>% X %>% f X %>% f(y) X

    %>% f %>% g X %>% f(y, .) f(X) f(X, y) g(f(X)) f(y, X)
  7. Regression Linear = + + ~(0, ) Input Output coefficient

    intercept residual Normal (Gaussian) distribution mean standard deviation parameters
  8. # set environment library(tidyverse) set.seed(123) # set parameters N <-

    7 a <- 4 b <- 3 s <- 15 # make data sample dat <- data.frame(x = runif(N, 0, 10), mutate(y = a * x + b + e) # attach {package} # set random seed # data No. # coefficient # intercept # standard deviation Random number generator e = rnorm(N, 0, s)) %>%
  9. # make data sample dat <- data.frame(x = runif(N, 0,

    10), mutate(y = a * x + b + e) e = rnorm(N, 0, s)) %>%
  10. [A , A ] = + E E E E

    observed predicted
  11. [A , A ] = + E E E observed

    predicted F predicted F unobserved E
  12. # liner model fitting fit_lm <- lm(formula = y ~

    x, data = dat) fit_lm <- lm(y ~ x, dat) abbreviated form (same meaning)
  13. # liner model fitting fit_lm <- lm(y ~ x, dat)

    # view help ?lm Usage lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, ...)
  14. [A , A ] = + E E E E

    observed predicted
  15. [A , A ] = + E E E E

    observed predicted
  16. [A , A ] = + E E E observed

    predicted F predicted F unobserved E unobserved F
  17. dat <- data.frame(x = runif(N, 0, 10), e = rnorm(N,

    0, s)) %>% mutate(y = a * x + b + e) dat_lm <- dat %>% mutate(predict = lm(y ~ x) %>% predict) E
  18. ggplot(dat_lm)+ geom_path(aes(x, predict), size = 3, color = "darkgrey")+ geom_point(aes(x,

    y), size = 3)+ geom_segment(aes(x, y, xend = x, yend = predict), color = "blue", linetype = 1, size = 0.7)+ geom_point(aes(x, predict), size = 3, color = "blue")+ theme_classic()+ theme(text = element_text(size = 21))+ ylab("y") ggsave("plot.png", width = 5, height = 5)
  19. ggplot(dat_lm)+ geom_path(aes(x, predict), size = 3, color = "darkgrey")+ geom_point(aes(x,

    y), size = 3)+ geom_segment(aes(x, y, xend = x, yend = predict), color = "blue", linetype = 1, size = 0.7)+ geom_point(aes(x, predict), size = 3, color = "blue")+ theme_classic()+ theme(text = element_text(size = 21))+ ylab("y") ggsave("plot.png", width = 5, height = 5)
  20. = + , H ≅ 0.84 = + , H

    ≅ 0.62 = + , H ≅ 0.37 summary(fit_lm)$r.squared
  21. = + E E E A − S S H

    = 1 − ∑ A H A ∑ A − S H A A Coefficient of determination = S
  22. = + , H ≅ 0.84 = + , H

    ≅ 0.62 = + , H ≅ 0.37
  23. = + , H ≅ 0.84 = + , H

    ≅ 0.62 = + , H ≅ 0.37 ~(0, = 5) ~(0, = 15) ~(0, = 25)
  24. Regression Linear = W + + Input Output coefficient intercept

    residual Multiple Linear Regression = W + X X + H H + Input Output intercept coefficient residual coefficient Input
  25. N <- 15; s <- 10; a0 <- 3; a1

    <- 3; a2 <- 2 dat <- data.frame(x1 = runif(N, 0, 10), x2 = runif(N, 0, 10)) %>% mutate(y = a0 + a1 * x1 + a2 * x2 + rnorm(N, 0, s)) Multiple Linear Regression
  26. N <- 15; s <- 10; a0 <- 3; a1

    <- 3; a2 <- 2 dat <- data.frame(x1 = runif(N, 0, 10), x2 = runif(N, 0, 10)) %>% mutate(y = a0 + a1 * x1 + a2 * x2 + rnorm(N, 0, s)) Multiple Linear Regression fit_lm <- lm(y ~ x1 + x2, dat)
  27. X = XX X + XH [ + ⋯ +

    X] ] H = HX X + HH [ + ⋯ + H] ] [ = [X [ + [H [ + ⋯ + [] ] ⋮ X ⋮ ] = XX … X] ⋮ ⋱ ⋮ ]X … ]] X ⋮ ] Principal Component Analysis eigenvector matrix principal components
  28. dat1 <- data.frame(...) dat_lm1 <- ... dat2 <- data.frame(...) dat_lm2

    <- ... dat3 <- data.frame(...) dat_lm3 <- ... Nested data modeling
  29. dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

    list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) Nested data modeling
  30. dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

    list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) %>% mutate(data = map2(data, s, ~mutate(.x, y = a * x + b + rnorm(N, 0, .y)))) Nested data modeling
  31. dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

    list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) %>% mutate(data = map2(data, s, ~mutate(.x, y = a * x + b + rnorm(N, 0, .y)))) %>% mutate(lm = map(data, ~lm(.$y ~ .$x)), predict = map(lm, predict)) Nested data modeling
  32. dat <- data.frame(x = runif(N, 0, 10)) %>% nest %>%

    list %>% rep(3) %>% bind_rows %>% rowid_to_column("id") %>% mutate(s = c(5, 10, 25)) %>% mutate(data = map2(data, s, ~mutate(.x, y = a * x + b + rnorm(N, 0, .y)))) %>% mutate(lm = map(data, ~lm(.$y ~ .$x)), predict = map(lm, predict)) %>% mutate(r_sq = map_dbl(lm, ~summary(.)$r.squared), a = map_dbl(lm, ~.$coeff[2]), b = map_dbl(lm, ~.$coeff[1])) Nested data modeling
  33. Regression Linear = + + ~(0, ) Input Output coefficient

    intercept residual Normal (Gaussian) distribution mean standard deviation parameters
  34. fit_lm <- lm(y ~ x1 + x2, dat) coeff <-

    fit_lm$coefficients plot3d(dat) planes3d(coeff[2], coeff[3], -1, coeff[1], alpha = 0.5) Multiple Linear Regression
  35. X ⋮ ] = XX … X] ⋮ ⋱ ⋮

    ]X … ]] X ⋮ ] Principal Component Analysis eigenvector matrix principal components fit_pca <- prcomp(dat, scale = T) biplot