Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SOC 4930 & SOC 5050 - Week 11

SOC 4930 & SOC 5050 - Week 11

Lecture slides for Week 10 of the Saint Louis University Course Quantitative Analysis: Applied Inferential Statistics. These slides cover the topics related to correlation analyses.

Christopher Prener

November 05, 2018
Tweet

More Decks by Christopher Prener

Other Decks in Education

Transcript

  1. AGENDA QUANTITATIVE ANALYSIS / WEEK 11 / LECTURE 11 1.

    Front Matter 2. More with knitr 3. Scatterplots 4. Matrix Arrays 5. Correlation in R 6. Power Analyses for Correlation 7. Back Matter
  2. 1. FRONT MATTER ANNOUNCEMENTS ITS or DPS issues? Lab 10

    and Problem Set 05 due next Monday as is peer review of partner’s materials! Draft papers due next Monday! Reminder - no additional lecture preps!
  3. IN-LINE CODE ```{r load-data} library(ggplot2) auto <- mpg ``` The

    average highway fuel efficiency in the data set is `r mean(auto$hwy)`. 2. MORE WITH KNITR
  4. ▸ x is the value you wish to round ▸

    val is the number of significant digitsval Available in base
 Installed with base R 2. MORE WITH KNITR ROUNDING IN R Parameters: round(x, digits = val) f(x)
  5. ▸ x is the value you wish to round ▸

    val is the number of significant digits 2. MORE WITH KNITR ROUNDING IN R Parameters: round(x, digits = val) f(x)
  6. ROUNDING IN R 2. MORE WITH KNITR round(x, digits =

    val) Using the hwy variable from ggplot2’s mpg data: > mean(mpg$hwy) [1] 23.44017 > round(mean(mpg$hwy), digits = 3) [1] 23.44 f(x)
  7. ROUNDED IN-LINE CODE ```{r load-data} library(ggplot2) auto <- mpg ```

    The average highway fuel efficiency in the data set is `r round(mean(auto$hwy), digits = 3)`. 2. MORE WITH KNITR
  8. \x{} 2. MORE WITH KNITR PEARSON’S R r = Pn

    i=1 (x ¯ x)(y ¯ y) (n 1)sxsy \sum_{i=1}^{n}
  9. 2. MORE WITH KNITR PEARSON’S R r = Pn i=1

    (x ¯ x)(y ¯ y) (n 1)sxsy \bar{x} \x{}
  10. 2. MORE WITH KNITR PEARSON’S R r = Pn i=1

    (x ¯ x)(y ¯ y) (n 1)sxsy \sum_{i=1}^{n}{(x-\bar{x})(y-\bar{y})} \x{}
  11. 2. MORE WITH KNITR PEARSON’S R r = Pn i=1

    (x ¯ x)(y ¯ y) (n 1)sxsy {s}_{x} \x{}
  12. 2. MORE WITH KNITR PEARSON’S R r = Pn i=1

    (x ¯ x)(y ¯ y) (n 1)sxsy (n-1){s}_{x}{s}_{y} \x{}
  13. 2. MORE WITH KNITR PEARSON’S R r = Pn i=1

    (x ¯ x)(y ¯ y) (n 1)sxsy \frac{}{} \x{}
  14. ▸ method is the parameter that specifies the type of

    model to use; we’ll focus on using linear models (“lm”) this semester ▸ The hex value will assign a color to the line using a six digit hexadecimal code - you can look up colors on colorhexa.com ▸ You can also specify the aesthetic mapping for x and y, but if this is done in the original ggplot() call, doing so is not necessary. Available in ggplot2
 Download via CRAN 3. SCATTERPLOTS WITH LINEAR MODEL Parameters: geom_smooth(method = “lm”, color = “#hex”)
  15. ▸ method is the parameter that specifies the type of

    model to use; we’ll focus on using linear models (“lm”) this semester ▸ The hex value will assign a color to the line using a six digit hexadecimal code - you can look up colors on colorhexa.com ▸ You can also specify the aesthetic mapping for x and y, but if this is done in the original ggplot() call, doing so is not necessary. 3. SCATTERPLOTS WITH LINEAR MODEL Parameters: geom_smooth(method = “lm”, color = “#hex”)
  16. WITH LINEAR MODEL 3. SCATTERPLOTS geom_smooth(method = “lm”, color =

    “#hex”) Using the hwy and displ variables from ggplot2’s mpg data with points colored by type of drive (drv): ggplot(data = mpg, mapping = aes(x = displ, 
 y = hwy)) + geom_point(position = “jitter”) + geom_smooth(method = “lm”, color = “#ff0000”)
  17. ▸ color is the parameter were the grouping variable is

    assigned • this should be specified within the aesthetic 3. SCATTERPLOTS WITH GROUPS Parameters: geom_point(mapping = aes(x = xvar, y = yvar, 
 color = groupVar))
  18. WITH GROUPS 3. SCATTERPLOTS geom_point(mapping = aes(x = xvar, y

    = yvar, 
 color = groupVar)) Using the hwy and displ variables from ggplot2’s mpg data with points colored by type of drive (drv): ggplot(data = mpg, mapping = aes(x = displ, y = hwy, 
 color = drv)) + geom_point(position = “jitter”)
  19. Using the hwy and displ variables from ggplot2’s mpg data

    with points colored by type of drive (drv): WITH LINEAR MODELS BY GROUP 3. SCATTERPLOTS ggplot(data = mpg, mapping = aes(x = displ, y = hwy, 
 color = drv)) + geom_point(position = “jitter”) + geom_smooth(method = “lm”)
  20. Using the hwy and displ variables from ggplot2’s mpg data

    with points colored by type of drive (drv): WITH LINEAR MODELS BY GROUP 3. SCATTERPLOTS ggplot(data = mpg, mapping = aes(x = displ, y = hwy, 
 color = drv)) + geom_point(position = “jitter”) + geom_smooth(method = “lm”, 
 mapping = aes(linetype = drv))
  21. ▸ facetVar is the parameter were the faceting variable is

    assigned 3. SCATTERPLOTS WITH FACETS Parameters: facet_grid(. ~ facetVar)
  22. WITH FACETS 3. SCATTERPLOTS facet_grid(. ~ facetVar) Using the hwy

    and displ variables from ggplot2’s mpg data with points colored by type of drive (drv): ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy, color = drv), position = “jitter”) + facet_grid(. ~ drv)
  23. WITH FACETS AND LINEAR MODELS 3. SCATTERPLOTS facet_grid(. ~ facetVar)

    Using the hwy and displ variables from ggplot2’s mpg data with points colored by type of drive (drv): ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + geom_point(position = “jitter”) + geom_smooth(method = “lm”) + facet_grid(. ~ drv)
  24. ▸ data is the data frame being used ▸ xvar

    is the x variable ▸ yvar is the y variable Available in ggstatsplot
 Download via CRAN 3. SCATTERPLOTS STATISTICAL PLOT Parameters: ggscatterstats(data = data, x = xvar, y = yvar)
  25. ▸ data is the data frame being used ▸ xvar

    is the x variable ▸ yvar is the y variable 3. SCATTERPLOTS STATISTICAL PLOT Parameters: ggscatterstats(data = data, x = xvar, y = yvar)
  26. STATISTICAL PLOT 3. SCATTERPLOTS ggscatterstats(data = data, x = xvar,

    y = yvar) Using the hwy and displ variables from ggplot2’s mpg data: ggscatterstats(data = mpg, x = hwy, y = displ) This will not create a ggplot object (and will return an error confirming this). Saving process is a bit different.
  27. SAVING STATISTICAL PLOTS > # option 1 (without marginal plots)

    > ggscatterstats(data = mpg, x = hwy, y = displ, marginal = FALSE) > ggsave(filename = here(“results”, “statplot.png”), dpi = 300) > > # option 2 (with marginal plots) > grdevices::png(here(“results”, “statplot.png”), width = 534, 
 + height = 400) > ggscatterstats(data = mpg, x = hwy, y = displ) > grdevices::dev.off() 3. SCATTERPLOTS
  28. M = 2 4 1 2 2 4 3 6

    3 5 4. MATRIX ARRAYS MATRIX A collection of values in rows and columns. 
 All values must be of the same data type. Matrix name in bold, upper case lettering Brackets, parentheses, or braces used to enclose values Element
  29. 4. MATRIX ARRAYS SCALAR A matrix with one element. m

    = ⇥ 1 ⇤ Lower case italicized matrix name
  30. 4. MATRIX ARRAYS SCALAR All single values saved to an

    object in R are scalars. > m <- 1 > m1 <- TRUE > m2 <- “ham”
  31. 4. MATRIX ARRAYS VECTOR A matrix with one row or

    column. m = 2 4 1 2 3 3 5 Lower case bold matrix name
  32. CREATING A VECTOR 4. MATRIX ARRAYS base::c(element, element, element) Create

    an atomic vector of integers: > m <- c(1, 2, 3) c is for “concatenate” (and “cookie”) f(x)
  33. 4. MATRIX ARRAYS LIST (GENERIC VECTOR) A vector that is

    a collection of multiple atomic vectors.
 Lists may contain vectors of different dimensions and types of data. M = 0 @a = 2 4 1 2 3 3 5 , b = 2 4 2 4 6 3 5 1 A
  34. 4. MATRIX ARRAYS SQUARE MATRIX A matrix with equal numbers

    of rows and columns. M = 2 4 1 2 4 2 4 8 3 6 12 3 5
  35. 4. MATRIX ARRAYS DIAGONAL Values in a square matrix running

    from upper left to lower right. M = 2 4 1 2 4 2 4 8 3 6 12 3 5 1 4 12
  36. CREATING A MATRIX 4. MATRIX ARRAYS base::as.matrix(objectName) Converting the data

    frame object ham into a matrix named eggs: > eggs <- as.matrix(ham) In practice, this should only be applied to numeric or logical data. Logical vectors will be converted to 0 (FALSE) and 1 (TRUE). If character vectors are in ham, the entire matrix will be character. f(x)
  37. 4. MATRIX ARRAYS WHAT IS A DATA FRAME? A collection

    of vectors that have the same length (like a matrix)
 but can be of different types (like a list). index ham eggs spam 1 TRUE Sunny 2 2 FALSE Poached 1 3 TRUE Scrambled 3 4 FALSE Sunny 4
  38. 4. MATRIX ARRAYS WHAT IS A DATA FRAME? A collection

    of vectors that have the same length (like a matrix)
 but can be of different types (like a list). ham TRUE FALSE TRUE FALSE
  39. CREATING A DATA FRAME 4. MATRIX ARRAYS base::data.frame(vector, vector, stringsAsFactors

    = FALSE) Create an atomic vector of integers: > M <- data.frame(
 x = c(1, 2, 3),
 y = c(“a”, “b”, “a”),
 stringsAsFactors = FALSE) f(x)
  40. 4. MATRIX ARRAYS CREATE A DATA FRAME In R, build

    a data frame named breakfast that has the variables ham, eggs, and spam. index ham eggs spam 1 TRUE Sunny 2 2 FALSE Poached 1 3 TRUE Scrambled 3 4 FALSE Sunny 4
  41. CREATE A DATA FRAME breakfast <- data.frame( ham = c(TRUE,

    FALSE, TRUE, FALSE), eggs = c(“Sunny”, “Poached”, “Scrambled”, “Sunny”), spam = c(2, 1, 3, 4), stringsAsFactors = FALSE) 4. MATRIX ARRAYS
  42. f(2) f(4) f(6) 4. MATRIX ARRAYS SPEAKING OF VECTORS… R’s

    functions are often vectorized. But what the $%&# does that mean? f <- function(x){ x*2 }
 m <- c(2, 4, 6) Let: Output: > f(m) [1] 4 8 12 f(m[1]) f(m[2]) f(m[3]) Under the hood: 4 8 12
  43. 5. CORRELATION IN R MISSING DATA Missing data are represented

    by NA values in R. index ham eggs spam 1 TRUE Sunny 2 2 FALSE Poached 1 3 TRUE Scrambled NA 4 FALSE NA 4
  44. 5. CORRELATION IN R MISSING DATA Sometimes missing data are

    assigned special values, like -9. If that is the
 case (as in the final project), they need to be recoded. index ham eggs spam 1 TRUE Sunny 2 2 FALSE Poached 1 3 TRUE Scrambled -9 4 FALSE -9 4
  45. RECODING MISSING DATA > library(dplyr) > foo <- data.frame(ham =

    c(1, 2, 3, 4, -9)) > foo <- mutate(foo, ham = ifelse(ham == -9, NA, ham)) 5. CORRELATION IN R
  46. ▸ data is the data frame or tibble being used

    Available in naniar
 Download via CRAN 5. CORRELATION IN R MISSING DATA ANALYSIS Parameters: miss_var_summary(data) f(x)
  47. ▸ data is the data frame or tibble being used

    5. CORRELATION IN R MISSING DATA ANALYSIS Parameters: miss_var_summary(data) f(x)
  48. MISSING DATA ANALYSIS 5. CORRELATION IN R miss_var_summary(data) f(x) Using

    from dplyr’s starwars data: miss_var_summary(starwars) Can be followed with %>% knitr::kable() to create a nicely formatted table of missing data.
  49. 5. CORRELATION IN R MISSING DATA Pairwise deletion removes missing

    data on a case-by-case basis. ham eggs TRUE Sunny FALSE Poached TRUE Scrambled ham spam TRUE 2 FALSE 1 FALSE 4 index ham eggs spam 1 TRUE Sunny 2 2 FALSE Poached 1 3 TRUE Scrambled NA 4 FALSE NA 4
  50. 5. CORRELATION IN R MISSING DATA This leads to unequal

    comparisons because the mix of observations
 for ham and eggs has a different composition than for ham and spam. ham eggs TRUE Sunny FALSE Poached TRUE Scrambled ham spam TRUE 2 FALSE 1 FALSE 4 index ham eggs spam 1 TRUE Sunny 2 2 FALSE Poached 1 3 TRUE Scrambled NA 4 FALSE NA 4
  51. 5. CORRELATION IN R MISSING DATA Listwise deletion removes all

    missing data for all given variables. index ham eggs spam 1 TRUE Sunny 2 2 FALSE Poached 1 3 TRUE Scrambled NA 4 FALSE NA 4
  52. 5. CORRELATION IN R MISSING DATA Listwise deletion removes all

    missing data for all given variables. This can significantly impact n. If listwise deletion removes more than 5% of the observations, this is problematic for generalization. index ham eggs spam 1 TRUE Sunny 2 2 FALSE Poached 1
  53. LISTWISE DELETION 5. CORRELATION IN R stats::na.omit(data) Removing all missing

    data from dplyr’s starwars data: > sw_listwise <- na.omit(starwars) Document how this impacts your sample size by using the base::nrow() function both before and after you use na.omit(). f(x)
  54. LISTWISE DELETION 5. CORRELATION IN R stats::na.omit(data) Removing all missing

    data from dplyr’s starwars data: > sw_listwise <- na.omit(starwars) Make sure to remove all unneeded variables (with dplyr::select()) before performing listwise deletion to avoid inadvertently removing too many observations. f(x)
  55. ? REVIEW ▸ Both x and y should be continuous,

    normally distributed variables ▸ There should be a linear relationship between x and y ▸ Sufficiently large sample size (n >= 30) ▸ There should be no extreme outliers 5. CORRELATION IN R What are the assumptions for Pearson’s r?
  56. ▸ data is the data frame or tibble being used

    ▸ use is set equal to either “complete.obs” (listwise deletion) or “pairwise.complete.obs” (pairwise deletion) Available in stats
 Installed with base R 5. CORRELATION IN R PEARSON’S R IN R Parameters: corr(data, use, method = “pearson”) f(x)
  57. ▸ data is the data frame or tibble being used

    ▸ use is set equal to either “complete.obs” (listwise deletion) or “pairwise.complete.obs” (pairwise deletion) 5. CORRELATION IN R PEARSON’S R IN R Parameters: corr(data, use, method = “pearson”) f(x)
  58. ▸ Does not provide statistical significance values for relationships. ▸

    These can be obtained with a second function, cor.test(), but this only works on a single pair of variables at a time. ▸ Unwanted variables must be removed from the data frame. ▸ Rounds to 7 decimal places. 5. CORRELATION IN R PEARSON’S R IN R Problems: corr(data, use, method = “pearson”) f(x)
  59. ▸ matrix is a matrix version of the data being

    used Available in Hmisc
 Download via CRAN 5. CORRELATION IN R PEARSON’S R IN R Parameters: rcorr(matrix, type = “pearson”) f(x)
  60. ▸ matrix is a matrix version of the data being

    used 5. CORRELATION IN R PEARSON’S R IN R Parameters: rcorr(matrix, type = “pearson”) f(x)
  61. ▸ matrix has to be converted, which means unwanted variables

    must be removed from the data frame ahead of time. • The error produced when you forget about the matrix requirement is utterly unhelpful. ▸ No option for listwise deletion. ▸ P-values returned in a separate part of list output. ▸ Rounds to two decimal places. 5. CORRELATION IN R PEARSON’S R IN R Problems: rcorr(matrix, type = “pearson”) f(x)
  62. SETUP FOR PEARSON’S R > library(dplyr) > library(ggplot2) > >

    autoData <- mpg > autoSubset <- select(autoData, cyl, cty, hwy) > autoSubset <- as.matrix(autoSubset) 5. CORRELATION IN R
  63. PEARSON’S R IN R > rcorr(autoSubset, type = "pearson") cyl

    cty hwy cyl 1.00 -0.81 -0.76 cty -0.81 1.00 0.96 hwy -0.76 0.96 1.00 n= 234 P cyl cty hwy cyl 0 0 cty 0 0 hwy 0 0 5. CORRELATION IN R
  64. ▸ matrix is a matrix version of the data being

    used Available as script in lecture-11
 Download via GitHub 5. CORRELATION IN R PEARSON’S R IN R Parameters: corrTable(data, coef = “pearson”, listwise = TRUE,
 round = 3, pStar = TRUE, ...) f(x)
  65. ▸ data is the data frame or tibble being used

    ▸ listwise is set equal to either TRUE (listwise deletion) or FALSE (pairwise deletion) ▸ round is set equal to the number of significant digits to display ▸ pStar is set equal to either TRUE (show stars) or FALSE (no statistical significance indicators) ▸ ... optionally provides a space for unquoted names to be added, separated by commas, to limit output to specific variables. 5. CORRELATION IN R PEARSON’S R IN R Parameters: corrTable(data, coef = “pearson”, listwise = TRUE,
 round = 3, pStar = TRUE, ...) f(x)
  66. PEARSON’S R IN R 5. CORRELATION IN R corrTable(data, coef

    = “pearson”, listwise = TRUE,
 round = 3, pStar = TRUE, ...) Using the cyl, hwy, and cty variables from ggplot2’s mpg data: corrTable(mpg, coef = “pearson”, listwise = TRUE, round = 3, pStar = TRUE, cyl, hwy, cty) f(x) Can be followed with %>% knitr::kable() to create a nicely formatted table of correlation coefficients.
  67. PEARSON’S R IN R 5. CORRELATION IN R corrTable(data, coef

    = “pearson”, listwise = TRUE,
 round = 3, pStar = TRUE, ...) Using the cyl, hwy, and cty variables from ggplot2’s mpg data: corrTable(mpg, coef = “pearson”, listwise = TRUE, round = 3, pStar = TRUE, cyl, hwy, cty) Can be saved directly to .csv without using broom::tidy(). f(x)
  68. PEARSON’S R IN R 5. CORRELATION IN R corrTable(data, coef

    = “pearson”, listwise = TRUE,
 round = 3, pStar = TRUE, ...) Using the cyl, hwy, and cty variables from ggplot2’s mpg data: corrTable(mpg, coef = “pearson”, listwise = TRUE, round = 3, pStar = TRUE, cyl, hwy, cty) You will need to save the .R script from GitHub to source/ and then source the function call before using corrTable()! f(x)
  69. PEARSON’S R IN R > corrTable(mpg, coef = “pearson”, listwise

    = TRUE, round = 3, 
 pStar = TRUE, cyl, hwy, cty) cyl hwy cty cyl 1.000 hwy -0.762*** 1.000 cty -0.806*** 0.956*** 1.000 5. CORRELATION IN R
  70. ▸ r should be set equal to the expected correlation

    coefficient ▸ sig.level should be set to the needed alpha value, which is typically .05 ▸ power should be set equal to the needed power value (1-β, where β is the probability of Type II error); values of 80% to 90% are typically desired. ▸ alternative is used to specify whether significance testing will be done using one- or two-sided tests Available in pwr
 Download via CRAN 6. POWER ANALYSES FOR CORRELATION SAMPLE SIZE ESTIMATES Parameters: pwr(r = rVal, sig.level = .05, power = powerVal, alternative = "two.sided") f(x)
  71. ▸ r should be set equal to the expected correlation

    coefficient ▸ sig.level should be set to the needed alpha value, which is typically .05 ▸ power should be set equal to the needed power value (1-β, where β is the probability of Type II error); values of 80% to 90% are typically desired. ▸ alternative is used to specify whether significance testing will be done using one- or two-sided tests 6. POWER ANALYSES FOR CORRELATION SAMPLE SIZE ESTIMATES Parameters: pwr(r = rVal, sig.level = .05, power = powerVal, alternative = "two.sided") f(x)
  72. SAMPLE SIZE ESTIMATES 6. POWER ANALYSES FOR CORRELATION pwr(r =

    rVal, sig.level = .05, power = powerVal, alternative = "two.sided") An estimate to detect a moderate effect size (r = .55) with high statistical power (.9): pwr.r.test(r = .55, sig.level = .05, power = .9, alternative = "two.sided") f(x)
  73. SAMPLE SIZE ESTIMATES > pwr.r.test(r = .55, sig.level = .05,

    power = .9, alternative = "two.sided") approximate correlation power calculation (arctangh transformation) n = 50.24877 r = 0.55 sig.level = 0.05 power = 0.99 alternative = two.sided 6. POWER ANALYSES FOR CORRELATION
  74. AGENDA REVIEW 7. BACK MATTER 2. More with knitr 3.

    Scatterplots 4. Matrix Arrays 5. Correlation in R 6. Power Analyses for Correlation
  75. Reminder - no additional lecture preps! REMINDERS 7. BACK MATTER

    Draft papers due next Monday! Lab 10 and Problem Set 05 due next Monday as is peer review of partner’s materials!