Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SOC 4015 & SOC 5050 - Lecture 12

SOC 4015 & SOC 5050 - Lecture 12

Slides for Lecture 12 of the Saint Louis University Course Quantitative Analysis: Applied Inferential Statistics. These slides cover the topics related to producing dissemination ready plots with ggplot2 and the basics of linear regression.

Christopher Prener

November 12, 2018
Tweet

More Decks by Christopher Prener

Other Decks in Education

Transcript

  1. AGENDA QUANTITATIVE ANALYSIS / WEEK 12 / LECTURE 12 1.

    Front Matter 2. Plots for Dissemination 3. The Failings of Simple Models 4. Bivariate Regression Theory 5. Bivariate Regression in R 6. Back Matter
  2. All peer reviews are due next Monday - this is

    a change from the syllabus! Lab 11 is due next Monday - there will be no problem set. Please focus on the final project! 1. FRONT MATTER ANNOUNCEMENTS Aligning our syllabus with GIS and plan for next year - final two problem sets will be waved and given full credit.
  3. 2. PLOTS FOR DISSEMINATION BASE PLOT ggplot(data = mpg, mapping

    = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = as.factor(cyl)), position = "jitter") + geom_smooth(method = "lm")
  4. 2. PLOTS FOR DISSEMINATION BASE PLOT ggplot(data = mpg, mapping

    = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = as.factor(cyl)), position = "jitter") + geom_smooth(method = "lm")
  5. 2. PLOTS FOR DISSEMINATION BASE PLOT ggplot(data = mpg, mapping

    = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = as.factor(cyl)), position = "jitter") + geom_smooth(method = "lm")
  6. 2. PLOTS FOR DISSEMINATION BASE PLOT ggplot(data = mpg, mapping

    = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = as.factor(cyl)), position = "jitter") + geom_smooth(method = "lm")
  7. 2. PLOTS FOR DISSEMINATION INCREASE POINT SIZE ggplot(data = mpg,

    mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = as.factor(cyl)), size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2)
  8. 2. PLOTS FOR DISSEMINATION ADD STROKE AROUND POINT ggplot(data =

    mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2)
  9. 2. PLOTS FOR DISSEMINATION INCREASE FONT SIZE ggplot(data = mpg,

    mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2) + theme_grey(base_size = 28)
  10. 2. PLOTS FOR DISSEMINATION ADJUST THE THEME ggplot(data = mpg,

    mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2) + theme_hc(base_size = 28) +
  11. 2. PLOTS FOR DISSEMINATION ADJUST LABELS ggplot(data = mpg, mapping

    = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2) + theme_hc(base_size = 28) + labs( x = "Engine Displacement (litres)", y = "Highway Fuel Efficiency (mpg)" )
  12. 2. PLOTS FOR DISSEMINATION ADD TITLE ggplot(data = mpg, mapping

    = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2) + theme_hc(base_size = 28) + labs( title = "Fuel Efficiency and Engine Size", x = "Engine Displacement (litres)", y = "Highway Fuel Efficiency (mpg)" )
  13. 2. PLOTS FOR DISSEMINATION ADD SUBTITLE ggplot(data = mpg, mapping

    = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2) + theme_hc(base_size = 28) + labs( title = "Fuel Efficiency and Engine Size", subtitle = "Select Vehicles Sold in the United States", x = "Engine Displacement (litres)", y = "Highway Fuel Efficiency (mpg)" )
  14. 2. PLOTS FOR DISSEMINATION ADD CAPTION ggplot(data = mpg, mapping

    = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2) + theme_hc(base_size = 28) + labs( title = "Fuel Efficiency and Engine Size", subtitle = "Select Vehicles Sold in the United States", caption = "Data via ggplot2 package for R", x = "Engine Displacement (litres)", y = "Highway Fuel Efficiency (mpg)" )
  15. 2. PLOTS FOR DISSEMINATION EDIT LEGEND SIZE ggplot(data = mpg,

    mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2) + theme_hc(base_size = 28) + labs( title = "Fuel Efficiency and Engine Size", subtitle = "Select Vehicles Sold in the United States", caption = "Data via ggplot2 package for R", x = "Engine Displacement (litres)", y = "Highway Fuel Efficency (mpg)", fill = "Cylinders" ) + theme(legend.key.size = unit(1, units="cm"))
  16. 2. PLOTS FOR DISSEMINATION EDIT LEGEND LABELS ggplot(data = mpg,

    mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2) + theme_hc(base_size = 28) + labs( title = "Fuel Efficiency and Engine Size", subtitle = "Select Vehicles Sold in the United States", caption = "Data via ggplot2 package for R", x = "Engine Displacement (litres)", y = "Highway Fuel Efficency (mpg)" ) + theme(legend.key.size = unit(1, units="cm")) + scale_fill_discrete(labels = c("Four", "Five", "Six", "Eight"), name = "Cylinders") Change fill to color if that is the aesthetic mapping used!
  17. 2. PLOTS FOR DISSEMINATION EDIT COLOR PALETTE ggplot(data = mpg,

    mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", size = 2) + theme_hc(base_size = 28) + labs( title = "Fuel Efficiency and Engine Size", subtitle = "Select Vehicles Sold in the United States", caption = "Data via ggplot2 package for R", x = "Engine Displacement (litres)", y = "Highway Fuel Efficency (mpg)" ) + theme(legend.key.size = unit(1, units="cm")) + scale_fill_brewer(palette = "Set1", labels = c("Four", "Five", "Six", "Eight"), name = "Cylinders") Change fill to color if that is the aesthetic mapping used!
  18. 2. PLOTS FOR DISSEMINATION COLOR BREWER PALETTES > palette <-

    RColorBrewer::brewer.pal(9, "Set1") > > palette [1] "#E41A1C" "#377EB8" "#4DAF4A" "#984EA3" "#FF7F00" "#FFFF33" "#A65628" "#F781BF" “#999999" > > palette[1] [1] "#E41A1C" > > palette[5] [1] “#FF7F00"
  19. 2. PLOTS FOR DISSEMINATION EDIT COLOR PALETTE ggplot(data = mpg,

    mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", color = palette[5], size = 2) + theme_hc(base_size = 28) + labs( title = "Fuel Efficiency and Engine Size", subtitle = "Select Vehicles Sold in the United States", caption = "Data via ggplot2 package for R", x = "Engine Displacement (litres)", y = "Highway Fuel Efficency (mpg)" ) + theme(legend.key.size = unit(1, units="cm")) + scale_fill_brewer(palette = "Set1", labels = c("Four", "Five", "Six", "Eight"), name = "Cylinders")
  20. 2. PLOTS FOR DISSEMINATION REMOVE LEGEND ggplot(data = mpg, mapping

    = aes(x = displ, y = hwy)) + geom_point(mapping = aes(fill = as.factor(cyl)), pch = 21, size = 4, position = "jitter") + geom_smooth(method = "lm", color = palette[5], size = 2) + theme_hc(base_size = 28) + labs( title = "Fuel Efficiency and Engine Size", subtitle = "Select Vehicles Sold in the United States", caption = "Data via ggplot2 package for R", x = "Engine Displacement (litres)", y = "Highway Fuel Efficiency (mpg)" ) + theme(legend.position = “none") + scale_fill_brewer(palette = "Set1")
  21. 3. THE FAILINGS OF SIMPLE MODELS DIFFERENCE OF MEANS xb

    xa y dependent variable independent variable
  22. 3. THE FAILINGS OF SIMPLE MODELS CORRELATION x y implied


    dependent variable implied independent variable
  23. 3. THE FAILINGS OF SIMPLE MODELS THE “REAL” WORLD xa

    y dependent variable independent variables xb
  24. 3. THE FAILINGS OF SIMPLE MODELS THE “REAL” WORLD xc

    xa y dependent variable independent variables xd xe xb
  25. 3. THE FAILINGS OF SIMPLE MODELS THE “REAL” WORLD xa

    y dependent variable independent variables xf xc xd xe xb
  26. THE GOAL OF OLS REGRESSION 10 0 1 2 3

    4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis residual sum of 
 squares (SSR )
  27. OLS REGRESSION BASICS 10 0 1 2 3 4 5

    6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis y = + x or y = b + mx is the Greek letter “alpha” is the Greek 
 letter “beta”
  28. OLS REGRESSION BASICS 10 0 1 2 3 4 5

    6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis y = + x where = y intercept (when x = 0)
  29. OLS REGRESSION BASICS 10 0 1 2 3 4 5

    6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis y = + x where x x = slope of line
  30. OLS REGRESSION BASICS 10 0 1 2 3 4 5

    6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis y = + x βx a
  31. 4. BIVARIATE REGRESSION THEORY OLS REGRESSION BASICS y = +

    i xi + y is the dependent variable (DV) in regression analysis is called the ‘constant’ rather than the ‘intercept’ y = + x
  32. 4. BIVARIATE REGRESSION THEORY OLS REGRESSION BASICS y = b0

    + bi xi + subscript used because the slope of y is dependent on multiple factors y = + x is the Greek 
 letter “epsilon”
  33. 4. BIVARIATE REGRESSION THEORY OLS REGRESSION BASICS y = +

    i xi + subscript used because the slope of y is dependent on multiple factors is included because we are estimating the line, there may be unexplained variation in y y = a + bx
  34. 4. BIVARIATE REGRESSION THEORY MODEL BUILDING y = dependent variable

    = constant xi = independent variable i i = beta value of IV i DV = height IV = gender (where FALSE = male & TRUE = female) y = + i xi + yheight = + 1 xfemale +
  35. 4. BIVARIATE REGRESSION THEORY MODEL BUILDING yheight = + 1

    xfemale + constant DV IV error reference category 
 is “built in” constant is “male”
  36. 4. BIVARIATE REGRESSION THEORY MODEL BUILDING y = dependent variable

    = constant xi = independent variable i i = beta value of IV i DV = test score IV = grade (where 0 = pre-K, 1 = elementary,
 & 2 = middle) y = + i xi + yscore = + 1 xelementary + 2 xmiddle +
  37. 4. BIVARIATE REGRESSION THEORY MODEL BUILDING yscore = + 1

    xelementary + 2 xmiddle + constant DV IVs error reference category 
 is “built in” IVs constant is “pre-K”
  38. \x{} REGRESSION EQUATION $ y = \alpha + {\beta}_{1}{x}_{1} +

    \epsilon $ y = + 1 x1 + 4. BIVARIATE REGRESSION THEORY
  39. 4. BIVARIATE REGRESSION THEORY ESTIMATED VALUE OF Y yscore =

    + 1 xelementary + 2 xmiddle + yscore = 40 + (5)1 xelementary + (10)2 xmiddle + ˆ Circumflex or “hat” - 
 e.g. “y-hat”
  40. 4. BIVARIATE REGRESSION THEORY ESTIMATED VALUE OF Y yscore =

    + 1 xelementary + 2 xmiddle + yscore = 40 + (5)1 xelementary + (10)2 xmiddle + Circumflex or “hat” - 
 e.g. “y-hat” Circumflex or “hat” - 
 e.g. “y-hat” Values are the product of regression equation ˆ
  41. 4. BIVARIATE REGRESSION THEORY ESTIMATED VALUE OF Y yscore =

    + 1 xelementary + 2 xmiddle + yscore = 40 + (5)1 xelementary + (10)2 xmiddle + What is the estimated mean test score for elementary students? ˆ
  42. 4. BIVARIATE REGRESSION THEORY ESTIMATED VALUE OF Y yscore =

    + 1 xelementary + 2 xmiddle + yscore = 40 + (5)1 xelementary + (10)2 xmiddle + yscore = 40 + (5)(1) + (10)(0) + What is the estimated mean test score for elementary students? ˆ ˆ
  43. 4. BIVARIATE REGRESSION THEORY ESTIMATED VALUE OF Y yscore =

    + 1 xelementary + 2 xmiddle + yscore = 40 + 5 + 0 + yscore = 45 yscore = 40 + (5)1 xelementary + (10)2 xmiddle + yscore = 40 + (5)(1) + (10)(0) + What is the estimated mean test score for elementary students? ˆ ˆ ˆ ˆ
  44. 4. BIVARIATE REGRESSION THEORY ESTIMATED VALUE OF Y yscore =

    + 1 xelementary + 2 xmiddle + yscore = 40 + (5)1 xelementary + (10)2 xmiddle + What is the estimated mean test score for middle school students? ˆ
  45. 4. BIVARIATE REGRESSION THEORY ESTIMATED VALUE OF Y yscore =

    + 1 xelementary + 2 xmiddle + yscore = 40 + 0 + 10 + ˆ yscore = 50 ˆ yscore = 40 + (5)1 xelementary + (10)2 xmiddle + yscore = 40 + (5)(0) + (10)(1) + ˆ ˆ What is the estimated mean test score for middle school students?
  46. MECHANICS OF OLS REGRESSION 10 0 1 2 3 4

    5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis total sum of 
 squares (SST )
  47. MECHANICS OF OLS REGRESSION 10 0 1 2 3 4

    5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis model sum of 
 squares (SSM )
  48. MECHANICS OF OLS REGRESSION 10 0 1 2 3 4

    5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis residual sum of 
 squares (SSR )
  49. ▸ r2 = estimate of variation explained ▸ SSM =

    model sum of squares ▸ SST = total sum of squares 4. BIVARIATE REGRESSION THEORY Let: EXPLAINED VARIATION Estimate of the proportion of the variance of y that the independent variable x “explains”.
  50. 4. BIVARIATE REGRESSION THEORY ASSUMPTIONS 
 Basic Assumptions: 1. y

    must be continuous* 2. x can be binary, ordinal*, or continuous 3. x must have a variance > 0 4. Relationship between x and y is linear 5. y should be normally distributed 6. There should be no significant outliers in x and y
  51. ▸ tilde (~) used in the construction of the formula

    where: • y is the dependent variable • x is the in dependent variable ▸ dataFrame is the data source (can be a tibble) Available in stats
 Included in base distributions of R 5. BIVARIATE REGRESSION IN R BASIC OLS MODEL Parameters: lm(y ~ x, data = dataFrame) f(x)
  52. ▸ tilde (~) used in the construction of the formula

    where: • y is the dependent variable • x is the in dependent variable ▸ dataFrame is the data source (can be a tibble) 5. BIVARIATE REGRESSION IN R BASIC OLS MODEL Parameters: lm(y ~ x, data = dataFrame) f(x)
  53. BASIC OLS MODEL 5. BIVARIATE REGRESSION IN R lm(y ~

    x, data = dataFrame) Using the hwy and displ variables from ggplot2’s mpg data: > lm(hwy ~ displ, data = mpg) Save model output into an object for reference later. Output is stored as a list, and contains far more data than what is printed. f(x)
  54. BASIC OLS MODEL > library(ggplot2) > model <- lm(hwy ~

    displ, data = mpg) > summary(model) <<<<< OUTPUT ON NEXT SLIDE >>>>> 5. BIVARIATE REGRESSION IN R
  55. BASIC OLS MODEL > summary(model) Call: lm(formula = hwy ~

    displ, data = x) Residuals: Min 1Q Median 3Q Max -7.1039 -2.1646 -0.2242 2.0589 15.0105 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 35.6977 0.7204 49.55 <2e-16 *** displ -3.5306 0.1945 -18.15 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.836 on 232 degrees of freedom Multiple R-squared: 0.5868, Adjusted R-squared: 0.585 F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16 5. BIVARIATE REGRESSION IN R
  56. BASIC OLS MODEL > summary(model) Call: lm(formula = hwy ~

    displ, data = x) Residuals: Min 1Q Median 3Q Max -7.1039 -2.1646 -0.2242 2.0589 15.0105 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 35.6977 0.7204 49.55 <2e-16 *** displ -3.5306 0.1945 -18.15 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.836 on 232 degrees of freedom Multiple R-squared: 0.5868, Adjusted R-squared: 0.585 F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16 5. BIVARIATE REGRESSION IN R
  57. SIMPLIFIED OUTPUT (BETAS) > summary(model) Coefficients: Estimate Std. Error t

    value Pr(>|t|) (Intercept) 35.6977 0.7204 49.55 <2e-16 *** displ -3.5306 0.1945 -18.15 <2e-16 *** 5. BIVARIATE REGRESSION IN R A liter increase in the size of the engine is associated with a 3.531 decrease in highway fuel efficiency (β = -3.531, p < .001). The larger the engine, the smaller the estimated fuel efficiency of the vehicle.
  58. SIMPLIFIED OUTPUT (R2) > summary(model) Multiple R-squared: 0.5868, Adjusted R-squared:

    0.585 5. BIVARIATE REGRESSION IN R The size of the engine accounts for an estimated 58.5% of the variation in highway fuel efficiency.
  59. AGENDA REVIEW 6. BACK MATTER 2. Plots for Dissemination 3.

    The Failings of Simple Models 4. Bivariate Regression Theory 5. Bivariate Regression in R
  60. All peer reviews are due next Monday - this is

    a change from the syllabus! Lab 11 is due next Monday - there will be no problem set. Please focus on the final project! REMINDERS 6. BACK MATTER