Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SOC 4930 & SOC 5050 - Week 13

SOC 4930 & SOC 5050 - Week 13

Lecture slides for Lecture 13 of the Saint Louis University Course Quantitative Analysis: Applied Inferential Statistics. These slides cover the topics related to the basics of multiple regression and the creation of formatted regression tables.

Christopher Prener

November 19, 2018
Tweet

More Decks by Christopher Prener

Other Decks in Education

Transcript

  1. AGENDA QUANTITATIVE ANALYSIS / WEEK 13 / LECTURE 13 1.

    Front Matter 2. Multiple Regression Theory 3. Multiple Regression in R 4. Regression Tables 5. Back Matter
  2. Lab 12 is due next Monday - there will be

    no problem set. Please focus on the final project! Lab 11 is due next Monday - there will be no problem set. Please focus on the final project! 1. FRONT MATTER ANNOUNCEMENTS All peer reviews are due today!
  3. 2. MULTIPLE REGRESSION THEORY THE “REAL” WORLD xc xa y

    dependent variable independent variables xd xe xb
  4. THE GOAL OF OLS REGRESSION 10 0 1 2 3

    4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis residual sum of 
 squares (SSR )
  5. 3. MULTIPLE REGRESSION THEORY IN OTHER WORDS… We want to

    explain as much of the variation in y as we can while also minimizing the residual error in the regression line.
  6. C D E 3. MULTIPLE REGRESSION THEORY ACCOUNTING FOR VARIATION

    study variable B A dependent variable LHS RHS
  7. 3. MULTIPLE REGRESSION THEORY VARIABLES FOR THE RHS What other

    measures are included with your data (or are accessible) that might help account for variation in y?
  8. 3. MULTIPLE REGRESSION THEORY LIMITS ON THE RHS For every

    RHS variable, we need 10 to 15 observations. If we exceed that rule of thumb, we consider the model “overfitted”.
  9. ? LIMITS ON THE RHS > library(ggplot2) > autoData <-

    mpg > nrow(mpg) [1] 234 3. MULTIPLE REGRESSION THEORY How many RHS predictors could we include in our model?
  10. 2. MULTIPLE REGRESSION THEORY AN EXAMPLE What is the effect

    of grade point average on test scores, controlling for the effects of time spent studying and socioeconomic status?
  11. 2. MULTIPLE REGRESSION THEORY AN EXAMPLE H1 = higher grade

    point averages are associated with higher test scores, holding constant both effort and SES
  12. 2. MULTIPLE REGRESSION THEORY AN EXAMPLE y = dependent variable

    = constant xi = independent variable i i = beta value of IV i DV = test score ME = gpa IV = hours studying IV = free lunch eligible y = + i xi + yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch +
  13. 2. MULTIPLE REGRESSION THEORY AN EXAMPLE yscore = + 1

    xgpa + 2 xstudyHrs + 3 xfreeLunch + What do you think the constant (or intercept) represents? ?
  14. 2. MULTIPLE REGRESSION THEORY AN EXAMPLE yscore = + 1

    xgpa + 2 xstudyHrs + 3 xfreeLunch + The average test score for a student with a GPA of “0” who studied for “0” hours and does not get a free lunch.
  15. ? INTERPRETING BETAS Coefficients: Estimate Std. Error t value Pr(>|t|)

    (Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY How would you interpret the effect of GPA on test scores?
  16. INTERPRETING BETAS Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)

    68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY A unit change in GPA is associated with 6.124 (p = .0009) increase in test scores. Higher GPAs are associated with better test scores, controlling for hours spent studying and free lunch eligibility.
  17. ? INTERPRETING BETAS Coefficients: Estimate Std. Error t value Pr(>|t|)

    (Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY How would you interpret the effects of hours spent studying and free lunch eligibility on test scores?
  18. STANDARD ERROR OF BETA Coefficients: Estimate Std. Error t value

    Pr(>|t|) (Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY The standard error is an indicator of amount of uncertainty in the estimate, representing the amount of variation present across observations. It is also used to find t.
  19. MEASURES OF MODEL FIT Coefficients: Estimate Std. Error t value

    Pr(>|t|) (Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY These are based on calculations of the total sum of squares and the residual sum of squared error.
  20. INTERPRETING R-SQUARED Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)

    68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY We use adjusted R2 with multiple regression to account for artificial increases in R2 due to added RHS parameters.
  21. ? INTERPRETING R-SQUARED Coefficients: Estimate Std. Error t value Pr(>|t|)

    (Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY How would you interpret the adjusted R2 value?
  22. INTERPRETING R-SQUARED Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)

    68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY The adjusted R2 value indicates that these factors together account for 39.35% of the variation in test scores.
  23. RESIDUAL STANDARD ERROR Coefficients: Estimate Std. Error t value Pr(>|t|)

    (Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY Also known as the root mean squared error. The average error per observation. We want to minimize this value.
  24. THE F-STATISTIC Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)

    68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY Evaluation of the null hypothesis that all the betas are equal to zero. It is a measure of the reliability of the model.
  25. CONFIDENCE INTERVALS 2.5 % 97.5 % (Intercept) 57.17611384 80.2570179 gpa

    8.03926404 4.2876193 studyHours 3.06346803 0.5734902 freeLunch -15.43865689 -1.3300016 3. MULTIPLE REGRESSION THEORY These are measures of the accuracy of each beta estimate. If they include zero, the estimate will not be statistically significant.
  26. 3. MULTIPLE REGRESSION THEORY USING MULTIPLE MODELS y = dependent

    variable = constant xi = independent variable i i = beta value of IV i DV = test score ME = gpa IV = hours studying IV = free lunch eligible y = + i xi + yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch +
  27. 3. MULTIPLE REGRESSION THEORY MULTIPLE MODELS yscore = + 1

    xgpa + 2 xstudyHrs + 3 xfreeLunch + yscore = + 1 xgpa + Model 1, Main Effects: Model 2, Full Model:
  28. 3. MULTIPLE REGRESSION THEORY MODEL BUILDING y = dependent variable

    = constant xi = independent variable i i = beta value of IV i DV = test score ME = gpa IV = hours studying IV = free lunch eligible IV = gender IV = race (white, black, other) y = + i xi + yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch + 4 xfemale +
 5 xwhite + 6 xblack +
  29. 3. MULTIPLE REGRESSION THEORY MULTIPLE MODELS yscore = + 1

    xgpa + 2 xstudyHrs + 3 xfreeLunch + yscore = + 1 xgpa + Model 1, Main Effects: Model 2, Main + other educational measures : yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch + 4 xfemale 
 + 5 xwhite + 6 xblack + Model 3, Full Model:
  30. 3. MULTIPLE REGRESSION THEORY COMPARING MODEL FIT yscore = +

    1 xgpa + 2 xstudyHrs + 3 xfreeLunch + yscore = + 1 xgpa + Model 1, Main Effects: Model 2, Main effect + other educational measures : yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch + 4 xfemale 
 + 5 xwhite + 6 xblack + Model 3, Full Model: Adjusted R2 should increase, indicated increasing explanatory power.
  31. 3. MULTIPLE REGRESSION THEORY COMPARING MODEL FIT yscore = +

    1 xgpa + 2 xstudyHrs + 3 xfreeLunch + yscore = + 1 xgpa + Model 1, Main Effects: Model 2, Main effect + other educational measures : yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch + 4 xfemale 
 + 5 xwhite + 6 xblack + Model 3, Full Model: We can also use AIC and BIC “information criterion” values, which should decrease.
  32. ▸ tilde (~) used in the construction of the formula

    where: • y is the dependent variable • x1, x2, x3 are the in dependent variables ▸ dataFrame is the data source (can be a tibble) All functions in this section are available in stats
 Included in base distributions of R 3. MULTIPLE REGRESSION IN R OLS MODEL Parameters: lm(y ~ x1+x2+x3, data = dataFrame) f(x)
  33. ▸ tilde (~) used in the construction of the formula

    where: • y is the dependent variable • x1, x2, and x3 are the independent variables ▸ dataFrame is the data source (can be a tibble) 3. MULTIPLE REGRESSION IN R OLS MODEL Parameters: lm(y ~ x1+x2+x3, data = dataFrame) f(x)
  34. OLS MODEL 3. MULTIPLE REGRESSION IN R lm(y ~ x1+x2+x3,

    data = dataFrame) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) Save model output into an object for reference later. Output is stored as a list, and contains far more data than what is printed. f(x)
  35. ▸ model is a regression model object’s name 3. MULTIPLE

    REGRESSION IN R CONFIDENCE INTERVALS & MODEL FIT Parameters: confint(model) AIC(model) BIC(model) f(x)
  36. CONFIDENCE INTERVALS 3. MULTIPLE REGRESSION IN R confint(model) Using the

    hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) > confit(model) f(x)
  37. CONFIDENCE INTERVALS > confint(model) 2.5 % 97.5 % (Intercept) 36.151002

    40.2813041 displ -2.983299 -0.9364445 cyl -2.174163 -0.5332101 3. MULTIPLE REGRESSION IN R
  38. AKAIKE’S INFORMATION CRITERION 3. MULTIPLE REGRESSION IN R AIC(model) Using

    the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) > AIC(model) [1] 1288.779 f(x)
  39. BAYESIAN INFORMATION CRITERION 3. MULTIPLE REGRESSION IN R BIC(model) Using

    the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) > BIC(model) [1] 1302.601 f(x)
  40. MULTIPLE OLS MODEL 3. MULTIPLE REGRESSION IN R lm(y ~

    x1+x2+x3, data = dataFrame) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ, data = autoData) > model2 <- lm(hwy ~ displ+cyl, data = autoData) Name each model object clearly! f(x)
  41. ▸ models is a comma separated list of all regression

    models ▸ “table title” is a title for your regression table All functions in this section are from stargazer
 Download via CRAN 4. REGRESSION TABLES BASIC REGRESSION TABLE Parameters: stargazer(models, title = "table title") f(x)
  42. ▸ models is a comma separated list of all regression

    models ▸ “table title” is a title for your regression table 4. REGRESSION TABLES BASIC REGRESSION TABLE Parameters: stargazer(models, title = "table title") f(x)
  43. BASIC REGRESSION TABLE 4. REGRESSION TABLES stargazer(models, title = "table

    title") Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = mpg) > stargazer(model, title = "basic regression table") <<<<< OUTPUT OMITTED >>>>> This will return LaTeX output by default. We’ll convert this to a Word document once the table is fully prepared. f(x)
  44. ADDING ADDITIONAL STATISTICS 4. REGRESSION TABLES stargazer(models, title = "table

    title", add.lines = list(c("text", value, value))) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = mpg) > aic <- round(AIC(model), digits = 3) > bic <- round(BIC(model), digits = 3) > stargazer(model, title = "basic regression table”,
 add.lines = list(c("AIC", aic),c("BIC", bic))) <<<<< OUTPUT OMITTED >>>>> f(x)
  45. REMOVING UNNEEDED STATISTICS 4. REGRESSION TABLES stargazer(models, title = "table

    title", add.lines = list(c("text", value, value)), omit.stat = "rsq", 
 df = FALSE) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = mpg) > aic <- round(AIC(model), digits = 3) > bic <- round(BIC(model), digits = 3) > stargazer(model, title = "basic regression table”,
 add.lines = list(c("AIC", aic),c("BIC", bic)), omit.stat = "rsq", df = FALSE) <<<<< OUTPUT OMITTED >>>>> f(x)
  46. CREATING OUTPUT 4. REGRESSION TABLES stargazer(models, title = "table title",

    add.lines = list(c("text", value, value)), omit.stat = "rsq", 
 df = FALSE, type = "html", out = filepath) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) > aic <- round(AIC(model), digits = 3) > bic <- round(BIC(model), digits = 3) > stargazer(model, title = "basic regression table”,
 add.lines = list(c("AIC", aic),c("BIC", bic)), omit.stat = "rsq", df = FALSE, type = "html",
 out = here("results", "models.html")) f(x)
  47. COMBINING MULTIPLE MODELS > model1 <- lm(hwy ~ displ, data

    = mpg) > aic1 <- AIC(model1) > bic1 <- BIC(model1) > > model2 <- lm(hwy ~ displ+cyl, data = mpg) > aic2 <- AIC(model2) > bic2 <- BIC(model2) > > stargazer(model1, model2, title = "Estimating Fuel Efficiency", add.lines = list(c("AIC", aic1, aic2),c("BIC", bic1, bic2)), omit.stat = "rsq", df = FALSE, type = "html",
 out = here("results", “models.html")) 4. REGRESSION TABLES
  48. SUMMARY STATISTICS 4. REGRESSION TABLES stargazer(data.frame(data), title = "table title",

    summary = TRUE, omit.summary.stat = c("p25", "p75"), type = "html", out = filepath) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > autoSub <- dplyr::filter(mpg, hwy, cyl, displ) > stargazer(data.frame(autoSub), title = “Descriptive Statistics”, summary = TRUE, omit.summary.stat = c("p25", "p75"), type = "html", out = here("results", "descriptives.html") f(x)
  49. AGENDA REVIEW 6. BACK MATTER 2. Multiple Regression Theory 3.

    Multiple Regression in R 4. Regression Tables
  50. REMINDERS 6. BACK MATTER Lab 12 is due next Monday

    - there will be no problem set. Please focus on the final project! Lab 11 is due next Monday - there will be no problem set. Please focus on the final project! All peer reviews are due today!