SOC 4930 & SOC 5050 - Week 13

Install the stargazer package WELCOME! GETTING STARTED

REGRESSION (PART 2) QUANTITATIVE ANALYSIS CHRISTOPHER PRENER, PH.D. FALL 2018
WEEK 13 LECTURE 13

AGENDA QUANTITATIVE ANALYSIS / WEEK 13 / LECTURE 13 1.
Front Matter 2. Multiple Regression Theory 3. Multiple Regression in R 4. Regression Tables 5. Back Matter

1 FRONT   MATTER

Lab 12 is due next Monday - there will be
no problem set. Please focus on the ﬁnal project! Lab 11 is due next Monday - there will be no problem set. Please focus on the ﬁnal project! 1. FRONT MATTER ANNOUNCEMENTS All peer reviews are due today!

MULTIPLE REGRESSION THEORY 2

2. MULTIPLE REGRESSION THEORY THE “REAL” WORLD xc xa y
dependent variable independent variables xd xe xb

THE GOAL OF OLS REGRESSION 10 0 1 2 3
4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 X Axis Y Axis residual sum of   squares (SSR )

3. MULTIPLE REGRESSION THEORY IN OTHER WORDS… We want to
explain as much of the variation in y as we can while also minimizing the residual error in the regression line.

3. MULTIPLE REGRESSION THEORY ACCOUNTING FOR VARIATION study variable B
A dependent variable

A dependent variable control variables C

D 3. MULTIPLE REGRESSION THEORY ACCOUNTING FOR VARIATION study variable
B A dependent variable moderator variables

A dependent variable mediator variables E

mediators 3. MULTIPLE REGRESSION THEORY ACCOUNTING FOR VARIATION study variable
B A dependent variable moderators controls

C D E 3. MULTIPLE REGRESSION THEORY ACCOUNTING FOR VARIATION
study variable B A dependent variable LHS RHS

3. MULTIPLE REGRESSION THEORY ACCOUNTING FOR VARIATION yheight = +
1 xfemale + LHS RHS

3. MULTIPLE REGRESSION THEORY VARIABLES FOR THE RHS What other
measures are included with your data (or are accessible) that might help account for variation in y?

3. MULTIPLE REGRESSION THEORY LIMITS ON THE RHS For every
RHS variable, we need 10 to 15 observations. If we exceed that rule of thumb, we consider the model “overﬁtted”.

? LIMITS ON THE RHS > library(ggplot2) > autoData <-
mpg > nrow(mpg) [1] 234 3. MULTIPLE REGRESSION THEORY How many RHS predictors could we include in our model?

2. MULTIPLE REGRESSION THEORY AN EXAMPLE What is the effect
of grade point average on test scores, controlling for the effects of time spent studying and socioeconomic status?

2. MULTIPLE REGRESSION THEORY AN EXAMPLE H1 = higher grade
point averages are associated with higher test scores, holding constant both effort and SES

2. MULTIPLE REGRESSION THEORY AN EXAMPLE y = dependent variable
= constant xi = independent variable i i = beta value of IV i DV = test score ME = gpa IV = hours studying IV = free lunch eligible y = + i xi + yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch +

2. MULTIPLE REGRESSION THEORY AN EXAMPLE yscore = + 1
xgpa + 2 xstudyHrs + 3 xfreeLunch + What do you think the constant (or intercept) represents? ?

2. MULTIPLE REGRESSION THEORY AN EXAMPLE yscore = + 1
xgpa + 2 xstudyHrs + 3 xfreeLunch + The average test score for a student with a GPA of “0” who studied for “0” hours and does not get a free lunch.

? INTERPRETING BETAS Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY How would you interpret the effect of GPA on test scores?

INTERPRETING BETAS Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)
68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY A unit change in GPA is associated with 6.124 (p = .0009) increase in test scores. Higher GPAs are associated with better test scores, controlling for hours spent studying and free lunch eligibility.

? INTERPRETING BETAS Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY How would you interpret the effects of hours spent studying and free lunch eligibility on test scores?

STANDARD ERROR OF BETA Coefficients: Estimate Std. Error t value
Pr(>|t|) (Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY The standard error is an indicator of amount of uncertainty in the estimate, representing the amount of variation present across observations. It is also used to ﬁnd t.

MEASURES OF MODEL FIT Coefficients: Estimate Std. Error t value
Pr(>|t|) (Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY These are based on calculations of the total sum of squares and the residual sum of squared error.

INTERPRETING R-SQUARED Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)
68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY We use adjusted R2 with multiple regression to account for artiﬁcial increases in R2 due to added RHS parameters.

? INTERPRETING R-SQUARED Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY How would you interpret the adjusted R2 value?

INTERPRETING R-SQUARED Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)
68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY The adjusted R2 value indicates that these factors together account for 39.35% of the variation in test scores.

RESIDUAL STANDARD ERROR Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY Also known as the root mean squared error. The average error per observation. We want to minimize this value.

THE F-STATISTIC Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)
68.7166 5.7262 12.000 1.81e-15 *** gpa 6.1242 0.0811 1.531 0.0009 *** studyHrs -1.17074 0.28129 -4.162 0.000148 *** freeLunch -7.8843 3.7484 -2.103 0.0412 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.728 on 43 degrees of freedom Multiple R-squared: 0.433, Adjusted R-squared: 0.3935 F-statistic: 10.95 on 3 and 43 DF, p-value: 1.811e-05 3. MULTIPLE REGRESSION THEORY Evaluation of the null hypothesis that all the betas are equal to zero. It is a measure of the reliability of the model.

CONFIDENCE INTERVALS 2.5 % 97.5 % (Intercept) 57.17611384 80.2570179 gpa
8.03926404 4.2876193 studyHours 3.06346803 0.5734902 freeLunch -15.43865689 -1.3300016 3. MULTIPLE REGRESSION THEORY These are measures of the accuracy of each beta estimate. If they include zero, the estimate will not be statistically signiﬁcant.

3. MULTIPLE REGRESSION THEORY USING MULTIPLE MODELS y = dependent
variable = constant xi = independent variable i i = beta value of IV i DV = test score ME = gpa IV = hours studying IV = free lunch eligible y = + i xi + yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch +

3. MULTIPLE REGRESSION THEORY MULTIPLE MODELS yscore = + 1
xgpa + 2 xstudyHrs + 3 xfreeLunch + yscore = + 1 xgpa + Model 1, Main Effects: Model 2, Full Model:

3. MULTIPLE REGRESSION THEORY MODEL BUILDING y = dependent variable
= constant xi = independent variable i i = beta value of IV i DV = test score ME = gpa IV = hours studying IV = free lunch eligible IV = gender IV = race (white, black, other) y = + i xi + yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch + 4 xfemale +  5 xwhite + 6 xblack +

3. MULTIPLE REGRESSION THEORY MULTIPLE MODELS yscore = + 1
xgpa + 2 xstudyHrs + 3 xfreeLunch + yscore = + 1 xgpa + Model 1, Main Effects: Model 2, Main + other educational measures : yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch + 4 xfemale   + 5 xwhite + 6 xblack + Model 3, Full Model:

3. MULTIPLE REGRESSION THEORY COMPARING MODEL FIT yscore = +
1 xgpa + 2 xstudyHrs + 3 xfreeLunch + yscore = + 1 xgpa + Model 1, Main Effects: Model 2, Main effect + other educational measures : yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch + 4 xfemale   + 5 xwhite + 6 xblack + Model 3, Full Model: Adjusted R2 should increase, indicated increasing explanatory power.

3. MULTIPLE REGRESSION THEORY COMPARING MODEL FIT yscore = +
1 xgpa + 2 xstudyHrs + 3 xfreeLunch + yscore = + 1 xgpa + Model 1, Main Effects: Model 2, Main effect + other educational measures : yscore = + 1 xgpa + 2 xstudyHrs + 3 xfreeLunch + 4 xfemale   + 5 xwhite + 6 xblack + Model 3, Full Model: We can also use AIC and BIC “information criterion” values, which should decrease.

MULTIPLE REGRESSION IN R 3

▸ tilde (~) used in the construction of the formula
where: • y is the dependent variable • x1, x2, x3 are the in dependent variables ▸ dataFrame is the data source (can be a tibble) All functions in this section are available in stats  Included in base distributions of R 3. MULTIPLE REGRESSION IN R OLS MODEL Parameters: lm(y ~ x1+x2+x3, data = dataFrame) f(x)

▸ tilde (~) used in the construction of the formula
where: • y is the dependent variable • x1, x2, and x3 are the independent variables ▸ dataFrame is the data source (can be a tibble) 3. MULTIPLE REGRESSION IN R OLS MODEL Parameters: lm(y ~ x1+x2+x3, data = dataFrame) f(x)

OLS MODEL 3. MULTIPLE REGRESSION IN R lm(y ~ x1+x2+x3,
data = dataFrame) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) Save model output into an object for reference later. Output is stored as a list, and contains far more data than what is printed. f(x)

▸ model is a regression model object’s name 3. MULTIPLE
REGRESSION IN R CONFIDENCE INTERVALS & MODEL FIT Parameters: confint(model) AIC(model) BIC(model) f(x)

CONFIDENCE INTERVALS 3. MULTIPLE REGRESSION IN R confint(model) Using the
hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) > confit(model) f(x)

CONFIDENCE INTERVALS > confint(model) 2.5 % 97.5 % (Intercept) 36.151002
40.2813041 displ -2.983299 -0.9364445 cyl -2.174163 -0.5332101 3. MULTIPLE REGRESSION IN R

AKAIKE’S INFORMATION CRITERION 3. MULTIPLE REGRESSION IN R AIC(model) Using
the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) > AIC(model) [1] 1288.779 f(x)

BAYESIAN INFORMATION CRITERION 3. MULTIPLE REGRESSION IN R BIC(model) Using
the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) > BIC(model) [1] 1302.601 f(x)

MULTIPLE OLS MODEL 3. MULTIPLE REGRESSION IN R lm(y ~
x1+x2+x3, data = dataFrame) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ, data = autoData) > model2 <- lm(hwy ~ displ+cyl, data = autoData) Name each model object clearly! f(x)

REGRESSION TABLES 4

▸ models is a comma separated list of all regression
models ▸ “table title” is a title for your regression table All functions in this section are from stargazer  Download via CRAN 4. REGRESSION TABLES BASIC REGRESSION TABLE Parameters: stargazer(models, title = "table title") f(x)

▸ models is a comma separated list of all regression
models ▸ “table title” is a title for your regression table 4. REGRESSION TABLES BASIC REGRESSION TABLE Parameters: stargazer(models, title = "table title") f(x)

BASIC REGRESSION TABLE 4. REGRESSION TABLES stargazer(models, title = "table
title") Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = mpg) > stargazer(model, title = "basic regression table") <<<<< OUTPUT OMITTED >>>>> This will return LaTeX output by default. We’ll convert this to a Word document once the table is fully prepared. f(x)

ADDING ADDITIONAL STATISTICS 4. REGRESSION TABLES stargazer(models, title = "table
title", add.lines = list(c("text", value, value))) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = mpg) > aic <- round(AIC(model), digits = 3) > bic <- round(BIC(model), digits = 3) > stargazer(model, title = "basic regression table”,  add.lines = list(c("AIC", aic),c("BIC", bic))) <<<<< OUTPUT OMITTED >>>>> f(x)

REMOVING UNNEEDED STATISTICS 4. REGRESSION TABLES stargazer(models, title = "table
title", add.lines = list(c("text", value, value)), omit.stat = "rsq",   df = FALSE) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = mpg) > aic <- round(AIC(model), digits = 3) > bic <- round(BIC(model), digits = 3) > stargazer(model, title = "basic regression table”,  add.lines = list(c("AIC", aic),c("BIC", bic)), omit.stat = "rsq", df = FALSE) <<<<< OUTPUT OMITTED >>>>> f(x)

CREATING OUTPUT 4. REGRESSION TABLES stargazer(models, title = "table title",
add.lines = list(c("text", value, value)), omit.stat = "rsq",   df = FALSE, type = "html", out = filepath) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > model <- lm(hwy ~ displ+cyl, data = autoData) > aic <- round(AIC(model), digits = 3) > bic <- round(BIC(model), digits = 3) > stargazer(model, title = "basic regression table”,  add.lines = list(c("AIC", aic),c("BIC", bic)), omit.stat = "rsq", df = FALSE, type = "html",  out = here("results", "models.html")) f(x)

COMBINING MULTIPLE MODELS > model1 <- lm(hwy ~ displ, data
= mpg) > aic1 <- AIC(model1) > bic1 <- BIC(model1) > > model2 <- lm(hwy ~ displ+cyl, data = mpg) > aic2 <- AIC(model2) > bic2 <- BIC(model2) > > stargazer(model1, model2, title = "Estimating Fuel Efficiency", add.lines = list(c("AIC", aic1, aic2),c("BIC", bic1, bic2)), omit.stat = "rsq", df = FALSE, type = "html",  out = here("results", “models.html")) 4. REGRESSION TABLES

SUMMARY STATISTICS 4. REGRESSION TABLES stargazer(data.frame(data), title = "table title",
summary = TRUE, omit.summary.stat = c("p25", "p75"), type = "html", out = filepath) Using the hwy, cyl and displ variables from ggplot2’s mpg data: > autoSub <- dplyr::filter(mpg, hwy, cyl, displ) > stargazer(data.frame(autoSub), title = “Descriptive Statistics”, summary = TRUE, omit.summary.stat = c("p25", "p75"), type = "html", out = here("results", "descriptives.html") f(x)

5 BACK   MATTER

AGENDA REVIEW 6. BACK MATTER 2. Multiple Regression Theory 3.
Multiple Regression in R 4. Regression Tables

REMINDERS 6. BACK MATTER Lab 12 is due next Monday
- there will be no problem set. Please focus on the ﬁnal project! Lab 11 is due next Monday - there will be no problem set. Please focus on the ﬁnal project! All peer reviews are due today!

SOC 4930 & SOC 5050 - Week 13

SOC 4930 & SOC 5050 - Week 13

More Decks by Christopher Prener

Other Decks in Education

Featured

Transcript