R and R Studio for data management and visualization using packages in the Tidyverse including `ggplot2` and ‘dplyr’. Used R projects and GitHub for collaborative project building and R Markdown for communication of results. Statistical modeling included an emphasis on linear and logistic regression. Instructor: Jamie D. Bedics, PhD, ABPP
one dependent variable (criterion) from one or more independent variables (predictors). Y=α+βX1 + e Criterion Predictor The Criterion and the Predictor variables come from your dataset. They are variables in your data. Y must be continuous (double, integer) X can be continuous or a factor
coefficients describe the relationship between X and Y in your data. For a one predictor model: 1. Intercept is the score of Y when X is 0. 2. Slope is the amount of change in Y for every 1-unit increase in X 3. Error represents what is unexplained in the model error We hope to use these coefﬁcients from our sample to predict future scores
interpret: 1. The intercept of the score only makes sense if the value of Y when X is 0 is meaningful (else we center…later). 2. Slope (Unstandardized) 3. Slope (Standardized) 4. Significance of Slopes 5. Confidence Intervals for intercept and slopes 6. R^2
Rules for interpreting slope and intercept. Slope is now the change in Y for 1-unit increase in X while other Xi is constant. Intercept is value of Y when all predictors are 0. Second Predictor Second Slope Coefﬁcient
is zero. Slope B = The amount of increase in Y for every 1-unit increase in X Interpreting R Output using `display` Y=α+βX1 + e Intercept Slope The slope is unstandardized or raw coefﬁcient and so the scale will depend on how the variables are measured.
say that when the raw (unstandardized) slope coefﬁcient is within 2 standard errors (coef.se) then it is consistent with the null. Here the raw regression coefficient (3.94) is greater than 2*SE (0.74) and so it’s likely statistically significant SE or “noise” raw slope coef. or “signal” Standard Error (coef.se)
the model can be determined by the “Residual SD” and “R-Squared.” You can think of the residual standard deviation as a measure of the average distance that each observation falls from its prediction from the model. R SD and R2 are related: where sy is the sd of Y Residual SD or sigma hat R2
(X), there is a 1-SD*β1 increase in our criterion (Y) Tip: According to APA Style, standardized regression coefﬁcients are denoted by the greek “β” and you use “B” for unstandardized regression coefﬁcients Use the `lm.beta` function from QuantPsyc You’ll need the SD for the predictor and the SD for the criterion to properly interpret. STANDARDIZED REGRESSION COEFFICIENTS Standardization allows comparisons of magnitude across different variables by standardizing the scale of measurement unique to each variable. Y=α+βX1 + e
Y when X1 & X2 are 0. Slopes B1 = The amount of increase in Y for every 1-unit increase in X1 when X2 is held constant B2 = The amount of increase in Y for every 1-unit increase in X2 when X1 is held constant Multiple Regression
with no interaction β1 = Compares participants with the same score on X2 (the other variable) but differ by1- unit on X1 (the variable you’re trying to interpret) β2 = Compares participants with the same score on X1 but differ by1-unit on X2 Y=α+β1X1 + β2X2
w/ no interaction. Slopes β1 = The amount of increase in Y for every 1- unit increase in X1 when X2 is 0 β2 = The amount of increase in Y for every 1-unit increase in X2 when X1 is 0 Multiple Regression w/ Interaction β1: β2 = The interaction coefficient tells you the exact amount you change (+/-) β1 or β2 based on the other variable
think of the impact and interpretability of your ﬁndings. Assumptions and Diagnostics (in decreasing order of importance) Additivity and Linearity: The most serious violation. 1. The expected value of dependent variable is a straight-line function of each independent variable, holding the others ﬁxed. 2. The slope of that line does not depend on the values of the other variables. 3. The effects of different independent variables on the expected value of the dependent variable are additive.
independent. Assumptions and Diagnostics (in decreasing order of importance) Equal Variance of errors: Constant variance of the errors. Consider weighted least squares. Often does not affect the most important parts of the regression model Normality of errors: The least important is that the errors are normally distributed.
variables about expected outcomes and what you might ﬁnding using unstandardized regression coefﬁcients. Testing Assumptions in R Additivity and Linearity: Diagnose: Residuals versus Predicted: Top Left/Bottom plot(model) Violation of Independence: Diagnose: Durbin-Watson test Vary between 0 and 4 with 2 being no correlation <1 or >3 a concern (>2 is -correlation; <2 is + correlation) durbinWatsonTest(model) or dwt(model)
Normality of errors: The least important is that the errors are normally distributed. Diagnose: Levene’s Test; residuals versus predicted scores If signiﬁcant, variances are different leveneTest(data$variable1, data$variable2) Diagnose: QQ-PLOT; Plot function plot(model) Points should fall closely along the line
be expected to be important in predicting the outcome. General Regression Principles Inputs do not always have to be separate. You might average or sum that can be used as a single predictor in the model For inputs that have large effects, consider including interactions as well.
a variable: General Regression Principles 1. If a predictor is not statistically signiﬁcant and has the expected sign, it is generally ﬁne to keep in the model. 2. If a predictor is not statistically signiﬁcant and does not have the expected sign, then consider removing it from the model. 3. If a predictor is statistically signiﬁcant and does not have the expected sign, think about whether it makes sense. 4. If signiﬁcant and in the expected direction, then keep it in!