Linear Regression method is one of the most common research methods examining the linear relationship of the dependent variable Y and independent variable(s) X(s).
Getting sample data.
#install.packages("car") ##if this is a new package we need to install first.
library(car) #We have to library the package every time we use
We are using the dataset "Prestige" * in the car package for this guide. We can use the help command to access the codebook:
help("Prestige") View(Prestige)
*Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.
In the simple linear regression example we have only one dependent variable(prestige) and one independent variable(education).
reg1<-lm(prestige ~ education, data = Prestige) #We put the dependent variable to the left of the '~' #and the independent variable(s) to the right and we tell R which dataset we are referring to. summary(reg1) Call: lm(formula = prestige ~ education, data = Prestige) Residuals: Min 1Q Median 3Q Max -26.0397 -6.5228 0.6611 6.7430 18.1636 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -10.732 3.677 -2.919 0.00434 ** education 5.361 0.332 16.148 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.103 on 100 degrees of freedom Multiple R-squared: 0.7228, Adjusted R-squared: 0.72 F-statistic: 260.8 on 1 and 100 DF, p-value: < 2.2e-16
From the result of this univariate analysis, we can see that average education years has a significant(small p-value, general rule of thumb <0.05) and positive relationship(positive coefficient) with the prestige score. To be specific, one way of interpreting the result is: One unit in crease in average education years is related to around 5.361 points increase of the prestige score. R-square shows the amount of variance of Y explained by X. In this case average education level explains 72.28% of the variance in prestige scores. Adj R2(72%) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R2 is closer to R2. This provides a more honest association between X and Y.
It's a useful practice to transform the variable to it's log form when doing the regression. You can either transform the variable beforehand or do so in the equation.
reg2<-lm(prestige~education+log(income)+women, data=Prestige) summary(reg2) Call: lm(formula = prestige ~ education + log(income) + women, data = Prestige) Residuals: Min 1Q Median 3Q Max -17.364 -4.429 -0.101 4.316 19.179 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -110.9658 14.8429 -7.476 3.27e-11 *** education 3.7305 0.3544 10.527 < 2e-16 *** log(income) 13.4382 1.9138 7.022 2.90e-10 *** women 0.0469 0.0299 1.568 0.12 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.093 on 98 degrees of freedom Multiple R-squared: 0.8351, Adjusted R-squared: 0.83 F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
The other interpretation is similar to the previous model. The interpretation of the log-transformed variable is a bit different. To interpret the slope coefficient we divide it by 100(13.4382/100=0.134382). This tells us that a 1% increase in income increases the dependent variable prestige score by about 0.13 points.
Note that in some cases the variable contains the value of 0 so we cannot directly transform them to the log-form. We can add the value of 1 to every observation and then do the transformation. Other practices you can check: https://aosmith.rbind.io/2018/09/19/the-log-0-problem/.
After running the regression model, we can access the model predicted values and the residuals compared to the real observation.
#first we fit the predictions prestige_pred<-fitted(reg2) prestige_pred<-as.data.frame(prestige_pred) #now we add the residual values too prestige_resid<-residuals(reg2) prestige_pred$resid<-prestige_resid
We can run the robust standard error regressions(control for heteroskedasticity, meaning unequal variances) too.
library(lmtest) library(sandwich) reg2$robse <- vcovHC(reg2, type="HC1") coeftest(reg2,reg2$robse) t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -110.965824 15.275221 -7.2644 9.074e-11 *** education 3.730508 0.388808 9.5947 9.176e-16 *** log(income) 13.438223 1.994275 6.7384 1.107e-09 *** women 0.046895 0.031484 1.4895 0.1396 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Similarly, we can have access to the cluster-robust standard errors regression results.
#cluster-robust standard errors
coeftest(reg2, reg2$clse)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -110.965824 14.842928 -7.4760 3.270e-11 ***
education 3.730508 0.354383 10.5268 < 2.2e-16 ***
log(income) 13.438223 1.913757 7.0219 2.896e-10 ***
women 0.046895 0.029899 1.5685 0.12
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Categorical variables play a big part in the linear regression too. Note that we only talk about where independent variables are categorical variables in this guide. When the dependent variable is a categorical variable, you may consider the alternatives of linear regression like logit regression and multinomial regression.
reg3<-lm(prestige~education+log(income)+type,data=Prestige) summary(reg3) Call: lm(formula = prestige ~ education + log(income) + type, data = Prestige) Residuals: Min 1Q Median 3Q Max -13.511 -3.746 1.011 4.356 18.438 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -82.6413 13.7875 -5.994 3.86e-08 *** education 3.2845 0.6081 5.401 5.06e-07 *** log(income) 10.4875 1.7167 6.109 2.31e-08 *** typebc 1.4394 2.3780 0.605 0.5465 typeprof 8.1903 2.5882 3.165 0.0021 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.637 on 93 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: 0.8555, Adjusted R-squared: 0.8493 F-statistic: 137.6 on 4 and 93 DF, p-value: < 2.2e-16
Here “type” is a categorical or factor variable with three options: bc (blue collar), prof (professional, managerial, and technical) and wc (white collar). R automatically recognizes it as factor and treat it accordingly. The missing one in the coefficient summary(wc here) is treated as a base line, therefore the value is 0. Remember to add the intercept value when calculating the estimate for the baseline group.
Here R automatically make the 'wc' as the reference group. We can change that too. We can manually reorder the factors.
#change the reference group to 'bc' Prestige$type <- with(Prestige, factor(type, levels = c("bc","wc","prof"))) reg3_reorder<-lm(prestige~education+log2(income)+type,data=Prestige) summary(reg3_reorder) Call: lm(formula = prestige ~ education + log(income) + type, data = Prestige) Residuals: Min 1Q Median 3Q Max -13.511 -3.746 1.011 4.356 18.438 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -81.2019 13.7431 -5.909 5.63e-08 *** education 3.2845 0.6081 5.401 5.06e-07 *** log(income) 10.4875 1.7167 6.109 2.31e-08 *** typewc -1.4394 2.3780 -0.605 0.5465 typeprof 6.7509 3.6185 1.866 0.0652 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.637 on 93 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: 0.8555, Adjusted R-squared: 0.8493 F-statistic: 137.6 on 4 and 93 DF, p-value: < 2.2e-16
We can see that the reference group changes now. We can also make all three categories show their estimates.
reg3_showall<-lm(prestige~0+education+log(income)+type,data=Prestige)
summary(reg3_showall)
Call:
lm(formula = prestige ~ 0 + education + log(income) + type, data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-13.511 -3.746 1.011 4.356 18.438
Coefficients:
Estimate Std. Error t value Pr(>|t|)
education 3.2845 0.6081 5.401 5.06e-07 ***
log(income) 10.4875 1.7167 6.109 2.31e-08 ***
typebc -81.2019 13.7431 -5.909 5.63e-08 ***
typewc -82.6413 13.7875 -5.994 3.86e-08 ***
typeprof -74.4510 15.1175 -4.925 3.65e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.637 on 93 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.9835, Adjusted R-squared: 0.9826
F-statistic: 1107 on 5 and 93 DF, p-value: < 2.2e-16
We can see that in this way all factors/levels are showing their estimates.
Sometimes we are interested in the interaction terms of two variables. Here is an example of interaction term with categorical variables.
reg4 <- lm(prestige ~ type*(education +log(income)), data = Prestige) #These are two other ways of running the same model reg4a <- lm(prestige ~ education + log(income) + type + log(income):type + education:type, data = Prestige) reg4b <- lm(prestige ~ education*type + log(income)*type, data = Prestige) summary(reg4) Call: lm(formula = prestige ~ type * (education + log(income)), data = Prestige) Residuals: Min 1Q Median 3Q Max -13.970 -4.124 1.206 3.829 18.059 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -120.0459 20.1576 -5.955 5.07e-08 *** typewc 30.2412 37.9788 0.796 0.42800 typeprof 85.1601 31.1810 2.731 0.00761 ** education 2.3357 0.9277 2.518 0.01360 * log(income) 15.9825 2.6059 6.133 2.32e-08 *** typewc:education 3.6400 1.7589 2.069 0.04140 * typeprof:education 0.6974 1.2895 0.541 0.58998 typewc:log(income) -8.1556 4.4029 -1.852 0.06730 . typeprof:log(income) -9.4288 3.7751 -2.498 0.01434 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6.409 on 89 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: 0.871, Adjusted R-squared: 0.8595 F-statistic: 75.15 on 8 and 89 DF, p-value: < 2.2e-16
We can use a table to illustrate the coefficient of each variable and each type.
bc | wc | prof | |
intercept | -120.05 | -120.05+30.24= -89.81 | -120.05 + 85.16 = -34.89 |
log(income) | 15.98 | 15.98-8.16 = 7.82 | 15.98-9.43 = 6.55 |
education | 2.34 | 2.34+3.64 = 5.98 | 2.34 + 0.697 = 3.037 |
For further information concerning categorical variables and interaction terms, you can also look at :https://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture10/lecture10-94842.html
It's a good habit to check for the assumptions of linear regressions to see whether the model is good. We will use the raw variables (without transforming) to construct a model and then we can test for the assumptions. Note that these commands, if not specifically stated, are from the 'car' package.
It's important to check whether the normality assumption holds for the model.
model1 <- lm(prestige ~ education + income + type, data = Prestige) qqPlot(model1)
We can see from the graph that at the tail it's a bit skewed.
We can also test the normality of the individual variables using tests. We can use the built-in shapiro test function:
shapiro.test(Prestige$income) Shapiro-Wilk normality test data: Prestige$income W = 0.81505, p-value = 5.634e-10
The p-value is really small, indicating that the income variable may not fit the normality assumption very well. It may be a better practice to transform them.
We can use the ncvTest() function to test for equal variances. We can also consult with the residual plots, which we will discuss later, whether the model meets the homoscedasticity assumption.
model1 <- lm(prestige ~ education + income + type, data = Prestige)
ncvTest(model1)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.09830307, Df = 1, p = 0.75388
The p-value here is high, indicating that this model does not have a problem of unequal variances.
We want to know whether we have too many variables that have high correlation with each other.
“When there are strong linear relationships among the predictors in a regression analysis, the precision of the estimated regression coefficients in linear models declines compared to what it would have been were the predictors uncorrelated with each other” (Fox:359)
model1 <- lm(prestige ~ education + income + type, data = Prestige) vif(model1) GVIF Df GVIF^(1/(2*Df)) education 5.973932 1 2.444163 income 1.681325 1 1.296659 type 6.102131 2 1.571703
A GVIF > 4 suggests collinearity.
We can also plot the residual plots.
#residual plot model1<-lm(prestige ~ education + income + type, data = Prestige) residualPlots(model1) model1a<-lm(prestige ~ education + log(income) + type, data = Prestige) residualPlots(model1a)
We can see that compared with the first model, it's better to log-transform the data first.
Outlier Identification is also significant.
We can first look at the QQplot.
model1 <- lm(prestige ~ education + income + type, data = Prestige) qqPlot(model1,id.n=2)
We can also run statistical tests on the data.
model1 <- lm(prestige ~ education + income + type, data = Prestige) outlierTest(model1) No Studentized residuals with Bonferroni p < 0.05 Largest |rstudent|: rstudent unadjusted p-value Bonferroni p medical.technicians 2.821091 0.0058632 0.57459
The null hypothesis for the Bonferonni adjusted outlier test is the observation is an outlier. Here observation related to ‘medical.technicians’ is an outlier.
We can look at the Influential/ high leverage points.
#Influence Plots influencePlot(model1)
The plot looks like this
For High leverage points:
influenceIndexPlot(model1, id.n=3)
If an observation is an outlier and influential (high leverage) then that observation can change the fit of the linear model, it is advisable to remove it. We can remove them in this way:
model1a <- update(prestige.reg4, subset=rownames(Prestige) != "general.managers") model1b <- update(prestige.reg4, subset= !(rownames(Prestige) %in% c("general.managers","medical.technicians")))
If you have questions or comments about this guide or method, please email data@Princeton.edu.