Skip to Main Content

Linear regression in R: Linear Regression Hands on R tutorial

linear regression

1. Linear Regression Models

1.1 Concept heads up

Linear Regression method is one of the most common research methods examining the linear relationship of the dependent variable Y and independent variable(s) X(s). 

 

1.2 Sample data

Getting sample data.

#install.packages("car") ##if this is a new package we need to install first. 
library(car) #We have to library the package every time we use

We are using the dataset "Prestige" * in the car package for this guide. We can use the help command to access the codebook:

help("Prestige")
View(Prestige)

*Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.

1.3 Simple Linear regression models

In the simple linear regression example we have only one dependent variable(prestige) and one independent variable(education).

reg1<-lm(prestige ~ education, data = Prestige) #We put the dependent variable to the left of the '~' 
#and the independent variable(s) to the right and we tell R which dataset we are referring to.
summary(reg1)
Call:
lm(formula = prestige ~ education, data = Prestige)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.0397  -6.5228   0.6611   6.7430  18.1636 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -10.732      3.677  -2.919  0.00434 ** 
education      5.361      0.332  16.148  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.103 on 100 degrees of freedom
Multiple R-squared:  0.7228,    Adjusted R-squared:   0.72 
F-statistic: 260.8 on 1 and 100 DF,  p-value: < 2.2e-16

From the result of this univariate analysis, we can see that average education years has a significant(small p-value, general rule of thumb <0.05) and positive relationship(positive coefficient) with the prestige score. To be specific, one way of interpreting the result is: One unit in crease in average education years is related to around 5.361 points increase of the prestige score. R-square shows the amount of variance of Y explained by X. In this case average education level explains 72.28% of the variance in prestige scores. Adj R2(72%) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R2 is closer to R2. This provides a more honest association between X and Y.

1.4 Log transformation

It's a useful practice to transform the variable to it's log form when doing the regression. You can either transform the variable beforehand or do so in the equation.

reg2<-lm(prestige~education+log(income)+women, data=Prestige)
summary(reg2)
Call:
lm(formula = prestige ~ education + log(income) + women, data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.364  -4.429  -0.101   4.316  19.179 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -110.9658    14.8429  -7.476 3.27e-11 ***
education      3.7305     0.3544  10.527  < 2e-16 ***
log(income)   13.4382     1.9138   7.022 2.90e-10 ***
women          0.0469     0.0299   1.568     0.12    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.093 on 98 degrees of freedom
Multiple R-squared:  0.8351,    Adjusted R-squared:   0.83 
F-statistic: 165.4 on 3 and 98 DF,  p-value: < 2.2e-16

The other interpretation is similar to the previous model. The interpretation of the log-transformed variable is a bit different. To interpret the slope coefficient we divide it by 100(13.4382/100=0.134382). This tells us that a 1% increase in income increases the dependent variable prestige score by about 0.13 points.

Note that in some cases the variable contains the value of 0 so we cannot directly transform them to the log-form. We can add the value of 1 to every observation and then do the transformation. Other practices you can check: https://aosmith.rbind.io/2018/09/19/the-log-0-problem/.

1.5 Predicted values and Residuals

After running the regression model, we can access the model predicted values and the residuals compared to the real observation.

#first we fit the predictions
prestige_pred<-fitted(reg2)
prestige_pred<-as.data.frame(prestige_pred)

#now we add the residual values too
prestige_resid<-residuals(reg2)
prestige_pred$resid<-prestige_resid

1.6 Robust regression

We can run the robust standard error regressions(control for heteroskedasticity, meaning unequal variances) too.

library(lmtest)
library(sandwich)
reg2$robse <- vcovHC(reg2, type="HC1")
coeftest(reg2,reg2$robse) 

t test of coefficients:

               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept) -110.965824   15.275221 -7.2644 9.074e-11 ***
education      3.730508    0.388808  9.5947 9.176e-16 ***
log(income)   13.438223    1.994275  6.7384 1.107e-09 ***
women          0.046895    0.031484  1.4895    0.1396    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Similarly, we can have access to the cluster-robust standard errors regression results.

#cluster-robust standard errors
coeftest(reg2, reg2$clse)

t test of coefficients:

               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept) -110.965824   14.842928 -7.4760 3.270e-11 ***
education      3.730508    0.354383 10.5268 < 2.2e-16 ***
log(income)   13.438223    1.913757  7.0219 2.896e-10 ***
women          0.046895    0.029899  1.5685      0.12    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

1.7 Regression with Categorical variables

Categorical variables play a big part in the linear regression too. Note that we only talk about where independent variables are categorical variables in this guide. When the dependent variable is a categorical variable, you may consider the alternatives of linear regression like logit regression and multinomial regression.

reg3<-lm(prestige~education+log(income)+type,data=Prestige)
summary(reg3)

Call:
lm(formula = prestige ~ education + log(income) + type, data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.511  -3.746   1.011   4.356  18.438 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -82.6413    13.7875  -5.994 3.86e-08 ***
education     3.2845     0.6081   5.401 5.06e-07 ***
log(income)  10.4875     1.7167   6.109 2.31e-08 ***
typebc        1.4394     2.3780   0.605   0.5465    
typeprof      8.1903     2.5882   3.165   0.0021 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.637 on 93 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.8555,    Adjusted R-squared:  0.8493 
F-statistic: 137.6 on 4 and 93 DF,  p-value: < 2.2e-16

Here “type” is a categorical or factor variable with three options: bc (blue collar), prof (professional, managerial, and technical) and wc (white collar). R automatically recognizes it as factor and treat it accordingly. The missing one in the coefficient summary(wc here) is treated as a base line, therefore the value is 0. Remember to add the intercept value when calculating the estimate for the baseline group.

Here R automatically make the 'wc' as the reference group. We can change that too. We can manually reorder the factors.

#change the reference group to 'bc'
Prestige$type <- with(Prestige, factor(type, levels = c("bc","wc","prof")))

reg3_reorder<-lm(prestige~education+log2(income)+type,data=Prestige)
summary(reg3_reorder)
Call:
lm(formula = prestige ~ education + log(income) + type, data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.511  -3.746   1.011   4.356  18.438 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -81.2019    13.7431  -5.909 5.63e-08 ***
education     3.2845     0.6081   5.401 5.06e-07 ***
log(income)  10.4875     1.7167   6.109 2.31e-08 ***
typewc       -1.4394     2.3780  -0.605   0.5465    
typeprof      6.7509     3.6185   1.866   0.0652 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.637 on 93 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.8555,	Adjusted R-squared:  0.8493 
F-statistic: 137.6 on 4 and 93 DF,  p-value: < 2.2e-16

We can see that the reference group changes now. We can also make all three categories show their estimates.

reg3_showall<-lm(prestige~0+education+log(income)+type,data=Prestige)
summary(reg3_showall)
Call:
lm(formula = prestige ~ 0 + education + log(income) + type, data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.511  -3.746   1.011   4.356  18.438 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
education     3.2845     0.6081   5.401 5.06e-07 ***
log(income)  10.4875     1.7167   6.109 2.31e-08 ***
typebc      -81.2019    13.7431  -5.909 5.63e-08 ***
typewc      -82.6413    13.7875  -5.994 3.86e-08 ***
typeprof    -74.4510    15.1175  -4.925 3.65e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.637 on 93 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.9835,    Adjusted R-squared:  0.9826 
F-statistic:  1107 on 5 and 93 DF,  p-value: < 2.2e-16

We can see that in this way all factors/levels are showing their estimates.

1.8 Categorical variables with interaction terms

Sometimes we are interested in the interaction terms of two variables. Here is an example of interaction term with categorical variables.

reg4 <- lm(prestige ~ type*(education +log(income)), data = Prestige)

#These are two other ways of running the same model
reg4a <- lm(prestige ~ education + log(income) +
              type + log(income):type + education:type,
            data = Prestige)
reg4b <- lm(prestige ~ education*type +
              log(income)*type, data = Prestige)
summary(reg4)

Call:
lm(formula = prestige ~ type * (education + log(income)), data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.970  -4.124   1.206   3.829  18.059 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -120.0459    20.1576  -5.955 5.07e-08 ***
typewc                 30.2412    37.9788   0.796  0.42800    
typeprof               85.1601    31.1810   2.731  0.00761 ** 
education               2.3357     0.9277   2.518  0.01360 *  
log(income)            15.9825     2.6059   6.133 2.32e-08 ***
typewc:education        3.6400     1.7589   2.069  0.04140 *  
typeprof:education      0.6974     1.2895   0.541  0.58998    
typewc:log(income)     -8.1556     4.4029  -1.852  0.06730 .  
typeprof:log(income)   -9.4288     3.7751  -2.498  0.01434 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.409 on 89 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.871,    Adjusted R-squared:  0.8595 
F-statistic: 75.15 on 8 and 89 DF,  p-value: < 2.2e-16

We can use a table to illustrate the coefficient of each variable and each type.

Interpretation of categorical variable and interaction term
  bc wc prof
intercept               -120.05                -120.05+30.24= -89.81 -120.05 + 85.16 = -34.89
log(income) 15.98 15.98-8.16 = 7.82 15.98-9.43 = 6.55
education 2.34 2.34+3.64 = 5.98 2.34 + 0.697 = 3.037

For further information concerning categorical variables and interaction terms, you can also look at :https://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture10/lecture10-94842.html

2.Assumption Diagnostics and Regression Trouble Shooting

It's a good habit to check for the assumptions of linear regressions to see whether the model is good. We will use the raw variables (without transforming) to construct a model and then we can test for the assumptions. Note that these commands, if not specifically stated, are from the 'car' package.

 

2.1 Normality Check

It's important to check whether the normality assumption holds for the model.

model1 <- lm(prestige ~ education + income + type, data = Prestige)
qqPlot(model1)

We can see from the graph that at the tail it's a bit skewed.

We can also test the normality of the individual variables using tests. We can use the built-in shapiro test function:

shapiro.test(Prestige$income)

    Shapiro-Wilk normality test

data:  Prestige$income
W = 0.81505, p-value = 5.634e-10

The p-value is really small, indicating that the income variable may not fit the normality assumption very well. It may be a better practice to transform them.

2.2 Heteroskedasticity Check

We can use the ncvTest() function to test for equal variances. We can also consult with the residual plots, which we will discuss later, whether the model meets the homoscedasticity assumption.

model1 <- lm(prestige ~ education + income + type, data = Prestige)
ncvTest(model1)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 0.09830307, Df = 1, p = 0.75388

The p-value here is high, indicating that this model does not have a problem of unequal variances.

2.3 Multicollinearity Check

We want to know whether we have too many variables that have high correlation with each other. 

“When there are strong linear relationships among the predictors in a regression analysis, the precision of the estimated regression coefficients in linear models declines compared to what it would have been were the predictors uncorrelated with each other” (Fox:359)

model1 <- lm(prestige ~ education + income + type, data = Prestige)
vif(model1)

              GVIF Df GVIF^(1/(2*Df))
education 5.973932  1        2.444163
income    1.681325  1        1.296659
type      6.102131  2        1.571703

 A GVIF > 4 suggests collinearity.

2.4 Residual Plots

We can also plot the residual plots. 

#residual plot
model1<-lm(prestige ~ education + income + type,
           data = Prestige)
residualPlots(model1)

model1a<-lm(prestige ~ education + log(income) + type,
           data = Prestige)
residualPlots(model1a)

We can see that compared with the first model, it's better to log-transform the data first.

2.5 Outliers

Outlier Identification is also significant. 

We can first look at the QQplot.

model1 <- lm(prestige ~ education + income + type, data = Prestige)
qqPlot(model1,id.n=2)

We can also run statistical tests on the data.

model1 <- lm(prestige ~ education + income + type, data = Prestige)
outlierTest(model1)

No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
                    rstudent unadjusted p-value Bonferroni p
medical.technicians 2.821091          0.0058632      0.57459

The null hypothesis for the Bonferonni adjusted outlier test is the observation is an outlier. Here observation related to ‘medical.technicians’ is an outlier.

2.6 Influential/High leverage points

We can look at the Influential/ high leverage points.

#Influence Plots
influencePlot(model1)

The plot looks like this

For High leverage points:

influenceIndexPlot(model1, id.n=3)

If an observation is an outlier and influential (high leverage) then that observation can change the fit of the linear model, it is advisable to remove it. We can remove them in this way:

model1a <- update(prestige.reg4, subset=rownames(Prestige) != "general.managers")
model1b <- update(prestige.reg4, subset= !(rownames(Prestige) %in% c("general.managers","medical.technicians")))

Reference list / Useful links

  • An R Companion to Applied Regression, Third Edition / John Fox , Sanford Weisberg, Sage Publications, 2019
  • Applied Econometrics with R / Christian Kleiber, Achim Zeileis, Springer, 2008
  • Applied Regression Analysis and Generalized Linear Models / John Fox, Sage, 2008
  • Categorical variables and interaction terms in linear regression, stratified regressions: https://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture10/lecture10-94842.html
  • Complex Surveys. A guide to Analysis Using R / Thomas Lumley, Wiley, 2010
  • Data analysis using regression and multilevel/hierarchical models / Andrew Gelman, Jennifer Hill. Cambridge ; New York : Cambridge University Press, 2007.
  • Data Manipulation with R / Phil Spector, Springer, 2008
  • Designing Social Inquiry: Scientific Inference in Qualitative Research / Gary King, Robert O. Keohane, Sidney Verba, Princeton University Press, 1994.
  • Econometric analysis / William H. Greene. 8th ed., Upper Saddle River, N.J. : Prentice Hall, 2018.
  • Introduction to econometrics / James H. Stock, Mark W. Watson. 3rd ed., Boston: Pearson Addison Wesley, 2011.
  • Introductory Statistics with R / Peter Dalgaard, Springer, 2008
  • Linear regression in R tutorial: https://www.datacamp.com/tutorial/linear-regression-R
  • R for Stata Users / Robert A. Muenchen, Joseph Hilbe, Springer, 2010
  • Statistical Analysis: an interdisciplinary introduction to univariate & multivariate methods / Sam Kachigan, New York : Radius Press, c1986
  • The log-0 problem: analysis strategies and options for choosing c in log(y + c): https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/
  • Unifying Political Methodology: The Likelihood Theory of Statistical Inference / Gary King, Cambridge University Press, 1989

Comments or Questions?

If you have questions or comments about this guide or method, please email data@Princeton.edu.

Data Consultant

Profile Photo
Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519

Data Consultant

Profile Photo
Muhammad Al Amin
He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051