Skip to Main Content

Linear Regression in Stata: A Hands-on Tutorial

Basic introduction to linear regression analysis, diagnostics and presentation (using Stata)

A Hands-on Tutorial

1. Running Linear Regression Models

1.1. Linear regression: an overview

We use regression to estimate the unknown effect of changing one variable over another (Stock and Watson, 2019, ch. 4).

When running a regression, we are making two assumptions,

(1) there is a linear relationship between two variables (i.e., X and Y) and

(2) this relationship is additive (i.e., Y= x1 + x2 + …+ xN)

Technically, linear regression estimates how much Y changes when X changes one unit.

In Stata, use the command regress, type:

regress [dependent variable] [independent variable(s)]
regress y x

In a multivariate setting, we type:

regress y x1 x2 x3 ...

Before running a regression, it is recommended to have a clear idea of what you are trying to estimate (i.e., your outcome and predictor variables).

A regression makes sense only if there is a sound theory behind it.

Example: Are SAT scores higher in states that spend more money on education controlling by other factors?

– Outcome (Y) variable – SAT scores, variable csat in the dataset

– Predictor (X) variables

  • Per pupil expenditures primary & secondary (expense)

  • % HS graduates taking SAT (percent)

  • Median household income (income)

  • % adults with HS diploma (high)

  • % adults with a college degree (college)

  • Region (region)

1.2. Examining the variables first

It is recommended first to examine the variables in the model to understand the characteristics of data. We use data from Hamilton (2006). To get the data, type:

use https://dss.princeton.edu/training/linreg1.dta

To get basic information/description about data and variables, type:

describe csat expense percent income high college region

Stata will provide the following table

. describe csat expense percent income high college region
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------
csat            int     %9.0g                 Mean composite SAT score
expense         int     %9.0g                 Per pupil expenditures prim&sec
percent         byte    %9.0g                 % HS graduates taking SAT
income          double  %10.0g                Median household income, $1,000
high            float   %9.0g                 % adults HS diploma
college         float   %9.0g                 % adults college degree
region          byte    %9.0g      region     Geographical region

 

To get the summary statistics of the variables, type:

summarize csat expense percent income high college region

Stata will provide the following table

. summarize csat expense percent income high college region
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        csat |         51     944.098    66.93497        832       1093
     expense |         51    5235.961    1401.155       2960       9259
     percent |         51    35.76471    26.19281          4         81
      income |         51    33.95657    6.423134     23.465     48.618
        high |         51    76.26078    5.588741       64.3       86.6
-------------+---------------------------------------------------------
     college |         51    20.02157     4.16578       12.3       33.3
      region |         50        2.54    1.128662          1          4

 

To check correlation matrix of the variables we are interested in, type:

pwcorr csat expense percent income high college, star(0.05) sig

Stata will provide the following table

. pwcorr csat expense percent income high college, star(0.05) sig
             |     csat  expense  percent   income     high  college
-------------+------------------------------------------------------
        csat |   1.0000
             |
             |
     expense |  -0.4663*  1.0000
             |   0.0006
             |
     percent |  -0.8758*  0.6509*  1.0000
             |   0.0000   0.0000
             |
      income |  -0.4713*  0.6784*  0.6733*  1.0000
             |   0.0005   0.0000   0.0000
             |
        high |   0.0858   0.3133*  0.1413   0.5099*  1.0000
             |   0.5495   0.0252   0.3226   0.0001
             |
     college |  -0.3729*  0.6400*  0.6091*  0.7234*  0.5319*  1.0000
             |   0.0070   0.0000   0.0000   0.0000   0.0001

In the table, numbers are Pearson correlation coefficients, go from -1 to 1. Closer to 1 means strong correlation. A negative value indicates an inverse relationship (roughly, when one goes up the other goes down).

 

Command graph matrix produces a graphical representation of the correlation matrix by presenting a series of scatter plots for all variables. Type:

graph matrix csat expense percent income high college, half

1.3. Running simple linear regression models

To run a simple linear regression model which pertains to one dependent variable and one independent variable, type:

regress csat expense

Here, csat is the outcome variable and  expense is the predictor variable.

Stata will give us the following output table.

. regress csat expense
      Source |       SS           df       MS      Number of obs   =        51
-------------+----------------------------------   F(1, 49)        =     13.61
       Model |  48708.3001         1  48708.3001   Prob > F        =    0.0006
    Residual |   175306.21        49  3577.67775   R-squared       =    0.2174
-------------+----------------------------------   Adj R-squared   =    0.2015
       Total |   224014.51        50   4480.2902   Root MSE        =    59.814
------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     expense |  -.0222756   .0060371    -3.69   0.001    -.0344077   -.0101436
       _cons |   1060.732    32.7009    32.44   0.000     995.0175    1126.447
------------------------------------------------------------------------------

Interpretation of the outputs:

  •  Prob > F = 0.0006 : This is the p-value of the model. It tests the null hypothesis that the R-square is equal to 0. To reject the null hypothesis, usually we need a p-value lower than 0.05. Here, the p-value of 0.0006 indicates a statistically significant relationship between X and Y.
  •  R-squared = 0.2174   : R-square shows the amount of variance of Y explained by X. In this case expense explains 22% of the variance in SAT scores.
  •  Adj R-squared = 0.2015  : Adjusted R-square shows the same as R-square but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R-square is closer to R-square. This provides a more honest association between X and Y.
  • Root MSE = 59.814 : root mean squared error, is the sd of the regression. The closer to zero better the fit.
  • The estimated coefficient for expense is  -.0222756. This means for each one-point increase in expense, SAT scores decrease by 0.022 points.
  • The t-values test the null hypothesis that each coefficient is 0. To reject this, you need a t-value greater than 1.96 (for 95% confidence). You can get the t-values by dividing the coefficient by its standard error. The t-values also show the importance of a variable in the model.
  •  P>|t| = 0.001 : The two-tailed p-value tests the null hypothesis that the coefficient is equal to 0 (i.e. no significant effect). To reject this, the p-value has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, expense is statistically significant in explaining SAT.

 

To run a multiple/multivariable linear regression model which pertains to one dependent variable and two or more than two independent variables, type:

regress csat expense percent income high college

Here, csat is the outcome variable and  expense, percent, income, high, and college are the predictor variables.

Stata will give us the following output table.

. regress csat expense percent income high college
      Source |       SS           df       MS      Number of obs   =        51
-------------+----------------------------------   F(5, 45)        =     42.23
       Model |  184663.309         5  36932.6617   Prob > F        =    0.0000
    Residual |  39351.2012        45  874.471137   R-squared       =    0.8243
-------------+----------------------------------   Adj R-squared   =    0.8048
       Total |   224014.51        50   4480.2902   Root MSE        =    29.571
------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     expense |   .0033528   .0044709     0.75   0.457     -.005652    .0123576
     percent |  -2.618177   .2538491   -10.31   0.000    -3.129455   -2.106898
      income |   .1055853   1.166094     0.09   0.928    -2.243048    2.454218
        high |   1.630841    .992247     1.64   0.107     -.367647    3.629329
     college |   2.030894   1.660118     1.22   0.228    -1.312756    5.374544
       _cons |   851.5649   59.29228    14.36   0.000     732.1441    970.9857
------------------------------------------------------------------------------

Interpretation of the outputs:

  • Prob > F = 0.0000 : This is the p-value of the model. It indicates the reliability of X to predict Y. Usually we need a p-value lower than 0.05 to show a statistically significant relationship between X and Y.
  • R-squared = 0.8243 : R-square shows the amount of variance of Y explained by X. In this case the model explains 82.43% of the variance in SAT scores.
  • Adj R-squared = 0.8048 :  Adjusted R-square shows the same as R2but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R square is closer to R square. This provides a more honest association between X and Y.
  • Root MSE = 29.571 : Root mean squared error, is the sd of the regression. The closer to zero better the fit.
  • The estimated coefficient for  expense is  .0033528. This means for each one-point increase in expense, SAT scores increase by 0.003 points, holding all other variables constant. However, this increase is not statistically significant as the p-value is not < 0.05. The estimated coefficient for  percent is  -2.618177. This means for each one-point increase in percent, SAT scores decrease significantly by 2.62 points, holding all other variables constant.  

            Interpretations for the other estimated coefficients are similar to the explained two variables.

  • The t-values test the null hypothesis that each coefficient is 0. To reject this, you need a t-value greater than 1.96 (for 95% confidence). You can get the t-values by dividing the coefficient by its standard error. The t-values also show the importance of a variable in the model. In this case, percent is the most important.
  • Two-tail p-values test the null hypothesis that each coefficient is 0. To reject this, the p-value has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, expense, income, high, and college are not statistically significant in explaining SAT. percent is the only variable that has significant impact on SAT (i.e., its coefficient is significantly different from 0).

 

Plotting the predicted values against observed values

One of the ways to say how good the model is will depend on how well it predicts Y.

We can generate the predicted values of Y (usually called Yhat) given the model by using predict immediately after running the regression. Type:

regress csat expense percent income high college
predict csat_predict
label variable csat_predict "csat predicted"

Running the above codes will create a new column in your dataset named csat_predict

For a quick assessment of the model run a scatter plot between csat and csat_predict

scatter csat csat_predict

We should expect a 45 degree pattern in the data. Y-axis is the observed data and x-axis the predicted data (Yhat).
In this case, the model seems to be doing a good job in predicting csat

1.4. Regression output table for more than one model

To show the outputs for more than one model in a single table side-by-side, you can use the commands eststo and esttab:

regress csat expense
eststo model1
regress csat expense percent income high college
eststo model2
xi: regress csat expense percent income high college i.region
eststo model3
esttab, r2 ar2 se scalar(rmse)

Stata will give us the following output table.

. esttab, r2 ar2 se scalar(rmse)
------------------------------------------------------------
                      (1)             (2)             (3)   
                     csat            csat            csat   
------------------------------------------------------------
expense           -0.0223***      0.00335        -0.00202   
                (0.00604)       (0.00447)       (0.00424)   
percent                            -2.618***       -3.008***
                                  (0.254)         (0.233)   
income                              0.106          -0.167   
                                  (1.166)         (1.036)   
high                                1.631           1.815   
                                  (0.992)         (1.185)   
college                             2.031           4.671** 
                                  (1.660)         (1.708)   
_Iregion_2                                          69.45***
                                                  (14.95)   
_Iregion_3                                          25.40   
                                                  (13.32)   
_Iregion_4                                          34.58***
                                                  (9.537)   
_cons              1060.7***        851.6***        808.0***
                  (32.70)         (59.29)         (79.79)   
------------------------------------------------------------
N                      51              51              50   
R-sq                0.217           0.824           0.911   
adj. R-sq           0.201           0.805           0.894   
rmse                59.81           29.57           21.49   
------------------------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

 

1.5. Transferring regression outputs to word or excel file

The command outreg2 gives us the option to export regression output to word or excel file. To do this, we have to first install the user-written outreg2 package. For installing, type:

ssc install outreg2

For transferring regression outputs with one model to a word file, type:

regress csat expense
outreg2 using myreg.doc, replace ctitle(Model 1)

Stata will give us the following outputs 

. outreg2 using myreg.doc, replace ctitle(Model 1)
myreg.doc
dir : seeout

Windows users: click on myreg.doc to open the file in Word (you can replace this name with your own). Otherwise, follow the Mac instructions.

Mac users: click on dir to go to the directory where myreg.doc is saved, and open it with Word (you can replace this name with your own)

The outputs in the word document look as follows.

(1)
VARIABLES
Model 1
expense
-0.0223***
(0.00604)
Constant
1,061***

     (32.70)

Observations
51
R-squared
0.217
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1

 

We can add more models (e.g., Model 2, Model 3) to the Word document by using the option append (NOTE: make sure to close myreg.doc)

regress csat expense percent
outreg2 using myreg.doc, append ctitle(Model 2)
regress csat expense percent income high college
outreg2 using myreg.doc, append ctitle(Model 3)

Stata will give us the following outputs 

. outreg2 using myreg.doc, append ctitle(Model 3)
myreg.doc
dir : seeout

Windows users: click on myreg.doc to open the file in Word (you can replace this name with your own). Otherwise, follow the Mac instructions.

Mac users: click on dir to go to the directory where myreg.doc is saved, and open it with Word (you can replace this name with your own)

The outputs in the Word document look as follows.

 

(1)

(2)

(3)

VARIABLES

Model 1

Model 2

Model 3

expense
-0.0223***
(0.00604)
0.00860**
(0.00420)
0.00335
(0.00420)
percent
-2.538***
(0.225)
-2.618***
(0.254)
income
0.106
(1.166)
high
1.631
(0.992)
college
2.031
(1.660)
Constant
1,061***

     (32.70)

989.8***

     (18.40)

851.6***

    (59.29)

Observations
51
51
51
R-squared
0.217
0.786
0.824
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1

You also have the option to export the outputs to Excel. Use the extension *.xls.

1.6. Robust regression

We run robust regression to control for heteroskedasticity. By default, Stata assumes homoskedastic standard errors, so if we have heteroskedastic variance, we need to adjust it by adding robust option in the regress command. Type:

regress csat expense percent income high college, robust

Stata will give us the following outputs. Notice that we now have Robust Std. Err, instead of  Std. Err.

. regress csat expense percent income high college, robust
Linear regression                               Number of obs     =         51
                                                F(5, 45)          =      50.90
                                                Prob > F          =     0.0000
                                                R-squared         =     0.8243
                                                Root MSE          =     29.571
------------------------------------------------------------------------------
             |               Robust
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     expense |   .0033528    .004781     0.70   0.487    -.0062766    .0129823
     percent |  -2.618177   .2288594   -11.44   0.000    -3.079123    -2.15723
      income |   .1055853   1.207246     0.09   0.931    -2.325933    2.537104
        high |   1.630841    .943318     1.73   0.091    -.2690989    3.530781
     college |   2.030894   2.113792     0.96   0.342    -2.226502     6.28829
       _cons |   851.5649   57.28743    14.86   0.000     736.1821    966.9477
------------------------------------------------------------------------------

1.7. Regression with dummy/categorical variables

When we add categorical variables in regression, we need to add n-1 dummy variables. Here ‘n’ is the number of categories in the variable.
In the example below, variable ‘industry’ has twelve categories (type tab industry, or tab industry, nolabel)

The easiest way to include a set of dummies in a regression is by using the prefix “i.”  By default, the first category (or lowest value) is
used as a reference. For example:

sysuse nlsw88.dta  // this is the Stata in-built dataset we will use for this example
tab industry
tab industry, nolabel
reg wage hours i.industry

Stata will give us the following regression output

. reg wage hours i.industry
      Source |       SS           df       MS      Number of obs   =     2,228
-------------+----------------------------------   F(12, 2215)     =     16.05
       Model |  5922.54753        12  493.545628   Prob > F        =    0.0000
    Residual |  68114.0215     2,215  30.7512512   R-squared       =    0.0800
-------------+----------------------------------   Adj R-squared   =    0.0750
       Total |   74036.569     2,227  33.2449794   Root MSE        =    5.5454
------------------------------------------------------------------------------------------
                    wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------------+----------------------------------------------------------------
                   hours |   .0723658   .0114787     6.30   0.000     .0498557    .0948759
                         |
                industry |
                 Mining  |   9.328331   3.082327     3.03   0.003     3.283779    15.37288
           Construction  |   1.858089   1.693951     1.10   0.273    -1.463809    5.179987
          Manufacturing  |   1.415641   1.377724     1.03   0.304    -1.286126    4.117407
 Transport/Comm/Utility  |   5.432544   1.467787     3.70   0.000     2.554162    8.310926
 Wholesale/Retail Trade  |   .4583809   1.378985     0.33   0.740    -2.245859     3.16262
Finance/Ins/Real Estate  |    3.92933      1.404     2.80   0.005     1.176036    6.682624
    Business/Repair Svc  |   1.990151   1.471971     1.35   0.177    -.8964373    4.876738
      Personal Services  |  -1.018771   1.459441    -0.70   0.485    -3.880786    1.843244
  Entertainment/Rec Svc  |   1.111801    1.90205     0.58   0.559    -2.618187     4.84179
  Professional Services  |   2.094988   1.359033     1.54   0.123    -.5701247    4.760101
  Public Administration  |   3.232405   1.409187     2.29   0.022      .468939    5.995871
                         |
                   _cons |   3.126629   1.401948     2.23   0.026     .3773593    5.875898
------------------------------------------------------------------------------------------

- To include all categories by suppressing the constant type:

reg wage hours bn.industry, robust hascons

- To change the reference category to “Professional services” (category number 11) instead of “Ag/Forestry/Fisheries” (category number 1), use the prefix “ib#.” where “#” is the number of the reference category you want to use; in this case is 11.

clear
sysuse nlsw88.dta
reg wage hours ib11.industry

Stata will give us the following regression output

. reg wage hours ib11.industry
      Source |       SS           df       MS      Number of obs   =     2,228
-------------+----------------------------------   F(12, 2215)     =     16.05
       Model |  5922.54753        12  493.545628   Prob > F        =    0.0000
    Residual |  68114.0215     2,215  30.7512512   R-squared       =    0.0800
-------------+----------------------------------   Adj R-squared   =    0.0750
       Total |   74036.569     2,227  33.2449794   Root MSE        =    5.5454
------------------------------------------------------------------------------------------
                    wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------------+----------------------------------------------------------------
                   hours |   .0723658   .0114787     6.30   0.000     .0498557    .0948759
                         |
                industry |
  Ag/Forestry/Fisheries  |  -2.094988   1.359033    -1.54   0.123    -4.760101    .5701247
                 Mining  |   7.233343   2.779684     2.60   0.009     1.782284     12.6844
           Construction  |  -.2368991    1.04783    -0.23   0.821     -2.29173    1.817932
          Manufacturing  |  -.6793477    .351426    -1.93   0.053    -1.368507    .0098112
 Transport/Comm/Utility  |   3.337556   .6167569     5.41   0.000     2.128073    4.547038
 Wholesale/Retail Trade  |  -1.636607   .3609973    -4.53   0.000    -2.344536   -.9286788
Finance/Ins/Real Estate  |   1.834342   .4449714     4.12   0.000     .9617372    2.706947
    Business/Repair Svc  |  -.1048377   .6298078    -0.17   0.868    -1.339913    1.130238
      Personal Services  |  -3.113759   .6004595    -5.19   0.000    -4.291282   -1.936237
  Entertainment/Rec Svc  |   -.983187    1.35906    -0.72   0.469    -3.648352    1.681978
  Public Administration  |   1.137416   .4610575     2.47   0.014     .2332663    2.041567
                         |
                   _cons |   5.221617   .4637205    11.26   0.000     4.312244    6.130989
------------------------------------------------------------------------------------------

1.8. Regression: interaction between dummies

Interaction terms are needed whenever there is reason to believe that the effect of one independent variable depends on the value of another independent variable. We will explore here the interaction between two dummy (binary) variables. In the example below, the effect of the student-teacher ratio on test scores may depend on the percentage of English learners in the district*.

To upload the data in Stata, type:

use https://dss.princeton.edu/training/linreg2.dta

– Dependent variable (Y): Average test score, variable testscr in the dataset.

– Independent variables (X)

  • Binary hi_str, where ‘0’ if the student-teacher ratio (str) is lower than 20, ‘1’ if it is 20 or higher.

- In Stata, first generate hi_str = 0 if str<20. Then replace hi_str=1 if str>=20

  • Binary hi_el, where ‘0’ if English learners (el_pct) is lower than 10%, ‘1’ equal to 10% or higher

- In Stata, first generate hi_el = 0 if el_pct<10. Then replace hi_el=1 if el_pct>=10

  • Interaction term str_el = hi_str * hi_el. In Stata: generate str_el = hi_str*hi_el

We run the regression
regress testscr hi_el hi_str str_el, robust

Stata will give us the following outputs

. regress testscr hi_el hi_str str_el, robust
Linear regression                               Number of obs     =        420
                                                F(3, 416)         =      60.20
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2956
                                                Root MSE          =     16.049
------------------------------------------------------------------------------
             |               Robust
     testscr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       hi_el |  -18.16295   2.345952    -7.74   0.000    -22.77435   -13.55155
      hi_str |  -1.907842   1.932215    -0.99   0.324    -5.705964    1.890279
      str_el |  -3.494335   3.121226    -1.12   0.264    -9.629677    2.641006
       _cons |   664.1433   1.388089   478.46   0.000     661.4147    666.8718
------------------------------------------------------------------------------

Interpretation:

From the above outputs we can write the following equation:

testscr_hat = 664.1 –18.1*hi_el –1.9*hi_str –3.5*str_el

- The effect of hi_str on the tests scores is -1.9 but given the interaction term (and assuming all coefficients are significant), the net effect is -1.9 -3.5*hi_el. If hi_el is 0 then the effect is -1.9 (which is hi_str coefficient), but if hi_el is 1 then the effect is -1.9 - 3.5 = - 5.4. In this case, the effect of student-teacher ratio is more negative in districts where the percent of English learners is higher.

- The average test score in the districts where student-teacher ratio is >= 20 and English learners is >= 10%  is 640.6. To calculate this number, we have to plug in 1 in place of hi_el, hi_str, and str_el in the above equation (i.e., 664.1-18.1 -1.9 - 3.5 = 640.6) .

* We use "California Test Score" data set (caschool.dta) which is used by Stock and Watson (2003) and is downloadable from here.

1.9. Regression: interaction between a dummy and a continuous variable

First, upload the data:

clear
use https://dss.princeton.edu/training/linreg2.dta

– Dependent variable (Y): Average test score, variable testscr in the dataset.

– Independent variables (X)

  • Continuous str, student-teacher ratio.

  • Binary hi_el, where ‘0’ if English learners (el_pct) is lower than 10%, ‘1’ equal to 10% or higher.

- In Stata, first generate hi_el = 0 if el_pct<10. Then replace hi_el=1 if el_pct>=10

  • Interaction term str_el_dc = str * hi_el. In Stata: generate str_el_dc = str*hi_el

We run the regression
regress testscr str hi_el str_el_dc, robust

Stata will give us the following outputs

. regress testscr str hi_el str_el_dc, robust
Linear regression                               Number of obs     =        420
                                                F(3, 416)         =      63.67
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3103
                                                Root MSE          =      15.88
------------------------------------------------------------------------------
             |               Robust
     testscr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         str |  -.9684601   .5891016    -1.64   0.101    -2.126447    .1895268
       hi_el |   5.639141   19.51456     0.29   0.773    -32.72029    43.99857
   str_el_dc |  -1.276613   .9669194    -1.32   0.187     -3.17727    .6240436
       _cons |   682.2458   11.86781    57.49   0.000     658.9175    705.5742
------------------------------------------------------------------------------

Interpretation:

From the above outputs we can write the following equation:

testscr_hat = 682.2 – 0.97*str + 5.6*hi_el – 1.28*str_el_dc

The effect of str on testscr will be mediated by hi_el.

  • If hi_el is 0 (low) then the effect of str is 682.2 –0.97*str.
  • If hi_el is 1 (high) then the effect of str is 682.2 –0.97*str + 5.6  – 1.28*str = 687.8  – 2.25*str

Notice that how hi_el changes both the intercept and the slope of str. Reducing str by one in low EL districts will increase test scores by 0.97 points, but it will have a higher impact (2.25 points) in high EL districts. The difference between these two effects is1.28 which is the coefficient of the interaction (Stock and Watson, 2003, p.223).

1.10. Regression: interaction between two continuous variables

First, upload the data:

clear
use https://dss.princeton.edu/training/linreg2.dta

– Dependent variable (Y): Average test score, variable testscr in the dataset.

– Independent variables (X)

  • Continuous str, student-teacher ratio.

  • Continuous el_pct, percent of English learners.

  • Interaction term str_el_cc = str * el_pct. In Stata: generate str_el_cc = str*el_pct

We run the regression
regress testscr str el_pct str_el_cc, robust

Stata will give us the following outputs

. regress testscr str el_pct str_el_cc, robust
Linear regression                               Number of obs     =        420
                                                F(3, 416)         =     155.05
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4264
                                                Root MSE          =     14.482
------------------------------------------------------------------------------
             |               Robust
     testscr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         str |  -1.117018   .5875135    -1.90   0.058    -2.271884    .0378468
      el_pct |  -.6729116   .3741231    -1.80   0.073    -1.408319    .0624958
   str_el_cc |   .0011618   .0185357     0.06   0.950    -.0352736    .0375971
       _cons |   686.3385   11.75935    58.37   0.000     663.2234    709.4537
------------------------------------------------------------------------------

Interpretation:

From the above outputs we can write the following equation:

testscr_hat = 686.3 – 1.12*str - 0.67*el_pct + 0.0012*str_el_cc

The effect of the interaction term is very small. Following Stock & Watson (2003, p.229), algebraically the slope of str is –1.12 + 0.0012*el_pct (remember that str_el_cc is equal to str*el_pct). So:

  • If el_pct = 10, the slope of str is -1.108

  • If el_pct = 20, the slope of str is -1.096. A difference in effect of 0.012 points.

In the continuous case there is an effect but is very small (and not significant). See Stock and Watson, 2003, for further details.

2. Assumption Diagnostics and Regression Trouble Shooting

We use data from Hamilton (2006) for all the analyses in section  2. To get the data, type:

use https://dss.princeton.edu/training/linreg1.dta

2.1. Exploring relationships between the dependent and independent variables

Let us first check the relationship between csat and percent

scatter csat percent

For checking the relationship between csat and high type:

scatter csat high

There seems to be a curvilinear relationship between csat and percent, and a slightly linear between csat and high. To deal with U-shaped curves we need to add a square version of the variable, in this case percent square
generate percent2 = percent^2

2.2. Checking functional form/linearity

The command acprplot (augmented component-plus-residual plot) provides another graphical way to examine the relationship between variables. It does provide good testing for linearity. Run this command after running a regression.

regress csat percent high /* Notice we do not include percent2 */
acprplot percent, lowess
acprplot high, lowess

acprplot percent, lowess

acprplot high, lowess

The option lowess (locally weighted scatterplot smoothing) draws the observed pattern in the data to help identify nonlinearities. Percent shows a quadratic relation; it makes sense to add a square version of it. High shows a polynomial pattern as well but goes around the regression line (except on the right). We could keep it as is for now.

The linearity corrected model is:
regress csat percent percent2 high, robust

Stata will give us the following outputs.

. regress csat percent percent2 high, robust
Linear regression                               Number of obs     =         51
                                                F(3, 47)          =     160.90
                                                Prob > F          =     0.0000
                                                R-squared         =     0.9251
                                                Root MSE          =       18.9
------------------------------------------------------------------------------
             |               Robust
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     percent |  -6.520312   .4934097   -13.21   0.000    -7.512924   -5.527699
    percent2 |   .0536555   .0056491     9.50   0.000      .042291    .0650201
        high |   2.986509     .54564     5.47   0.000     1.888823    4.084195
       _cons |   844.8207    38.8214    21.76   0.000     766.7221    922.9192
------------------------------------------------------------------------------

2.3. Testing for homoskedasticity

An important assumption of the classical linear regression model is that the variance in the residuals has to be homoskedastic or constant.

Graphical way to check homoskedasticity

When plotting residuals vs. predicted values (Yhat), we should not observe any pattern at all. In Stata, we do this using rvfplot right after running the regression. It will automatically draw a scatter plot between residuals and predicted values. Type:

regress csat expense percent income high college i.region
rvfplot, yline(0)
rvfplot, yline(0)

Residuals seem to expand slightly at higher levels of Yhat.

A non-graphical way to detect heteroskedasticity is the Breusch-Pagan test. The null hypothesis is that residuals are homoskedastic. In the example below, we reject the null at a 95% level and conclude that residuals are heteroscedastic.

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
         Ho: Constant variance
         Variables: fitted values of csat
         chi2(1)      =     1.40
         Prob > chi2  =   0.2375

The graphical and the Breusch-Pagan test suggest the possible presence of heteroskedasticity in our model. The problem with this is that we may have the wrong estimates of the standard errors for the coefficients and, therefore their t-values.

There are two ways to deal with this problem; one is using heteroskedasticity-robust standard errors, and the other one is using weighted least squares (see Stock and Watson, 2003, chapter 15). WLS requires knowledge of the conditional variance on which the weights are based; if this is known (rarely the case), then use WLS. In practice, it is recommended to use heteroskedasticity-robust standard errors to deal with heteroskedasticity.

By default Stata assumes homoskedastic standard errors, so we need to adjust our model to account for heteroskedasticity. To do this, we use the option robust in the regress command. For example,

regress csat expense percent income high college i.region, robust

Note: Stock and Watson (2019, chapter 5) suggest, as a rule of thumb, we should always assume heteroskedasticity in our model and therefore run robust regression.

2.4. Testing for multicollinearity

An important assumption for the multiple regression model is that independent variables are not perfectly multicolinear. One regressor should not be a linear function of another.
When multicollinearity is present in a model, standard errors may be inflated. Stata will drop one of the variables to avoid a division by zero in the OLS procedure (see Stock and Watson, 2019, chapter 6).
The Stata command to check for multicollinearity is vif (variance inflation factor). Type:

regress csat expense percent income high college i.region
vif

. vif
    Variable |       VIF       1/VIF  
-------------+----------------------
     expense |      3.18    0.314656
     percent |      3.88    0.257790
      income |      4.78    0.209068
        high |      4.71    0.212167
     college |      4.34    0.230156
      region |
          2  |      3.57    0.279850
          3  |      4.18    0.239156
          4  |      1.80    0.556855
-------------+----------------------
    Mean VIF |      3.81

Rule of thumb: A VIF > 10 or a 1/VIF < 0.10 indicates the presence of multicolinearity in the model.

Based on the above rule, we can say there is no multicolinearity in our model.

2.5. Testing for omitted-variable bias

How do we know we have included all necessary variables to explain Y?

Testing for omitted variable bias is important for our model since it is related to the assumption that the error term and the independent variables in the model are not correlated (E(e|X) = 0)

If we are missing variables in our model and

  • “is correlated with the included regressor” and,
  • “ the omitted variable is a determinant of the dependent variable” (Stock and Watson, 2019, p.170),

…then our regression coefficients are inconsistent.

In Stata, we test for omitted-variable bias using the ovtest command:

regress csat expense percent income high college i.region, robust
ovtest
. ovtest
Ramsey RESET test using powers of the fitted values of csat
       Ho:  model has no omitted variables
                  F(3, 38) =      2.15
                  Prob > F =      0.1096

The null hypothesis is that the model does not have omitted-variables bias, the p-value is higher than the usual threshold of 0.05 (95% significance), so we fail to reject the null and conclude that we do not need more variables.

 

Another command to test for the model specification is linktest. It basically checks whether we need more variables in our model by running a new regression with the observed Y(csat) against Yhat(csat_predicted or Xβ) and Yhat-squared as independent variables.

The thing to look for here is the significance of _hatsq. The null hypothesis is that there is no specification error. If the p-value of _hatsq is not significant, then we fail to reject the null and conclude that our model is correctly specified. Type:

regress csat expense percent income high college i.region, robust
linktest

. linktest
      Source |       SS           df       MS      Number of obs   =        50
-------------+----------------------------------   F(2, 47)        =    247.76
       Model |  194512.244         2  97256.1222   Prob > F        =    0.0000
    Residual |  18449.1356        47    392.5348   R-squared       =    0.9134
-------------+----------------------------------   Adj R-squared   =    0.9097
       Total |   212961.38        49  4346.15061   Root MSE        =    19.812
------------------------------------------------------------------------------
        csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        _hat |  -1.081591   1.866453    -0.58   0.565    -4.836411    2.673228
      _hatsq |   .0010976   .0009839     1.12   0.270    -.0008817     .003077
       _cons |   982.6278   881.8465     1.11   0.271    -791.4185    2756.674
------------------------------------------------------------------------------

The null hypothesis is that there is no specification error, the p-value of _hatsq is higher than the usual threshold of 0.05 (95% significance), so we fail to reject the null and conclude that model is correctly specified.

2.6. Checking for outliers

We use the avplots command (added-variable plots) to check for outliers. Outliers are data points with extreme values that could have a negative effect on our estimators. In Stata, type:

regress csat expense percent income high college i.region
avplot percent
avplot expense

avplot percent

avplot expense

The above plots regress each variable against all others; notice the coefficients on each. All data points seem to be in range, and no outliers were observed.

For checking the outliers for all variables of the model, type avplots after regression.

avplots

2.7. Testing for normality

Another assumption of the regression model (OLS) that impact the validity of all tests (p, t, and F) is that residuals behave ‘normal’. Residuals (here indicated by the letter “e”) are the difference between the observed values (Y) and the predicted values (Yhat): e = Y –Yhat.

In Stata, after running regression type: predict e, resid. It will generate a variable called “e” (residuals).

Three graphs will help us check for normality in the residuals: kdensity, pnorm, and qnorm.

kdensity e, normal

A kernel density plot produces a kind of histogram for the residuals; the option normal overlays a normal distribution to compare. Here residuals seem to follow a normal distribution. Below is an example using histogram.

histogram e, kdensity normal

If residuals do not follow a ‘normal’ pattern, then you should check for omitted variables, model specification, linearity, and functional forms. In sum, you may need to reassess your model/theory. In practice, normality does not represent much of a problem when dealing with really big samples.

- Standardize normal probability plot (pnorm) checks for non-normality in the middle range of residuals.

pnorm e

The plot is slightly off the line but looks ok.

- Quintile-normal plots (qnorm)check for non-normality in the extremes of the data (tails). It plots quintiles of residuals vs quintiles of a normal distribution.

qnorm e

Tails are a bit off the normal.

- A non-graphical test is the Shapiro-Wilk test for normality. It tests the hypothesis that the distribution is normal; in this case, the null hypothesis is that the distribution of the residuals is normal. Type

swilk e

. swilk e
                   Shapiro-Wilk W test for normal data
    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
           e |         50    0.96693      1.555     0.942    0.17316

The null hypothesis is that the distribution of the residuals is normal, here the p-value is 0.17 so we fail to reject the null. We conclude then that residuals are normally distributed.

2.8. Joint test (F-test)

To test whether two coefficients are jointly different from 0, use the command test (see Hamilton, 2006, p.175).

To test the null hypothesis that both coefficients do not have any effect on csat (βhigh= 0 and βcollege= 0), type:

regress csat expense percent income high college i.region
test high college

We will get

. test high college
 ( 1)  high = 0
 ( 2)  college = 0
       F(  2,    41) =   10.00
            Prob > F =    0.0003

The p-value is 0.0003; we reject the null and conclude that both variables have indeed a significant effect on SAT.

3. Regression: General Guidelines

The following are general guidelines for building a regression model suggested by Gelman and Hill (2007):

  1. Make sure all relevant predictors are included. These are based on your research question, theory and knowledge on the topic.
  2. Combine those predictors that tend to measure the same thing (i.e. as an index).
  3. Consider the possibility of adding interactions (mainly for those variables with large effects)
  4. Strategy to keep or drop variables:
    1. Predictor not significant and has the expected sign -> Keep it
    2. Predictor not significant and does not have the expected sign -> Drop it
    3. Predictor is significant and has the expected sign -> Keep it
    4. Predictor is significant but does not have the expected sign -> Review, you may need more variables, it may be interacting with another variable in the model or there may be an error in the data.

References / Useful Resources

Applied Regression Analysis and Generalized Linear Models / John Fox, Sage, 2008.
 
Data analysis using regression and multilevel/hierarchical models / Andrew Gelman, Jennifer Hill. Cambridge ; New York : Cambridge University Press, 2007.
 
 
Econometric analysis / William H. Greene. 8th ed., Upper Saddle River, N.J. : Prentice Hall, 2018.
 
Introduction to econometrics / James H. Stock, Mark W. Watson. 4th ed., Boston: Pearson Addison Wesley, 2019.
 
Statistics with Stata, Updated for Version 9/ Lawrence C. Hamilton. United Kingdom: Thomson/BrooksCole, 2006.

Data Consultant

Profile Photo
Muhammad Al Amin
He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051

Data Consultant

Profile Photo
Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519

Comments or Questions?

If you have questions or comments about this guide or method, please email data@Princeton.edu.