# Linear Regression in Stata: A Hands-on Tutorial

Basic introduction to linear regression analysis, diagnostics and presentation (using Stata)

## 1.1. Linear regression: an overview

We use regression to estimate the unknown effect of changing one variable over another (Stock and Watson, 2019, ch. 4).

When running a regression, we are making two assumptions,

(1) there is a linear relationship between two variables (i.e., X and Y) and

(2) this relationship is additive (i.e., Y= x1 + x2 + …+ xN)

Technically, linear regression estimates how much Y changes when X changes one unit.

In Stata, use the command regress, type:

###### regress y x

In a multivariate setting, we type:

regress y x1 x2 x3 ...

Before running a regression, it is recommended to have a clear idea of what you are trying to estimate (i.e., your outcome and predictor variables).

A regression makes sense only if there is a sound theory behind it.

Example: Are SAT scores higher in states that spend more money on education controlling by other factors?

– Outcome (Y) variable – SAT scores, variable csat in the dataset

– Predictor (X) variables

• Per pupil expenditures primary & secondary (expense)

• % HS graduates taking SAT (percent)

• Median household income (income)

• % adults with HS diploma (high)

• % adults with a college degree (college)

• Region (region)

## 1.2. Examining the variables first

It is recommended first to examine the variables in the model to understand the characteristics of data. We use data from Hamilton (2006). To get the data, type:

use https://dss.princeton.edu/training/linreg1.dta

To get basic information/description about data and variables, type:

describe csat expense percent income high college region

Stata will provide the following table

###### storage   display    value variable name   type    format     label      variable label --------------------------------------------------------------------------------- csat            int     %9.0g                 Mean composite SAT score expense         int     %9.0g                 Per pupil expenditures prim&sec percent         byte    %9.0g                 % HS graduates taking SAT income          double  %10.0g                Median household income, \$1,000 high            float   %9.0g                 % adults HS diploma college         float   %9.0g                 % adults college degree region          byte    %9.0g      region     Geographical region

To get the summary statistics of the variables, type:

summarize csat expense percent income high college region

Stata will provide the following table

###### Variable |        Obs        Mean    Std. Dev.       Min        Max -------------+---------------------------------------------------------         csat |         51     944.098    66.93497        832       1093      expense |         51    5235.961    1401.155       2960       9259      percent |         51    35.76471    26.19281          4         81       income |         51    33.95657    6.423134     23.465     48.618         high |         51    76.26078    5.588741       64.3       86.6 -------------+---------------------------------------------------------      college |         51    20.02157     4.16578       12.3       33.3       region |         50        2.54    1.128662          1          4

To check correlation matrix of the variables we are interested in, type:

pwcorr csat expense percent income high college, star(0.05) sig

Stata will provide the following table

###### |     csat  expense  percent   income     high  college -------------+------------------------------------------------------         csat |   1.0000              |              |      expense |  -0.4663*  1.0000              |   0.0006              |      percent |  -0.8758*  0.6509*  1.0000              |   0.0000   0.0000              |       income |  -0.4713*  0.6784*  0.6733*  1.0000              |   0.0005   0.0000   0.0000              |         high |   0.0858   0.3133*  0.1413   0.5099*  1.0000              |   0.5495   0.0252   0.3226   0.0001              |      college |  -0.3729*  0.6400*  0.6091*  0.7234*  0.5319*  1.0000              |   0.0070   0.0000   0.0000   0.0000   0.0001

In the table, numbers are Pearson correlation coefficients, go from -1 to 1. Closer to 1 means strong correlation. A negative value indicates an inverse relationship (roughly, when one goes up the other goes down).

Command graph matrix produces a graphical representation of the correlation matrix by presenting a series of scatter plots for all variables. Type:

graph matrix csat expense percent income high college, half ## 1.3. Running simple linear regression models

To run a simple linear regression model which pertains to one dependent variable and one independent variable, type:

regress csat expense

Here, csat is the outcome variable and  expense is the predictor variable.

Stata will give us the following output table.

###### ------------------------------------------------------------------------------         csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval] -------------+----------------------------------------------------------------      expense |  -.0222756   .0060371    -3.69   0.001    -.0344077   -.0101436        _cons |   1060.732    32.7009    32.44   0.000     995.0175    1126.447 ------------------------------------------------------------------------------

Interpretation of the outputs:

•  Prob > F = 0.0006 : This is the p-value of the model. It tests the null hypothesis that the R-square is equal to 0. To reject the null hypothesis, usually we need a p-value lower than 0.05. Here, the p-value of 0.0006 indicates a statistically significant relationship between X and Y.
•  R-squared = 0.2174   : R-square shows the amount of variance of Y explained by X. In this case expense explains 22% of the variance in SAT scores.
•  Adj R-squared = 0.2015  : Adjusted R-square shows the same as R-square but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R-square is closer to R-square. This provides a more honest association between X and Y.
• Root MSE = 59.814 : root mean squared error, is the sd of the regression. The closer to zero better the fit.
• The estimated coefficient for expense is  -.0222756. This means for each one-point increase in expense, SAT scores decrease by 0.022 points.
• The t-values test the null hypothesis that each coefficient is 0. To reject this, you need a t-value greater than 1.96 (for 95% confidence). You can get the t-values by dividing the coefficient by its standard error. The t-values also show the importance of a variable in the model.
•  P>|t| = 0.001 : The two-tailed p-value tests the null hypothesis that the coefficient is equal to 0 (i.e. no significant effect). To reject this, the p-value has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, expense is statistically significant in explaining SAT.

To run a multiple/multivariable linear regression model which pertains to one dependent variable and two or more than two independent variables, type:

regress csat expense percent income high college

Here, csat is the outcome variable and  expense, percent, income, high, and college are the predictor variables.

Stata will give us the following output table.

###### ------------------------------------------------------------------------------         csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval] -------------+----------------------------------------------------------------      expense |   .0033528   .0044709     0.75   0.457     -.005652    .0123576      percent |  -2.618177   .2538491   -10.31   0.000    -3.129455   -2.106898       income |   .1055853   1.166094     0.09   0.928    -2.243048    2.454218         high |   1.630841    .992247     1.64   0.107     -.367647    3.629329      college |   2.030894   1.660118     1.22   0.228    -1.312756    5.374544        _cons |   851.5649   59.29228    14.36   0.000     732.1441    970.9857 ------------------------------------------------------------------------------

Interpretation of the outputs:

• Prob > F = 0.0000 : This is the p-value of the model. It indicates the reliability of X to predict Y. Usually we need a p-value lower than 0.05 to show a statistically significant relationship between X and Y.
• R-squared = 0.8243 : R-square shows the amount of variance of Y explained by X. In this case the model explains 82.43% of the variance in SAT scores.
• Adj R-squared = 0.8048 :  Adjusted R-square shows the same as R2but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R square is closer to R square. This provides a more honest association between X and Y.
• Root MSE = 29.571 : Root mean squared error, is the sd of the regression. The closer to zero better the fit.
• The estimated coefficient for  expense is  .0033528. This means for each one-point increase in expense, SAT scores increase by 0.003 points, holding all other variables constant. However, this increase is not statistically significant as the p-value is not < 0.05. The estimated coefficient for  percent is  -2.618177. This means for each one-point increase in percent, SAT scores decrease significantly by 2.62 points, holding all other variables constant.

Interpretations for the other estimated coefficients are similar to the explained two variables.

• The t-values test the null hypothesis that each coefficient is 0. To reject this, you need a t-value greater than 1.96 (for 95% confidence). You can get the t-values by dividing the coefficient by its standard error. The t-values also show the importance of a variable in the model. In this case, percent is the most important.
• Two-tail p-values test the null hypothesis that each coefficient is 0. To reject this, the p-value has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, expense, income, high, and college are not statistically significant in explaining SAT. percent is the only variable that has significant impact on SAT (i.e., its coefficient is significantly different from 0).

Plotting the predicted values against observed values

One of the ways to say how good the model is will depend on how well it predicts Y.

We can generate the predicted values of Y (usually called Yhat) given the model by using predict immediately after running the regression. Type:

###### label variable csat_predict "csat predicted"

Running the above codes will create a new column in your dataset named csat_predict

For a quick assessment of the model run a scatter plot between csat and csat_predict

scatter csat csat_predict We should expect a 45 degree pattern in the data. Y-axis is the observed data and x-axis the predicted data (Yhat).
In this case, the model seems to be doing a good job in predicting csat

## 1.4. Regression output table for more than one model

To show the outputs for more than one model in a single table side-by-side, you can use the commands eststo and esttab:

###### regress csat expense eststo model1 regress csat expense percent income high college eststo model2 xi: regress csat expense percent income high college i.region eststo model3 esttab, r2 ar2 se scalar(rmse)

Stata will give us the following output table.

## 1.5. Transferring regression outputs to word or excel file

The command outreg2 gives us the option to export regression output to word or excel file. To do this, we have to first install the user-written outreg2 package. For installing, type:

ssc install outreg2

For transferring regression outputs with one model to a word file, type:

###### outreg2 using myreg.doc, replace ctitle(Model 1)

Stata will give us the following outputs

###### . outreg2 using myreg.doc, replace ctitle(Model 1) myreg.doc dir : seeout

Windows users: click on myreg.doc to open the file in Word (you can replace this name with your own). Otherwise, follow the Mac instructions.

Mac users: click on dir to go to the directory where myreg.doc is saved, and open it with Word (you can replace this name with your own)

The outputs in the word document look as follows.

(32.70)

###### *** p<0.01, ** p<0.05, * p<0.1

We can add more models (e.g., Model 2, Model 3) to the Word document by using the option append (NOTE: make sure to close myreg.doc)

###### regress csat expense percent outreg2 using myreg.doc, append ctitle(Model 2) regress csat expense percent income high college outreg2 using myreg.doc, append ctitle(Model 3)

Stata will give us the following outputs

###### . outreg2 using myreg.doc, append ctitle(Model 3) myreg.doc dir : seeout

Windows users: click on myreg.doc to open the file in Word (you can replace this name with your own). Otherwise, follow the Mac instructions.

Mac users: click on dir to go to the directory where myreg.doc is saved, and open it with Word (you can replace this name with your own)

The outputs in the Word document look as follows.

(1)

(2)

(3)

VARIABLES

Model 1

Model 2

Model 3

(32.70)

(18.40)

(59.29)

###### *** p<0.01, ** p<0.05, * p<0.1

You also have the option to export the outputs to Excel. Use the extension *.xls.

## 1.6. Robust regression

We run robust regression to control for heteroskedasticity. By default, Stata assumes homoskedastic standard errors, so if we have heteroskedastic variance, we need to adjust it by adding robust option in the regress command. Type:

regress csat expense percent income high college, robust

Stata will give us the following outputs. Notice that we now have Robust Std. Err, instead of  Std. Err.

## 1.7. Regression with dummy/categorical variables

When we add categorical variables in regression, we need to add n-1 dummy variables. Here ‘n’ is the number of categories in the variable.
In the example below, variable ‘industry’ has twelve categories (type tab industry, or tab industry, nolabel)

The easiest way to include a set of dummies in a regression is by using the prefix “i.”  By default, the first category (or lowest value) is
used as a reference. For example:

###### reg wage hours i.industry

Stata will give us the following regression output

###### ------------------------------------------------------------------------------------------                     wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval] -------------------------+----------------------------------------------------------------                    hours |   .0723658   .0114787     6.30   0.000     .0498557    .0948759                          |                 industry |                  Mining  |   9.328331   3.082327     3.03   0.003     3.283779    15.37288            Construction  |   1.858089   1.693951     1.10   0.273    -1.463809    5.179987           Manufacturing  |   1.415641   1.377724     1.03   0.304    -1.286126    4.117407  Transport/Comm/Utility  |   5.432544   1.467787     3.70   0.000     2.554162    8.310926  Wholesale/Retail Trade  |   .4583809   1.378985     0.33   0.740    -2.245859     3.16262 Finance/Ins/Real Estate  |    3.92933      1.404     2.80   0.005     1.176036    6.682624     Business/Repair Svc  |   1.990151   1.471971     1.35   0.177    -.8964373    4.876738       Personal Services  |  -1.018771   1.459441    -0.70   0.485    -3.880786    1.843244   Entertainment/Rec Svc  |   1.111801    1.90205     0.58   0.559    -2.618187     4.84179   Professional Services  |   2.094988   1.359033     1.54   0.123    -.5701247    4.760101   Public Administration  |   3.232405   1.409187     2.29   0.022      .468939    5.995871                          |                    _cons |   3.126629   1.401948     2.23   0.026     .3773593    5.875898 ------------------------------------------------------------------------------------------

- To include all categories by suppressing the constant type:

###### reg wage hours bn.industry, robust hascons

- To change the reference category to “Professional services” (category number 11) instead of “Ag/Forestry/Fisheries” (category number 1), use the prefix “ib#.” where “#” is the number of the reference category you want to use; in this case is 11.

###### clear sysuse nlsw88.dta reg wage hours ib11.industry

Stata will give us the following regression output

## 1.8. Regression: interaction between dummies

Interaction terms are needed whenever there is reason to believe that the effect of one independent variable depends on the value of another independent variable. We will explore here the interaction between two dummy (binary) variables. In the example below, the effect of the student-teacher ratio on test scores may depend on the percentage of English learners in the district*.

To upload the data in Stata, type:

use https://dss.princeton.edu/training/linreg2.dta

– Dependent variable (Y): Average test score, variable testscr in the dataset.

– Independent variables (X)

• Binary hi_str, where ‘0’ if the student-teacher ratio (str) is lower than 20, ‘1’ if it is 20 or higher.

- In Stata, first generate hi_str = 0 if str<20. Then replace hi_str=1 if str>=20

• Binary hi_el, where ‘0’ if English learners (el_pct) is lower than 10%, ‘1’ equal to 10% or higher

- In Stata, first generate hi_el = 0 if el_pct<10. Then replace hi_el=1 if el_pct>=10

• Interaction term str_el = hi_str * hi_el. In Stata: generate str_el = hi_str*hi_el

We run the regression
regress testscr hi_el hi_str str_el, robust

Stata will give us the following outputs

###### ------------------------------------------------------------------------------              |               Robust      testscr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval] -------------+----------------------------------------------------------------        hi_el |  -18.16295   2.345952    -7.74   0.000    -22.77435   -13.55155       hi_str |  -1.907842   1.932215    -0.99   0.324    -5.705964    1.890279       str_el |  -3.494335   3.121226    -1.12   0.264    -9.629677    2.641006        _cons |   664.1433   1.388089   478.46   0.000     661.4147    666.8718 ------------------------------------------------------------------------------

Interpretation:

From the above outputs we can write the following equation:

testscr_hat = 664.1 –18.1*hi_el –1.9*hi_str –3.5*str_el

- The effect of hi_str on the tests scores is -1.9 but given the interaction term (and assuming all coefficients are significant), the net effect is -1.9 -3.5*hi_el. If hi_el is 0 then the effect is -1.9 (which is hi_str coefficient), but if hi_el is 1 then the effect is -1.9 - 3.5 = - 5.4. In this case, the effect of student-teacher ratio is more negative in districts where the percent of English learners is higher.

- The average test score in the districts where student-teacher ratio is >= 20 and English learners is >= 10%  is 640.6. To calculate this number, we have to plug in 1 in place of hi_el, hi_str, and str_el in the above equation (i.e., 664.1-18.1 -1.9 - 3.5 = 640.6) .

* We use "California Test Score" data set (caschool.dta) which is used by Stock and Watson (2003) and is downloadable from here.

## 1.9. Regression: interaction between a dummy and a continuous variable

###### use https://dss.princeton.edu/training/linreg2.dta

– Dependent variable (Y): Average test score, variable testscr in the dataset.

– Independent variables (X)

• Continuous str, student-teacher ratio.

• Binary hi_el, where ‘0’ if English learners (el_pct) is lower than 10%, ‘1’ equal to 10% or higher.

- In Stata, first generate hi_el = 0 if el_pct<10. Then replace hi_el=1 if el_pct>=10

• Interaction term str_el_dc = str * hi_el. In Stata: generate str_el_dc = str*hi_el

We run the regression
regress testscr str hi_el str_el_dc, robust

Stata will give us the following outputs

###### ------------------------------------------------------------------------------              |               Robust      testscr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval] -------------+----------------------------------------------------------------          str |  -.9684601   .5891016    -1.64   0.101    -2.126447    .1895268        hi_el |   5.639141   19.51456     0.29   0.773    -32.72029    43.99857    str_el_dc |  -1.276613   .9669194    -1.32   0.187     -3.17727    .6240436        _cons |   682.2458   11.86781    57.49   0.000     658.9175    705.5742 ------------------------------------------------------------------------------

Interpretation:

From the above outputs we can write the following equation:

testscr_hat = 682.2 – 0.97*str + 5.6*hi_el – 1.28*str_el_dc

The effect of str on testscr will be mediated by hi_el.

• If hi_el is 0 (low) then the effect of str is 682.2 –0.97*str.
• If hi_el is 1 (high) then the effect of str is 682.2 –0.97*str + 5.6  – 1.28*str = 687.8  – 2.25*str

Notice that how hi_el changes both the intercept and the slope of str. Reducing str by one in low EL districts will increase test scores by 0.97 points, but it will have a higher impact (2.25 points) in high EL districts. The difference between these two effects is1.28 which is the coefficient of the interaction (Stock and Watson, 2003, p.223).

## 1.10. Regression: interaction between two continuous variables

###### use https://dss.princeton.edu/training/linreg2.dta

– Dependent variable (Y): Average test score, variable testscr in the dataset.

– Independent variables (X)

• Continuous str, student-teacher ratio.

• Continuous el_pct, percent of English learners.

• Interaction term str_el_cc = str * el_pct. In Stata: generate str_el_cc = str*el_pct

We run the regression
regress testscr str el_pct str_el_cc, robust

Stata will give us the following outputs

###### ------------------------------------------------------------------------------              |               Robust      testscr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval] -------------+----------------------------------------------------------------          str |  -1.117018   .5875135    -1.90   0.058    -2.271884    .0378468       el_pct |  -.6729116   .3741231    -1.80   0.073    -1.408319    .0624958    str_el_cc |   .0011618   .0185357     0.06   0.950    -.0352736    .0375971        _cons |   686.3385   11.75935    58.37   0.000     663.2234    709.4537 ------------------------------------------------------------------------------

Interpretation:

From the above outputs we can write the following equation:

testscr_hat = 686.3 – 1.12*str - 0.67*el_pct + 0.0012*str_el_cc

The effect of the interaction term is very small. Following Stock & Watson (2003, p.229), algebraically the slope of str is –1.12 + 0.0012*el_pct (remember that str_el_cc is equal to str*el_pct). So:

• If el_pct = 10, the slope of str is -1.108

• If el_pct = 20, the slope of str is -1.096. A difference in effect of 0.012 points.

In the continuous case there is an effect but is very small (and not significant). See Stock and Watson, 2003, for further details.

## 2. Assumption Diagnostics and Regression Trouble Shooting

We use data from Hamilton (2006) for all the analyses in section  2. To get the data, type:

use https://dss.princeton.edu/training/linreg1.dta

## 2.1. Exploring relationships between the dependent and independent variables

Let us first check the relationship between csat and percent

###### scatter csat percent For checking the relationship between csat and high type:

scatter csat high There seems to be a curvilinear relationship between csat and percent, and a slightly linear between csat and high. To deal with U-shaped curves we need to add a square version of the variable, in this case percent square
generate percent2 = percent^2

## 2.2. Checking functional form/linearity

The command acprplot (augmented component-plus-residual plot) provides another graphical way to examine the relationship between variables. It does provide good testing for linearity. Run this command after running a regression.

###### regress csat percent high /* Notice we do not include percent2 */ acprplot percent, lowess acprplot high, lowess

acprplot percent, lowess acprplot high, lowess The option lowess (locally weighted scatterplot smoothing) draws the observed pattern in the data to help identify nonlinearities. Percent shows a quadratic relation; it makes sense to add a square version of it. High shows a polynomial pattern as well but goes around the regression line (except on the right). We could keep it as is for now.

The linearity corrected model is:
regress csat percent percent2 high, robust

Stata will give us the following outputs.

## 2.3. Testing for homoskedasticity

An important assumption of the classical linear regression model is that the variance in the residuals has to be homoskedastic or constant.

Graphical way to check homoskedasticity

When plotting residuals vs. predicted values (Yhat), we should not observe any pattern at all. In Stata, we do this using rvfplot right after running the regression. It will automatically draw a scatter plot between residuals and predicted values. Type:

###### rvfplot, yline(0) Residuals seem to expand slightly at higher levels of Yhat.

A non-graphical way to detect heteroskedasticity is the Breusch-Pagan test. The null hypothesis is that residuals are homoskedastic. In the example below, we reject the null at a 95% level and conclude that residuals are heteroscedastic.

###### chi2(1)      =     1.40          Prob > chi2  =   0.2375

The graphical and the Breusch-Pagan test suggest the possible presence of heteroskedasticity in our model. The problem with this is that we may have the wrong estimates of the standard errors for the coefficients and, therefore their t-values.

There are two ways to deal with this problem; one is using heteroskedasticity-robust standard errors, and the other one is using weighted least squares (see Stock and Watson, 2003, chapter 15). WLS requires knowledge of the conditional variance on which the weights are based; if this is known (rarely the case), then use WLS. In practice, it is recommended to use heteroskedasticity-robust standard errors to deal with heteroskedasticity.

By default Stata assumes homoskedastic standard errors, so we need to adjust our model to account for heteroskedasticity. To do this, we use the option robust in the regress command. For example,

regress csat expense percent income high college i.region, robust

Note: Stock and Watson (2019, chapter 5) suggest, as a rule of thumb, we should always assume heteroskedasticity in our model and therefore run robust regression.

## 2.4. Testing for multicollinearity

An important assumption for the multiple regression model is that independent variables are not perfectly multicolinear. One regressor should not be a linear function of another.
When multicollinearity is present in a model, standard errors may be inflated. Stata will drop one of the variables to avoid a division by zero in the OLS procedure (see Stock and Watson, 2019, chapter 6).
The Stata command to check for multicollinearity is vif (variance inflation factor). Type:

regress csat expense percent income high college i.region
vif

###### Variable |       VIF       1/VIF   -------------+----------------------      expense |      3.18    0.314656      percent |      3.88    0.257790       income |      4.78    0.209068         high |      4.71    0.212167      college |      4.34    0.230156       region |           2  |      3.57    0.279850           3  |      4.18    0.239156           4  |      1.80    0.556855 -------------+----------------------     Mean VIF |      3.81

Rule of thumb: A VIF > 10 or a 1/VIF < 0.10 indicates the presence of multicolinearity in the model.

Based on the above rule, we can say there is no multicolinearity in our model.

## 2.5. Testing for omitted-variable bias

How do we know we have included all necessary variables to explain Y?

Testing for omitted variable bias is important for our model since it is related to the assumption that the error term and the independent variables in the model are not correlated (E(e|X) = 0)

If we are missing variables in our model and

• “is correlated with the included regressor” and,
• “ the omitted variable is a determinant of the dependent variable” (Stock and Watson, 2019, p.170),

…then our regression coefficients are inconsistent.

In Stata, we test for omitted-variable bias using the ovtest command:

###### Ramsey RESET test using powers of the fitted values of csat        Ho:  model has no omitted variables                   F(3, 38) =      2.15                   Prob > F =      0.1096

The null hypothesis is that the model does not have omitted-variables bias, the p-value is higher than the usual threshold of 0.05 (95% significance), so we fail to reject the null and conclude that we do not need more variables.

Another command to test for the model specification is linktest. It basically checks whether we need more variables in our model by running a new regression with the observed Y(csat) against Yhat(csat_predicted or Xβ) and Yhat-squared as independent variables.

The thing to look for here is the significance of _hatsq. The null hypothesis is that there is no specification error. If the p-value of _hatsq is not significant, then we fail to reject the null and conclude that our model is correctly specified. Type:

regress csat expense percent income high college i.region, robust

###### ------------------------------------------------------------------------------         csat |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval] -------------+----------------------------------------------------------------         _hat |  -1.081591   1.866453    -0.58   0.565    -4.836411    2.673228       _hatsq |   .0010976   .0009839     1.12   0.270    -.0008817     .003077        _cons |   982.6278   881.8465     1.11   0.271    -791.4185    2756.674 ------------------------------------------------------------------------------

The null hypothesis is that there is no specification error, the p-value of _hatsq is higher than the usual threshold of 0.05 (95% significance), so we fail to reject the null and conclude that model is correctly specified.

## 2.6. Checking for outliers

We use the avplots command (added-variable plots) to check for outliers. Outliers are data points with extreme values that could have a negative effect on our estimators. In Stata, type:

###### regress csat expense percent income high college i.region avplot percent avplot expense

avplot percent avplot expense The above plots regress each variable against all others; notice the coefficients on each. All data points seem to be in range, and no outliers were observed.

For checking the outliers for all variables of the model, type avplots after regression.

avplots ## 2.7. Testing for normality

Another assumption of the regression model (OLS) that impact the validity of all tests (p, t, and F) is that residuals behave ‘normal’. Residuals (here indicated by the letter “e”) are the difference between the observed values (Y) and the predicted values (Yhat): e = Y –Yhat.

In Stata, after running regression type: predict e, resid. It will generate a variable called “e” (residuals).

Three graphs will help us check for normality in the residuals: kdensity, pnorm, and qnorm.

kdensity e, normal A kernel density plot produces a kind of histogram for the residuals; the option normal overlays a normal distribution to compare. Here residuals seem to follow a normal distribution. Below is an example using histogram.

histogram e, kdensity normal If residuals do not follow a ‘normal’ pattern, then you should check for omitted variables, model specification, linearity, and functional forms. In sum, you may need to reassess your model/theory. In practice, normality does not represent much of a problem when dealing with really big samples.

- Standardize normal probability plot (pnorm) checks for non-normality in the middle range of residuals.

pnorm e The plot is slightly off the line but looks ok.

- Quintile-normal plots (qnorm)check for non-normality in the extremes of the data (tails). It plots quintiles of residuals vs quintiles of a normal distribution.

qnorm e Tails are a bit off the normal.

- A non-graphical test is the Shapiro-Wilk test for normality. It tests the hypothesis that the distribution is normal; in this case, the null hypothesis is that the distribution of the residuals is normal. Type

swilk e

###### Variable |        Obs       W           V         z       Prob>z -------------+------------------------------------------------------            e |         50    0.96693      1.555     0.942    0.17316

The null hypothesis is that the distribution of the residuals is normal, here the p-value is 0.17 so we fail to reject the null. We conclude then that residuals are normally distributed.

## 2.8. Joint test (F-test)

To test whether two coefficients are jointly different from 0, use the command test (see Hamilton, 2006, p.175).

To test the null hypothesis that both coefficients do not have any effect on csat (βhigh= 0 and βcollege= 0), type:

We will get

###### F(  2,    41) =   10.00             Prob > F =    0.0003

The p-value is 0.0003; we reject the null and conclude that both variables have indeed a significant effect on SAT.

## 3. Regression: General Guidelines

The following are general guidelines for building a regression model suggested by Gelman and Hill (2007):

1. Make sure all relevant predictors are included. These are based on your research question, theory and knowledge on the topic.
2. Combine those predictors that tend to measure the same thing (i.e. as an index).
3. Consider the possibility of adding interactions (mainly for those variables with large effects)
4. Strategy to keep or drop variables:
1. Predictor not significant and has the expected sign -> Keep it
2. Predictor not significant and does not have the expected sign -> Drop it
3. Predictor is significant and has the expected sign -> Keep it
4. Predictor is significant but does not have the expected sign -> Review, you may need more variables, it may be interacting with another variable in the model or there may be an error in the data.

## References / Useful Resources

Applied Regression Analysis and Generalized Linear Models / John Fox, Sage, 2008.

Data analysis using regression and multilevel/hierarchical models / Andrew Gelman, Jennifer Hill. Cambridge ; New York : Cambridge University Press, 2007.

Econometric analysis / William H. Greene. 8th ed., Upper Saddle River, N.J. : Prentice Hall, 2018.

Introduction to econometrics / James H. Stock, Mark W. Watson. 4th ed., Boston: Pearson Addison Wesley, 2019.

Statistics with Stata, Updated for Version 9/ Lawrence C. Hamilton. United Kingdom: Thomson/BrooksCole, 2006.