We use regression to estimate the unknown effect of changing one variable over another (Stock and Watson, 2019, ch. 4).
When running a regression, we are making two assumptions,
(1) there is a linear relationship between two variables (i.e., X and Y) and
(2) this relationship is additive (i.e., Y= x1 + x2 + …+ xN)
Technically, linear regression estimates how much Y changes when X changes one unit.
In Stata, use the command regress, type:
In a multivariate setting, we type:
regress y x1 x2 x3 ...
Before running a regression, it is recommended to have a clear idea of what you are trying to estimate (i.e., your outcome and predictor variables).
A regression makes sense only if there is a sound theory behind it.
Example: Are SAT scores higher in states that spend more money on education controlling by other factors?
– Outcome (Y) variable – SAT scores, variable csat in the dataset
– Predictor (X) variables
Per pupil expenditures primary & secondary (expense)
% HS graduates taking SAT (percent)
Median household income (income)
% adults with HS diploma (high)
% adults with a college degree (college)
Region (region)
It is recommended first to examine the variables in the model to understand the characteristics of data. We use data from Hamilton (2006). To get the data, type:
use https://dss.princeton.edu/training/linreg1.dta
To get basic information/description about data and variables, type:
describe csat expense percent income high college region
Stata will provide the following table
To get the summary statistics of the variables, type:
summarize csat expense percent income high college region
Stata will provide the following table
To check correlation matrix of the variables we are interested in, type:
pwcorr csat expense percent income high college, star(0.05) sig
Stata will provide the following table
In the table, numbers are Pearson correlation coefficients, go from -1 to 1. Closer to 1 means strong correlation. A negative value indicates an inverse relationship (roughly, when one goes up the other goes down).
Command graph matrix produces a graphical representation of the correlation matrix by presenting a series of scatter plots for all variables. Type:
graph matrix csat expense percent income high college, half
To run a simple linear regression model which pertains to one dependent variable and one independent variable, type:
regress csat expense
Here, csat is the outcome variable and expense is the predictor variable.
Stata will give us the following output table.
Interpretation of the outputs:
To run a multiple/multivariable linear regression model which pertains to one dependent variable and two or more than two independent variables, type:
regress csat expense percent income high college
Here, csat is the outcome variable and expense, percent, income, high, and college are the predictor variables.
Stata will give us the following output table.
Interpretation of the outputs:
Interpretations for the other estimated coefficients are similar to the explained two variables.
Plotting the predicted values against observed values
One of the ways to say how good the model is will depend on how well it predicts Y.
We can generate the predicted values of Y (usually called Yhat) given the model by using predict immediately after running the regression. Type:
Running the above codes will create a new column in your dataset named csat_predict
For a quick assessment of the model run a scatter plot between csat and csat_predict
scatter csat csat_predict
We should expect a 45 degree pattern in the data. Y-axis is the observed data and x-axis the predicted data (Yhat).
In this case, the model seems to be doing a good job in predicting csat
To show the outputs for more than one model in a single table side-by-side, you can use the commands eststo and esttab:
Stata will give us the following output table.
The command outreg2 gives us the option to export regression output to word or excel file. To do this, we have to first install the user-written outreg2 package. For installing, type:
ssc install outreg2
For transferring regression outputs with one model to a word file, type:
Stata will give us the following outputs
Windows users: click on myreg.doc to open the file in Word (you can replace this name with your own). Otherwise, follow the Mac instructions.
Mac users: click on dir to go to the directory where myreg.doc is saved, and open it with Word (you can replace this name with your own)
The outputs in the word document look as follows.
|
(1) |
VARIABLES |
Model 1 |
|
|
expense |
-0.0223*** |
|
(0.00604) |
Constant |
1,061***(32.70) |
Observations |
51 |
R-squared |
0.217 |
We can add more models (e.g., Model 2, Model 3) to the Word document by using the option append (NOTE: make sure to close myreg.doc)
Stata will give us the following outputs
Windows users: click on myreg.doc to open the file in Word (you can replace this name with your own). Otherwise, follow the Mac instructions.
Mac users: click on dir to go to the directory where myreg.doc is saved, and open it with Word (you can replace this name with your own)
The outputs in the Word document look as follows.
|
(1) |
(2) |
(3) |
VARIABLES |
Model 1 |
Model 2 |
Model 3 |
|
|
|
|
expense |
-0.0223***(0.00604) |
0.00860**(0.00420) |
0.00335(0.00420) |
percent |
|
-2.538***(0.225) |
-2.618***(0.254) |
income |
|
|
0.106(1.166) |
high |
|
|
1.631(0.992) |
college |
|
|
2.031(1.660) |
Constant |
1,061***(32.70) |
989.8***(18.40) |
851.6***(59.29) |
Observations |
51 |
51 |
51 |
R-squared |
0.217 |
0.786 |
0.824 |
You also have the option to export the outputs to Excel. Use the extension *.xls.
We run robust regression to control for heteroskedasticity. By default, Stata assumes homoskedastic standard errors, so if we have heteroskedastic variance, we need to adjust it by adding robust option in the regress command. Type:
regress csat expense percent income high college, robust
Stata will give us the following outputs. Notice that we now have Robust Std. Err, instead of Std. Err.
When we add categorical variables in regression, we need to add n-1 dummy variables. Here ‘n’ is the number of categories in the variable.
In the example below, variable ‘industry’ has twelve categories (type tab industry, or tab industry, nolabel)
The easiest way to include a set of dummies in a regression is by using the prefix “i.” By default, the first category (or lowest value) is
used as a reference. For example:
Stata will give us the following regression output
- To include all categories by suppressing the constant type:
- To change the reference category to “Professional services” (category number 11) instead of “Ag/Forestry/Fisheries” (category number 1), use the prefix “ib#.” where “#” is the number of the reference category you want to use; in this case is 11.
Stata will give us the following regression output
Interaction terms are needed whenever there is reason to believe that the effect of one independent variable depends on the value of another independent variable. We will explore here the interaction between two dummy (binary) variables. In the example below, the effect of the student-teacher ratio on test scores may depend on the percentage of English learners in the district*.
To upload the data in Stata, type:
use https://dss.princeton.edu/training/linreg2.dta
– Dependent variable (Y): Average test score, variable testscr in the dataset.
– Independent variables (X)
Binary hi_str, where ‘0’ if the student-teacher ratio (str) is lower than 20, ‘1’ if it is 20 or higher.
- In Stata, first generate hi_str = 0 if str<20. Then replace hi_str=1 if str>=20
Binary hi_el, where ‘0’ if English learners (el_pct) is lower than 10%, ‘1’ equal to 10% or higher
- In Stata, first generate hi_el = 0 if el_pct<10. Then replace hi_el=1 if el_pct>=10
Interaction term str_el = hi_str * hi_el. In Stata: generate str_el = hi_str*hi_el
We run the regression
regress testscr hi_el hi_str str_el, robust
Stata will give us the following outputs
Interpretation:
From the above outputs we can write the following equation:
testscr_hat = 664.1 –18.1*hi_el –1.9*hi_str –3.5*str_el
- The effect of hi_str on the tests scores is -1.9 but given the interaction term (and assuming all coefficients are significant), the net effect is -1.9 -3.5*hi_el. If hi_el is 0 then the effect is -1.9 (which is hi_str coefficient), but if hi_el is 1 then the effect is -1.9 - 3.5 = - 5.4. In this case, the effect of student-teacher ratio is more negative in districts where the percent of English learners is higher.
- The average test score in the districts where student-teacher ratio is >= 20 and English learners is >= 10% is 640.6. To calculate this number, we have to plug in 1 in place of hi_el, hi_str, and str_el in the above equation (i.e., 664.1-18.1 -1.9 - 3.5 = 640.6) .
* We use "California Test Score" data set (caschool.dta) which is used by Stock and Watson (2003) and is downloadable from here.
First, upload the data:
– Dependent variable (Y): Average test score, variable testscr in the dataset.
– Independent variables (X)
Continuous str, student-teacher ratio.
Binary hi_el, where ‘0’ if English learners (el_pct) is lower than 10%, ‘1’ equal to 10% or higher.
- In Stata, first generate hi_el = 0 if el_pct<10. Then replace hi_el=1 if el_pct>=10
Interaction term str_el_dc = str * hi_el. In Stata: generate str_el_dc = str*hi_el
We run the regression
regress testscr str hi_el str_el_dc, robust
Stata will give us the following outputs
Interpretation:
From the above outputs we can write the following equation:
testscr_hat = 682.2 – 0.97*str + 5.6*hi_el – 1.28*str_el_dc
The effect of str on testscr will be mediated by hi_el.
Notice that how hi_el changes both the intercept and the slope of str. Reducing str by one in low EL districts will increase test scores by 0.97 points, but it will have a higher impact (2.25 points) in high EL districts. The difference between these two effects is1.28 which is the coefficient of the interaction (Stock and Watson, 2003, p.223).
First, upload the data:
– Dependent variable (Y): Average test score, variable testscr in the dataset.
– Independent variables (X)
Continuous str, student-teacher ratio.
Continuous el_pct, percent of English learners.
Interaction term str_el_cc = str * el_pct. In Stata: generate str_el_cc = str*el_pct
We run the regression
regress testscr str el_pct str_el_cc, robust
Stata will give us the following outputs
Interpretation:
From the above outputs we can write the following equation:
testscr_hat = 686.3 – 1.12*str - 0.67*el_pct + 0.0012*str_el_cc
The effect of the interaction term is very small. Following Stock & Watson (2003, p.229), algebraically the slope of str is –1.12 + 0.0012*el_pct (remember that str_el_cc is equal to str*el_pct). So:
If el_pct = 10, the slope of str is -1.108
If el_pct = 20, the slope of str is -1.096. A difference in effect of 0.012 points.
In the continuous case there is an effect but is very small (and not significant). See Stock and Watson, 2003, for further details.
We use data from Hamilton (2006) for all the analyses in section 2. To get the data, type:
use https://dss.princeton.edu/training/linreg1.dta
Let us first check the relationship between csat and percent
For checking the relationship between csat and high type:
scatter csat high
There seems to be a curvilinear relationship between csat and percent, and a slightly linear between csat and high. To deal with U-shaped curves we need to add a square version of the variable, in this case percent square
generate percent2 = percent^2
The command acprplot (augmented component-plus-residual plot) provides another graphical way to examine the relationship between variables. It does provide good testing for linearity. Run this command after running a regression.
acprplot percent, lowess
acprplot high, lowess
The option lowess (locally weighted scatterplot smoothing) draws the observed pattern in the data to help identify nonlinearities. Percent shows a quadratic relation; it makes sense to add a square version of it. High shows a polynomial pattern as well but goes around the regression line (except on the right). We could keep it as is for now.
The linearity corrected model is:
regress csat percent percent2 high, robust
Stata will give us the following outputs.
An important assumption of the classical linear regression model is that the variance in the residuals has to be homoskedastic or constant.
Graphical way to check homoskedasticity
When plotting residuals vs. predicted values (Yhat), we should not observe any pattern at all. In Stata, we do this using rvfplot right after running the regression. It will automatically draw a scatter plot between residuals and predicted values. Type:
Residuals seem to expand slightly at higher levels of Yhat.
A non-graphical way to detect heteroskedasticity is the Breusch-Pagan test. The null hypothesis is that residuals are homoskedastic. In the example below, we reject the null at a 95% level and conclude that residuals are heteroscedastic.
The graphical and the Breusch-Pagan test suggest the possible presence of heteroskedasticity in our model. The problem with this is that we may have the wrong estimates of the standard errors for the coefficients and, therefore their t-values.
There are two ways to deal with this problem; one is using heteroskedasticity-robust standard errors, and the other one is using weighted least squares (see Stock and Watson, 2003, chapter 15). WLS requires knowledge of the conditional variance on which the weights are based; if this is known (rarely the case), then use WLS. In practice, it is recommended to use heteroskedasticity-robust standard errors to deal with heteroskedasticity.
By default Stata assumes homoskedastic standard errors, so we need to adjust our model to account for heteroskedasticity. To do this, we use the option robust in the regress command. For example,
regress csat expense percent income high college i.region, robust
Note: Stock and Watson (2019, chapter 5) suggest, as a rule of thumb, we should always assume heteroskedasticity in our model and therefore run robust regression.
An important assumption for the multiple regression model is that independent variables are not perfectly multicolinear. One regressor should not be a linear function of another.
When multicollinearity is present in a model, standard errors may be inflated. Stata will drop one of the variables to avoid a division by zero in the OLS procedure (see Stock and Watson, 2019, chapter 6).
The Stata command to check for multicollinearity is vif (variance inflation factor). Type:
regress csat expense percent income high college i.region
vif
Rule of thumb: A VIF > 10 or a 1/VIF < 0.10 indicates the presence of multicolinearity in the model.
Based on the above rule, we can say there is no multicolinearity in our model.
How do we know we have included all necessary variables to explain Y?
Testing for omitted variable bias is important for our model since it is related to the assumption that the error term and the independent variables in the model are not correlated (E(e|X) = 0)
If we are missing variables in our model and
…then our regression coefficients are inconsistent.
In Stata, we test for omitted-variable bias using the ovtest command:
The null hypothesis is that the model does not have omitted-variables bias, the p-value is higher than the usual threshold of 0.05 (95% significance), so we fail to reject the null and conclude that we do not need more variables.
Another command to test for the model specification is linktest. It basically checks whether we need more variables in our model by running a new regression with the observed Y(csat) against Yhat(csat_predicted or Xβ) and Yhat-squared as independent variables.
The thing to look for here is the significance of _hatsq. The null hypothesis is that there is no specification error. If the p-value of _hatsq is not significant, then we fail to reject the null and conclude that our model is correctly specified. Type:
regress csat expense percent income high college i.region, robust
linktest
The null hypothesis is that there is no specification error, the p-value of _hatsq is higher than the usual threshold of 0.05 (95% significance), so we fail to reject the null and conclude that model is correctly specified.
We use the avplots command (added-variable plots) to check for outliers. Outliers are data points with extreme values that could have a negative effect on our estimators. In Stata, type:
avplot percent
avplot expense
The above plots regress each variable against all others; notice the coefficients on each. All data points seem to be in range, and no outliers were observed.
For checking the outliers for all variables of the model, type avplots after regression.
avplots
Another assumption of the regression model (OLS) that impact the validity of all tests (p, t, and F) is that residuals behave ‘normal’. Residuals (here indicated by the letter “e”) are the difference between the observed values (Y) and the predicted values (Yhat): e = Y –Yhat.
In Stata, after running regression type: predict e, resid. It will generate a variable called “e” (residuals).
Three graphs will help us check for normality in the residuals: kdensity, pnorm, and qnorm.
kdensity e, normal
A kernel density plot produces a kind of histogram for the residuals; the option normal overlays a normal distribution to compare. Here residuals seem to follow a normal distribution. Below is an example using histogram.
histogram e, kdensity normal
If residuals do not follow a ‘normal’ pattern, then you should check for omitted variables, model specification, linearity, and functional forms. In sum, you may need to reassess your model/theory. In practice, normality does not represent much of a problem when dealing with really big samples.
- Standardize normal probability plot (pnorm) checks for non-normality in the middle range of residuals.
pnorm e
The plot is slightly off the line but looks ok.
- Quintile-normal plots (qnorm)check for non-normality in the extremes of the data (tails). It plots quintiles of residuals vs quintiles of a normal distribution.
qnorm e
Tails are a bit off the normal.
- A non-graphical test is the Shapiro-Wilk test for normality. It tests the hypothesis that the distribution is normal; in this case, the null hypothesis is that the distribution of the residuals is normal. Type
swilk e
The null hypothesis is that the distribution of the residuals is normal, here the p-value is 0.17 so we fail to reject the null. We conclude then that residuals are normally distributed.
To test whether two coefficients are jointly different from 0, use the command test (see Hamilton, 2006, p.175).
To test the null hypothesis that both coefficients do not have any effect on csat (βhigh= 0 and βcollege= 0), type:
We will get
The p-value is 0.0003; we reject the null and conclude that both variables have indeed a significant effect on SAT.
The following are general guidelines for building a regression model suggested by Gelman and Hill (2007):
If you have questions or comments about this guide or method, please email data@Princeton.edu.