Panel data (also known as longitudinal or cross-sectional time-series data) is a dataset in which the behavior of each individual or entity (e.g., country, state, company, industry) is observed at multiple points in time.
Example:
In the above Panel dataset, we have data for variables y, x1, x2, and x3 for each entity (i.e., countries - Angola, Brazil, and China) at multiple points in time (i.e., years - 2018, 2019, and 2020).
When all entities are observed across all times, we call it a balanced panel.
When some entities are not observed in some years, we call it an unbalanced panel.
The increasing availability of data observed on cross-sections of units (like households, firms, countries etc.) and over time has given rise to a number of estimation approaches exploiting this double dimensionality to cope with some of the typical problems associated with economic data.
Panel data enables us to control for individual heterogeneity. That means, Panel data allows us to control for variables you cannot observe or measure like cultural factors or difference in business practices across companies; or variables that change over time but not across entities (i.e. national policies, federal regulations, international agreements, etc.)
With panel data you can include variables at different levels of analysis (i.e. students, schools, districts, states) suitable for multilevel or hierarchical modeling.
Some drawbacks are data collection issues (i.e. sampling design, coverage), non-response in the case of micro panels or cross-country dependency in the case of macro panels (i.e. correlation between countries)
Note: For a comprehensive list of advantages and disadvantages of panel data and examples explaining this, see Baltagi, Econometric Analysis of Panel Data (chapter 1).
When we work with panel data in Stata, we need to declare that we have a panel dataset.
To get the data in Stata, type the following codes in Stata command window:
use https://dss.princeton.edu/training/Panel101_new.dta
For setting the data as Panel, type:
xtset country year
Stata will give us the following message:
The term “(strongly balanced)” refers to the fact that all countries have data for all years. If, for example, a country does not have data for any year, then the data is unbalanced. Ideally, you would want to have a balanced dataset, but this is not always the case. Nevertheless, you can still run the model.
NOTE: If you get the following error after using xtset:
You need to convert ‘country’ to numeric. To do this, type:
Now you have to use ‘country1’ instead of ‘country’ for xtset declaration. That means you have to type:
xtset country1 year
This guide discusses two basic methods we commonly use to analyze panel data:
Concept
When using FE, we assume that the characteristics of an individual may impact or bias the predictor or outcome variables, and we need to control for this. This is the rationale behind the assumption of the correlation between an entity’s error term and predictor variables. FE removes the effect of those time-invariant characteristics, and therefore, we can assess the net effect of the predictors on the outcome variable.
When using FE, we assume that something within the individual may impact or bias the predictor or outcome variables, and we need to control for this. FE model removes the effects of individual or entity's time-invariant characteristics so we can assess the net effect of the predictors on the outcome variable.
The FE regression model has n different intercepts, one for each entity. These intercepts can be represented by a set of binary variables, and these binary variables absorb the influences of all omitted variables that differ from one entity to the next but are constant over time.
Estimation
This guide discusses two different ways to estimate fixed effects models: (i) within estimator, (ii) dummy variable estimator .
(i) Within Estimator
This is the more commonly used estimator for fixed effects models. This estimator is called the "within estimator", as it uses time variation within each cross-section.
- Use the following dataset (ignore this step if you have already opened the dataset in the previous section)
use https://dss.princeton.edu/training/Panel101_new.dta, clear
- Declare the dataset as a panel using xtset (ignore this step if you have already declared the dataset as a panel)
- Use the following command to estimate your fixed effects model
xtreg y x1 x2, fe
Note: using the fe option indicates we estimate a fixed effects model.
Stata will give us the following results:
The coefficient of x1 indicates how much of Y changes over time, on average per country, when x1 increases by one unit, holding all other variables constant.
The first highlighted p-value suggests whether x1 significantly affects the dependent variable (y). As the p value is < 0.10, the coefficient for x1 is significant at 10% level.
The second highlighted p-value suggests whether the estimated model is statistically significant. As the p value is < 0.01, the model is statistically significant at 1% level.
(ii) Dummy Variable Regression
When there are a small number of fixed effects to be estimated, it is convenient to just run dummy variable regression for a FE model.
- Use the following dataset (ignore this step if you have already opened the dataset for the previous section)
use https://dss.princeton.edu/training/Panel101_new.dta, clear
- Declare the dataset as a panel using xtset (ignore this step if you have already declared the dataset as a panel)
- Use the following command to estimate your fixed effects model
reg y x1 x2 i.country
Stata will give us the following results:
. reg y x1 x2 i.country
Notice that the estimated coefficients for x1 and x2 are the same for both the "Within Estimator" method and the "Dummy Variable Regression" method.
Notes:
- Including a lagged dependent variable as a regressor in a fixed effects model can introduce bias, a problem often referred to as the "Nickell bias" or "dynamic panel bias." This bias arises because the lagged dependent variable is correlated with the individual-specific effects, violating the assumption of strict exogeneity required for consistent estimation of fixed effects models. In this case, using dynamic panel data models such as the Arellano-Bond or the generalized method of moments (GMM) can provide consistent estimates.
Concept
If individual or entity-specific effects are strictly uncorrelated with the regressors, it may be appropriate to model the individual or entity-specific constant terms as randomly distributed across cross-sectional units. This view would be appropriate if we believe that sampled cross-sectional units were drawn from a large population.
An advantage of using the random effects method is that you can include time-invariant variables (e.g., geographical contiguity, distance between states) in your model. In the fixed effects model, these variables are absorbed by the intercept.
Estimation
- Use the following dataset
use https://dss.princeton.edu/training/Panel101_new.dta, clear
- Declare the dataset as a panel using xtset
- Use the following command to estimate your random effects model
xtreg y x1 x2, re
Note: the use of re option indicates that we are estimating a random effects model.
Stata will give us the following results:
The coefficient of x1 indicates how much of Y changes over time, on average per country, when x1 increases by one unit, holding all other variables constant.
The highlighted p-value suggests whether x1 significantly affects the dependent variable (y). As the p value is not < 0.10, the coefficient for x1 is not significant at 10% level.
Hausman Test
- Use the Hausman test to decide whether to use a fixed effects or random effects model.
- Procedures:
- Run a fixed effects model and save the estimates
- Run a random effects model and save the estimates
- Perform the Hausman test
- Use the following Stata commands
xtreg y x1 x2, fe
estimates store fixed
xtreg y x1 x2, re
estimates store random
hausman fixed random
Stata will give us the following results:
...
Decision rule: if the Prob > chi2 (p value) value is < 0.05, use a fixed effects model. In this case, we should use a random effect model as the p-value is = 0.05.
DSS Data Analysis Guides https://library.princeton.edu/dss/training
Princeton DSS Libguides https://libguides.princeton.edu/dss
Stata Resources https://www.stata.com/features/overview/linear-fixed-and-random-effects-models/
Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist's companion. Princeton University Press.
Baltagi, B. (2021). Econometric analysis of panel data (6th ed). Springer.
Bartels, B. (2008). "Beyond fixed versus random effects": a framework for improving substantive and statistical analysis of panel, time-series cross-sectional, and multilevel data. The Society for Political Methodology, 9, 1-43. Available at: https://home.gwu.edu/~bartels/cluster.pdf
Baum, C. F. (2006). An introduction to modern econometrics using Stata. Stata Press.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Greene, W. H. (2018). Econometric analysis (8th ed.). Pearson.
Hamilton, L. C. (2012). Statistics with Stata: version 12. Cengage Learning.
Hoechle, D. (2007). Robust standard errors for panel regressions with cross-sectional dependence. The stata journal, 7(3), 281-312. Available at: https://journals.sagepub.com/doi/pdf/10.1177/1536867X0700700301
Kohler, U., & Kreuter, F. (2012). Data analysis using Stata (3rd ed.). Stata Press.
Stock, J. H., & Watson, M. W. (2019). Introduction to econometrics (4th ed.). Pearson.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT Press.
Wooldridge, J. M. (2020). Introductory econometrics: a modern approach (7th ed). Cengage Learning.
If you have questions or comments about this guide or method, please email data@Princeton.edu.