Panel data (also known as longitudinal or cross-sectional time-series data) refers to data for n different entities at different time periods. These entities could be states, companies, individuals, countries, etc.
Example of Panel data:
The increasing availability of data observed on cross-sections of units (like households, firms, countries etc.) and over time has given rise to a number of estimation approaches exploiting this double dimensionality to cope with some of the typical problems associated with economic data.
Panel data enables us to control for individual heterogeneity. That means, Panel data allows us to control for variables you cannot observe or measure like cultural factors or difference in business practices across companies; or variables that change over time but not across entities (i.e. national policies, federal regulations, international agreements, etc.)
With panel data you can include variables at different levels of analysis (i.e. students, schools, districts, states) suitable for multilevel or hierarchical modeling.
Some drawbacks are data collection issues (i.e. sampling design, coverage), non-response in the case of micro panels or cross-country dependency in the case of macro panels (i.e. correlation between countries)
Note: For a comprehensive list of advantages and disadvantages of panel data and examples explaining this, see Baltagi, Econometric Analysis of Panel Data (chapter 1).
In this guide we focus on two common techniques used to analyze panel data:
the fixed effects model assumes that the omitted effects of the model can be arbitrarily correlated with the included variables. This is useful whenever you are only interested in analyzing the impact of variables that vary over time (the time effects).
FE explore the relationship between predictor and outcome variables within an entity (country, person, company, etc.). Each entity has its own individual characteristics that may or may not influence the predictor variables (for example, being a male or female could influence the opinion toward certain issue; or the political system of a particular country could have some effect on trade or GDP; or the business practices of a company may influence its stock price).
When using FE we assume that something within the individual may impact or bias the predictor or outcome variables and we need to control for this. This is the rationale behind the assumption of the correlation between entity’s error term and predictor variables. FE remove the effect of those time-invariant characteristics so we can assess the net effect of the predictors on the outcome variable.
The FE regression model has n different intercepts, one for each entity. These intercepts can be represented by a set of binary variable and these binary variables absorb the influences of all omitted variables that differ from one entity to the next but are constant over time.
Another important assumption of the FE model is that those time-invariant characteristics are unique to the individual and should not be correlated with other individual characteristics. Each entity is different therefore the entity’s error term and the constant (which captures individual characteristics) should not be correlated with the others. If the error terms are correlated, then FE is no suitable since inferences may not be correct and you need to model that relationship (probably using random-effects).
If the individual effects are strictly uncorrelated with the regressors it may be appropriate to model the individual specific constant terms as randomly distributed across cross-sectional units. This view would be appropriate if we believed that sampled cross-sectional units were drawn from a large population.
If you have reason to believe that differences across entities have some influence on your dependent variable then you should use random effects. In random-effects you need to specify those individual characteristics that may or may not influence the predictor variables. The problem with this is that some variables may not be available therefore leading to omitted variable bias in the model.
An advantage of random effects is that you can include time invariant variables (i.e. gender). In the fixed effects model these variables are absorbed by the intercept. The cost is the possibility of inconsistent estimators, of the assumption is inappropriate.
How to choose between fixed-effects model and random-effects model?
To decide between fixed or random effects you can run a Hausman test where the null hypothesis is that the preferred model is random effects vs. the alternative the fixed effects. It basically tests whether the unique errors are correlated with the regressors, the null hypothesis is they are not.
phtest computes the Hausman test which is based on the comparison of two sets of estimates. Run a fixed effects model and save the estimates, then run a random model and save the estimates, then perform the test.
The code example
# We pull the data first library(foreign) Panel <- read.dta("http://dss.princeton.edu/training/Panel101.dta")
library(plm) fixed <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="within") #fixed model random <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="random") #random model phtest(fixed,random) #Hausman test
The result is below.
Hausman Test data: y ~ x1 chisq = 3.674, df = 1, p-value = 0.05527 alternative hypothesis: one model is inconsistent
If the p-value is significant (for example <0.05) then use fixed effects, if not use random effects. In this case the p value is slightly larger than 0.05, it may still be better to use fixed effects models.
Let's see how to run the fixed-effects model using plm.
library(plm) fixed <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="within") summary(fixed)
We use index to specify the panel setting. This step is not necessary every time.
We use "within" to specify we are using fix-effects models.
In fact, several models can be estimated with plm by filing the model argument.
For more details concerning other models you can check “Panel Data Econometrics in R: the plm package”:https://www.jstatsoft.org/article/view/v027i02
Now Let's look at the output.
Oneway (individual) effect Within Model Call: plm(formula = y ~ x1, data = Panel, model = "within", index = c("country", "year")) Balanced Panel: n = 7, T = 10, N = 70 (n = # of groups/panels, T = # years, N = total # of observations) Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -8.63e+09 -9.70e+08 5.40e+08 0.00e+00 1.39e+09 5.61e+09 Coefficients: Estimate Std. Error t-value Pr(>|t|) x1 2475617827 1106675594 2.237 0.02889 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 5.2364e+20 Residual Sum of Squares: 4.8454e+20 R-Squared: 0.074684 Adj. R-Squared: -0.029788 F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892
The coeff of x1 indicates how much Y changes overtime, on average per country, when X increases by one unit.
The first p-value indicates whether the variable has a significant influence on your dependent variable (y). If p value smaller than 0.05 then yes. In this case 0.02889, meaning x1 being a significant variable. The second p-value indicates whether this model is OK. In this model we have only one variable so the p-value is exactly the same.
fixef(fixed) A B C D E F 880542404 -1057858363 -1722810755 3162826897 -602622000 2010731793 G -984717493
We can display the fixed effects (constants for each country).
Let's see how to run the random-effects model using plm.
random <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="random") summary(random)
And the results.
Oneway (individual) effect Random Effect Model (Swamy-Arora's transformation) Call: plm(formula = y ~ x1, data = Panel, model = "random", index = c("country", "year")) Balanced Panel: n = 7, T = 10, N = 70 (n = # of groups/panels, T = # years, N = total # of observations) Effects: var std.dev share idiosyncratic 7.815e+18 2.796e+09 0.873 individual 1.133e+18 1.065e+09 0.127 theta: 0.3611 Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -8.94e+09 -1.51e+09 2.82e+08 0.00e+00 1.56e+09 6.63e+09 Coefficients: Estimate Std. Error z-value Pr(>|z|) (Intercept) 1037014284 790626206 1.3116 0.1896 x1 1247001782 902145601 1.3823 0.1669 Total Sum of Squares: 5.6595e+20 Residual Sum of Squares: 5.5048e+20 R-Squared: 0.02733 Adj. R-Squared: 0.013026 Chisq: 1.91065 on 1 DF, p-value: 0.16689
Interpretation of the coefficients is tricky since they include both the within-entity and between-entity effects. In the case of TSCS data represents the average effect of X over Y when X changes across time and between countries by one unit.
Pr(>|t|)= Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (95%, you could choose also an alpha of 0.10), if this is the case then you can say that the variable has a significant influence on your dependent variable (y).
The second p-value indicates whether this model is OK. In this model we have only one variable so the p-value is exactly the same.
Simple OLS regression
First we observe that instead of using fixed-effects model. we use simple OLS regression.
ols <-lm(y ~ x1, data=Panel) summary(ols)
The summary gives
Call: lm(formula = y ~ x1, data = Panel) Residuals: Min 1Q Median 3Q Max -9.546e+09 -1.578e+09 1.554e+08 1.422e+09 7.183e+09 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.524e+09 6.211e+08 2.454 0.0167 * x1 4.950e+08 7.789e+08 0.636 0.5272 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.028e+09 on 68 degrees of freedom Multiple R-squared: 0.005905, Adjusted R-squared: -0.008714 F-statistic: 0.4039 on 1 and 68 DF, p-value: 0.5272
We can see that the model gives a really high p-value, which means this model is insignificant.
After constructing an OLS model, we can also run a pFtest to see which is the better fitted model.
The null hypothesis of the pFtest is OLS is better than fixed.
F test for individual effects data: y ~ x1 F = 2.9655, df1 = 6, df2 = 62, p-value = 0.01307 alternative hypothesis: significant effects
The p-value is really small so we reject the null-hypothesis, which means a fixed-effect model would be a better fit.
Let's introduce another way of using fixed-effects without using plm. The least squares dummy variable model.
Fixed effects using Least squares dummy variable model
fixed.dum <-lm(y ~ x1 + factor(country) - 1, data=Panel) summary(fixed.dum)
And the summary shows
Call: lm(formula = y ~ x1 + factor(country) - 1, data = Panel) Residuals: Min 1Q Median 3Q Max -8.634e+09 -9.697e+08 5.405e+08 1.386e+09 5.612e+09 Coefficients: Estimate Std. Error t value Pr(>|t|) x1 2.476e+09 1.107e+09 2.237 0.02889 * factor(country)A 8.805e+08 9.618e+08 0.916 0.36347 factor(country)B -1.058e+09 1.051e+09 -1.006 0.31811 factor(country)C -1.723e+09 1.632e+09 -1.056 0.29508 factor(country)D 3.163e+09 9.095e+08 3.478 0.00093 *** factor(country)E -6.026e+08 1.064e+09 -0.566 0.57329 factor(country)F 2.011e+09 1.123e+09 1.791 0.07821 . factor(country)G -9.847e+08 1.493e+09 -0.660 0.51190 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.796e+09 on 62 degrees of freedom Multiple R-squared: 0.4402, Adjusted R-squared: 0.368 F-statistic: 6.095 on 8 and 62 DF, p-value: 8.892e-06
We can see that the estimate for x1 is the same with that of the fixed effects model using plm. The constants for each country is shown in the result as well.
Test for time-fixed effects.
We don't always want to examine time effects of a model. Sometimes we are interested in other factors rather than the time effect. We can use tests to see whether time-fixed effects are needed.
First we construct a time-fixed effects model.
fixed.time <- plm(y ~ x1 + factor(year), data=Panel, index=c("country","year"), model="within") summary(fixed.time)
Oneway (individual) effect Within Model Call: plm(formula = y ~ x1 + factor(year), data = Panel, model = "within", index = c("country", "year")) Balanced Panel: n = 7, T = 10, N = 70 Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -7.92e+09 -1.05e+09 -1.40e+08 0.00e+00 1.63e+09 5.49e+09 Coefficients: Estimate Std. Error t-value Pr(>|t|) x1 1389050354 1319849567 1.0524 0.29738 factor(year)1991 296381559 1503368528 0.1971 0.84447 factor(year)1992 145369666 1547226548 0.0940 0.92550 factor(year)1993 2874386795 1503862554 1.9113 0.06138 . factor(year)1994 2848156288 1661498927 1.7142 0.09233 . factor(year)1995 973941306 1567245748 0.6214 0.53698 factor(year)1996 1672812557 1631539254 1.0253 0.30988 factor(year)1997 2991770063 1627062032 1.8388 0.07156 . factor(year)1998 367463593 1587924445 0.2314 0.81789 factor(year)1999 1258751933 1512397632 0.8323 0.40898 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 5.2364e+20 Residual Sum of Squares: 4.0201e+20 R-Squared: 0.23229 Adj. R-Squared: 0.00052851 F-statistic: 1.60365 on 10 and 53 DF, p-value: 0.13113
After constructing a fixed-time effects model, we can run some tests to see whether this is needed.
We can run a pFtest.
The null hypothesis is that no time-fixed effects is needed.
The output gives:
F test for individual effects data: y ~ x1 + factor(year) F = 1.209, df1 = 9, df2 = 53, p-value = 0.3094 alternative hypothesis: significant effects
If the p-value is small, which indicates that we can reject the null hypothesis, then use time-fixed effects. In this case we have a p value of 0.3094. There is no need for time-fixed effects.
We can also run a Lagrange Multiplier Test for time effects (Breush-Pagan). The function is plmtest and we specify "bp" in the type. The null hypothesis is the same: no time-fixed effects needed.
plmtest(fixed, c("time"), type=("bp"))
The result gives:
Lagrange Multiplier Test - time effects (Breusch-Pagan) data: y ~ x1 chisq = 0.16532, df = 1, p-value = 0.6843 alternative hypothesis: significant effects
Again, the p value is large so do not need to use time-fixed effects in this model.
DSS Online Training Section https://dss.princeton.edu/training/
Princeton DSS Libguides https://libguides.princeton.edu/dss
John Fox's site https://socialsciences.mcmaster.ca/jfox/
UCLA Resources https://stats.oarc.ucla.edu/r/
Plm package: https://cran.r-project.org/web/packages/plm/plm.pdf
Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist's companion. Princeton University Press.
Baltagi, B. (2021). Econometric analysis of panel data (6th ed). Springer.
Bartels, B. (2008). "Beyond fixed versus random effects": a framework for improving substantive and statistical analysis of panel, time-series cross-sectional, and multilevel data. The Society for Political Methodology, 9, 1-43. Available at: https://home.gwu.edu/~bartels/cluster.pdf
Croissant, Y., & Millo, G. (2008). Panel data econometrics inr: Theplmpackage. Journal of Statistical Software, 27(2). https://doi.org/10.18637/jss.v027.i02
Dalgaard, P. (2011). Introductory statistics with R. Lightning Source UK Ltd.
Fox, J., Weisberg, S., & Fox, J. (2011). An R companion to applied regression. SAGE.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Greene, W. H. (2018). Econometric analysis (8th ed.). Pearson.
Hoechle, D. (2007). Robust standard errors for panel regressions with cross-sectional dependence. The stata journal, 7(3), 281-312. Available at: https://journals.sagepub.com/doi/pdf/10.1177/1536867X0700700301
Kleiber, C., Springer, & Zeileis, A. (2008). Applied Econometrics with R. Scholars Portal.
Lumley, T. (2010). Complex surveys: A guide to analysis using R. Wiley.
Spector, P. (2008). Data manipulation with R. Springer.
Stock, J. H., & Watson, M. W. (2019). Introduction to econometrics (4th ed.). Pearson.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT Press.
Wooldridge, J. M. (2020). Introductory econometrics: a modern approach (7th ed). Cengage Learning.