Panel Data Using R: Fixed-effects and Random-effects

Fixed-effects and Random-effects

Panel Data

Panel data (also known as longitudinal or cross-sectional time-series data) refers to data for n different entities at different time periods. These entities could be states, companies, individuals, countries, etc.

Example of Panel data:

 Country year Y X1 X2 X3 Cat 2020 9.6 3.5 1.9 4.9 Cat 2021 8.8 3.6 4.3 9.1 Cat 2022 5.8 2.5 1.4 1.8 Dog 2020 1.0 2.5 9.0 1.4 Dog 2021 7.0 2.3 1.2 2.5 Dog 2022 2.5 3.1 3.0 3.2 Rabbit 2020 2.5 5.0 3.2 2.3 Rabbit 2021 0.3 8.1 2.5 7.4 Rabbit 2022 2.5 9.9 5.0 5.2

Panel Properties

The increasing availability of data observed on cross-sections of units (like households, firms, countries etc.) and over time has given rise to a number of estimation approaches exploiting this double dimensionality to cope with some of the typical problems associated with economic data.

Panel data enables us to control for individual heterogeneity. That means, Panel data allows us to control for variables you cannot observe or measure like cultural factors or difference in business practices across companies; or variables that change over time but not across entities (i.e. national policies, federal regulations, international agreements, etc.)

With panel data you can include variables at different levels of analysis (i.e. students, schools, districts, states) suitable for multilevel or hierarchical modeling.

Some drawbacks are data collection issues (i.e. sampling design, coverage), non-response in the case of micro panels or cross-country dependency in the case of macro panels (i.e. correlation between countries)

Note: For a comprehensive list of advantages and disadvantages of panel data and examples explaining this, see Baltagi, Econometric Analysis of Panel Data (chapter 1).

Fixed-effects or Random-effects

In this guide we focus on two common techniques used to analyze panel data:

• Fixed effects
• Random effects

Fixed effects

the fixed effects model assumes that the omitted effects of the model can be arbitrarily correlated with the included variables. This is useful whenever you are only interested in analyzing the impact of variables that vary over time (the time effects).

FE explore the relationship between predictor and outcome variables within an entity (country, person, company, etc.). Each entity has its own individual characteristics that may or may not influence the predictor variables (for example, being a male or female could influence the opinion toward certain issue; or the political system of a particular country could have some effect on trade or GDP; or the business practices of a company may influence its stock price).

When using FE we assume that something within the individual may impact or bias the predictor or outcome variables and we need to control for this. This is the rationale behind the assumption of the correlation between entity’s error term and predictor variables. FE remove the effect of those time-invariant characteristics so we can assess the net effect of the predictors on the outcome variable.

The FE regression model has n different intercepts, one for each entity. These intercepts can be represented by a set of binary variable and these binary variables absorb the influences of all omitted variables that differ from one entity to the next but are constant over time.

Another important assumption of the FE model is that those time-invariant characteristics are unique to the individual and should not be correlated with other individual characteristics. Each entity is different therefore the entity’s error term and the constant (which captures individual characteristics) should not be correlated with the others. If the error terms are correlated, then FE is no suitable since inferences may not be correct and you need to model that relationship (probably using random-effects).

Random effects

If the individual effects are strictly uncorrelated with the regressors it may be appropriate to model the individual specific constant terms as randomly distributed across cross-sectional units. This view would be appropriate if we believed that sampled cross-sectional units were drawn from a large population.

If you have reason to believe that differences across entities have some influence on your dependent variable then you should use random effects. In random-effects you need to specify those individual characteristics that may or may not influence the predictor variables. The problem with this is that some variables may not be available therefore leading to omitted variable bias in the model.

An advantage of random effects is that you can include time invariant variables (i.e. gender). In the fixed effects model these variables are absorbed by the intercept. The cost is the possibility of inconsistent estimators, of the assumption is inappropriate.

How to choose between fixed-effects model and random-effects model?

To decide between fixed or random effects you can run a Hausman test where the null hypothesis is that the preferred model is random effects vs. the alternative the fixed effects. It basically tests whether the unique errors are correlated with the regressors, the null hypothesis is they are not.

phtest computes the Hausman test which is based on the comparison of two sets of estimates. Run a fixed effects model and save the estimates, then run a random model and save the estimates, then perform the test.

The code example

```# We pull the data first
library(foreign)
```
```library(plm)
fixed <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="within")  #fixed model
random <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="random")  #random model
phtest(fixed,random) #Hausman test
```

The result is below.

```         Hausman Test

data:  y ~ x1
chisq = 3.674, df = 1, p-value = 0.05527
alternative hypothesis: one model is inconsistent
```

If the p-value is significant (for example <0.05) then use fixed effects, if not use random effects. In this case the p value is slightly larger than 0.05, it may still be better to use fixed effects models.

Fixed-effects

Let's see how to run the fixed-effects model using plm.

```library(plm)
fixed <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="within")
summary(fixed)
```

We use index to specify the panel setting. This step is not necessary every time.

We use "within" to specify we are using fix-effects models.

In fact, several models can be estimated with plm by filing the model argument.

• the fixed effects model (within)
• the error components (random effects) model (random)
• the pooling model (pooling)
• the first-difference model (fd)
• the between model (between)

For more details concerning other models you can check “Panel Data Econometrics in R: the plm package”:https://www.jstatsoft.org/article/view/v027i02

Now Let's look at the output.

```Oneway (individual) effect Within Model

Call:
plm(formula = y ~ x1, data = Panel, model = "within", index = c("country",
"year"))

Balanced Panel: n = 7, T = 10, N = 70 (n = # of groups/panels, T = # years, N = total # of observations)

Residuals:
Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-8.63e+09 -9.70e+08  5.40e+08  0.00e+00  1.39e+09  5.61e+09

Coefficients:
Estimate Std. Error t-value Pr(>|t|)
x1 2475617827 1106675594   2.237  0.02889 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    5.2364e+20
Residual Sum of Squares: 4.8454e+20
R-Squared:      0.074684
F-statistic: 5.00411 on 1 and 62 DF, p-value: 0.028892
```

The coeff of x1 indicates how much Y changes overtime, on average per country, when X increases by one unit.

The first p-value indicates whether the variable has a significant influence on your dependent variable (y). If p value smaller than 0.05 then yes. In this case 0.02889, meaning x1 being a significant variable. The second p-value indicates whether this model is OK. In this model we have only one variable so the p-value is exactly the same.

```fixef(fixed)
A           B           C           D           E           F
880542404 -1057858363 -1722810755  3162826897  -602622000  2010731793
G
-984717493
```

We can display the fixed effects (constants for each country).

Random-effects

Let's see how to run the random-effects model using plm.

```random <- plm(y ~ x1, data=Panel, index=c("country", "year"), model="random")
summary(random)
```

And the results.

```Oneway (individual) effect Random Effect Model
(Swamy-Arora's transformation)

Call:
plm(formula = y ~ x1, data = Panel, model = "random", index = c("country",
"year"))

Balanced Panel: n = 7, T = 10, N = 70 (n = # of groups/panels, T = # years, N = total # of observations)

Effects:
var   std.dev share
idiosyncratic 7.815e+18 2.796e+09 0.873
individual    1.133e+18 1.065e+09 0.127
theta: 0.3611

Residuals:
Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-8.94e+09 -1.51e+09  2.82e+08  0.00e+00  1.56e+09  6.63e+09

Coefficients:
Estimate Std. Error z-value Pr(>|z|)
(Intercept) 1037014284  790626206  1.3116   0.1896
x1          1247001782  902145601  1.3823   0.1669

Total Sum of Squares:    5.6595e+20
Residual Sum of Squares: 5.5048e+20
R-Squared:      0.02733
Chisq: 1.91065 on 1 DF, p-value: 0.16689
```

Interpretation of the coefficients is tricky since they include both the within-entity and between-entity effects. In the case of TSCS data represents the average effect of X over Y when X changes across time and between countries by one unit.

Pr(>|t|)= Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (95%, you could choose also an alpha of 0.10), if this is the case then you can say that the variable has a significant influence on your dependent variable (y).

The second p-value indicates whether this model is OK. In this model we have only one variable so the p-value is exactly the same.

Comparison with simple OLS and another method for fixed-effects

Simple OLS regression

First we observe that instead of using fixed-effects model. we use simple OLS regression.

`ols <-lm(y ~ x1, data=Panel) summary(ols) `

The summary gives

```Call:
lm(formula = y ~ x1, data = Panel)

Residuals:
Min         1Q     Median         3Q        Max
-9.546e+09 -1.578e+09  1.554e+08  1.422e+09  7.183e+09

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.524e+09  6.211e+08   2.454   0.0167 *
x1          4.950e+08  7.789e+08   0.636   0.5272
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.028e+09 on 68 degrees of freedom
Multiple R-squared:  0.005905,	Adjusted R-squared:  -0.008714
F-statistic: 0.4039 on 1 and 68 DF,  p-value: 0.5272
```

We can see that the model gives a really high p-value, which means this model is insignificant.

After constructing an OLS model, we can also run a pFtest to see which is the better fitted model.

```pFtest(fixed, ols)
```

The null hypothesis of the pFtest is OLS is better than fixed.

```F test for individual effects

data:  y ~ x1
F = 2.9655, df1 = 6, df2 = 62, p-value = 0.01307
alternative hypothesis: significant effects
```

The p-value is really small so we reject the null-hypothesis, which means a fixed-effect model would be a better fit.

Let's introduce another way of using fixed-effects without using plm. The least squares dummy variable model.

Fixed effects using Least squares dummy variable model

```fixed.dum <-lm(y ~ x1 + factor(country) - 1, data=Panel)
summary(fixed.dum)
```

And the summary shows

```Call:
lm(formula = y ~ x1 + factor(country) - 1, data = Panel)

Residuals:
Min         1Q     Median         3Q        Max
-8.634e+09 -9.697e+08  5.405e+08  1.386e+09  5.612e+09

Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1                2.476e+09  1.107e+09   2.237  0.02889 *
factor(country)A  8.805e+08  9.618e+08   0.916  0.36347
factor(country)B -1.058e+09  1.051e+09  -1.006  0.31811
factor(country)C -1.723e+09  1.632e+09  -1.056  0.29508
factor(country)D  3.163e+09  9.095e+08   3.478  0.00093 ***
factor(country)E -6.026e+08  1.064e+09  -0.566  0.57329
factor(country)F  2.011e+09  1.123e+09   1.791  0.07821 .
factor(country)G -9.847e+08  1.493e+09  -0.660  0.51190
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.796e+09 on 62 degrees of freedom
Multiple R-squared:  0.4402,	Adjusted R-squared:  0.368
F-statistic: 6.095 on 8 and 62 DF,  p-value: 8.892e-06
```

We can see that the estimate for x1 is the same with that of the fixed effects model using plm. The constants for each country is shown in the result as well.

Other tests/Diagnostics

Test for time-fixed effects.

We don't always want to examine time effects of a model. Sometimes we are interested in other factors rather than the time effect. We can use tests to see whether time-fixed effects are needed.

First we construct a time-fixed effects model.

```fixed.time <- plm(y ~ x1 + factor(year), data=Panel, index=c("country","year"), model="within")
summary(fixed.time)
```

The summary:

```Oneway (individual) effect Within Model

Call:
plm(formula = y ~ x1 + factor(year), data = Panel, model = "within",
index = c("country", "year"))

Balanced Panel: n = 7, T = 10, N = 70

Residuals:
Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-7.92e+09 -1.05e+09 -1.40e+08  0.00e+00  1.63e+09  5.49e+09

Coefficients:
Estimate Std. Error t-value Pr(>|t|)
x1               1389050354 1319849567  1.0524  0.29738
factor(year)1991  296381559 1503368528  0.1971  0.84447
factor(year)1992  145369666 1547226548  0.0940  0.92550
factor(year)1993 2874386795 1503862554  1.9113  0.06138 .
factor(year)1994 2848156288 1661498927  1.7142  0.09233 .
factor(year)1995  973941306 1567245748  0.6214  0.53698
factor(year)1996 1672812557 1631539254  1.0253  0.30988
factor(year)1997 2991770063 1627062032  1.8388  0.07156 .
factor(year)1998  367463593 1587924445  0.2314  0.81789
factor(year)1999 1258751933 1512397632  0.8323  0.40898
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    5.2364e+20
Residual Sum of Squares: 4.0201e+20
R-Squared:      0.23229
F-statistic: 1.60365 on 10 and 53 DF, p-value: 0.13113
```

After constructing a fixed-time effects model, we can run some tests to see whether this is needed.

We can run a pFtest.

The null hypothesis is that no time-fixed effects is needed.

```pFtest(fixed.time, fixed)
```

The output gives:

```F test for individual effects

data:  y ~ x1 + factor(year)
F = 1.209, df1 = 9, df2 = 53, p-value = 0.3094
alternative hypothesis: significant effects
```

If the p-value is small, which indicates that we can reject the null hypothesis, then use time-fixed effects. In this case we have a p value of 0.3094. There is no need for time-fixed effects.

We can also run a Lagrange Multiplier Test for time effects (Breush-Pagan). The function is plmtest and we specify "bp" in the type. The null hypothesis is the same: no time-fixed effects needed.

```plmtest(fixed, c("time"), type=("bp"))
```

The result gives:

```Lagrange Multiplier Test - time effects (Breusch-Pagan)

data:  y ~ x1
chisq = 0.16532, df = 1, p-value = 0.6843
alternative hypothesis: significant effects
```

Again, the p value is large so do not need to use time-fixed effects in this model.

DSS Online Training Section https://dss.princeton.edu/training/

Princeton DSS Libguides https://libguides.princeton.edu/dss

John Fox's site https://socialsciences.mcmaster.ca/jfox/

Quick-R https://www.statmethods.net/

UCLA Resources https://stats.oarc.ucla.edu/r/

Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics: An empiricist's companion. Princeton University Press.

Baltagi, B. (2021). Econometric analysis of panel data (6th ed). Springer.

Bartels, B. (2008). "Beyond fixed versus random effects": a framework for improving substantive and statistical analysis of panel, time-series cross-sectional, and multilevel data. The Society for Political Methodology9, 1-43. Available at: https://home.gwu.edu/~bartels/cluster.pdf

Croissant, Y., & Millo, G. (2008). Panel data econometrics inr: Theplmpackage. Journal of Statistical Software, 27(2). https://doi.org/10.18637/jss.v027.i02

Dalgaard, P. (2011). Introductory statistics with R. Lightning Source UK Ltd.

Fox, J., Weisberg, S., & Fox, J. (2011). An R companion to applied regression. SAGE.

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.

Greene, W. H. (2018). Econometric analysis (8th ed.). Pearson.

Hoechle, D. (2007). Robust standard errors for panel regressions with cross-sectional dependence. The stata journal7(3), 281-312. Available at: https://journals.sagepub.com/doi/pdf/10.1177/1536867X0700700301

Kleiber, C., Springer, & Zeileis, A. (2008). Applied Econometrics with R. Scholars Portal.

Lumley, T. (2010). Complex surveys: A guide to analysis using R. Wiley.

Spector, P. (2008). Data manipulation with R. Springer.

Stock, J. H., & Watson, M. W. (2019). Introduction to econometrics (4th ed.). Pearson.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT Press.

Wooldridge, J. M. (2020). Introductory econometrics: a modern approach (7th ed). Cengage Learning.

Data Consultant

Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519