When a dataset contains large number of variables, there is a possibility that many of these variables substantively overlap with each other. In this situation, we may need to reduce the number of variables in our dataset. Factor analysis can be a useful tool to apply in this regard. For example, in psychology research, we can reduce long personality test responses to a small number of personality traits by conducting a factor analysis.
Key objectives of factor analysis are:
(i) Getting a small set of variables (preferably uncorrelated) from a large set of variables (most of which are correlated with each other).
(ii) Creating indexes with variables that conceptually measure similar things.
There are two types of factor analysis.
Note: Sometimes we get confused about using factor analysis and principal component analysis (PCA) interchangeably as they are very similar in many ways. But keep in mind that there is a fundamental difference between them - PCA is a linear combination of variables and factor analysis is a measurement model of a latent variable (for more see here and here).
In this tutorial, we will show how to conduct different kinds of exploratory factor analysis using data from Meijers and Zaslove (2021).
Let's first see how to conduct a very basic factor analysis.
Load the following dataset:
use https://dss.princeton.edu/training/factor.dta
The dataset (constructed by Meijers and Zaslove, 2021) contains information for 250 political parties in 28 European countries. The authors measure populism in political parties using expert surveys. As populism is a multi-dimensional concept, the authors measure it with the help of five variables: manichean, indivisible, generalwill, peoplecentrism, and antielitism. By using simple factor analysis, we will identify the number of latent factor(s) among these five variables. To do that, use the following Stata command:
factor manichean indivisble generalwill peoplecentrism antielitism
Stata provides us with the following outputs:
Interpretation:
In the previous example, we showed principal-factor solution, where the communalities (defined as 1 - Uniqueness) were estimated using the squared multiple correlation coefficients. However, if we assume that there are no unique factors, we should use the "Principal-component factors" option (keep in mind that principal-component factors analysis and principal component analysis are not the same thing!). To do that let's load the following dataset:
- Use the following Stata codes:
Note: after the factor command we list the variables from which we want to get a latent factor(s).
Stata provides us with the following outputs:
Interpretation:
- After running factor we can now use the rotate command to get a clearer pattern of our factor model. To do this, type rotate
Stata provides us with the following outputs:
Factor analysis/correlation Number of obs = 236
Method: principal-component factors Retained factors = 1
Rotation: orthogonal varimax (Kaiser off) Number of params = 5
Notice that from the first table we can see more clearly that Factor1 explains 82.59% of the total observed variance in our factor model.
- We now know that Factor1 is the latent factor in our model, which can contain the information from the variables included to measure populism. We will now create a new variable in our dataset. To do that type:
predict factor1
Stata provides us with the following outputs:
Notes:
predict populism_factor
Iterated factor analysis is similar to principal-component factors analysis. However, the key difference between the two is that in iterated factor analysis, the solutions are iterated to obtain better estimates.
- To conduct iterated factor analysis, let's load the following dataset:
- Use the following Stata codes:
factor manichean indivisble generalwill peoplecentrism antielitism,ipf mineigen(1)
Notes:
Stata provides us with the following outputs:
Interpretation:
- After running factor we can now use the rotate command to get a clearer pattern of our factor model. To do this, type: rotate
Stata provides us with the following outputs:
The outputs are similar to those for the principal-component factors analysis. However, in this case, Factor1 explains 100% of the total observed variance in our factor model.
- We now know that Factor1 is the latent factor in our model, which can contain the information from the variables included to measure populism. We will now create a new variable in our dataset. To do that type:
predict factor1
Stata provides us with the following outputs:
Notes:
predict populism_factor
DSS Data Analysis Guides https://libguides.princeton.edu/c.php?g=1415215
Princeton DSS Libguides https://libguides.princeton.edu/dss
Stata Manual for Factor Analysis https://www.stata.com/manuals13/mvfactor.pdf
UCLA Resources https://stats.oarc.ucla.edu/stata/output/factor-analysis/
Watkins, M. W. (2021). A Step-by-step Guide to Exploratory Factor Analysis with Stata. Routledge.
Wu, H. S. (2018). Introduction to Factor Analysis. CFDR Workshop Series. Available at https://www.bgsu.edu/content/dam/BGSU/college-of-arts-and-sciences/center-for-family-and-demographic-research/documents/Workshops/2018-Factor-Analysis.pdf
If you have questions or comments about this guide or method, please email data@Princeton.edu.