# Factor Analysis in Stata: Getting Started with Factor Analysis

This tutorial provides a step-by-step guide to conduct basic factor analysis using Stata

## What is Factor Analysis?

When a dataset contains large number of variables, there is a possibility that many of these variables substantively overlap with each other. In this situation, we may need to reduce the number of variables in our dataset. Factor analysis can be a useful tool to apply in this regard. For example, in psychology research, we can reduce long personality test responses to a small number of personality traits by conducting a factor analysis.

Key objectives of factor analysis are:

(i) Getting a small set of variables (preferably uncorrelated) from a large set of variables (most of which are correlated with each other).

(ii) Creating indexes with variables that conceptually measure similar things.

There are two types of factor analysis.

1. Exploratory Factor Analysis: We use exploratory factor analysis when we do not have a predefined idea of the structure or how many dimensions there are in a set of variables.
2. Confirmatory Factor Analysis: We use confirmatory factor analysis when we want to test a specific hypothesis about the structure or the number of dimensions underlying a set of variables. For instance,  we use confirmatory factor analysis if we think our data have two dimensions and we want to verify that.

Note: Sometimes we get confused about using factor analysis and principal component analysis (PCA) interchangeably as they are very similar in many ways. But keep in mind that there is a fundamental difference between them - PCA is a linear combination of variables and factor analysis is a measurement model of a latent variable (for more see here and here).

In this tutorial, we will show how to conduct different kinds of exploratory factor analysis using data from Meijers and Zaslove (2021).

## A Simple Factor Analysis

Let's first see how to conduct a very basic factor analysis.

use https://dss.princeton.edu/training/factor.dta

The dataset (constructed by Meijers and Zaslove, 2021) contains information for 250 political parties in 28 European countries. The authors measure populism in political parties using expert surveys. As populism is a multi-dimensional concept, the authors measure it with the help of five variables: manichean, indivisible, generalwill, peoplecentrism, and antielitism. By using simple factor analysis, we will identify the number of latent factor(s) among these five variables. To do that, use the following Stata command:

factor manichean indivisble generalwill peoplecentrism antielitism

Stata provides us with the following outputs:

###### -----------------------------------------------------------         Variable |  Factor1   Factor2   Factor3 |   Uniqueness      -------------+------------------------------+--------------        manichean |   0.8610    0.0035    0.1112 |      0.2464         indivisble |   0.8691    0.3273   -0.0166 |      0.1373        generalwill |   0.9222    0.2402   -0.0334 |      0.0907       peoplecent~m |   0.9005   -0.2553   -0.0784 |      0.1177        antielitism |   0.8987   -0.3105    0.0223 |      0.0955       -----------------------------------------------------------

Interpretation:

1. From the outputs, we see that the simple factor analysis command retains only three factors (Factor1, Factor2, and Factor3). It has dropped the other two factors (Factor4 and Factor5) because their Eigenvalues are negative. According to the mineigen (0) criterion, we should retain only those factors that have positive eigenvalues.
2. If the Uniqueness is high, then the corresponding variable is not well explained by the factors. Values > 0.6 are usually considered as high. As all of our Uniqueness values (see the second table) are much lower than 0.6, we can say that the (five) variables  are sufficiently explained by the retained (three) factors.

## Principal-component Factors Analysis

In the previous example, we showed principal-factor solution, where the communalities (defined as 1 - Uniqueness) were estimated using the squared multiple correlation coefficients. However, if we assume that there are no unique factors, we should use the "Principal-component factors" option (keep in mind that principal-component factors analysis and principal component analysis are not the same thing!). To do that let's load the following dataset:

###### use https://dss.princeton.edu/training/factor.dta, clear

- Use the following Stata codes:

###### factor manichean indivisble generalwill peoplecentrism antielitism, pcf

Note: after the factor command we list the variables from which we want to get a latent factor(s).

Stata provides us with the following outputs:

###### ---------------------------------------         Variable |  Factor1 |   Uniqueness      -------------+----------+--------------        manichean |   0.9008 |      0.1886         indivisble |   0.8876 |      0.2122        generalwill |   0.9322 |      0.1310       peoplecent~m |   0.9130 |      0.1664        antielitism |   0.9096 |      0.1726       ---------------------------------------

Interpretation:

1. For Principal-component factors, Kaiser criterion suggests to retain the factors with eigenvalues greater than or equal to 1. In the first table, we see only Factor1 met this criterion. So, we retain Factor1 only.
2. Proportion in the first table shows the size of variance explained by each factor. Here, (0.8259*100=) 82.59% of total variation is explained by Factor1.
3. Factor loadings are the weights and correlation between each variable and the factor. The higher the load, the more relevant it is in defining the factor’s dimensionality. A negative value indicates an inverse impact on the factor. Here, Factor1 is retained because it has an eigenvalue of  > 1.
4. Uniqueness is the variance that is 'unique' to the variable and not shared with other variables. The smaller the Uniqueness, the higher the relevance of the variable in the factor model. For example, only 18.86% of the variance in "manichean" is not shared with other variables (i.e., 81.14% of the variance in "manichean" is shared with other variables). Therefore, manichean is highly relevant in the factor model. Similarly, other variables are also highly relevant in the factor model indicating that  all of them highly contributed to defining Factor1.

- After running factor we can now use the rotate command to get a clearer pattern of our factor model. To do this, type rotate

Stata provides us with the following outputs:

Factor analysis/correlation                      Number of obs    =        236
Method: principal-component factors          Retained factors =          1
Rotation: orthogonal varimax (Kaiser off)    Number of params =          5

###### -----------------------                  | Factor1      -------------+---------          Factor1 |  1.0000      -----------------------

Notice that from the first table we can see more clearly that Factor1 explains 82.59% of the total observed variance in our factor model.

- We now know that Factor1 is the latent factor in our model, which can contain the information from the variables included to measure populism. We will now create a new variable in our dataset. To do that type:

predict factor1

Stata provides us with the following outputs:

###### ------------------------         Variable |  Factor1      -------------+----------        manichean |  0.21814        indivisble |  0.21495       generalwill |  0.22576      peoplecent~m |  0.22111       antielitism |  0.22028      ------------------------

Notes:

1. Notice in the variable list in Stata window that a new variable factor1 has been generated, which we can use to run a regression of our choice.
2. The numbers in Factor1 column in the above table are the regression coefficients used to estimate the individual scores (per case/row) for factor1 variable in the dataset.
3. If we got two or more factors from our analysis, we should type the name of each of them to create a set of new variables.
4. If we got two or more factors from our analysis, we could also create indexes out of each cluster of variables.
5. We can name the variable as we wish. For example, instead of factor1, we can name the new variable as populism_factor. In that case, type the following command

predict populism_factor

## Iterated Principal-factor Analysis

Iterated factor analysis is similar to principal-component factors analysis. However, the key difference between the two is that in iterated factor analysis, the solutions are iterated to obtain better estimates.

- To conduct iterated factor analysis, let's load the following dataset:

###### use https://dss.princeton.edu/training/factor.dta, clear

- Use the following Stata codes:

factor manichean indivisble generalwill peoplecentrism antielitism,ipf mineigen(1)

Notes:

1. After the factor command, we list the variable names from which we want to derive latent factor(s).
2. After ipf, by mineigen(1)we specify that that we want the factors having eigenvalues greater than 1 because Kaiser criterion suggests to retain the factors with eigenvalues greater than or equal to 1.

Stata provides us with the following outputs:

###### ---------------------------------------         Variable |  Factor1 |   Uniqueness     -------------+----------+--------------        manichean |   0.8712 |      0.2411         indivisble |   0.8515 |      0.2750        generalwill |   0.9227 |      0.1486       peoplecent~m |   0.8913 |      0.2057        antielitism |   0.8856 |      0.2157       ---------------------------------------

Interpretation:

1. In case of iterated  principal-factor, Kaiser criterion suggests to retain the factors with eigenvalues greater than or equal to 1. In the first table, we see only Factor1 met this criterion. So, we retain factor 1 only.
2. Proportion in the first table shows the size of variance explained by each factor. Here, 100% of total variation is explained by Factor1.
3. Factor loadings are the weights and correlation between each variable and the factor. The higher the load, the more relevant it is in defining the factor’s dimensionality. A negative value indicates an inverse impact on the factor.
4. Uniqueness is the variance that is 'unique' to the variable and not shared with other variables. The smaller the Uniqueness, the higher the relevance of the variable in the factor model. For example, only 24.11% of the variance in "manichean" is not shared with other variables (i.e., 75.89% of the variance in "manichean" is shared with other variables). Therefore, manichean is highly relevant in the factor model. Similarly, other items are also highly relevant in the factor model indicating that  all the items highly contributed to defining Factor1.

- After running factor we can now use the rotate command to get a clearer pattern of our factor model. To do this, type: rotate

Stata provides us with the following outputs:

###### -----------------------                  | Factor1     -------------+---------          Factor1 |  1.0000     -----------------------

The outputs are similar to those for the principal-component factors analysis. However, in this case, Factor1 explains 100% of the total observed variance in our factor model.

- We now know that Factor1 is the latent factor in our model, which can contain the information from the variables included to measure populism. We will now create a new variable in our dataset. To do that type:

predict factor1

Stata provides us with the following outputs:

###### ------------------------         Variable |  Factor1     -------------+----------        manichean |  0.18634       indivisble |  0.05959      generalwill |  0.40780     peoplecent~m |  0.20162      antielitism |  0.20724     ------------------------

Notes:

1. Notice in the variable list in Stata window that a new variable factor1 has been generated, which we can use to run a regression of our choice.
2. The numbers in Factor1 column in the above table are the regression coefficients used to estimate the individual scores (per case/row) for factor1 variable in the dataset.
3. If we got two or more factors from our analysis, we should type the name of each of them to create a set of new variables.
4. If we got two or more factors from our analysis, we could also create indexes out of each cluster of variables.
5. We can name the variable as we wish. For example, instead of factor1, we can name the new variable as populism_factor. In that case, type the following command

predict populism_factor

## References

DSS Online Training Section https://dss.princeton.edu/training/

Meijers, M. J., & Zaslove, A. (2021). Measuring populism in political parties: appraisal of a new approach. Comparative political studies, 54(2), 372-407.

Princeton DSS Libguides https://libguides.princeton.edu/dss

Stata Manual for Factor Analysis https://www.stata.com/manuals13/mvfactor.pdf

Watkins, M. W. (2021). A Step-by-step Guide to Exploratory Factor Analysis with Stata. Routledge.

Wu, H. S. (2018).  Introduction to Factor Analysis. CFDR Workshop Series. Available at https://www.bgsu.edu/content/dam/BGSU/college-of-arts-and-sciences/center-for-family-and-demographic-research/documents/Workshops/2018-Factor-Analysis.pdf