Skip to Main Content

Factor Analysis in Stata: Getting Started with Factor Analysis

This tutorial provides a step-by-step guide to conduct basic factor analysis using Stata

Getting Started with Factor Analysis

1. What is Factor Analysis?

When a dataset contains large number of variables, there is a possibility that many of these variables substantively overlap with each other. In this situation, we may need to reduce the number of variables in our dataset. Factor analysis can be a useful tool to apply in this regard. For example, in psychology research, we can reduce long personality test responses to a small number of personality traits by conducting a factor analysis.

Key objectives of factor analysis are:

(i) Getting a small set of variables (preferably uncorrelated) from a large set of variables (most of which are correlated with each other).

(ii) Creating indexes with variables that conceptually measure similar things.

There are two types of factor analysis. 

  1. Exploratory Factor Analysis: We use exploratory factor analysis when we do not have a predefined idea of the structure or how many dimensions there are in a set of variables.
  2. Confirmatory Factor Analysis: We use confirmatory factor analysis when we want to test a specific hypothesis about the structure or the number of dimensions underlying a set of variables. For instance,  we use confirmatory factor analysis if we think our data have two dimensions and we want to verify that.

Note: Sometimes we get confused about using factor analysis and principal component analysis (PCA) interchangeably as they are very similar in many ways. But keep in mind that there is a fundamental difference between them - PCA is a linear combination of variables and factor analysis is a measurement model of a latent variable (for more see here and here).  

In this tutorial, we will show how to conduct different kinds of exploratory factor analysis using data from Meijers and Zaslove (2021). 

2. A Simple Factor Analysis

Let's first see how to conduct a very basic factor analysis.

Load the following dataset: 

  use https://dss.princeton.edu/training/factor.dta

The dataset (constructed by Meijers and Zaslove, 2021) contains information for 250 political parties in 28 European countries. The authors measure populism in political parties using expert surveys. As populism is a multi-dimensional concept, the authors measure it with the help of five variables: manichean, indivisible, generalwill, peoplecentrism, and antielitism. By using simple factor analysis, we will identify the number of latent factor(s) among these five variables. To do that, use the following Stata command:

 factor manichean indivisble generalwill peoplecentrism antielitism

Stata provides us with the following outputs:

Factor analysis/correlation                      Number of obs    =        236
    Method: principal factors                    Retained factors =          3
    Rotation: (unrotated)                        Number of params =         10
    --------------------------------------------------------------------------
         Factor  |   Eigenvalue   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      3.96566      3.63925            0.9500       0.9500
        Factor2  |      0.32641      0.30602            0.0782       1.0282
        Factor3  |      0.02039      0.08281            0.0049       1.0331
        Factor4  |     -0.06241      0.01332           -0.0150       1.0181
        Factor5  |     -0.07573            .           -0.0181       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(10) = 1370.74 Prob>chi2 = 0.0000
Factor loadings (pattern matrix) and unique variances
    -----------------------------------------------------------
        Variable |  Factor1   Factor2   Factor3 |   Uniqueness 
    -------------+------------------------------+--------------
       manichean |   0.8610    0.0035    0.1112 |      0.2464  
      indivisble |   0.8691    0.3273   -0.0166 |      0.1373  
     generalwill |   0.9222    0.2402   -0.0334 |      0.0907  
    peoplecent~m |   0.9005   -0.2553   -0.0784 |      0.1177  
     antielitism |   0.8987   -0.3105    0.0223 |      0.0955  
    -----------------------------------------------------------

Interpretation:

  1. From the outputs, we see that the simple factor analysis command retains only three factors (Factor1, Factor2, and Factor3). It has dropped the other two factors (Factor4 and Factor5) because their Eigenvalues are negative. According to the mineigen (0) criterion, we should retain only those factors that have positive eigenvalues. 
  2. If the Uniqueness is high, then the corresponding variable is not well explained by the factors. Values > 0.6 are usually considered as high. As all of our Uniqueness values (see the second table) are much lower than 0.6, we can say that the (five) variables  are sufficiently explained by the retained (three) factors.

3. Principal-component Factors Analysis

In the previous example, we showed principal-factor solution, where the communalities (defined as 1 - Uniqueness) were estimated using the squared multiple correlation coefficients. However, if we assume that there are no unique factors, we should use the "Principal-component factors" option (keep in mind that principal-component factors analysis and principal component analysis are not the same thing!). To do that let's load the following dataset: 

   use https://dss.princeton.edu/training/factor.dta, clear

- Use the following Stata codes:

   factor manichean indivisble generalwill peoplecentrism antielitism, pcf

Note: after the factor command we list the variables from which we want to get a latent factor(s).

Stata provides us with the following outputs:

Factor analysis/correlation                      Number of obs    =        236
    Method: principal-component factors          Retained factors =          1
    Rotation: (unrotated)                        Number of params =          5
    --------------------------------------------------------------------------
         Factor  |   Eigenvalue   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      4.12933      3.65862            0.8259       0.8259
        Factor2  |      0.47071      0.22482            0.0941       0.9200
        Factor3  |      0.24589      0.16335            0.0492       0.9692
        Factor4  |      0.08254      0.01102            0.0165       0.9857
        Factor5  |      0.07152            .            0.0143       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(10) = 1370.74 Prob>chi2 = 0.0000
Factor loadings (pattern matrix) and unique variances
    ---------------------------------------
        Variable |  Factor1 |   Uniqueness 
    -------------+----------+--------------
       manichean |   0.9008 |      0.1886  
      indivisble |   0.8876 |      0.2122  
     generalwill |   0.9322 |      0.1310  
    peoplecent~m |   0.9130 |      0.1664  
     antielitism |   0.9096 |      0.1726  
    ---------------------------------------

Interpretation:

  1. For Principal-component factors, Kaiser criterion suggests to retain the factors with eigenvalues greater than or equal to 1. In the first table, we see only Factor1 met this criterion. So, we retain Factor1 only.
  2. Proportion in the first table shows the size of variance explained by each factor. Here, (0.8259*100=) 82.59% of total variation is explained by Factor1.
  3. Factor loadings are the weights and correlation between each variable and the factor. The higher the load, the more relevant it is in defining the factor’s dimensionality. A negative value indicates an inverse impact on the factor. Here, Factor1 is retained because it has an eigenvalue of  > 1. 
  4. Uniqueness is the variance that is 'unique' to the variable and not shared with other variables. The smaller the Uniqueness, the higher the relevance of the variable in the factor model. For example, only 18.86% of the variance in "manichean" is not shared with other variables (i.e., 81.14% of the variance in "manichean" is shared with other variables). Therefore, manichean is highly relevant in the factor model. Similarly, other variables are also highly relevant in the factor model indicating that  all of them highly contributed to defining Factor1.

- After running factor we can now use the rotate command to get a clearer pattern of our factor model. To do this, type rotate

Stata provides us with the following outputs:

 Factor analysis/correlation                      Number of obs    =        236
    Method: principal-component factors          Retained factors =          1
    Rotation: orthogonal varimax (Kaiser off)    Number of params =          5

    --------------------------------------------------------------------------
         Factor  |     Variance   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      4.12933            .            0.8259       0.8259
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(10) = 1370.74 Prob>chi2 = 0.0000
Rotated factor loadings (pattern matrix) and unique variances
    ---------------------------------------
        Variable |  Factor1 |   Uniqueness 
    -------------+----------+--------------
       manichean |   0.9008 |      0.1886  
      indivisble |   0.8876 |      0.2122  
     generalwill |   0.9322 |      0.1310  
    peoplecent~m |   0.9130 |      0.1664  
     antielitism |   0.9096 |      0.1726  
    ---------------------------------------
Factor rotation matrix
    -----------------------
                 | Factor1 
    -------------+---------
         Factor1 |  1.0000 
    -----------------------

 

Notice that from the first table we can see more clearly that Factor1 explains 82.59% of the total observed variance in our factor model.

- We now know that Factor1 is the latent factor in our model, which can contain the information from the variables included to measure populism. We will now create a new variable in our dataset. To do that type:

  predict factor1

Stata provides us with the following outputs:

. predict factor1
(option regression assumed; regression scoring)
Scoring coefficients (method = regression; based on varimax rotated factors)
    ------------------------
        Variable |  Factor1 
    -------------+----------
       manichean |  0.21814 
      indivisble |  0.21495 
     generalwill |  0.22576 
    peoplecent~m |  0.22111 
     antielitism |  0.22028 
    ------------------------

 

Notes:

  1. Notice in the variable list in Stata window that a new variable factor1 has been generated, which we can use to run a regression of our choice. 
  2. The numbers in Factor1 column in the above table are the regression coefficients used to estimate the individual scores (per case/row) for factor1 variable in the dataset.
  3. If we got two or more factors from our analysis, we should type the name of each of them to create a set of new variables.
  4. If we got two or more factors from our analysis, we could also create indexes out of each cluster of variables. 
  5. We can name the variable as we wish. For example, instead of factor1, we can name the new variable as populism_factor. In that case, type the following command

    predict populism_factor  

4. Iterated Principal-factor Analysis

Iterated factor analysis is similar to principal-component factors analysis. However, the key difference between the two is that in iterated factor analysis, the solutions are iterated to obtain better estimates.

- To conduct iterated factor analysis, let's load the following dataset:

 use https://dss.princeton.edu/training/factor.dta, clear

- Use the following Stata codes:

factor manichean indivisble generalwill peoplecentrism antielitism,ipf mineigen(1)

Notes:

  1. After the factor command, we list the variable names from which we want to derive latent factor(s).
  2. After ipf, by mineigen(1)we specify that that we want the factors having eigenvalues greater than 1 because Kaiser criterion suggests to retain the factors with eigenvalues greater than or equal to 1.

Stata provides us with the following outputs:

Factor analysis/correlation                      Number of obs    =        236
    Method: iterated principal factors           Retained factors =          1
    Rotation: (unrotated)                        Number of params =          5
    --------------------------------------------------------------------------
         Factor  |   Eigenvalue   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      3.91391      3.66278            1.0000       1.0000
        Factor2  |      0.25113      0.23507            0.0642       1.0642
        Factor3  |      0.01606      0.14168            0.0041       1.0683
        Factor4  |     -0.12562      0.01596           -0.0321       1.0362
        Factor5  |     -0.14158            .           -0.0362       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(10) = 1370.74 Prob>chi2 = 0.0000
Factor loadings (pattern matrix) and unique variances
    ---------------------------------------
        Variable |  Factor1 |   Uniqueness
    -------------+----------+--------------
       manichean |   0.8712 |      0.2411  
      indivisble |   0.8515 |      0.2750  
     generalwill |   0.9227 |      0.1486  
    peoplecent~m |   0.8913 |      0.2057  
     antielitism |   0.8856 |      0.2157  
    ---------------------------------------

Interpretation:

  1. In case of iterated  principal-factor, Kaiser criterion suggests to retain the factors with eigenvalues greater than or equal to 1. In the first table, we see only Factor1 met this criterion. So, we retain factor 1 only.
  2. Proportion in the first table shows the size of variance explained by each factor. Here, 100% of total variation is explained by Factor1.
  3. Factor loadings are the weights and correlation between each variable and the factor. The higher the load, the more relevant it is in defining the factor’s dimensionality. A negative value indicates an inverse impact on the factor.
  4. Uniqueness is the variance that is 'unique' to the variable and not shared with other variables. The smaller the Uniqueness, the higher the relevance of the variable in the factor model. For example, only 24.11% of the variance in "manichean" is not shared with other variables (i.e., 75.89% of the variance in "manichean" is shared with other variables). Therefore, manichean is highly relevant in the factor model. Similarly, other items are also highly relevant in the factor model indicating that  all the items highly contributed to defining Factor1.

- After running factor we can now use the rotate command to get a clearer pattern of our factor model. To do this, type: rotate

Stata provides us with the following outputs:

Factor analysis/correlation                      Number of obs    =        236
    Method: iterated principal factors           Retained factors =          1
    Rotation: orthogonal varimax (Kaiser off)    Number of params =          5
    --------------------------------------------------------------------------
         Factor  |     Variance   Difference        Proportion   Cumulative
    -------------+------------------------------------------------------------
        Factor1  |      3.91391            .            1.0000       1.0000
    --------------------------------------------------------------------------
    LR test: independent vs. saturated:  chi2(10) = 1370.74 Prob>chi2 = 0.0000
Rotated factor loadings (pattern matrix) and unique variances
    ---------------------------------------
        Variable |  Factor1 |   Uniqueness
    -------------+----------+--------------
       manichean |   0.8712 |      0.2411  
      indivisble |   0.8515 |      0.2750  
     generalwill |   0.9227 |      0.1486  
    peoplecent~m |   0.8913 |      0.2057  
     antielitism |   0.8856 |      0.2157  
    ---------------------------------------
Factor rotation matrix
    -----------------------
                 | Factor1
    -------------+---------
         Factor1 |  1.0000
    -----------------------

The outputs are similar to those for the principal-component factors analysis. However, in this case, Factor1 explains 100% of the total observed variance in our factor model.

- We now know that Factor1 is the latent factor in our model, which can contain the information from the variables included to measure populism. We will now create a new variable in our dataset. To do that type:

  predict factor1

Stata provides us with the following outputs:

.  predict factor1
(regression scoring assumed)
Scoring coefficients (method = regression; based on varimax rotated factors)
    ------------------------
        Variable |  Factor1
    -------------+----------
       manichean |  0.18634
      indivisble |  0.05959
     generalwill |  0.40780
    peoplecent~m |  0.20162
     antielitism |  0.20724
    ------------------------

Notes:

  1. Notice in the variable list in Stata window that a new variable factor1 has been generated, which we can use to run a regression of our choice. 
  2. The numbers in Factor1 column in the above table are the regression coefficients used to estimate the individual scores (per case/row) for factor1 variable in the dataset.
  3. If we got two or more factors from our analysis, we should type the name of each of them to create a set of new variables.
  4. If we got two or more factors from our analysis, we could also create indexes out of each cluster of variables. 
  5. We can name the variable as we wish. For example, instead of factor1, we can name the new variable as populism_factor. In that case, type the following command

    predict populism_factor   

5. References

DSS Data Analysis Guides https://libguides.princeton.edu/c.php?g=1415215

Meijers, M. J., & Zaslove, A. (2021). Measuring populism in political parties: appraisal of a new approach. Comparative political studies, 54(2), 372-407.
 

Princeton DSS Libguides https://libguides.princeton.edu/dss

Stata Manual for Factor Analysis https://www.stata.com/manuals13/mvfactor.pdf

UCLA Resources https://stats.oarc.ucla.edu/stata/output/factor-analysis/

Watkins, M. W. (2021). A Step-by-step Guide to Exploratory Factor Analysis with Stata. Routledge.

Wu, H. S. (2018).  Introduction to Factor Analysis. CFDR Workshop Series. Available at https://www.bgsu.edu/content/dam/BGSU/college-of-arts-and-sciences/center-for-family-and-demographic-research/documents/Workshops/2018-Factor-Analysis.pdf

Data Consultant

Profile Photo
Muhammad Al Amin
He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051

Data Consultant

Profile Photo
Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519

Comments or Questions?

If you have questions or comments about this guide or method, please email data@Princeton.edu.