Skip to Main Content

Missing Data: Multiple Imputation in Stata

This guide discusses multiple imputation techniques for missing data using Stata.

Multiple Imputation in Stata

1. Introduction

Stata offers several methods for imputation, including single imputation, multiple imputation, and interpolation. The choice of method depends on the nature of your data and the type of analysis you intend to perform. This guide provides step-by-step instructions for conducting multiple imputation of missing data using Stata.

Multiple imputation is one of the most robust and widely used statistical techniques for dealing with missing data. In multiple imputation, the distribution of observed data is used to estimate a set of plausible values for missing data. The missing values are replaced by the estimated plausible values to create a “complete” dataset.

2. Multiple Imputation in Stata

STEP 1: Preparing Your Data

1.1. Load your dataset. For this tutorial, we will use the "mheart5.dta", a data file available from Stata Corp. Type:

webuse "mheart5.dta"

1.2. Explore missing data. To examine the missing data pattern type:

misstable sum, gen(miss_)

The “misstable” command with the “gen()” option generates indicators for missingness. These new variables are added to the data file and start with the prefix miss_.

Stata will give us the following table.

                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
           age |        12                 142  |    142   20.73613    83.78423
           bmi |        28                 126  |    126   17.22643    38.24214
  -----------------------------------------------------------------------------

 

The Obs=. column represents the number of missing values for each variable. If there is no entry for a variable, it has no missing values.

The Obs<. column represents the number of observed values for each variable.

1.3. you may tabulate the new indicator variables as an additional check. Type:

tab1 miss_age miss_bmi

Stata will give us the following table.

-> tabulation of miss_age  
   (age>=.) |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        142       92.21       92.21
          1 |         12        7.79      100.00
------------+-----------------------------------
      Total |        154      100.00
-> tabulation of miss_bmi  
   (bmi>=.) |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        126       81.82       81.82
          1 |         28       18.18      100.00
------------+-----------------------------------
      Total |        154      100.00

Indicator variables miss_age and miss_bmi were added to the data file in step 1.2; a value of 1 on these variables indicates the observation is missing information on the specific variable. A value of 0 indicates the observation in not missing. 12 observations are missing information on age, 28 observations are missing on BMI.

STEP 2: Consider the Mechanism of Missingness

Decide which variables need imputation. Not all missing data needs to be imputed; sometimes, missingness is informative. Multiple imputation is appropriate when data are missing completely at random (MCAR) or missing at random (MAR). It would be difficult to perform a legitimate analysis if data are missing not at random (MNAR).

2.1. Logistic regression models could be used to examine whether any of the variables in the data file predict missingness. If they do, the data are MAR rather than MCAR. To run the logit model, type:

logit miss_bmi attack smoke age female hsgrad

Stata will give us the following output table.

Iteration 0:  Log likelihood = -49.994502  
Iteration 1:  Log likelihood =  -47.73123  
Iteration 2:  Log likelihood = -47.614822  
Iteration 3:  Log likelihood = -47.614515  
Iteration 4:  Log likelihood = -47.614515  
Logistic regression                                     Number of obs =    142
                                                        LR chi2(5)    =   4.76
                                                        Prob > chi2   = 0.4459
Log likelihood = -47.614515                             Pseudo R2     = 0.0476
------------------------------------------------------------------------------
    miss_bmi | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
      attack |   .0101071   .5775173     0.02   0.986    -1.121806     1.14202
      smokes |   .1965135   .5739319     0.34   0.732    -.9283723    1.321399
         age |  -.0485561   .0244407    -1.99   0.047     -.096459   -.0006532
      female |   .0892789   .6256756     0.14   0.887    -1.137023    1.315581
      hsgrad |   .3940007   .6888223     0.57   0.567    -.9560662    1.744068
       _cons |   .1414761   1.423355     0.10   0.921    -2.648249    2.931201
------------------------------------------------------------------------------

 

age is statistically significantly associated with miss_bmi, suggesting that the data are MAR rather than MCAR.

2.2. Run another logistic model to make sure that no other variables other than BMI are statistically significantly associated with missingness of age. Type:

logit miss_age attack smoke female hsgrad

Stata will give us the following results.

Iteration 0:  Log likelihood = -42.144379  
Iteration 1:  Log likelihood = -40.780233  
Iteration 2:  Log likelihood = -40.713422  
Iteration 3:  Log likelihood = -40.713172  
Iteration 4:  Log likelihood = -40.713172  
Logistic regression                                     Number of obs =    154
                                                        LR chi2(4)    =   2.86
                                                        Prob > chi2   = 0.5811
Log likelihood = -40.713172                             Pseudo R2     = 0.0340
------------------------------------------------------------------------------
    miss_age | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
      attack |  -1.035628   .7108815    -1.46   0.145     -2.42893    .3576738
      smokes |   .2788896   .6369393     0.44   0.661    -.9694886    1.527268
      female |  -.0059384   .7025713    -0.01   0.993    -1.382953    1.371076
      hsgrad |   .5426292   .8029777     0.68   0.499    -1.031178    2.116437
       _cons |  -2.649692   .7993453    -3.31   0.001     -4.21638   -1.083004
------------------------------------------------------------------------------

None of the p-values are less than 0.05 indicating no other variables are statistically significantly associated with missingness of age.

2.3. T-test may also be informative in evaluating whether the values of other variables vary between the missing and the non-missing groups. Type:

foreach var of varlist attack smoke age female hsgrad {
ttest `var', by(miss_bmi)
}

Results for t test between miss_bmi and age are presented below. Results for t test for other variables with miss_bmi are not presented for brevity.

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       0 |     126    57.14571    1.022929    11.48234    55.12121    59.17021
       1 |      16    50.82253    2.810969    11.24388    44.83109    56.81397
---------+--------------------------------------------------------------------
Combined |     142    56.43324    .9727211    11.59131    54.51024    58.35624
---------+--------------------------------------------------------------------
    diff |            6.323186    3.040682                .3115936    12.33478
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =   2.0795
H0: diff = 0                                     Degrees of freedom =      140
    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9803         Pr(|T| > |t|) = 0.0394          Pr(T > t) = 0.0197
 

T-test suggests a statistically significant relationship between missigness of BMI and age. T-tests between missingness of BMI and the other variables (i.e., attack, smokes, female and hsgrad) were not statistically significant.

STEP 3: Setting-up Multiple Imputation

3.1: Declare MI set. Prior to imputation, data should be set to wide using the “mi set” command. This set up suggests Stata how the additional imputations should be stored. Type:

mi set wide

3.2. Register variables for imputation. Variables in the data set have to be registered using the “mi register” command. “mi register regular” specifies the variables that should not be imputed (either because they have no missing values or because there is no need). “mi register imputed” specifies the variables to be imputed in the procedure. Type:

mi register regular female attack smokes hsgrad
mi register imputed bmi age

STEP 4: Choosing an Imputation Method

- Choose an appropriate method for each type of variable you want to impute. For instance, choose regress for continuous variables (e.g., age, bmi), logit for binary variables (e.g., gender), mlogit for multinomial variable (e.g., political parties).

- Stata offers  mi impute chained (MICE) for a flexible approach that can handle different types of variables (e.g., continuous, binary, ordinal, nominal, truncated, count) in one command.

- mi impute chained is an iterative process. The variable with the fewest missing values is imputed first followed by the variable with the next fewest missing values and so on for the rest of the variables.

STEP 5: Imputing the Missing Data

5.1. Perform the imputation. To impute missing values using chained equations for 5 imputations, type:

mi impute chained (regress) age bmi = attack smokes hsgrad female, add(5) rseed(9478) replace

Interpretation of the codes:

mi impute chained: command for chained imputation with a mixed varieties of variables. 

(regress) age bmi: Both BMI and age, the variables to be imputed, are continuous variables. Therefore, regress is specified as a method.

attack smokes hsgrad female: These are regular variables which should not be imputed.

add(5): The number of datasets to be imputed (5).

rseed(9478): The “rseed()” option may be used for results reproducibility.

Stata will give us the following table.

note: missing-value pattern is monotone; no iteration performed.
Conditional models (monotone):
               age: regress age attack smokes hsgrad female
               bmi: regress bmi age attack smokes hsgrad female
Performing chained iterations ...
Multivariate imputation                     Imputations =        5
Chained equations                                 added =        5
Imputed: m=1 through m=5                        updated =        0
Initialization: monotone                     Iterations =        0
                                                burn-in =        0
               age: linear regression
               bmi: linear regression
------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
               age |        142           12        12 |       154
               bmi |        126           28        28 |       154
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
 of the number of filled-in observations.)

Imputations =        5 and added =        5 indicate five imputations of variables which have missing values on the observed dataset were added. The new variables are noted with a prefix _x_ where x represent the imputation number (i.e. , 1, 2,…,5).

5.2. Check the imputation. To compare the distribution of imputed variables to that of the observed and the complete ( i.e., observed and imputed) data use the midiagplots command which can be downloaded by typing findit midiagplots in the command line. In sum, type the following codes:

findit midiagplots
midiagplots bmi, m(1/5) combine

Stata will give us the following plots.

The above plots represent the distribution of BMI (this could be done for any of the imputed variables) and suggest a good overlap between observed and completed data.

STEP 6: Analyzing the Imputed Data

6.1. Perform your analysis. Use mi estimate to perform your desired statistical analysis on the imputed dataset. For example, to run a logit model, you would use: 

mi estimate, or: logit attack bmi age female smokes hsgrad

Interpretation:

In the above codes, we use a logistic regression to estimate the probability of a heart attack with the pooled 5 imputed data files. 

The “mi estimate” prefix first runs the estimation command on each of the imputations separately. It then combines the results and displays the combined output.

Using the “or” option will present odds ratios following a logistic regression.

Stata gives us the following output table:

Multiple-imputation estimates                   Imputations       =          5
Logistic regression                             Number of obs     =        154
                                                Average RVI       =     0.0454
                                                Largest FMI       =     0.1607
DF adjustment:   Large sample                   DF:     min       =     175.09
                                                        avg       = 2243117.63
                                                        max       =   1.30e+07
Model F test:       Equal FMI                   F(   5, 5615.2)   =       3.13
Within VCE type:          OIM                   Prob > F          =     0.0080
------------------------------------------------------------------------------
      attack | Odds ratio   Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         bmi |   1.094327   .0525146     1.88   0.062     .9954409    1.203037
         age |   1.028546    .016244     1.78   0.075      .997173    1.060906
      female |   .8829223   .3608274    -0.30   0.761     .3963305    1.966924
      smokes |   3.138241   1.100922     3.26   0.001     1.577892    6.241591
      hsgrad |   1.160312   .4648238     0.37   0.711     .5291435    2.544345
       _cons |   .0097271   .0156492    -2.88   0.004     .0004141    .2284712
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

3. References / Useful Resources

Azur, Melissa J., Stuart, Elizabeth A., Frangakis, Constantine & Leaf, Philip J. (2011), Multiple Impuitation by Chained Equations: What is it and how does it work? International Journal Methods Psychiatric Research, 20(1), 40-49.
 
Cain, M. (2020). Analyzing data with missing values using multiple imputation. Stata. Available at: https://www.stata.com/training/webinar_series/multiple-imputation/mi_slides.pdf
 
DSS Data Analysis Guides. Available at: https://library.princeton.edu/dss/training
 
Medeiros, R. (2016). Handling missing data in Stata: Imputation and likelihood-based approaches. Swiss Stata Users Group meeting. Available at: https://www.stata.com/meeting/switzerland16/slides/medeiros-switzerland16.pdf
 
Stata manual. mi impute — Impute missing values. Available at: https://www.stata.com/manuals13/mimiimpute.pdf
 
Sata. Multiple Imputation. Available at: https://www.stata.com/features/multiple-imputation/
 
UCLA Statistical Methods and Data Analytics. Multiple Imputation in Stata. Available at: https://stats.oarc.ucla.edu/stata/seminars/mi_in_stata_pt1_new/
 
White, Ian R., Royston Patrick & Wood, Angela M. (2011), Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30, 377-399.

Data Consultant

Profile Photo
Muhammad Al Amin
He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051

Data Consultant

Profile Photo
Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519

Comments or Questions?

If you have questions or comments about this guide or method, please email data@Princeton.edu.