Skip to Main Content

# Missing Data: Multiple Imputation in Stata

This guide discusses multiple imputation techniques for missing data using Stata.

## 1. Introduction

Stata offers several methods for imputation, including single imputation, multiple imputation, and interpolation. The choice of method depends on the nature of your data and the type of analysis you intend to perform. This guide provides step-by-step instructions for conducting multiple imputation of missing data using Stata.

Multiple imputation is one of the most robust and widely used statistical techniques for dealing with missing data. In multiple imputation, the distribution of observed data is used to estimate a set of plausible values for missing data. The missing values are replaced by the estimated plausible values to create a “complete” dataset.

## 2. Multiple Imputation in Stata

STEP 1: Preparing Your Data

1.1. Load your dataset. For this tutorial, we will use the "mheart5.dta", a data file available from Stata Corp. Type:

webuse "mheart5.dta"

1.2. Explore missing data. To examine the missing data pattern type:

misstable sum, gen(miss_)

The “misstable” command with the “gen()” option generates indicators for missingness. These new variables are added to the data file and start with the prefix miss_.

Stata will give us the following table.

###### Obs<.                                                 +------------------------------                |                                | Unique       Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max   -------------+--------------------------------+------------------------------            age |        12                 142  |    142   20.73613    83.78423            bmi |        28                 126  |    126   17.22643    38.24214   -----------------------------------------------------------------------------

The Obs=. column represents the number of missing values for each variable. If there is no entry for a variable, it has no missing values.

The Obs<. column represents the number of observed values for each variable.

1.3. you may tabulate the new indicator variables as an additional check. Type:

tab1 miss_age miss_bmi

Stata will give us the following table.

###### (bmi>=.) |      Freq.     Percent        Cum. ------------+-----------------------------------           0 |        126       81.82       81.82           1 |         28       18.18      100.00 ------------+-----------------------------------       Total |        154      100.00

Indicator variables miss_age and miss_bmi were added to the data file in step 1.2; a value of 1 on these variables indicates the observation is missing information on the specific variable. A value of 0 indicates the observation in not missing. 12 observations are missing information on age, 28 observations are missing on BMI.

STEP 2: Consider the Mechanism of Missingness

Decide which variables need imputation. Not all missing data needs to be imputed; sometimes, missingness is informative. Multiple imputation is appropriate when data are missing completely at random (MCAR) or missing at random (MAR). It would be difficult to perform a legitimate analysis if data are missing not at random (MNAR).

2.1. Logistic regression models could be used to examine whether any of the variables in the data file predict missingness. If they do, the data are MAR rather than MCAR. To run the logit model, type:

logit miss_bmi attack smoke age female hsgrad

Stata will give us the following output table.

###### ------------------------------------------------------------------------------     miss_bmi | Coefficient  Std. err.      z    P>|z|     [95% conf. interval] -------------+----------------------------------------------------------------       attack |   .0101071   .5775173     0.02   0.986    -1.121806     1.14202       smokes |   .1965135   .5739319     0.34   0.732    -.9283723    1.321399          age |  -.0485561   .0244407    -1.99   0.047     -.096459   -.0006532       female |   .0892789   .6256756     0.14   0.887    -1.137023    1.315581       hsgrad |   .3940007   .6888223     0.57   0.567    -.9560662    1.744068        _cons |   .1414761   1.423355     0.10   0.921    -2.648249    2.931201 ------------------------------------------------------------------------------

age is statistically significantly associated with miss_bmi, suggesting that the data are MAR rather than MCAR.

2.2. Run another logistic model to make sure that no other variables other than BMI are statistically significantly associated with missingness of age. Type:

logit miss_age attack smoke female hsgrad

Stata will give us the following results.

###### ------------------------------------------------------------------------------     miss_age | Coefficient  Std. err.      z    P>|z|     [95% conf. interval] -------------+----------------------------------------------------------------       attack |  -1.035628   .7108815    -1.46   0.145     -2.42893    .3576738       smokes |   .2788896   .6369393     0.44   0.661    -.9694886    1.527268       female |  -.0059384   .7025713    -0.01   0.993    -1.382953    1.371076       hsgrad |   .5426292   .8029777     0.68   0.499    -1.031178    2.116437        _cons |  -2.649692   .7993453    -3.31   0.001     -4.21638   -1.083004 ------------------------------------------------------------------------------

None of the p-values are less than 0.05 indicating no other variables are statistically significantly associated with missingness of age.

2.3. T-test may also be informative in evaluating whether the values of other variables vary between the missing and the non-missing groups. Type:

foreach var of varlist attack smoke age female hsgrad {
ttest `var', by(miss_bmi)
}

Results for t test between miss_bmi and age are presented below. Results for t test for other variables with miss_bmi are not presented for brevity.

###### Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0  Pr(T < t) = 0.9803         Pr(|T| > |t|) = 0.0394          Pr(T > t) = 0.0197

T-test suggests a statistically significant relationship between missigness of BMI and age. T-tests between missingness of BMI and the other variables (i.e., attack, smokes, female and hsgrad) were not statistically significant.

STEP 3: Setting-up Multiple Imputation

3.1: Declare MI set. Prior to imputation, data should be set to wide using the “mi set” command. This set up suggests Stata how the additional imputations should be stored. Type:

mi set wide

3.2. Register variables for imputation. Variables in the data set have to be registered using the “mi register” command. “mi register regular” specifies the variables that should not be imputed (either because they have no missing values or because there is no need). “mi register imputed” specifies the variables to be imputed in the procedure. Type:

mi register regular female attack smokes hsgrad
mi register imputed bmi age

STEP 4: Choosing an Imputation Method

- Choose an appropriate method for each type of variable you want to impute. For instance, choose regress for continuous variables (e.g., age, bmi), logit for binary variables (e.g., gender), mlogit for multinomial variable (e.g., political parties).

- Stata offers  mi impute chained (MICE) for a flexible approach that can handle different types of variables (e.g., continuous, binary, ordinal, nominal, truncated, count) in one command.

- mi impute chained is an iterative process. The variable with the fewest missing values is imputed first followed by the variable with the next fewest missing values and so on for the rest of the variables.

STEP 5: Imputing the Missing Data

5.1. Perform the imputation. To impute missing values using chained equations for 5 imputations, type:

mi impute chained (regress) age bmi = attack smokes hsgrad female, add(5) rseed(9478) replace

Interpretation of the codes:

mi impute chained: command for chained imputation with a mixed varieties of variables.

(regress) age bmi: Both BMI and age, the variables to be imputed, are continuous variables. Therefore, regress is specified as a method.

attack smokes hsgrad female: These are regular variables which should not be imputed.

add(5): The number of datasets to be imputed (5).

rseed(9478): The “rseed()” option may be used for results reproducibility.

Stata will give us the following table.

###### ------------------------------------------------------------------                    |               Observations per m                                 |----------------------------------------------           Variable |   Complete   Incomplete   Imputed |     Total -------------------+-----------------------------------+----------                age |        142           12        12 |       154                bmi |        126           28        28 |       154 ------------------------------------------------------------------ (Complete + Incomplete = Total; Imputed is the minimum across m  of the number of filled-in observations.)

Imputations =        5 and added =        5 indicate five imputations of variables which have missing values on the observed dataset were added. The new variables are noted with a prefix _x_ where x represent the imputation number (i.e. , 1, 2,…,5).

5.2. Check the imputation. To compare the distribution of imputed variables to that of the observed and the complete ( i.e., observed and imputed) data use the midiagplots command which can be downloaded by typing findit midiagplots in the command line. In sum, type the following codes:

findit midiagplots
midiagplots bmi, m(1/5) combine

Stata will give us the following plots.

The above plots represent the distribution of BMI (this could be done for any of the imputed variables) and suggest a good overlap between observed and completed data.

STEP 6: Analyzing the Imputed Data

6.1. Perform your analysis. Use mi estimate to perform your desired statistical analysis on the imputed dataset. For example, to run a logit model, you would use:

mi estimate, or: logit attack bmi age female smokes hsgrad

Interpretation:

In the above codes, we use a logistic regression to estimate the probability of a heart attack with the pooled 5 imputed data files.

The “mi estimate” prefix first runs the estimation command on each of the imputations separately. It then combines the results and displays the combined output.

Using the “or” option will present odds ratios following a logistic regression.

Stata gives us the following output table:

## 3. References / Useful Resources

Azur, Melissa J., Stuart, Elizabeth A., Frangakis, Constantine & Leaf, Philip J. (2011), Multiple Impuitation by Chained Equations: What is it and how does it work? International Journal Methods Psychiatric Research, 20(1), 40-49.

Cain, M. (2020). Analyzing data with missing values using multiple imputation. Stata. Available at: https://www.stata.com/training/webinar_series/multiple-imputation/mi_slides.pdf

DSS Data Analysis Guides. Available at: https://library.princeton.edu/dss/training

Medeiros, R. (2016). Handling missing data in Stata: Imputation and likelihood-based approaches. Swiss Stata Users Group meeting. Available at: https://www.stata.com/meeting/switzerland16/slides/medeiros-switzerland16.pdf

Stata manual. mi impute — Impute missing values. Available at: https://www.stata.com/manuals13/mimiimpute.pdf

Sata. Multiple Imputation. Available at: https://www.stata.com/features/multiple-imputation/

UCLA Statistical Methods and Data Analytics. Multiple Imputation in Stata. Available at: https://stats.oarc.ucla.edu/stata/seminars/mi_in_stata_pt1_new/

White, Ian R., Royston Patrick & Wood, Angela M. (2011), Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30, 377-399.

## Data Consultant

Muhammad Al Amin
He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051

## Data Consultant

Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519

## Comments or Questions?

If you have questions or comments about this guide or method, please email data@Princeton.edu.