Research Guides: Missing Data Imputation in Stata: Multiple Imputation Techniques

1. Introduction

Stata offers several methods for imputation, including single imputation, multiple imputation, and interpolation. The choice of method depends on the nature of your data and the type of analysis you intend to perform. This guide provides step-by-step instructions for conducting multiple imputation of missing data using Stata.

Multiple imputation is one of the most robust and widely used statistical techniques for dealing with missing data. In multiple imputation, the distribution of observed data is used to estimate a set of plausible values for missing data. The missing values are replaced by the estimated plausible values to create a “complete” dataset.

2. Multiple Imputation in Stata

STEP 1: Preparing Your Data

1.1. Load your dataset. For this tutorial, we will use the "mheart5.dta", a data file available from Stata Corp. Type:

webuse "mheart5.dta"

1.2. Explore missing data. To examine the missing data pattern type:

misstable sum, gen(miss_)

The “misstable” command with the “gen()” option generates indicators for missingness. These new variables are added to the data file and start with the prefix miss_.

Stata will give us the following table.

Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
age | 12 142 | 142 20.73613 83.78423
bmi | 28 126 | 126 17.22643 38.24214
-----------------------------------------------------------------------------

The Obs=. column represents the number of missing values for each variable. If there is no entry for a variable, it has no missing values.

The Obs<. column represents the number of observed values for each variable.

1.3. you may tabulate the new indicator variables as an additional check. Type:

tab1 miss_age miss_bmi

Stata will give us the following table.

-> tabulation of miss_age

(age>=.) | Freq. Percent Cum.
------------+-----------------------------------
0 | 142 92.21 92.21
1 | 12 7.79 100.00
------------+-----------------------------------
Total | 154 100.00

-> tabulation of miss_bmi

(bmi>=.) | Freq. Percent Cum.
------------+-----------------------------------
0 | 126 81.82 81.82
1 | 28 18.18 100.00
------------+-----------------------------------
Total | 154 100.00

Indicator variables miss_age and miss_bmi were added to the data file in step 1.2; a value of 1 on these variables indicates the observation is missing information on the specific variable. A value of 0 indicates the observation in not missing. 12 observations are missing information on age, 28 observations are missing on BMI.

STEP 2: Consider the Mechanism of Missingness

Decide which variables need imputation. Not all missing data needs to be imputed; sometimes, missingness is informative. Multiple imputation is appropriate when data are missing completely at random (MCAR) or missing at random (MAR). It would be difficult to perform a legitimate analysis if data are missing not at random (MNAR).

2.1. Logistic regression models could be used to examine whether any of the variables in the data file predict missingness. If they do, the data are MAR rather than MCAR. To run the logit model, type:

logit miss_bmi attack smoke age female hsgrad

Stata will give us the following output table.

Iteration 0: Log likelihood = -49.994502
Iteration 1: Log likelihood = -47.73123
Iteration 2: Log likelihood = -47.614822
Iteration 3: Log likelihood = -47.614515
Iteration 4: Log likelihood = -47.614515

Logistic regression Number of obs = 142
LR chi2(5) = 4.76
Prob > chi2 = 0.4459
Log likelihood = -47.614515 Pseudo R2 = 0.0476

------------------------------------------------------------------------------
miss_bmi | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
attack | .0101071 .5775173 0.02 0.986 -1.121806 1.14202
smokes | .1965135 .5739319 0.34 0.732 -.9283723 1.321399
age | -.0485561 .0244407 -1.99 0.047 -.096459 -.0006532
female | .0892789 .6256756 0.14 0.887 -1.137023 1.315581
hsgrad | .3940007 .6888223 0.57 0.567 -.9560662 1.744068
_cons | .1414761 1.423355 0.10 0.921 -2.648249 2.931201
------------------------------------------------------------------------------

age is statistically significantly associated with miss_bmi, suggesting that the data are MAR rather than MCAR.

2.2. Run another logistic model to make sure that no other variables other than BMI are statistically significantly associated with missingness of age. Type:

logit miss_age attack smoke female hsgrad

Stata will give us the following results.

Iteration 0: Log likelihood = -42.144379
Iteration 1: Log likelihood = -40.780233
Iteration 2: Log likelihood = -40.713422
Iteration 3: Log likelihood = -40.713172
Iteration 4: Log likelihood = -40.713172

Logistic regression Number of obs = 154
LR chi2(4) = 2.86
Prob > chi2 = 0.5811
Log likelihood = -40.713172 Pseudo R2 = 0.0340

------------------------------------------------------------------------------
miss_age | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
attack | -1.035628 .7108815 -1.46 0.145 -2.42893 .3576738
smokes | .2788896 .6369393 0.44 0.661 -.9694886 1.527268
female | -.0059384 .7025713 -0.01 0.993 -1.382953 1.371076
hsgrad | .5426292 .8029777 0.68 0.499 -1.031178 2.116437
_cons | -2.649692 .7993453 -3.31 0.001 -4.21638 -1.083004
------------------------------------------------------------------------------

None of the p-values are less than 0.05 indicating no other variables are statistically significantly associated with missingness of age.

2.3. T-test may also be informative in evaluating whether the values of other variables vary between the missing and the non-missing groups. Type:

foreach var of varlist attack smoke age female hsgrad {
ttest `var', by(miss_bmi)
}

Results for t test between miss_bmi and age are presented below. Results for t test for other variables with miss_bmi are not presented for brevity.

Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
0 | 126 57.14571 1.022929 11.48234 55.12121 59.17021
1 | 16 50.82253 2.810969 11.24388 44.83109 56.81397
---------+--------------------------------------------------------------------
Combined | 142 56.43324 .9727211 11.59131 54.51024 58.35624
---------+--------------------------------------------------------------------
diff | 6.323186 3.040682 .3115936 12.33478
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = 2.0795
H0: diff = 0 Degrees of freedom = 140

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.9803 Pr(|T| > |t|) = 0.0394 Pr(T > t) = 0.0197

T-test suggests a statistically significant relationship between missigness of BMI and age. T-tests between missingness of BMI and the other variables (i.e., attack, smokes, female and hsgrad) were not statistically significant.

STEP 3: Setting-up Multiple Imputation

3.1: Declare MI set. Prior to imputation, data should be set to wide using the “mi set” command. This set up suggests Stata how the additional imputations should be stored. Type:

mi set wide

3.2. Register variables for imputation. Variables in the data set have to be registered using the “mi register” command. “mi register regular” specifies the variables that should not be imputed (either because they have no missing values or because there is no need). “mi register imputed” specifies the variables to be imputed in the procedure. Type:

mi register regular female attack smokes hsgrad
mi register imputed bmi age

STEP 4: Choosing an Imputation Method

- Choose an appropriate method for each type of variable you want to impute. For instance, choose regress for continuous variables (e.g., age, bmi), logit for binary variables (e.g., gender), mlogit for multinomial variable (e.g., political parties).

- Stata offers mi impute chained (MICE) for a flexible approach that can handle different types of variables (e.g., continuous, binary, ordinal, nominal, truncated, count) in one command.

- mi impute chained is an iterative process. The variable with the fewest missing values is imputed first followed by the variable with the next fewest missing values and so on for the rest of the variables.

STEP 5: Imputing the Missing Data

5.1. Perform the imputation. To impute missing values using chained equations for 5 imputations, type:

mi impute chained (regress) age bmi = attack smokes hsgrad female, add(5) rseed(9478) replace

Interpretation of the codes:

mi impute chained: command for chained imputation with a mixed varieties of variables.

(regress) age bmi: Both BMI and age, the variables to be imputed, are continuous variables. Therefore, regress is specified as a method.

attack smokes hsgrad female: These are regular variables which should not be imputed.

add(5): The number of datasets to be imputed (5).

rseed(9478): The “rseed()” option may be used for results reproducibility.

Stata will give us the following table.

note: missing-value pattern is monotone; no iteration performed.

Conditional models (monotone):
age: regress age attack smokes hsgrad female
bmi: regress bmi age attack smokes hsgrad female

Performing chained iterations ...

Multivariate imputation Imputations = 5
Chained equations added = 5
Imputed: m=1 through m=5 updated = 0

Initialization: monotone Iterations = 0
burn-in = 0

age: linear regression
bmi: linear regression

------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
age | 142 12 12 | 154
bmi | 126 28 28 | 154
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
of the number of filled-in observations.)

Imputations = 5 and added = 5 indicate five imputations of variables which have missing values on the observed dataset were added. The new variables are noted with a prefix _x_ where x represent the imputation number (i.e. , 1, 2,…,5).

5.2. Check the imputation. To compare the distribution of imputed variables to that of the observed and the complete ( i.e., observed and imputed) data use the midiagplots command which can be downloaded by typing findit midiagplots in the command line. In sum, type the following codes:

findit midiagplots
midiagplots bmi, m(1/5) combine

Stata will give us the following plots.

The above plots represent the distribution of BMI (this could be done for any of the imputed variables) and suggest a good overlap between observed and completed data.

STEP 6: Analyzing the Imputed Data

6.1. Perform your analysis. Use mi estimate to perform your desired statistical analysis on the imputed dataset. For example, to run a logit model, you would use:

mi estimate, or: logit attack bmi age female smokes hsgrad

Interpretation:

In the above codes, we use a logistic regression to estimate the probability of a heart attack with the pooled 5 imputed data files.

The “mi estimate” prefix first runs the estimation command on each of the imputations separately. It then combines the results and displays the combined output.

Using the “or” option will present odds ratios following a logistic regression.

Stata gives us the following output table:

Multiple-imputation estimates Imputations = 5
Logistic regression Number of obs = 154
Average RVI = 0.0454
Largest FMI = 0.1607
DF adjustment: Large sample DF: min = 175.09
avg = 2243117.63
max = 1.30e+07
Model F test: Equal FMI F( 5, 5615.2) = 3.13
Within VCE type: OIM Prob > F = 0.0080

------------------------------------------------------------------------------
attack | Odds ratio Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
bmi | 1.094327 .0525146 1.88 0.062 .9954409 1.203037
age | 1.028546 .016244 1.78 0.075 .997173 1.060906
female | .8829223 .3608274 -0.30 0.761 .3963305 1.966924
smokes | 3.138241 1.100922 3.26 0.001 1.577892 6.241591
hsgrad | 1.160312 .4648238 0.37 0.711 .5291435 2.544345
_cons | .0097271 .0156492 -2.88 0.004 .0004141 .2284712
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

3. References / Useful Resources

Azur, Melissa J., Stuart, Elizabeth A., Frangakis, Constantine & Leaf, Philip J. (2011), Multiple Impuitation by Chained Equations: What is it and how does it work? International Journal Methods Psychiatric Research, 20(1), 40-49.

Cain, M. (2020). Analyzing data with missing values using multiple imputation. Stata. Available at: https://www.stata.com/training/webinar_series/multiple-imputation/mi_slides.pdf

DSS Data Analysis Guides. Available at: https://libguides.princeton.edu/c.php?g=1415215

Medeiros, R. (2016). Handling missing data in Stata: Imputation and likelihood-based approaches. Swiss Stata Users Group meeting. Available at: https://www.stata.com/meeting/switzerland16/slides/medeiros-switzerland16.pdf

Stata manual. mi impute — Impute missing values. Available at: https://www.stata.com/manuals13/mimiimpute.pdf

Sata. Multiple Imputation. Available at: https://www.stata.com/features/multiple-imputation/

UCLA Statistical Methods and Data Analytics. Multiple Imputation in Stata. Available at: https://stats.oarc.ucla.edu/stata/seminars/mi_in_stata_pt1_new/

White, Ian R., Royston Patrick & Wood, Angela M. (2011), Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30, 377-399.

Data Consultant

Muhammad Al Amin

He/Him/His

Email Me

Contact:

Firestone Library, A-12-F.1

609-258-6051

Data Consultant

Yufei Qin

Email Me

Contact:

Firestone Library, A.12F.2

6092582519

Missing Data Imputation in Stata: Multiple Imputation Techniques

Multiple Imputation Techniques

Table of Contents

1. Introduction

2. Multiple Imputation in Stata

3. References / Useful Resources

1. Introduction

2. Multiple Imputation in Stata

-> tabulation of miss_age

(age>=.) | Freq. Percent Cum.
------------+-----------------------------------
0 | 142 92.21 92.21
1 | 12 7.79 100.00
------------+-----------------------------------
Total | 154 100.00

-> tabulation of miss_bmi

(bmi>=.) | Freq. Percent Cum.
------------+-----------------------------------
0 | 126 81.82 81.82
1 | 28 18.18 100.00
------------+-----------------------------------
Total | 154 100.00

Iteration 0: Log likelihood = -49.994502
Iteration 1: Log likelihood = -47.73123
Iteration 2: Log likelihood = -47.614822
Iteration 3: Log likelihood = -47.614515
Iteration 4: Log likelihood = -47.614515

Logistic regression Number of obs = 142
LR chi2(5) = 4.76
Prob > chi2 = 0.4459
Log likelihood = -47.614515 Pseudo R2 = 0.0476

Iteration 0: Log likelihood = -42.144379
Iteration 1: Log likelihood = -40.780233
Iteration 2: Log likelihood = -40.713422
Iteration 3: Log likelihood = -40.713172
Iteration 4: Log likelihood = -40.713172

Logistic regression Number of obs = 154
LR chi2(4) = 2.86
Prob > chi2 = 0.5811
Log likelihood = -40.713172 Pseudo R2 = 0.0340

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.9803 Pr(|T| > |t|) = 0.0394 Pr(T > t) = 0.0197

note: missing-value pattern is monotone; no iteration performed.

Conditional models (monotone):
age: regress age attack smokes hsgrad female
bmi: regress bmi age attack smokes hsgrad female

Performing chained iterations ...

Multivariate imputation Imputations = 5
Chained equations added = 5
Imputed: m=1 through m=5 updated = 0

Initialization: monotone Iterations = 0
burn-in = 0

age: linear regression
bmi: linear regression

Multiple-imputation estimates Imputations = 5
Logistic regression Number of obs = 154
Average RVI = 0.0454
Largest FMI = 0.1607
DF adjustment: Large sample DF: min = 175.09
avg = 2243117.63
max = 1.30e+07
Model F test: Equal FMI F( 5, 5615.2) = 3.13
Within VCE type: OIM Prob > F = 0.0080

3. References / Useful Resources

Data Consultant

Data Consultant

Comments or Questions?

Missing Data Imputation in Stata: Multiple Imputation Techniques

Multiple Imputation Techniques

Table of Contents

1. Introduction

2. Multiple Imputation in Stata

3. References / Useful Resources

1. Introduction

2. Multiple Imputation in Stata

-> tabulation of miss_age

(age>=.) | Freq. Percent Cum. ------------+----------------------------------- 0 | 142 92.21 92.21 1 | 12 7.79 100.00 ------------+----------------------------------- Total | 154 100.00

-> tabulation of miss_bmi

(bmi>=.) | Freq. Percent Cum. ------------+----------------------------------- 0 | 126 81.82 81.82 1 | 28 18.18 100.00 ------------+----------------------------------- Total | 154 100.00

Iteration 0: Log likelihood = -49.994502 Iteration 1: Log likelihood = -47.73123 Iteration 2: Log likelihood = -47.614822 Iteration 3: Log likelihood = -47.614515 Iteration 4: Log likelihood = -47.614515

Logistic regression Number of obs = 142 LR chi2(5) = 4.76 Prob > chi2 = 0.4459 Log likelihood = -47.614515 Pseudo R2 = 0.0476

Iteration 0: Log likelihood = -42.144379 Iteration 1: Log likelihood = -40.780233 Iteration 2: Log likelihood = -40.713422 Iteration 3: Log likelihood = -40.713172 Iteration 4: Log likelihood = -40.713172

Logistic regression Number of obs = 154 LR chi2(4) = 2.86 Prob > chi2 = 0.5811 Log likelihood = -40.713172 Pseudo R2 = 0.0340

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.9803 Pr(|T| > |t|) = 0.0394 Pr(T > t) = 0.0197

note: missing-value pattern is monotone; no iteration performed.

Conditional models (monotone): age: regress age attack smokes hsgrad female bmi: regress bmi age attack smokes hsgrad female

Performing chained iterations ...

Multivariate imputation Imputations = 5 Chained equations added = 5 Imputed: m=1 through m=5 updated = 0

Initialization: monotone Iterations = 0 burn-in = 0

age: linear regression bmi: linear regression

Multiple-imputation estimates Imputations = 5 Logistic regression Number of obs = 154 Average RVI = 0.0454 Largest FMI = 0.1607 DF adjustment: Large sample DF: min = 175.09 avg = 2243117.63 max = 1.30e+07 Model F test: Equal FMI F( 5, 5615.2) = 3.13 Within VCE type: OIM Prob > F = 0.0080

3. References / Useful Resources

Data Consultant

Data Consultant

Comments or Questions?

Subscribe to our Newsletter

(age>=.) | Freq. Percent Cum.
------------+-----------------------------------
0 | 142 92.21 92.21
1 | 12 7.79 100.00
------------+-----------------------------------
Total | 154 100.00

(bmi>=.) | Freq. Percent Cum.
------------+-----------------------------------
0 | 126 81.82 81.82
1 | 28 18.18 100.00
------------+-----------------------------------
Total | 154 100.00

Iteration 0: Log likelihood = -49.994502
Iteration 1: Log likelihood = -47.73123
Iteration 2: Log likelihood = -47.614822
Iteration 3: Log likelihood = -47.614515
Iteration 4: Log likelihood = -47.614515

Logistic regression Number of obs = 142
LR chi2(5) = 4.76
Prob > chi2 = 0.4459
Log likelihood = -47.614515 Pseudo R2 = 0.0476

Iteration 0: Log likelihood = -42.144379
Iteration 1: Log likelihood = -40.780233
Iteration 2: Log likelihood = -40.713422
Iteration 3: Log likelihood = -40.713172
Iteration 4: Log likelihood = -40.713172

Logistic regression Number of obs = 154
LR chi2(4) = 2.86
Prob > chi2 = 0.5811
Log likelihood = -40.713172 Pseudo R2 = 0.0340

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.9803 Pr(|T| > |t|) = 0.0394 Pr(T > t) = 0.0197

Conditional models (monotone):
age: regress age attack smokes hsgrad female
bmi: regress bmi age attack smokes hsgrad female

Multivariate imputation Imputations = 5
Chained equations added = 5
Imputed: m=1 through m=5 updated = 0

Initialization: monotone Iterations = 0
burn-in = 0

age: linear regression
bmi: linear regression

Multiple-imputation estimates Imputations = 5
Logistic regression Number of obs = 154
Average RVI = 0.0454
Largest FMI = 0.1607
DF adjustment: Large sample DF: min = 175.09
avg = 2243117.63
max = 1.30e+07
Model F test: Equal FMI F( 5, 5615.2) = 3.13
Within VCE type: OIM Prob > F = 0.0080