Stata offers several methods for imputation, including single imputation, multiple imputation, and interpolation. The choice of method depends on the nature of your data and the type of analysis you intend to perform. This guide provides step-by-step instructions for conducting multiple imputation of missing data using Stata.
Multiple imputation is one of the most robust and widely used statistical techniques for dealing with missing data. In multiple imputation, the distribution of observed data is used to estimate a set of plausible values for missing data. The missing values are replaced by the estimated plausible values to create a “complete” dataset.
STEP 1: Preparing Your Data
1.1. Load your dataset. For this tutorial, we will use the "mheart5.dta", a data file available from Stata Corp. Type:
webuse "mheart5.dta"
1.2. Explore missing data. To examine the missing data pattern type:
misstable sum, gen(miss_)
The “misstable” command with the “gen()” option generates indicators for missingness. These new variables are added to the data file and start with the prefix miss_.
Stata will give us the following table.
The Obs=. column represents the number of missing values for each variable. If there is no entry for a variable, it has no missing values.
The Obs<. column represents the number of observed values for each variable.
1.3. you may tabulate the new indicator variables as an additional check. Type:
tab1 miss_age miss_bmi
Stata will give us the following table.
Indicator variables miss_age and miss_bmi were added to the data file in step 1.2; a value of 1 on these variables indicates the observation is missing information on the specific variable. A value of 0 indicates the observation in not missing. 12 observations are missing information on age, 28 observations are missing on BMI.
STEP 2: Consider the Mechanism of Missingness
Decide which variables need imputation. Not all missing data needs to be imputed; sometimes, missingness is informative. Multiple imputation is appropriate when data are missing completely at random (MCAR) or missing at random (MAR). It would be difficult to perform a legitimate analysis if data are missing not at random (MNAR).
2.1. Logistic regression models could be used to examine whether any of the variables in the data file predict missingness. If they do, the data are MAR rather than MCAR. To run the logit model, type:
logit miss_bmi attack smoke age female hsgrad
Stata will give us the following output table.
age is statistically significantly associated with miss_bmi, suggesting that the data are MAR rather than MCAR.
2.2. Run another logistic model to make sure that no other variables other than BMI are statistically significantly associated with missingness of age. Type:
logit miss_age attack smoke female hsgrad
Stata will give us the following results.
None of the p-values are less than 0.05 indicating no other variables are statistically significantly associated with missingness of age.
2.3. T-test may also be informative in evaluating whether the values of other variables vary between the missing and the non-missing groups. Type:
foreach var of varlist attack smoke age female hsgrad {
ttest `var', by(miss_bmi)
}
Results for t test between miss_bmi and age are presented below. Results for t test for other variables with miss_bmi are not presented for brevity.
T-test suggests a statistically significant relationship between missigness of BMI and age. T-tests between missingness of BMI and the other variables (i.e., attack, smokes, female and hsgrad) were not statistically significant.
STEP 3: Setting-up Multiple Imputation
3.1: Declare MI set. Prior to imputation, data should be set to wide using the “mi set” command. This set up suggests Stata how the additional imputations should be stored. Type:
mi set wide
3.2. Register variables for imputation. Variables in the data set have to be registered using the “mi register” command. “mi register regular” specifies the variables that should not be imputed (either because they have no missing values or because there is no need). “mi register imputed” specifies the variables to be imputed in the procedure. Type:
mi register regular female attack smokes hsgrad
mi register imputed bmi age
STEP 4: Choosing an Imputation Method
- Choose an appropriate method for each type of variable you want to impute. For instance, choose regress for continuous variables (e.g., age, bmi), logit for binary variables (e.g., gender), mlogit for multinomial variable (e.g., political parties).
- Stata offers mi impute chained (MICE) for a flexible approach that can handle different types of variables (e.g., continuous, binary, ordinal, nominal, truncated, count) in one command.
- mi impute chained is an iterative process. The variable with the fewest missing values is imputed first followed by the variable with the next fewest missing values and so on for the rest of the variables.
STEP 5: Imputing the Missing Data
5.1. Perform the imputation. To impute missing values using chained equations for 5 imputations, type:
mi impute chained (regress) age bmi = attack smokes hsgrad female, add(5) rseed(9478) replace
Interpretation of the codes:
mi impute chained: command for chained imputation with a mixed varieties of variables.
(regress) age bmi: Both BMI and age, the variables to be imputed, are continuous variables. Therefore, regress is specified as a method.
attack smokes hsgrad female: These are regular variables which should not be imputed.
add(5): The number of datasets to be imputed (5).
rseed(9478): The “rseed()” option may be used for results reproducibility.
Stata will give us the following table.
Imputations = 5 and added = 5 indicate five imputations of variables which have missing values on the observed dataset were added. The new variables are noted with a prefix _x_ where x represent the imputation number (i.e. , 1, 2,…,5).
5.2. Check the imputation. To compare the distribution of imputed variables to that of the observed and the complete ( i.e., observed and imputed) data use the midiagplots command which can be downloaded by typing findit midiagplots in the command line. In sum, type the following codes:
findit midiagplots
midiagplots bmi, m(1/5) combine
Stata will give us the following plots.
The above plots represent the distribution of BMI (this could be done for any of the imputed variables) and suggest a good overlap between observed and completed data.
STEP 6: Analyzing the Imputed Data
6.1. Perform your analysis. Use mi estimate to perform your desired statistical analysis on the imputed dataset. For example, to run a logit model, you would use:
mi estimate, or: logit attack bmi age female smokes hsgrad
Interpretation:
In the above codes, we use a logistic regression to estimate the probability of a heart attack with the pooled 5 imputed data files.
The “mi estimate” prefix first runs the estimation command on each of the imputations separately. It then combines the results and displays the combined output.
Using the “or” option will present odds ratios following a logistic regression.
Stata gives us the following output table:
If you have questions or comments about this guide or method, please email data@Princeton.edu.