# Missing Data Imputation in R: Missing data R tutorial

a brief guide to handle missing data in R

## Missing Data Introduction

Missing data is a common issue encountered in research across various fields, including social sciences. It occurs when no data value is stored for the variable in an observation. It can happen for a wide range of reasons: from participants not answering certain questions in a survey to errors in data collection or transfer processes.

Missing data can significantly impact the analysis, potentially leading to biased estimates, reduced statistical power, and invalid conclusions. It's crucial for researchers and analysts to recognize the types of missing data, understand the mechanisms behind them, and apply appropriate methods for handling them.

## Identify Missing Data

We first need to identify where and how data is missing in our dataset. This initial step is crucial as it informs us about the nature of our missing data, helps us hypothesize about potential reasons for its absence, and guides our strategy for dealing with it effectively.

Understanding the Landscape of Missingness:

1. Quantifying Missing Data: Determine the extent of missing data, both at a variable level and observation level. Understanding the volume of missing data is critical as it impacts the choice of handling techniques.

2. Visualizing Missing Data Patterns: Visual patterns of missing data sometimes can reveal insights, indicating whether missing data is random or systematic.

3. Analyzing the Impact of Missing Data: Assess how missing values in one variable might relate to another, aiding in understanding if data is Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR).

Sample Data

We are using the built-in dataset 'airquality' in R as a sample dataset. We can use the help command to access the codebook:

```# Load the airquality dataset
data("airquality")
help("airquality")

# View the first few rows of the dataset

# Summary of the dataset to check for missing values
summary(airquality)

# View them in spreadsheet format
View(airquality)
```

It's a data frame with 153 observations on 6 variables.

[,1]    Ozone    numeric    Ozone (ppb)
[,2]    Solar.R    numeric    Solar R (lang)
[,3]    Wind    numeric    Wind (mph)
[,4]    Temp    numeric    Temperature (degrees F)
[,5]    Month    numeric    Month (1--12)
[,6]    Day    numeric    Day of month (1--31)

From only the preview we can see there are some missing data present.

Now we see how many missing variables are there.

```# Count missing values in each column
sapply(airquality, function(x) sum(is.na(x)))
Ozone Solar.R    Wind    Temp   Month     Day
37       7       0       0       0       0
# Overall missing value count
sum(is.na(airquality))
[1] 44
```

We can see that in total there are 44 missing data in the dataset. The Wind, Temp, Month and Day columns have no missing data and Ozone column has 37. Solar.R column also has 7 missing.

We can also calculate the percentage of missing values in each column. This could be really useful for big and messy datasets.

```# Calculate the percentage of missing values in each column
sapply(airquality, function(x) mean(is.na(x)) * 100)

Ozone   Solar.R      Wind      Temp     Month       Day
24.183007  4.575163  0.000000  0.000000  0.000000  0.000000
```

## Deleting Missing Rows

One of the simplest approaches to address missing data in a dataset is to delete observations (rows) that contain any missing values. This method, often referred to as "listwise deletion" or "complete case analysis," involves removing entire records from the analysis if they are missing any data point in one or more variables

When to Consider Deleting Missing Rows:

• Minimal Missing Data: If the missing data is slight and seemingly random, eliminating those incomplete entries is unlikely to significantly affect the dataset's overall quality.
• MCAR Data: Deletion is most appropriate when the missing data is Missing Completely At Random (MCAR), meaning there's no systematic difference between the missing and observed values.
```# Remove rows with any missing value
airquality_complete_cases <- na.omit(airquality)
# See the missing value now
sum(is.na(airquality_complete_cases))
[1] 0
# Get the dimensions of the dataset (rows and columns)
dim(airquality_complete_cases)
[1] 111   6```

We can see that the new dataset now contains no missing value.

Note that many statistical software packages and functions designed for linear regression and similar models have built-in mechanisms to address missing data. Typically, these mechanisms involve automatically excluding rows with missing values in any variable included in the model (listwise deletion). Therefore if you do not have specific needs it may not be necessary to manually remove all the missing values.

## Imputation with Mean

When dealing with missing data, a common and straightforward approach is to fill in the missing values with the mean of the available values in the same variable. This method, known as "mean imputation," involves calculating the average of the non-missing values for each variable and substituting that average for the missing entries.

When to Consider Deleting Missing Rows

• Suitable for MCAR: Mean imputation is most effective when the data is Missing Completely At Random (MCAR). It assumes that the missing values, on average, are similar to the observed ones.
• Preliminary Analysis: It can serve as a quick fix for preliminary analyses, allowing for a fuller utilization of the dataset, especially when dealing with minimal missingness.

Partially

We can choose to impute only certain columns if other columns just have trivial missingness.

```# For a single column
# Create a copy of the original dataframe
airquality_imputed_one <- airquality
airquality_imputed_one\$Ozone[is.na(airquality\$Ozone)] <- mean(airquality\$Ozone, na.rm = TRUE)

# Count missing values in each column
sapply(airquality_imputed_one, function(x) sum(is.na(x)))
Ozone Solar.R    Wind    Temp   Month     Day
0       7       0       0       0       0
```

We can see that now the dataset has only 7 missing values in solar.R column.

Whole Dataset

We can also impute every single column in a dataset. Note that if you use this code below, make sure all the columns with missing values are numeric.

```# For all columns in a dataset
airquality_imputed <- airquality
for(i in 1:ncol(airquality_imputed)) {
if(is.numeric(airquality_imputed[[i]])) {
airquality_imputed[[i]][is.na(airquality_imputed[[i]])] <- mean(airquality_imputed[[i]], na.rm = TRUE)
}
}

# Count missing values in each column
sapply(airquality_imputed, function(x) sum(is.na(x)))
Ozone Solar.R    Wind    Temp   Month     Day
0       0       0       0       0       0
```

Now this imputation is finished.

Let's now compare the imputed/unimputed version. We plot the density plots together.

```# Density plots
ggplot(airquality, aes(x=Ozone, fill="Original")) +
geom_density(alpha=0.5) +
geom_density(data=airquality_imputed, aes(x=Ozone, fill="Imputed"), alpha=0.5) +
labs(title="Density Plot of Ozone: Original vs. Imputed")
```

We can see that compared with the original dataset, the imputed dataset has a high density of around 50, which is normal because a lot of mean values were imputed to the missing columns.

Now we compare two different linear models.

```# Linear model before imputation
lm_original <- lm(Ozone ~ Wind + Temp, data=na.omit(airquality))

# Linear model after imputation
lm_imputed <- lm(Ozone ~ Wind + Temp, data=airquality_imputed)
```
```# Compare summaries
summary(lm_original)
Call:
lm(formula = Ozone ~ Wind + Temp, data = na.omit(airquality))

Residuals:
Min      1Q  Median      3Q     Max
-42.156 -13.216  -3.123  10.598  98.492

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -67.3220    23.6210  -2.850  0.00524 **
Wind         -3.2948     0.6711  -4.909 3.26e-06 ***
Temp          1.8276     0.2506   7.294 5.29e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.73 on 108 degrees of freedom
Multiple R-squared:  0.5814,	Adjusted R-squared:  0.5736
F-statistic: 74.99 on 2 and 108 DF,  p-value: < 2.2e-16

summary(lm_imputed)
Call:
lm(formula = Ozone ~ Wind + Temp, data = airquality_imputed)

Residuals:
Min      1Q  Median      3Q     Max
-38.550 -13.998  -4.306  10.530 104.458

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.2159    19.3090  -2.135   0.0344 *
Wind         -2.5986     0.5543  -4.688 6.14e-06 ***
Temp          1.4024     0.2063   6.798 2.35e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.4 on 150 degrees of freedom
Multiple R-squared:  0.451,	Adjusted R-squared:  0.4437
F-statistic: 61.62 on 2 and 150 DF,  p-value: < 2.2e-16
```

We can see that imputation did not change the significance of the variables, nor the sign of the coefficients. The adjusted R-square of the imputed model is a bit lower than that of the original.

We can also check the assumption checks of both models.

```# Diagnostic plots before imputation
par(mfrow=c(2,2))
plot(lm_original)
```

```# Diagnostic plots after imputation
par(mfrow=c(2,2))
plot(lm_imputed)
```

Both models appear good normality fit. However, after imputation, there is a slight improvement on homoscedasticity (equal variances).

## Imputation with MICE((Multivariate Imputation by Chained Equations)

In this section we will be using R packages 'mice' and 'naniar' to do the imputation.

First we need to download and library the packages.

```# Install packages if they are not already installed
install.packages(c("mice", "ggplot2", "naniar"))

# Load the packages
library(mice)
library(ggplot2)
library(naniar)
```

Then we can visualize the missing data in each column using the vis_miss() function.

```# Visualize missing data
vis_miss(airquality)
```

From the graph, similarly, we can see that the other four columns do not have any missing values. 24% of Ozone is missing and 5% of Solar.R is missing.

We will now use the mice package to impute the missing value. MICE means multivariate imputation by chained equations.

```# Set the seed for reproducibility
set.seed(12345)

# Perform Multiple Imputation
imputed_data <- mice(airquality, m=5, method='pmm', print=FALSE)
```

Explanation of the mice function:

• Number of Imputations (m=5): The 'm' argument specifies how many complete datasets you wish to generate, each with missing values filled in. By setting m to 5, the function will create five versions of your dataset, each with missing values imputed differently. This multiplicity captures the uncertainty inherent in the imputation process.

• Imputation Method (method = 'pmm'): The 'method' argument dictates the statistical technique mice will use to predict missing values. PMM means Predictive Mean Matching and it's a non-parametric approach particularly suited for continuous data. PMM operates by finding observed values with similar predictive characteristics to the missing entries. The missing values are then imputed u, thus preserving the distribution and variance of the original data more effectively than simpler methods, such as mean imputation.

Note that now the imputed_data is not a simple dataframe.

Now, let's compare different models.

```# Fit a linear model
model_original <- lm(Ozone ~ Solar.R, data=complete_cases)

# Summarize the model
summary(model_original)
Call:
lm(formula = Ozone ~ Solar.R, data = complete_cases)

Residuals:
Min      1Q  Median      3Q     Max
-48.292 -21.361  -8.864  16.373 119.136

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.59873    6.74790   2.756 0.006856 **
Solar.R      0.12717    0.03278   3.880 0.000179 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 31.33 on 109 degrees of freedom
Multiple R-squared:  0.1213,	Adjusted R-squared:  0.1133
F-statistic: 15.05 on 1 and 109 DF,  p-value: 0.0001793
```
```# Fit the same model to each of the imputed datasets and pool the results
model_imputed <- with(data=imputed_data, exp=lm(Ozone ~ Solar.R))

# Pool the results
pooled_results <- pool(model_imputed)

# Summarize the pooled model
summary(pooled_results)
term   estimate std.error statistic      df      p.value
1 (Intercept) 22.1672798 6.2750160  3.532625 55.9334 0.0008329364
2     Solar.R  0.1062197 0.0305815  3.473330 51.5235 0.0010503463```

The pooled estimate also gives a significant p value. However, we see a slight decrease in the estimate for Solar.R.

We can also see that whether the imputation distorted the distribution of a variable too much or not.

```par(mfrow=c(1,2))
hist(complete_cases\$Ozone, main="Original Ozone", xlab="Ozone concentration", col="blue")
hist(completed_data\$Ozone, main="Imputed Ozone", xlab="Ozone concentration", col="red")
```

We can notice a slight difference in the middle of the graph. In general, the distribution is not very distorted,

## Reference list / Useful links

• Allison, Paul D. (2001), Missing Data (Series: Quantitative Applications in the Social Sciences). A SAGE University paper.
• Azur, Melissa J., Stuart, Elizabeth A., Frangakis, Constantine & Leaf, Philip J. (2011), Multiple Impuitation by Chained Equations: What is it and how does it work? International Journal Methods Psychiatric Research, 20(1), 40-49.
• Getting Started with naniar: https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html
• mice: Multivariate Imputation by Chained Equations https://cran.r-project.org/web/packages/mice/mice.pdf
• Schafer, Joseph L. (1999), Multiple Imputation: a primer. Statistical Methods in Medical Research, 8, 3-15. van Buuren, Stef. (2012), Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton, FL.
• Stef van Buuren, Flexible Imputation of Missing Data: https://stefvanbuuren.name/fimd/
• White, Ian R., Royston Patrick & Wood, Angela M. (2011), Multiple imputation using chained equations: Issues and guidance for practice.

## Data Consultant

Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519