Missing data is a common issue encountered in research across various fields, including social sciences. It occurs when no data value is stored for the variable in an observation. It can happen for a wide range of reasons: from participants not answering certain questions in a survey to errors in data collection or transfer processes.
Missing data can significantly impact the analysis, potentially leading to biased estimates, reduced statistical power, and invalid conclusions. It's crucial for researchers and analysts to recognize the types of missing data, understand the mechanisms behind them, and apply appropriate methods for handling them.
We first need to identify where and how data is missing in our dataset. This initial step is crucial as it informs us about the nature of our missing data, helps us hypothesize about potential reasons for its absence, and guides our strategy for dealing with it effectively.
Understanding the Landscape of Missingness:
Quantifying Missing Data: Determine the extent of missing data, both at a variable level and observation level. Understanding the volume of missing data is critical as it impacts the choice of handling techniques.
Visualizing Missing Data Patterns: Visual patterns of missing data sometimes can reveal insights, indicating whether missing data is random or systematic.
Analyzing the Impact of Missing Data: Assess how missing values in one variable might relate to another, aiding in understanding if data is Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR).
Sample Data
We are using the built-in dataset 'airquality' in R as a sample dataset. We can use the help command to access the codebook:
# Load the airquality dataset data("airquality") help("airquality") # View the first few rows of the dataset head(airquality) # Summary of the dataset to check for missing values summary(airquality) # View them in spreadsheet format View(airquality)
It's a data frame with 153 observations on 6 variables.
[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1--12)
[,6] Day numeric Day of month (1--31)
From only the preview we can see there are some missing data present.
Now we see how many missing variables are there.
# Count missing values in each column sapply(airquality, function(x) sum(is.na(x))) Ozone Solar.R Wind Temp Month Day 37 7 0 0 0 0 # Overall missing value count sum(is.na(airquality)) [1] 44
We can see that in total there are 44 missing data in the dataset. The Wind, Temp, Month and Day columns have no missing data and Ozone column has 37. Solar.R column also has 7 missing.
We can also calculate the percentage of missing values in each column. This could be really useful for big and messy datasets.
# Calculate the percentage of missing values in each column
sapply(airquality, function(x) mean(is.na(x)) * 100)
Ozone Solar.R Wind Temp Month Day
24.183007 4.575163 0.000000 0.000000 0.000000 0.000000
One of the simplest approaches to address missing data in a dataset is to delete observations (rows) that contain any missing values. This method, often referred to as "listwise deletion" or "complete case analysis," involves removing entire records from the analysis if they are missing any data point in one or more variables
When to Consider Deleting Missing Rows:
# Remove rows with any missing value airquality_complete_cases <- na.omit(airquality) # See the missing value now sum(is.na(airquality_complete_cases)) [1] 0 # Get the dimensions of the dataset (rows and columns) dim(airquality_complete_cases) [1] 111 6
We can see that the new dataset now contains no missing value.
Note that many statistical software packages and functions designed for linear regression and similar models have built-in mechanisms to address missing data. Typically, these mechanisms involve automatically excluding rows with missing values in any variable included in the model (listwise deletion). Therefore if you do not have specific needs it may not be necessary to manually remove all the missing values.
When dealing with missing data, a common and straightforward approach is to fill in the missing values with the mean of the available values in the same variable. This method, known as "mean imputation," involves calculating the average of the non-missing values for each variable and substituting that average for the missing entries.
When to Consider Deleting Missing Rows
Partially
We can choose to impute only certain columns if other columns just have trivial missingness.
# For a single column # Create a copy of the original dataframe airquality_imputed_one <- airquality airquality_imputed_one$Ozone[is.na(airquality$Ozone)] <- mean(airquality$Ozone, na.rm = TRUE) # Count missing values in each column sapply(airquality_imputed_one, function(x) sum(is.na(x))) Ozone Solar.R Wind Temp Month Day 0 7 0 0 0 0
We can see that now the dataset has only 7 missing values in solar.R column.
Whole Dataset
We can also impute every single column in a dataset. Note that if you use this code below, make sure all the columns with missing values are numeric.
# For all columns in a dataset airquality_imputed <- airquality for(i in 1:ncol(airquality_imputed)) { if(is.numeric(airquality_imputed[[i]])) { airquality_imputed[[i]][is.na(airquality_imputed[[i]])] <- mean(airquality_imputed[[i]], na.rm = TRUE) } } # Count missing values in each column sapply(airquality_imputed, function(x) sum(is.na(x))) Ozone Solar.R Wind Temp Month Day 0 0 0 0 0 0
Now this imputation is finished.
Let's now compare the imputed/unimputed version. We plot the density plots together.
# Density plots
ggplot(airquality, aes(x=Ozone, fill="Original")) +
geom_density(alpha=0.5) +
geom_density(data=airquality_imputed, aes(x=Ozone, fill="Imputed"), alpha=0.5) +
labs(title="Density Plot of Ozone: Original vs. Imputed")
We can see that compared with the original dataset, the imputed dataset has a high density of around 50, which is normal because a lot of mean values were imputed to the missing columns.
Now we compare two different linear models.
# Linear model before imputation lm_original <- lm(Ozone ~ Wind + Temp, data=na.omit(airquality)) # Linear model after imputation lm_imputed <- lm(Ozone ~ Wind + Temp, data=airquality_imputed)
# Compare summaries summary(lm_original) Call: lm(formula = Ozone ~ Wind + Temp, data = na.omit(airquality)) Residuals: Min 1Q Median 3Q Max -42.156 -13.216 -3.123 10.598 98.492 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -67.3220 23.6210 -2.850 0.00524 ** Wind -3.2948 0.6711 -4.909 3.26e-06 *** Temp 1.8276 0.2506 7.294 5.29e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21.73 on 108 degrees of freedom Multiple R-squared: 0.5814, Adjusted R-squared: 0.5736 F-statistic: 74.99 on 2 and 108 DF, p-value: < 2.2e-16 summary(lm_imputed) Call: lm(formula = Ozone ~ Wind + Temp, data = airquality_imputed) Residuals: Min 1Q Median 3Q Max -38.550 -13.998 -4.306 10.530 104.458 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -41.2159 19.3090 -2.135 0.0344 * Wind -2.5986 0.5543 -4.688 6.14e-06 *** Temp 1.4024 0.2063 6.798 2.35e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21.4 on 150 degrees of freedom Multiple R-squared: 0.451, Adjusted R-squared: 0.4437 F-statistic: 61.62 on 2 and 150 DF, p-value: < 2.2e-16
We can see that imputation did not change the significance of the variables, nor the sign of the coefficients. The adjusted R-square of the imputed model is a bit lower than that of the original.
We can also check the assumption checks of both models.
# Diagnostic plots before imputation
par(mfrow=c(2,2))
plot(lm_original)
# Diagnostic plots after imputation
par(mfrow=c(2,2))
plot(lm_imputed)
Both models appear good normality fit. However, after imputation, there is a slight improvement on homoscedasticity (equal variances).
In this section we will be using R packages 'mice' and 'naniar' to do the imputation.
First we need to download and library the packages.
# Install packages if they are not already installed install.packages(c("mice", "ggplot2", "naniar")) # Load the packages library(mice) library(ggplot2) library(naniar)
Then we can visualize the missing data in each column using the vis_miss() function.
# Visualize missing data
vis_miss(airquality)
From the graph, similarly, we can see that the other four columns do not have any missing values. 24% of Ozone is missing and 5% of Solar.R is missing.
We will now use the mice package to impute the missing value. MICE means multivariate imputation by chained equations.
# Set the seed for reproducibility set.seed(12345) # Perform Multiple Imputation imputed_data <- mice(airquality, m=5, method='pmm', print=FALSE)
Explanation of the mice function:
Number of Imputations (m=5): The 'm' argument specifies how many complete datasets you wish to generate, each with missing values filled in. By setting m to 5, the function will create five versions of your dataset, each with missing values imputed differently. This multiplicity captures the uncertainty inherent in the imputation process.
Imputation Method (method = 'pmm'): The 'method' argument dictates the statistical technique mice will use to predict missing values. PMM means Predictive Mean Matching and it's a non-parametric approach particularly suited for continuous data. PMM operates by finding observed values with similar predictive characteristics to the missing entries. The missing values are then imputed u, thus preserving the distribution and variance of the original data more effectively than simpler methods, such as mean imputation.
Note that now the imputed_data is not a simple dataframe.
Now, let's compare different models.
# Fit a linear model model_original <- lm(Ozone ~ Solar.R, data=complete_cases) # Summarize the model summary(model_original) Call: lm(formula = Ozone ~ Solar.R, data = complete_cases) Residuals: Min 1Q Median 3Q Max -48.292 -21.361 -8.864 16.373 119.136 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 18.59873 6.74790 2.756 0.006856 ** Solar.R 0.12717 0.03278 3.880 0.000179 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 31.33 on 109 degrees of freedom Multiple R-squared: 0.1213, Adjusted R-squared: 0.1133 F-statistic: 15.05 on 1 and 109 DF, p-value: 0.0001793
# Fit the same model to each of the imputed datasets and pool the results model_imputed <- with(data=imputed_data, exp=lm(Ozone ~ Solar.R)) # Pool the results pooled_results <- pool(model_imputed) # Summarize the pooled model summary(pooled_results) term estimate std.error statistic df p.value 1 (Intercept) 22.1672798 6.2750160 3.532625 55.9334 0.0008329364 2 Solar.R 0.1062197 0.0305815 3.473330 51.5235 0.0010503463
The pooled estimate also gives a significant p value. However, we see a slight decrease in the estimate for Solar.R.
We can also see that whether the imputation distorted the distribution of a variable too much or not.
par(mfrow=c(1,2)) hist(complete_cases$Ozone, main="Original Ozone", xlab="Ozone concentration", col="blue") hist(completed_data$Ozone, main="Imputed Ozone", xlab="Ozone concentration", col="red")
We can notice a slight difference in the middle of the graph. In general, the distribution is not very distorted,
If you have questions or comments about this guide or method, please email data@Princeton.edu.