Introduction

What is this course?

By popular demand, this is a lesson about missing data: what it is, what problems it can cause, how to recognize and describe it, and what to do about it.

What will you learn from this course?

Conceptual learning objectives

At the end of this course, you will understand …

  • What the different kinds of missing data are
  • The consequences of ignoring missing data
  • How missing data increases uncertainty in your conclusions
  • How to decide the best way to address missing data

Practical skills

At the end of this course, you will be able to …

  • Make graphs and tables to visualize missing data
  • Impute missing data with a variety of methods
  • Use multiple imputation to incorporate uncertainty due to missing data in your statistical estimates

What is missing data and why is it a problem?

Where there is data 1, there is missing data.

Getting to know missing data

Missing data may happen for many different reasons. The most common causes of missingness are different in different scientific fields. What are some reasons you can think of for why a data point might be missing?

  • Instruments fail
  • Datum was simply not measured (for example, daily time series data has missing values on weekends and holidays)
  • Attrition (a patient drops out of a study or a plant dies before its final biomass can be recorded)
  • Loss of data records
  • Datum is possible to record for some subjects but not others (for example, we are measuring length of longest petal, but some plants do not have any flowers)
  • Data value can be above or below the limit of detection (concentration of an element is too low to be measured; or a thermometer may have an upper bound of temperature that it can measure)
  • Data value doesn’t fit the statistical distribution we are assuming in our model (for example, zero values that become undefined if we log-transform the data before fitting a statistical model)

Not all missingness is created equal. There are three categories of missing data (Rubin 1976), based on the process that determines how probable it is for a data point to be missing.

Missing completely at random (MCAR): The probability of being missing is equal for all data points in a dataset. If we can consider this assumption to be valid, it is nice because we don’t have to account for any biases resulting from missing data.

Missing at random (MAR): The probability of being missing varies, but it varies according to a group or variable that we have in our observed data. For example our instrument is more likely to fail on a rainy day, but we have data on rainfall by day. This is not as easy to deal with as MCAR, but it is still feasible to deal with because we can make use of the other information we have to model the process that is causing the missing data pattern. Most of the standard methods for dealing with missing data assume MAR.

Missing not at random (MNAR): The probability of being missing varies, for unknown reasons, i.e., reasons that we cannot account for with some other source of data. This is the most difficult situation to deal with. For example, if you are doing a survey with a follow-up, people who do not have a strong opinion are less likely to respond or to follow up. Or plant genotypes that are less resistant to disease are more likely to die before the end of the study, so their final biomass value is more likely to be missing. Especially if you work with survival data, you may have heard the term censored data. Censoring is a special kind of missing not at random. If a data point is censored, that means it is missing if it’s above or below a threshold (for example, a concentration below the limit of detection of the instrument or assay). Censoring also occurs if the subject was removed from the study, or if it died. A common type of censoring occurs in survival studies where subjects are only tracked for a finite period of time. Any subjects still alive at the end of the study period have censored time-of-death values — we know the subject survived at least as long as the study period but we don’t know exactly how long. All of those types of censoring will bias the results if they are not properly accounted for.

Side note: Missing replicates in designed experiments

Note that I have not discussed the issue of entire replications missing from treatments in a designed experiment. For example, a plot in a randomized complete block design may be destroyed or not measured for whatever reason. In that case, the entire observation is missing, including the outcome. Obviously, it is never good to lose observations because it reduces the overall sample size. However, it is not a major concern from the perspective of analysis. In olden times, unbalanced datasets caused issues with analysis of variance. Modern procedures are very robust to unbalanced data. It is important to note that if you are doing ANOVA and using sums of squares to calculate F-statistics associated with different model terms, there is more than one way to calculate sum of squares. You may have heard of Type I, Type II, and Type III sum of squares. These differ in the way that the variability is estimated, specifically the order in which the main effect and interaction terms are considered. The different types of sum of squares are equivalent if the dataset is perfectly balanced (and there’s no correlation between predictors), but become different if you have unbalanced data. Therefore, if your dataset is severely unbalanced, you should investigate whether your bottom-line conclusions are sensitive to the type of sum of squares you choose. If they are, it would be worth thinking about it more deeply and making an informed decision. But I personally don’t usually sweat it that much, because I usually find that multiple comparisons between treatment means are more biologically and practically relevant than sum of squares F-tests anyway! Now, with that small digression out of the way, let’s get back to talking about missing data.

How does R deal with missing data?

Load any packages that we’ll need for the lesson. These include some of the tidyverse data manipulation packages (dplyr, tidyr, ggplot2), some standard statistical packages (glmmTMB, brms, emmeans, broom.mixed), the zoo package which is specialized for working with time series data, the VIM and naniar packages which help with visualizing missing data, and the missForest and mice packages for imputation.

library(agridat)
library(glmmTMB)
library(dplyr)
library(ggplot2)
library(tidyr)
library(emmeans)
library(broom.mixed)
library(brms)
library(zoo)
library(VIM)
library(naniar)
library(missForest)
library(mice)

Let’s explore how R handles missing data. Variables in R can belong to different data types, like integer, double-precision numeric, character, and factor. Regardless of the data type, missing values are represented by a special code, NA for “not available.”

x <- c(1, 2, NA, 4, 5)

For many operations, the default behavior is to return NA if any of the input values are NA, no matter how many non-missing values there are. This is the case for many functions in base R that take a vector of values as input and return a single value:

mean(x)
## [1] NA
sum(x)
## [1] NA
median(x)
## [1] NA

Many such functions have an argument na.rm which is FALSE by default. We can change that default behavior by explicitly setting na.rm = TRUE to remove missing values and compute the summary function on the non-missing values:

mean(x, na.rm = TRUE)
## [1] 3
sum(x, na.rm = TRUE)
## [1] 12
median(x, na.rm = TRUE)
## [1] 3

Many new R users are frustrated at this choice of defaults. Why doesn’t R remove the missing values and calculate the mean or sum, without me having to waste keystrokes explicitly telling it to get rid of those values? I believe that the default behavior is actually sensible. Interpretation changes when missing values are deleted, so this should not be done silently without our express consent.

However, many R model fitting functions do indeed delete missing values without explicit consent. The deletion of missing values can cause unexpected problems. Let’s say we want to plot the observed versus predicted values for a linear mixed model, to see whether the model is a reasonably good fit for the data. This works fine for an example dataset with no missing values:

data(cochran.beets)

fit <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)

plot(x = cochran.beets$yield, y = predict(fit))

But what happens if we introduce a single missing value?

cochran.beets$yield[3] <- NA

fit2 <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)

plot(x = cochran.beets$yield, y = predict(fit2))
## Error in xy.coords(x, y, xlabel, ylabel, log): 'x' and 'y' lengths differ

We get an error because the length of x is 42, which includes a NA somewhere in there, but the length of y is 41 — the missing value was already stripped out before the model fitting, so the predicted values don’t include it! You would need to do some gymnastics to line them up.

By the way, you can verify how many observations were used to fit a model and compare it to the number of rows in the input data, to see if any were removed.

nrow(cochran.beets)
## [1] 42
nobs(fit2)
## [1] 41

Visualizing missing data

Before we learn what to do about missing data, let’s learn a few tools for exploring a dataset and getting a better idea of the patterns of missing data in it. Two useful R packages for this are VIM and naniar (originally known as ggmissing). VIM uses syntax and plotting style closer to base R, while naniar depends on ggplot2 and the other “tidyverse” packages.

Image (c) University of Georgia Cooperative Extension
Image (c) University of Georgia Cooperative Extension

The data

We will be using a simple example dataset for some of the simpler examples here. It is a small dataset with a few dozen observations of jassid (leafhopper, an insect pest) abundance, with seven weather covariates. The data were collected in the Maharashtra region of India and published on Kaggle by Madhura Bhagat. I have artificially deleted some values from the dataset, completely at random and independently for each covariate, with a different probability of missingness for each variable. The response variable, jassid abundance, has no missing values.

Import the data from a CSV file. You may import from a web URL where I have hosted the file on GitHub (first option shown below) or, if you are running this on the SCINet server, from the data folder (second option shown below).

jassids <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/jassids.csv')
jassids <- read.csv('data/jassids.csv')

Look at the first few rows of the data.

head(jassids)
##   Jassids    RF temp_max temp_min   RH1   RH2  BSS wind_velocity
## 1    0.20 18.87    34.60    24.50 68.82 45.54 6.55          6.48
## 2    0.23 34.96    33.92    24.19 71.23 48.12 5.25          6.17
## 3    0.87 29.51       NA    23.66 73.88 50.89 4.81            NA
## 4    1.63    NA    32.10    23.69 77.48 57.47 4.10          5.73
## 5    2.38 17.78    31.44       NA 77.87 59.18 3.58          5.41
## 6    5.12 29.38    31.06    23.55 77.08 61.15 2.91          5.68

Base R functions for summarizing missing data

One thing to notice is that the plain old base R function summary(), which has a method for data frames, includes information on missing values. If a column contains any NA values, summary() will tell you how many:

summary(jassids)
##     Jassids            RF            temp_max        temp_min    
##  Min.   :0.000   Min.   : 0.000   Min.   :28.00   Min.   :10.18  
##  1st Qu.:2.237   1st Qu.: 2.663   1st Qu.:29.25   1st Qu.:18.77  
##  Median :3.995   Median :21.435   Median :30.00   Median :22.30  
##  Mean   :3.816   Mean   :23.418   Mean   :30.25   Mean   :20.91  
##  3rd Qu.:5.287   3rd Qu.:32.800   3rd Qu.:30.89   3rd Qu.:23.67  
##  Max.   :8.000   Max.   :90.300   Max.   :36.60   Max.   :26.20  
##                  NA's   :4        NA's   :12      NA's   :2      
##       RH1             RH2             BSS         wind_velocity  
##  Min.   :63.65   Min.   :31.25   Min.   : 1.000   Min.   :0.550  
##  1st Qu.:74.23   1st Qu.:41.71   1st Qu.: 3.800   1st Qu.:1.070  
##  Median :79.38   Median :54.77   Median : 5.970   Median :2.400  
##  Mean   :78.45   Mean   :52.58   Mean   : 6.027   Mean   :2.741  
##  3rd Qu.:84.71   3rd Qu.:63.35   3rd Qu.: 7.522   3rd Qu.:4.300  
##  Max.   :92.80   Max.   :71.05   Max.   :20.700   Max.   :6.550  
##                  NA's   :5                        NA's   :9

At least in my RStudio environment, glimpse(jassids) from the dplyr package will highlight NA values in red (not shown here); note that this only shows the first few rows of each column, so it won’t necessarily flag the presence of missing values in a column.

Another handy function is complete.cases(). It returns a logical vector with TRUEs for each row that is “complete,” and FALSEs for each row that contains at least one missing value.

complete.cases(jassids)
##  [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [13]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [25]  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE
## [37]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [49]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Pass the output of complete.cases() to table() to get the number of rows with and without missing values. Here, 25 of the 58 rows have at least one missing value.

table(complete.cases(jassids))
## 
## FALSE  TRUE 
##    25    33

The function is.na() tests whether each value of a vector is missing. We can negate it by writing !is.na() to return TRUE if a value isn’t missing. We can use this to make tables of missingness for individual data frame columns. For example, 12 of the 58 rows are missing a value for temp_max.

table(!is.na(jassids$temp_max))
## 
## FALSE  TRUE 
##    12    46

We can make a two-way table to see how many rows are missing values from either or both of two columns. This will also work for three or more columns but the table becomes hard to interpret. Here, we see that no rows are missing both temp_max and temp_min, but some are missing one of the two.

with(jassids, table(!is.na(temp_max), !is.na(temp_min)))
##        
##         FALSE TRUE
##   FALSE     0   12
##   TRUE      2   44

Packages for missing-data visualizations

Here’s a visualization of the missing data from the VIM package. First we use aggr() to compute the number of missing values in each variable and in each combination of two variables. Then the resulting object has a plot() method. The argument numbers = TRUE is optional and shows the proportion of missing values in each combination.

plot(aggr(jassids), numbers = TRUE)

The naniar package has the function vis_miss() which is originally from the visdat package. It shows missing data row-by-row. Currently (December 2025) there is a bug in this package that causes the axis labels to display incorrectly unless an adjustment is made with theme_sub_axis_x(). This should be fixed in the next release of the package.

vis_miss(jassids) + theme_sub_axis_x(text = element_text(vjust = 0))

Try adding cluster = TRUE and/or sort_miss = TRUE to vis_miss() to arrange missing values row-wise and column-wise (plots not shown here).

vis_miss(jassids, cluster = TRUE)
vis_miss(jassids, sort_miss = TRUE)

Note on the figures

From now on, whenever I show a figure, we will use the same style as Van Buuren 2019, which is to plot the observed data with solid blue circles and the imputed values with hollow red circles. I’ve created my own custom functions for visualizing the missing data patterns and will not go into detail about the code. This is a workshop about learning what to do with missing data and we don’t have time to cover the ins and outs of plotting code.

What to do about missing data

It’s inevitable that you will have missing data at some point. But you still need to analyze the data and answer the questions you care about! How to proceed? Here is a rundown of some different solutions, starting from the quick-and-dirty and proceeding to the most robust (but also the most labor-intensive to implement).

The choice is yours

The strategies for dealing with missing data can be grouped into three main categories. We will go through some examples of each.

  • Delete the observations that contain missing data and analyze only the complete observations
  • Impute missing values with a single imputation, treating the imputed value the same as the other data values
  • Impute missing values with multiple imputation, accounting for the uncertainty in the imputation in your final results

How do we choose a strategy?

But before we describe how each strategy works in detail and do hands-on examples, I want to go through potential rationale for choosing a strategy.

The biggest factor determining your missing data strategy is the proportion of data that are missing. If it is relatively small, differences in imputed values will have a negligible effect on your bottom-line conclusions. I would consider something on the order of 5% missing values to be a small proportion. In that case, it may not be worth the effort to develop a complex missing-data model; simpler strategies will work just as well. But be aware that there is no globally valid “rule of thumb” of what rate of missingness you can safely ignore. You should evaluate the consequences of missing data on a case-by-case basis.

The way in which the data are missing is another factor. If the data are missing completely at random, then if you delete observations that contain missing data, you’re just randomly selecting observations from the dataset and removing them. The worst-case scenario there is that you end up with a smaller dataset with reduced statistical power to detect effects. As mentioned above, if the proportion of observations with missing data is small, the reduction in power will be minimal. On the other hand, what if there is a relationship between the probability of an observation having missing data and one of the predictors or outcomes that you care about? In that case you could bias your inference if you ignore the missing data pattern, even if the number of missing values is relatively small.

Missing data strategy 1: Delete it

When confronted with missing data, it can be tempting to simply ignore it. By “ignore,” I mean pretend the missing data don’t exist by deleting observations that contain missing data until you end up with a complete dataset with no missing values. There are a few different ways to do this.

Strategies for deleting observations with missing data

Listwise deletion, a.k.a. complete-case analysis: This is by far the commonest way of dealing with missing data. Just delete any observations, or “cases,” with any missing values. In the typical data format needed for statistical analysis, each row of the data frame is a case.

I usually use the dplyr function filter() together with complete.cases() to do listwise deletion on a dataset. Note we can use . to mean “the dataset that has been passed to this function from a previous step.” In our jassids example this leaves us with 33 complete rows.

jassids_complete_only <- jassids %>%
  filter(complete.cases(.))

Note that listwise deletion is the method used by default in many R and SAS statistical modeling packages. So even if you don’t explicitly delete incomplete cases before passing the data frame to the model, it will be done internally (taking into account only the columns that are needed for that model). As we’ve shown above, do this at your own risk!

This code shows that the number of observations is greater for a linear model that models jassid abundance as a linear function only of maximum temperature, compared to one that uses both maximum temperature and wind velocity. When we add another predictor, the number of complete cases goes down.

model1 <- lm(Jassids ~ temp_max, data = jassids)
model2 <- lm(Jassids ~ temp_max + wind_velocity, data = jassids)

nobs(model1)
## [1] 46
nobs(model2)
## [1] 41

Fortunately, as we’ve already mentioned, many R modeling functions return an error if missing values are present in the data (such as clmm()), or at least a warning message if deletion was done (such as brm()). Be wary of the ones like lm(), lme4(), and glmmTMB() that do the deletion by default and bury the warning somewhere in the summary output!

In multivariate analyses, you may need to remove both columns (variables) and rows (cases) from the original dataset to get a complete dataset of acceptable size. For example, let’s say you are doing a multivariate analysis where you are trying to use the relative abundance of 100 microbiome taxa to predict a trait value. A few taxa have missing data for 90% of the rows. If you just try to do complete-case analysis on the full dataset, you will have few or no rows left to work with. In this case, a valid strategy could be to delete columns that are above some unacceptable threshold of proportion missing values, iterating until you get a reasonable number of complete cases.

Pairwise deletion, a.k.a. available-case analysis: This is a method where instead of using only the complete cases for every calculation, you do separate calculations for each variable or subset of variables, using the cases that are complete for those variables. This is often the method chosen for generating correlation matrices. If you want the correlation between two particular variables, you don’t necessarily need to worry about values being missing for other variables in that dataset.

The function cor() calculates a correlation matrix containing the correlations between each pair of columns in a matrix. It has an argument use that allows you to choose whether you’re doing complete-case analysis/listwise deletion, use = 'complete.obs' or pairwise deletion, use = 'pairwise.complete.obs'. The default is use = 'everything' which will return NA if a pair of variables contains any missing values.

Below you can see how the output of cor() differs depending on which deletion strategy we choose.

cor(jassids)
##               Jassids RF temp_max temp_min    RH1 RH2     BSS wind_velocity
## Jassids        1.0000 NA       NA       NA  0.193  NA  0.0262            NA
## RF                 NA  1       NA       NA     NA  NA      NA            NA
## temp_max           NA NA        1       NA     NA  NA      NA            NA
## temp_min           NA NA       NA        1     NA  NA      NA            NA
## RH1            0.1934 NA       NA       NA  1.000  NA -0.2764            NA
## RH2                NA NA       NA       NA     NA   1      NA            NA
## BSS            0.0262 NA       NA       NA -0.276  NA  1.0000            NA
## wind_velocity      NA NA       NA       NA     NA  NA      NA             1
cor(jassids, use = 'complete.obs')
##               Jassids      RF temp_max temp_min    RH1    RH2    BSS wind_velocity
## Jassids        1.0000  0.0326   -0.318   0.0274  0.360  0.230 -0.105        -0.116
## RF             0.0326  1.0000    0.150   0.5810  0.431  0.658 -0.306         0.130
## temp_max      -0.3179  0.1495    1.000   0.5034 -0.302 -0.109  0.160         0.423
## temp_min       0.0274  0.5810    0.503   1.0000  0.309  0.726 -0.345         0.679
## RH1            0.3599  0.4313   -0.302   0.3092  1.000  0.499 -0.161        -0.183
## RH2            0.2300  0.6577   -0.109   0.7257  0.499  1.000 -0.536         0.459
## BSS           -0.1050 -0.3063    0.160  -0.3454 -0.161 -0.536  1.000        -0.460
## wind_velocity -0.1157  0.1296    0.423   0.6791 -0.183  0.459 -0.460         1.000
cor(jassids, use = 'pairwise.complete.obs')
##               Jassids      RF temp_max temp_min     RH1     RH2     BSS wind_velocity
## Jassids        1.0000 -0.0409  -0.1544   0.0217  0.1934  0.0617  0.0262       -0.2515
## RF            -0.0409  1.0000   0.1636   0.6278  0.4782  0.6936 -0.4050        0.2437
## temp_max      -0.1544  0.1636   1.0000   0.4832 -0.1822 -0.0964  0.1326        0.4200
## temp_min       0.0217  0.6278   0.4832   1.0000  0.4795  0.7447 -0.4427        0.6969
## RH1            0.1934  0.4782  -0.1822   0.4795  1.0000  0.5488 -0.2764        0.0219
## RH2            0.0617  0.6936  -0.0964   0.7447  0.5488  1.0000 -0.6564        0.5505
## BSS            0.0262 -0.4050   0.1326  -0.4427 -0.2764 -0.6564  1.0000       -0.5337
## wind_velocity -0.2515  0.2437   0.4200   0.6969  0.0219  0.5505 -0.5337        1.0000

Of course the pairwise deletion method isn’t helpful for many statistical modeling approaches that require all variables to be non-missing, such as multiple regression.

Missing data strategy 2: Single imputation

We’ve shown that under certain circumstances, deletion methods like listwise deletion can waste a surprisingly high proportion of your data. Especially if your sample size was small to begin with, it’s often unacceptable to throw away so much data that you invested so much time, money, and resources to generate. More importantly, if your data aren’t missing completely at random, listwise deletion will bias your results!

Let’s start with the simplest class of strategies for imputing missing values. By “impute,” I mean replace a missing data point with some kind of substituted value. When we impute, we harness some other source of available information to make a more or less educated guess about what that missing value truly is (or what it would be, if we could measure it).

Strategies for imputing missing data

  • Impute using mean or mode
  • Impute using adjacent observations in space and/or time
  • Impute using a randomly selected value from another observation
  • Impute using a regression model

Imputing using the mean or mode

A simple form of imputation is to replace all missing values with the mean or median of that variable, or if it is a categorical variable, the mode (most common category). There are a lot of ways to do this in R. The base R function replace() is a pretty straightforward way to replace missing values with imputed values. It takes three arguments, the first is a vector in which we want to replace values, the second is a vector of indexes where we want to do the replacing, and the third is the value or values we want to replace with. Here, I’ve defined a function to take the vector x and, at all the locations where is.na(x) is true, replace with the mean of the non-missing values of x.

impute_mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

We can define an almost identical function to impute with the median:

impute_median <- function(x) replace(x, is.na(x), median(x, na.rm = TRUE))

Here, we apply the impute_mean() function to all numeric columns of the jassids dataframe. Note that I am assigning the result to a new dataframe, jassids_mean_imputed, rather than assigning the result back to the original jassids object. This is good practice because the imputed dataframe does not have any indication of which values are the actual observations and which are the imputed values. If we keep the original dataframe in our working environment, we can refer to it to get that information.

jassids_mean_imputed <- jassids %>%
  mutate(across(where(is.numeric), impute_mean))

Also note that impute_mean() is applied to all numeric columns even if they don’t have any missing values. If a column with no missing values is passed to the function, it will return that column unchanged.

How does this look for the temp_max variable? Here are plots for the distribution of observed (blue) and imputed (red) values and a scatter plot with temp_max plotted on the y-axis with temp_min on the x-axis. We see, unsurprisingly, that all the missing values have been imputed with the same value: the mean. You’ll see more figures like this later.

Imputing using adjacent observations in space and/or time

A decent first assumption if you have data with space and/or time structure is that two observations that are close together in space or time are more likely to be similar than two observations that are far apart. We can put that to use if we have missing data.

The simplest way to do this is just to copy adjacent observations to fill in missing values. The tidyverse has a function called tidyr::fill() but it is more intended for filling in missing categorical values rather than dealing with time series data. The zoo package, on the other hand, is specialized for time series data. zoo has the function na.locf(), where LOCF stands for “last observation carried forward.”

Image (c) Te Ara, the Encyclopedia of New Zealand
Image (c) Te Ara, the Encyclopedia of New Zealand

For the next couple of examples we will be using a version of the diggle.cow dataset that you may be familiar with from other lessons I’ve taught. It’s a time series dataset of body weight measurements taken on 23 different days at semi-regular intervals from a total of 26 cows in a two-way factorial experiment. The two treatments were tuberculosis infection and iron supplementation. I’ve created a version of the dataset with some values artificially missing completely at random.

Import the data. As before you may import from the web URL where it’s hosted or from the data folder if you are running this code on the SCINet server.

cows <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/diggle.csv')
cows <- read.csv('data/diggle.csv')

Convert animal ID and treatment columns to factor variables.

cows <- cows %>%
  mutate(across(c(animal, iron, infect), factor))

Here, if an observation of weight is missing, we take the observation from the previous time point for that same animal and fill it in. To make sure that only observations from the same animal are taken, we first group_by(animal) then arrange() to sort the data first by animal and then in increasing order by day. The argument na.rm = FALSE within na.locf() means that if there are any remaining missing values at the beginning of any animal’s time series, they should be retained so that the imputed column has the same length as the original column.

cows_imputed <- cows %>%
  group_by(animal) %>%
  arrange(animal, day) %>%
  mutate(weight_LOCF_imputed = na.locf(weight, na.rm = FALSE))

I’ve created a new dataframe called cows_imputed which is a copy of cows with a new column called weight_LOCF_imputed appended to it so that the missing and imputed versions of weight can be compared side-by-side.

There are still missing values because the first day’s value was missing for some animals. We can go back and apply na.locf() again to the already-imputed column, but this time with the argument fromLast = TRUE. This will replace that missing value with the next day’s value.

cows_imputed <- cows_imputed %>%
  mutate(weight_LOCF_imputed = na.locf(weight_LOCF_imputed, na.rm = FALSE, fromLast = TRUE))
Linear interpolation, image (c) Wikipedia
Linear interpolation, image (c) Wikipedia

Simply copying the previous observation might not be what you are looking for. Another way to impute would be to use a simple linear interpolation, drawing an imaginary line between the non-missing points on either side of the missing one. For example, cow A04 weighed 290 kg on day 380, and 310 kg on day 478. Its weight on day 445 is missing. It might be reasonable to assume that it gained weight at a constant rate between the two dates. We can draw a line between the two points \((x_0, y_0)\) and \((x_1, y_1)\) and find the y-coordinate for a given value of x using this formula:

\[y = \frac{y_0(x_1-x) + y_1(x-x_0)}{x_1-x_0}\]

Here’s a demonstration that this works. The imputed value of ~303 kg is roughly halfway between 290 and 310 because day 445 is roughly halfway between 380 and 478.

interp <- function(x, x0, y0, x1, y1) (y0 * (x1 - x) + y1 * (x - x0)) / (x1 - x0)

interp(x = 445, x0 = 380, y0 = 290, x1 = 478, y1 = 310)
## [1] 303.2653

Luckily, we don’t have to define a linear interpolation function from scratch every time we want to use it. Linear interpolation is implemented in base R using the approx() function. We can fill in the missing values in the cows dataset using this method. Although the dataframe is already grouped by animal and sorted by animal and day, I’ve repeated those lines in the code because you would need to include them if you are starting with a dataframe that is not grouped and sorted.

The approx() function needs x and y columns, and the argument xout tells the function which x values need to get interpolated y values. In this case xout is the same as x because we want interpolations for any missing y values corresponding to x values. The rule = 2 argument means that if there is an NA value at the beginning or end, it will replace it with the closest non-missing value.

cows_imputed <- cows_imputed %>%
  group_by(animal) %>%
  arrange(animal, day) %>%
  mutate(weight_interpolation_imputed = approx(x = day, y = weight, xout = day, rule = 2)$y)

More sophisticated interpolation methods can be used for time series data; we will not get into that here.

Imputing using randomly selected values from other observations

Another old-school type of imputation is to impute using a randomly selected value from another observation. Back in the day this was called “hot deck” imputation because data used to be stored on decks of punch cards. If a replacement value was taken from the dataset that was currently being read in the card reader, it was taken from the “hot deck.” In contrast, “cold deck” imputation means taking a value from a different dataset. The value that is randomly chosen could be taken from any other observation in the dataset, or it could be stratified. For example, if a dataset contains individuals that are either juvenile or adult and either male or female, a missing value for body mass of a juvenile female could be chosen at random from the body mass values of the other juvenile females in the dataset.

Image (c) IBM
Image (c) IBM

Here’s an example of how to use R tidyverse functions to do hot-deck imputation. First stratify a dataset using group_by(). Then within a call to mutate() use some logic with the if_else() function to replace missing values with randomly chosen values from within the same group.

Group the data by treatment combination and time point using the columns iron, infect, and day. Within each of these groups, use the mutate() function to create an imputed version of the column weight. We use if_else() for this, which takes three arguments. The first argument is the condition we are testing; in this case it is is.na(weight), which will return TRUE if weight is missing and FALSE if not. The second argument is what value to assign if the condition is TRUE. Here, it will be a randomly sampled value from the non-missing values of weight. The third value is what value to assign if the condition is FALSE, which in this case is the unchanged value of weight.

set.seed(1)

cows_imputed <- cows_imputed %>%
  group_by(iron, infect, day) %>%
  mutate(weight_hotdeck_imputed = if_else(is.na(weight), sample(na.omit(weight), size = length(weight), replace = TRUE), weight))

IMPORTANT: You might have noticed that hot-deck imputation differs from the other forms of imputation we’ve covered so far because it involves randomness. Implementing hot-deck imputation more than once on the same dataset will give you a different result each time. Keep this in mind because it will be important when we start to talk about multiple imputation in a few minutes!

Here’s what the imputations look like. As usual, each time we show the distributions of the observed and imputed values on the left, and a scatterplot on the right. Does anything stand out to you about these plots?

Last observation carried forward

Linear interpolation

Hot-deck imputation

Imputing using a regression model

The next stop on our journey from simple quick-and-dirty imputation to sophisticated cutting-edge imputation is imputing using regression. Why not take advantage of the relationships that exist between variables in our dataset to impute missing values? To do regression imputation, we just fit a regression model with the variable we want to impute as the dependent variable and one or more of the other variables in the dataset as predictors. The input data points are those where we have a complete set of values for every variable. Then we simply use the regression model to predict the value of the dependent variable where it’s missing.

We could do this with a simple bivariate regression. Going back to the jassids dataset, let’s impute the temp_max variable using the temp_min variable. Our linear model tells us we can predict temp_max using the equation 25.95 + 0.21 * temp_min. (How could you check how good of a prediction that is?)

temp_linear_fit <- lm(temp_max ~ temp_min, data = jassids)

coefficients(temp_linear_fit)
## (Intercept)    temp_min 
##  25.9537226   0.2058685

Use the ifelse() function2 to create a vector of imputed temperature values, using the predicted temp_max values for the rows in the dataframe where they were missing originally, and the observed values otherwise.

pred_temp_max <- predict(temp_linear_fit, newdata = jassids)

temp_linear_imputed <- ifelse(is.na(jassids$temp_max), pred_temp_max, jassids$temp_max)

How does this look? Below, the right-hand plot shows the temp_max variable plotted against the temp_min variable.

You can see an issue from the right-hand plot above: it looks like the relationship between maximum and minimum temperature is not really linear, especially as minimum temperature gets above 20 degrees. But the imputed values all fall right on that possibly inappropriate linear trendline. Well, linear models are so 20th century, anyway. Get with the times and do some machine learning instead! Instead of predicting maximum temperature with a linear model using one other predictor, let’s fit a random forest model that predicts maximum temperature — and any other variables that have missing values — using all the other predictors in the dataset. Random forest can capture nonlinear relationships quite well.

Because this isn’t a random forest lesson, I will not go into much detail on exactly how random forest works. See my lesson on machine learning basics for details on how random forest works. What you need to know for now is that the R package missForest has a function called missForest(). If you pass missForest() a dataframe with missing values, you get a complete dataframe back where the missing values were imputed using a random forest technique. There are a lot of additional arguments to tweak the tuning parameters of the random forest model, but I won’t get into any of those here. The only additional argument I have included is variablewise = TRUE which will give us a report of the mean squared error (MSE) of the random forest model column-by-column. These values are the errors for the non-missing values in the column. According to the package developer, if these errors are low, we can be confident that the errors in the imputations of the missing values (the imputations we actually care about) are also low.

set.seed(2)
jassids_RF_imputed <- missForest(jassids, variablewise = TRUE)

This function returns a list of two objects, ximp and OOBerror. The first is the complete dataset and the second is the MSE for each column.

head(jassids_RF_imputed$ximp)
##   Jassids      RF temp_max temp_min   RH1   RH2  BSS wind_velocity
## 1    0.20 18.8700  34.6000  24.5000 68.82 45.54 6.55         6.480
## 2    0.23 34.9600  33.9200  24.1900 71.23 48.12 5.25         6.170
## 3    0.87 29.5100  31.1573  23.6600 73.88 50.89 4.81         4.357
## 4    1.63 24.1195  32.1000  23.6900 77.48 57.47 4.10         5.730
## 5    2.38 17.7800  31.4400  23.2172 77.87 59.18 3.58         5.410
## 6    5.12 29.3800  31.0600  23.5500 77.08 61.15 2.91         5.680

Take the square root of the MSE values and rename them after the columns of jassids so we can get an idea of the quality of the imputation. We see there are zeros for the columns that didn’t have any missing values. The random forest model predicted the non-missing values of temperature with an “average” error of 1.16 degrees.

Remember, this is the error for the predictions for the points where we do know the values. For the points where we don’t know the values (the missing values), we will never know for sure if the imputation was correct. That’s always the issue with imputation of missing data!

sqrt(jassids_RF_imputed$OOBerror) |> setNames(names(jassids))
##       Jassids            RF      temp_max      temp_min           RH1 
##      0.000000     15.186841      1.155743      1.684493      0.000000 
##           RH2           BSS wind_velocity 
##      4.458125      0.000000      1.071593

By comparison, here is the root mean squared error for the simple linear model using only minimum temperature to predict maximum temperature. At 1.44, it’s a bit higher than the random forest, but still reasonable. The machine learning model only modestly outperforms the naive linear model approach in terms of predicting values we already know.

sqrt(mean(residuals(temp_linear_fit)^2))
## [1] 1.438877

Here are plots similar to those we’ve made for other imputation strategies. Compared to what we’ve seen before, the imputed temperature values are starting to look more reasonable in terms of having plausible-looking noise and being distributed similar to the non-missing values.

Consequences of single imputation

Many datasets are “cleaned” by using ad hoc imputation methods like substituting the mean or doing linear interpolation on a time series. Some people even get fancy and do regression-based imputation, including machine learning models like the random forest imputation we just saw. You end up with a nice tidy-looking dataset with no unsightly empty cells or NAs, and you proceed to fit a statistical model, test hypotheses, and make inferences. So far, so good, right? Right?

Typical approach to imputation
Typical approach to imputation

Well, it’s important to remember that any value that you did not directly observe and measure is not data. It is the output of a model. Most ad hoc imputation methods generate a single imputed dataset. The dataset contains modeled values that are masquerading as data. As a result it is hiding any uncertainty associated with the imputation process. This can have serious consequences for your inference later on: it might lead to inappropriate rejections of the null hypothesis (if using Frequentist statistics) or inappropriately precise estimates of effect size (in both Bayesian and Frequentist frameworks). Actually, this is true for even the most sophisticated machine learning prediction model used to impute missing data: pretending that modeled values are actual observations inevitably leads to bias.

I will just quote the words of the great Stef van Buuren here, talking about regression imputation.

“[Regression imputation] artificially strengthens the relations in the data. Correlations are biased upwards. Variability is underestimated. Imputations are too good to be true. Regression imputation is a recipe for false positive and spurious relations.”

Of course, as I’ve said a few times already in this lesson, it is a matter of degree. If the proportion of missing values is small, a quick fix approach is okay. Ignoring uncertainty in the modeled values generated by single imputations, if there are only a few, will not have serious consequences. But if you have lots of missing values and you really need to impute them, the only statistically and scientifically defensible thing to do is to impute using an approach that explicitly accounts for the uncertainty. This is an ideal segue into our next section . . .

Missing data strategy 3: Multiple imputation

For the rest of our lesson, we’ll be talking about the most technically demanding, but also the most interesting, rewarding, and useful, flavor of imputation.

All the imputation methods we’ve discussed so far, with the possible exception of hot-deck imputation, turn a missing value into a single imputed data point without any measure of uncertainty associated with the imputing process for each of the individual estimates. This is a problem no matter how fancy of a statistical model is doing the imputing. How do we make sure our final estimates of uncertainty of the effect sizes we care about include the uncertainty resulting from some of the underlying data being missing and imputed?

Multiple imputation is the best way of imputing missing values and correctly accounting for the associated uncertainty. Below, we will discuss and give some examples of different multiple imputation techniques, showing how you would present the results of an analysis based on multiply imputed data.

Donald Rubin
Donald Rubin

Multiple imputation using chained equations

Multiple imputation traces back to a brilliant insight by statistician Donald Rubin in 1977. Rubin observed that you could capture the uncertainty in the missing data by creating multiple different imputed versions of a dataset, drawing the imputed values from an estimated distribution each time. Then, you fit the desired statistical model to each one of these imputed datasets and somehow combine the estimates from the multiple models, in such a way that the uncertainty in the missing values is reflected in the uncertainty estimates (e.g., confidence intervals) of the final estimates. Adjustments are done to the degrees of freedom because when you impute values, you are “using up” degrees of freedom, with consequences for your test statistics and p-values.

A mathematical derivation of how this all works is beyond me, and anyway it’s out of the scope of this short workshop. If you want to learn more, I highly recommend Stef van Buuren’s book (see Further Reading section below).

The mice package does all of this to a tee. The developers have written tons of methods to allow mice to be incorporated into your R stats workflow. Multiple imputations generated with mice can be used not only with base R model fitting functions like lm() and glm() but also with mixed model functions like lmer() and glmmTMB(), and even Bayesian GLMMs with brm(). You can also use emmeans() on a set of imputed datasets generated with mice, and the increased uncertainty and adjustments to the degrees of freedom associated with the imputation will be correctly reflected in the estimates you get there. If you have more customized analyses, you may have to write some of the code to perform analyses on each imputed dataset and combine the results “manually.” But the out-of-the-box code should work for most common analyses you might do.

The MICE (multiple imputation using chained equations) workflow goes like this:

  • First, we create a set of m imputed datasets using the function mice(). This produces an object of class mids (multiply imputed dataset).
  • Next, we wrap our statistical modeling function inside the function with(), which fits the statistical model to each of the m datasets within the mids object. This produces an object of class mira (multiply imputed regression analyses).
  • Finally, we use the function pool() on the mira object to pool (combine) the estimates from each model into a combined estimate, with adjusted degrees of freedom. This creates an object of class mipo (multiple imputation pooled object). Alternatively, we can pass the mira object to functions like emmeans().

Let’s go through each step with some examples, and talk about some of the different arguments we can use to modify the imputation process to suit our needs.

mice() for multiple imputation

Let’s do multiple imputation on the jassids dataset. For a first try, let’s not supply any arguments other than the defaults, except to specify m = 10 to create ten imputed datasets, and seed to ensure the result is reproducible (remember that multiple imputation involves random draws from probability distributions). Simulation studies have shown that m = 10 is adequate for most practical applications, but feel free to try out different values to see when your final estimates converge on a stable value.

jassids_mice_imputed <- mice(jassids, m = 10, seed = 27606)

What do we get? Take a look at some components of the mids object. nmis shows us how many missing values there were in each column. imp shows us the imputed values for each of the ten datasets, column by column. Columns that had no missing values are associated with empty dataframes within imp, but columns with missing values are associated with \(n \times m\) dataframes, where \(n\) is the number of missing values. Lastly, method shows us the imputation method that was used to impute each column.

jassids_mice_imputed$nmis
##       Jassids            RF      temp_max      temp_min           RH1 
##             0             4            12             2             0 
##           RH2           BSS wind_velocity 
##             5             0             9
jassids_mice_imputed$imp
## $Jassids
##  [1] 1  2  3  4  5  6  7  8  9  10
## <0 rows> (or 0-length row.names)
## 
## $RF
##        1     2     3    4     5     6     7     8     9    10
## 4  22.06 21.87 23.10 32.5  2.00 42.11 17.78 19.69 21.00 21.00
## 7  21.00 10.60 24.44 47.2 23.70  2.66 10.60 23.70  0.00 22.06
## 18  5.65  5.65  4.80  0.0  0.00 10.60  2.00 24.44 19.69 28.90
## 21  2.67  2.67  2.60  0.0  8.54  0.00 28.90  2.60  2.60 24.44
## 
## $temp_max
##        1     2     3     4     5     6     7     8     9    10
## 3  32.10 32.10 31.44 30.99 34.60 32.10 30.99 34.60 30.99 33.92
## 12 29.95 29.99 29.45 30.60 30.27 28.95 30.03 29.75 31.34 31.34
## 14 30.03 29.15 29.35 31.20 28.95 29.22 29.15 29.22 31.34 29.99
## 18 30.70 29.35 28.85 31.20 29.34 30.99 30.54 30.01 31.20 31.34
## 19 30.60 30.95 31.06 30.65 31.06 29.99 31.20 30.01 31.06 30.60
## 22 29.78 31.44 30.03 29.78 29.45 29.98 29.35 30.99 31.82 30.01
## 31 30.99 31.20 32.10 30.05 31.20 31.20 30.99 30.65 32.10 34.60
## 36 29.10 28.25 28.00 28.00 28.00 28.07 29.34 29.39 29.22 29.65
## 41 32.10 31.06 30.95 30.95 32.10 30.03 31.34 31.20 30.05 30.54
## 46 28.85 28.85 30.60 29.95 31.82 30.01 28.85 29.99 31.44 28.85
## 51 30.10 30.10 29.65 30.10 30.60 29.10 28.85 30.99 30.50 29.49
## 53 28.85 31.06 31.20 31.44 28.85 31.34 30.95 30.60 30.60 28.85
## 
## $temp_min
##        1     2     3     4    5     6     7    8     9   10
## 5  22.73 23.31 23.69 23.31 25.0 23.31 23.55 23.5 23.55 25.0
## 23 15.89 13.88 13.88 15.50 18.9 16.55 19.60 19.6 15.50 15.2
## 
## $RH1
##  [1] 1  2  3  4  5  6  7  8  9  10
## <0 rows> (or 0-length row.names)
## 
## $RH2
##        1     2     3     4     5     6     7     8     9    10
## 11 51.50 62.00 59.08 66.65 65.85 60.66 59.86 62.00 60.05 60.05
## 18 45.54 53.95 51.95 43.96 36.43 51.85 53.95 51.85 45.54 36.43
## 20 34.80 33.35 34.80 42.00 36.88 34.80 36.45 36.43 37.92 33.35
## 28 36.88 38.81 38.77 36.88 38.81 36.43 33.65 41.71 36.88 36.88
## 51 34.80 38.81 45.54 34.80 43.96 48.12 53.95 33.35 36.45 35.00
## 
## $BSS
##  [1] 1  2  3  4  5  6  7  8  9  10
## <0 rows> (or 0-length row.names)
## 
## $wind_velocity
##       1    2    3    4    5    6    7    8    9   10
## 3  5.19 3.65 5.00 6.48 3.20 3.65 5.19 6.13 3.65 4.35
## 10 3.65 4.85 5.68 5.46 5.19 3.65 5.68 3.65 3.65 5.68
## 16 2.70 2.99 2.40 1.36 2.99 5.46 4.30 2.90 2.70 2.40
## 29 0.95 1.20 0.95 1.20 1.20 1.00 1.07 1.20 1.20 1.00
## 34 5.41 5.00 3.95 6.48 6.55 6.17 6.55 3.20 6.55 5.00
## 36 5.46 3.65 5.46 3.65 5.68 5.19 6.13 5.46 3.65 4.85
## 41 2.01 2.01 2.90 2.50 1.80 2.01 2.99 2.99 2.01 2.90
## 50 0.64 1.15 1.20 1.20 1.00 1.20 1.15 1.20 0.95 1.15
## 51 1.07 1.00 0.85 0.55 1.07 1.35 1.36 1.07 2.54 1.00
jassids_mice_imputed$method
##       Jassids            RF      temp_max      temp_min           RH1 
##            ""         "pmm"         "pmm"         "pmm"            "" 
##           RH2           BSS wind_velocity 
##         "pmm"            ""         "pmm"

You can see that "pmm", which stands for predictive mean matching, was chosen as the default method for each of the columns with missing data. Other defaults are used for categorical variables. The PMM algorithm works by first fitting a regression model with a ridge regularization parameter that uses the other columns of the dataset to predict values for all the data points for a given column, including both the missing and non-missing ones. Then, for each missing entry, we find a small set (default 5) of other observations in the dataset that have similar predicted values to the predicted value of the missing entry. We take the actual observed values for those five non-missing entries (not their predicted values) and randomly draw a value from them. That’s our imputed value!

If you were able to follow all that, you will have noticed that PMM combines aspects of regression imputation (in particular it’s a bit machine-learning in flavor because of the regularization parameter) and hot-deck imputation (because it draws values from the dataset currently being analyzed). You will also have noticed that there are various parameters that could be tweaked to get slightly different results. Of course, just like with any analysis you do, you should be aware of any sensitivity your final results have to design decisions you make along the way, such as the choice of methods and parameters for imputation. The defaults of mice() were selected because they are sensible and have been shown to work well in many cases, but you definitely should not accept them as gospel without exploring sensitivity.

Here’s a plot similar to the ones we made earlier showing the relationship between the temp_max and temp_min values, both observed and imputed. Now the imputed values have error bars on them to show the distance between the highest and lowest values from the ten imputations (showing only the temp_max imputations in this case). The hollow red points are the means of the ten imputed values. They are not “the” imputed value because there is no single imputed value; they are only plotted to show a rough idea of what MICE is doing.

If you look at the help documentation for ?mice you will see that it supports many other methods, including random forest imputation, and all kinds of other regression-based methods. But the key is that most of them incorporate randomness to some extent, so that you will get multiple realizations of the imputation that can be passed along to the next steps in the process, propagating the uncertainty from the imputation forward to the final result.

Actually, all the simpler methods we demonstrated earlier, like single imputation using the mean, or pure hot-deck imputation, are implemented within mice(). Thus, you could use mice() as a one-stop shop for all your imputation use cases, setting m = 1 for the methods that don’t incorporate randomness. I demonstrated how to do those forms of imputation with base R and tidyverse functions so that you could understand better how they work, before introducing you to the increased power and flexibility of mice().

Let’s try a different imputation method. If we provide method = 'rf' we will get random forest imputation; however this isn’t just a single imputation like you get from missForest(). It uses a special tweak on the random forest algorithm to generate multiple imputed datasets.

jassids_mice_RF_imputed <- mice(jassids, m = 10, seed = 27606, method = 'rf')

Again, you should explore the arguments of mice() to see what default tuning parameters are used for this method and how you might modify them.

Here’s a plot of the imputation results for the random forest algorithm. It looks pretty similar to the PMM results.

Now we can fit a model, for example let’s say we wanted to predict jassid abundance from all other variables in the dataset (ignoring that it’s not a great model). First we’ll use the datasets imputed with the PMM method.

jassids_multiple_lms <- with(jassids_mice_imputed, lm(Jassids ~ RF + temp_max + temp_min + RH1 + RH2 + BSS + wind_velocity))
jassids_multiple_lm_coeffs <- pool(jassids_multiple_lms)

summary(jassids_multiple_lm_coeffs)
##            term    estimate   std.error  statistic       df     p.value
## 1   (Intercept)  3.22001541 13.94901920  0.2308417 28.91597 0.819063322
## 2            RF -0.04343613  0.02306303 -1.8833660 35.22209 0.067929015
## 3      temp_max -0.15770185  0.42503637 -0.3710314 28.13832 0.713393053
## 4      temp_min  0.39032166  0.24511860  1.5923788 37.11455 0.119782024
## 5           RH1 -0.01700536  0.06538380 -0.2600852 44.47061 0.795998535
## 6           RH2  0.05202114  0.07733337  0.6726868 31.14328 0.506107501
## 7           BSS -0.09162332  0.13436862 -0.6818803 44.60785 0.498839966
## 8 wind_velocity -0.92724852  0.30788293 -3.0116918 41.23133 0.004423532

Let’s look at whether we get different results if we use the datasets imputed with the random forest algorithm instead of the predictive mean matching algorithm.

jassids_multiple_lms_RF <- with(jassids_mice_RF_imputed, lm(Jassids ~ RF + temp_max + temp_min + RH1 + RH2 + BSS + wind_velocity))
jassids_multiple_lm_RF_coeffs <- pool(jassids_multiple_lms_RF)

summary(jassids_multiple_lm_RF_coeffs)
##            term    estimate   std.error   statistic       df    p.value
## 1   (Intercept) -0.52099982 13.19998453 -0.03946973 24.60334 0.96883443
## 2            RF -0.03898685  0.02300277 -1.69487618 34.83599 0.09902162
## 3      temp_max -0.05665819  0.40569700 -0.13965641 23.33124 0.89013006
## 4      temp_min  0.25100063  0.25320340  0.99130041 24.57854 0.33119833
## 5           RH1  0.01044568  0.06824466  0.15306217 34.72891 0.87923514
## 6           RH2  0.05725752  0.07813806  0.73277373 22.96654 0.47110981
## 7           BSS -0.03662229  0.14068826 -0.26030804 39.34083 0.79598382
## 8 wind_velocity -0.68459260  0.34112942 -2.00684129 21.35661 0.05759101

Here’s a plot of the coefficients and their approximate 95% confidence intervals for the two methods. (Most of the code here is just for putting together the estimates from the different methods and drawing the plot.) For comparison, I’ve included the results from complete-case analysis, where we are forced to use only the rows in the jassids dataframe that have no missing values. Looks like our estimates are not too sensitive to the method of imputation used. However, we might draw somewhat different conclusions about the effects of some of the variables on jassid abundance from our imputed analysis compared to the complete-case analysis. Why do you think that might be?

See the exercises below for more comparing of different multiple imputation methods, or feel free to try different ones out for yourself.

Example multiple imputation workflow: glmmTMB and emmeans

Let’s work through another example of the MICE workflow using the cow weight dataset we saw earlier today. Note that I am using a slightly inappropriate statistical model just for demonstration purposes.

For this imputation, we will be using the "2l.pan" method implemented in the mice package. This method is suitable for imputing data with a multilevel structure. This would be anything you’d analyze with a mixed model, such as data that come from an experiment with a randomized block design, or data with repeated measures on the same individuals over time. The cow dataset has repeated measures of weight taken on the same cows over multiple days.

This method takes some bookkeeping code to set up. We have to specify that we only want to impute the weight column by creating a vector of imputation method names that has an empty string '' for all dataframe columns other than weight, and the value '2l.pan' for weight. (weight is the fourth of five columns in the input dataframe.) Next we have to create a predictor matrix to specify which effects are fixed and which are random (grouping variables) when predicting weight. The weight row of this matrix needs to have -2 for the grouping variable and 1 for the fixed effects, with 0 in all other rows because we are not imputing the other variables in the dataframe.

imp_methods <- c('', '', '', '2l.pan', '')

pred_mat <- rbind(animal = 0, iron = 0, infect = 0, weight = c(animal = -2, iron = 1, infect = 1, weight = 0, day = 1), day = 0)

Next, do the multiple imputation on the cows dataframe. Note that we need to convert the grouping variable, animal, to an integer. This is required by the imputation method.

cows_multilevel_imputed <- cows %>%
  mutate(animal = as.integer(animal)) %>%
  mice(m = 10, method = imp_methods, predictorMatrix = pred_mat, seed = 142)

Here’s a plot showing how the imputation did, showing a few selected cows’ data on separate panels. Again there is an error bar on each imputed value showing the distance between the highest and lowest imputed values, and the hollow red points being the means of the imputations of each missing value.

Use with() to fit a mixed model to each of the imputed datasets, with a random intercept for animal and fixed effects for the two treatments, time in days, and their interaction.

cows_fit <- with(cows_multilevel_imputed, glmmTMB(weight ~ iron * infect * day + (1 | animal)))

Use pool() to get pooled estimates of the model coefficients.

pool(cows_fit) |> summary()
##                               term    estimate    std.error  statistic       df
## 1                      (Intercept) 95.32099880  6.727945471 14.1679208 571.3392
## 2                       ironNoIron -2.28425203  9.771052539 -0.2337775 572.7028
## 3                infectNonInfected  3.83621449 13.998305763  0.2740485 572.8479
## 4                              day  0.31094655  0.005315728 58.4955782 251.0726
## 5     ironNoIron:infectNonInfected  8.62215014 18.878210860  0.4567250 582.4862
## 6                   ironNoIron:day  0.04946753  0.007564675  6.5392801 338.5060
## 7            infectNonInfected:day  0.06299665  0.010817034  5.8238378 348.1347
## 8 ironNoIron:infectNonInfected:day -0.01053632  0.014238521 -0.7399870 506.6398
##         p.value
## 1  2.876813e-39
## 2  8.152413e-01
## 3  7.841462e-01
## 4 2.767789e-148
## 5  6.480390e-01
## 6  2.280503e-10
## 7  1.308385e-08
## 8  4.596507e-01

Usually in a designed experiment like this one, the model coefficients by themselves are hard to interpret. We are often more interested in estimated marginal means or trends. In this case we can use emmeans() to get the pooled estimates for each treatment’s mean weight on a specific day (day 750 in this example), or emtrends() to get the pooled estimates for each treatment’s marginal time trend (weight gain rate in kg/day). You can proceed to do hypothesis tests such as pairwise mean comparisons as you normally would; the uncertainty associated with the imputations is incorporated.

emmeans(cows_fit, ~ iron + infect, at = list(day = 750))
##  iron   infect      emmean    SE     df lower.CL upper.CL
##  Iron   Infected       329  6.56  33260      316      341
##  NoIron Infected       363  6.89 137860      350      377
##  Iron   NonInfected    380 12.00  41978      356      403
##  NoIron NonInfected    415 10.40  57381      395      435
## 
## Confidence level used: 0.95
emtrends(cows_fit, ~ iron + infect, var = 'day')
##  iron   infect      day.trend      SE   df lower.CL upper.CL
##  Iron   Infected        0.311 0.00532  497    0.301    0.321
##  NoIron Infected        0.360 0.00542 1576    0.350    0.371
##  Iron   NonInfected     0.374 0.00930 2749    0.356    0.392
##  NoIron NonInfected     0.413 0.00857  320    0.396    0.430
## 
## Confidence level used: 0.95

Example multiple imputation workflow: Bayesian GLMM using brms

The procedure to fit a Bayesian GLMM with the brms package using multiply imputed data is very similar to the above. We use the function brm_multiple() which is designed to work with objects of class mids. Here, if we specify chains = 1, a single Markov chain will be sampled for each of the ten datasets (in practice this should be increased). The results are pooled simply by concatenating the samples from the different chains, just the same way it is done if you run multiple chains with a single dataset. Pooling and propagating uncertainty from imputation is much more straightforward in a Bayesian context. Because we’re using uninformative prior distributions on the treatment and time effects, and because uncertainty due to the imputation is being propagated in exactly the same way for the Bayesian and non-Bayesian model fits, we get virtually identical results to the frequentist statistical approach with glmmTMB().

I’m glossing over a lot of details about Bayesian statistics here, and skipping the usual convergence and model fit diagnostics I would do. This is just a quick demonstration that multiple imputation works well if you’re fitting Bayesian mixed models. Check out my Bayesian lessons for more details on the Bayes approach.

cows_bayesfit <- brm_multiple(
  weight ~ iron * infect * day + (1 | animal), 
  data = cows_multilevel_imputed,
  prior = c(
    prior(normal(0, 10), class = b)
  ),
  cores = 4, chains = 1, iter = 2000, warmup = 1000, seed = 142,
  file = 'data/cows_bayesfit', file_refit = 'never'
)

emmeans(cows_bayesfit, ~ iron + infect, at = list(day = 750))
##  iron   infect      emmean lower.HPD upper.HPD
##  Iron   Infected       329       316       341
##  NoIron Infected       365       352       378
##  Iron   NonInfected    379       359       397
##  NoIron NonInfected    412       392       430
## 
## Point estimate displayed: median 
## HPD interval probability: 0.95
emtrends(cows_bayesfit, ~ iron + infect, var = 'day')
##  iron   infect      day.trend lower.HPD upper.HPD
##  Iron   Infected        0.311     0.301     0.321
##  NoIron Infected        0.360     0.350     0.370
##  Iron   NonInfected     0.374     0.356     0.391
##  NoIron NonInfected     0.414     0.397     0.430
## 
## Point estimate displayed: median 
## HPD interval probability: 0.95

A final note on Bayesian imputation

I just told you that multiple imputation is the best kind of imputation, but I neglected to mention there is also a fully Bayesian approach to imputation. The MICE approach is already basically Bayesian in spirit, because of the way it propagates uncertainty from the imputation step through to your final estimates, and the way it samples repeatedly from a “posterior” distribution to impute each missing value. We’ve shown above how we can use brm_multiple() to incorporate MICE imputation into our Bayesian modeling workflow. It’s a two-step process: first do the imputation, then fit the same Bayesian model to each imputed dataset. Well, why not just go “full Bayesian” and explicitly include the imputation process within a single statistical model? This is an elegant and statistically sound way of imputing missing data. The Bayesian approach to imputation treats missing values in your dataset as more parameters for the model to estimate, in addition to the slopes, intercepts, variances, etc. that you might be more familiar with. In the Bayesian world, all parameters are treated as being uncertain, and are modeled with probability distributions that represent the state of our (imperfect and uncertain) knowledge about them.

As described in the brms missing data vignette, this kind of Bayesian imputation has the great advantage that the model only needs to be fit once instead of m times, once for each imputed dataset. Bayesian models can be computationally intensive to fit, so this may be a big deal. But we lose a little of the “automatic” flavor of mice imputation. By default, mice uses all other variables to predict the missing ones. But when doing imputation and model fitting in the same step, we have to explicitly write out which variables are being used to impute the others.

It’s outside the scope of this workshop to include a worked example of this kind of Bayesian imputation. For now, I will refer you to the brms missing data vignette. I would be happy to work with anyone interested in Bayesian imputation of their missing data — feel free to contact me!

Additional resources

More R packages for missing data

In addition to the great packages introduced in this lesson, you may also want to check these out:

  • Amelia: Uses a bootstrapping algorithm to do multiple imputation, specializing in time-series datasets
  • simputation: Implements a number of single, not multiple, imputation algorithms, and is written to work well with the “tidyverse pipe” syntax
  • miceadds: An extension of the mice package that implements some other techniques beyond what’s available in mice
  • imputeTS: Methods for time series imputation

Further reading

Here are some handbooks and tutorials about missing data and imputation that you can use to expand on what you’ve learned here:

Exercises

For the exercises, we will use another example dataset that contains missing values. It is the mammalsleep dataset from the mice package, originally published by Allison and Cicchetti in 1976. The researchers wanted to explore whether total hours of sleep per day (ts in the dataset) at the species level could be predicted by other species traits such as body weight (bw), brain weight (brw), and gestation time (gt). Type ?mammalsleep in your console to get more information about the dataset.

Little brown bats sleep ~20 hours a day. Image (c) Ondrej Prosicky
Little brown bats sleep ~20 hours a day. Image (c) Ondrej Prosicky

Here is how you load the dataset, that comes pre-loaded with the mice package, into your R environment. Exclude the species name column from the dataset to simplify how mice() will handle the imputation later.

data(mammalsleep, package = 'mice')
mammalsleep <- mammalsleep %>% select(-species)

Exercise 1

Use summary statistics and visualizations to explore the missing values in the dataset. Use the VIM or naniar package, or make your own visualizations.

Exercise 2

Fit a linear model using only rows that have no missing values for ts, bw, brw, or gt. Model hours slept per day as a function of the logarithm of body weight, the logarithm of brain weight, and the logarithm of gestation time. Your model formula will be ts ~ log(bw) + log(brw) + log(gt). Output the model coefficients and their standard errors.

Exercise 3

Impute the missing values using the mean of each column and fit the same linear model as above to the imputed dataset. Again, output the model coefficients and their standard errors.

Exercise 4

Impute the missing values using the PMM algorithm with mice(), setting m = 10. Fit the same linear model as above to each of the imputed datasets. Output pooled estimates of the model coefficients and their standard errors.

Exercise 5

Are the conclusions different depending on which of the three methods you used? If so, why?

Click here for answers


  1. It’s OK to say “there is data” because data is now considered by many speakers to be a mass collective noun.↩︎

  2. Notice this is base R’s ifelse(), not tidyverse’s if_else().↩︎