By popular demand, this is a lesson about missing data: what it is, what problems it can cause, how to recognize and describe it, and what to do about it.
At the end of this course, you will understand …
At the end of this course, you will be able to …
Where there is data 1, there is missing data.
Missing data may happen for many different reasons. The most common causes of missingness are different in different scientific fields. What are some reasons you can think of for why a data point might be missing?
Not all missingness is created equal. There are three categories of missing data (Rubin 1976), based on the process that determines how probable it is for a data point to be missing.
Missing completely at random (MCAR): The probability of being missing is equal for all data points in a dataset. If we can consider this assumption to be valid, it is nice because we don’t have to account for any biases resulting from missing data.
Missing at random (MAR): The probability of being missing varies, but it varies according to a group or variable that we have in our observed data. For example our instrument is more likely to fail on a rainy day, but we have data on rainfall by day. This is not as easy to deal with as MCAR, but it is still feasible to deal with because we can make use of the other information we have to model the process that is causing the missing data pattern. Most of the standard methods for dealing with missing data assume MAR.
Missing not at random (MNAR): The probability of being missing varies, for unknown reasons, i.e., reasons that we cannot account for with some other source of data. This is the most difficult situation to deal with. For example, if you are doing a survey with a follow-up, people who do not have a strong opinion are less likely to respond or to follow up. Or plant genotypes that are less resistant to disease are more likely to die before the end of the study, so their final biomass value is more likely to be missing. Especially if you work with survival data, you may have heard the term censored data. Censoring is a special kind of missing not at random. If a data point is censored, that means it is missing if it’s above or below a threshold (for example, a concentration below the limit of detection of the instrument or assay). Censoring also occurs if the subject was removed from the study, or if it died. A common type of censoring occurs in survival studies where subjects are only tracked for a finite period of time. Any subjects still alive at the end of the study period have censored time-of-death values — we know the subject survived at least as long as the study period but we don’t know exactly how long. All of those types of censoring will bias the results if they are not properly accounted for.
Note that I have not discussed the issue of entire replications missing from treatments in a designed experiment. For example, a plot in a randomized complete block design may be destroyed or not measured for whatever reason. In that case, the entire observation is missing, including the outcome. Obviously, it is never good to lose observations because it reduces the overall sample size. However, it is not a major concern from the perspective of analysis. In olden times, unbalanced datasets caused issues with analysis of variance. Modern procedures are very robust to unbalanced data. It is important to note that if you are doing ANOVA and using sums of squares to calculate F-statistics associated with different model terms, there is more than one way to calculate sum of squares. You may have heard of Type I, Type II, and Type III sum of squares. These differ in the way that the variability is estimated, specifically the order in which the main effect and interaction terms are considered. The different types of sum of squares are equivalent if the dataset is perfectly balanced (and there’s no correlation between predictors), but become different if you have unbalanced data. Therefore, if your dataset is severely unbalanced, you should investigate whether your bottom-line conclusions are sensitive to the type of sum of squares you choose. If they are, it would be worth thinking about it more deeply and making an informed decision. But I personally don’t usually sweat it that much, because I usually find that multiple comparisons between treatment means are more biologically and practically relevant than sum of squares F-tests anyway! Now, with that small digression out of the way, let’s get back to talking about missing data.
Load any packages that we’ll need for the lesson. These include some of the tidyverse data manipulation packages (dplyr, tidyr, ggplot2), some standard statistical packages (glmmTMB, brms, emmeans, broom.mixed), the zoo package which is specialized for working with time series data, the VIM and naniar packages which help with visualizing missing data, and the missForest and mice packages for imputation.
library(agridat)
library(glmmTMB)
library(dplyr)
library(ggplot2)
library(tidyr)
library(emmeans)
library(broom.mixed)
library(brms)
library(zoo)
library(VIM)
library(naniar)
library(missForest)
library(mice)
Let’s explore how R handles missing data. Variables in R can belong
to different data types, like integer, double-precision numeric,
character, and factor. Regardless of the data type, missing values are
represented by a special code, NA for “not available.”
x <- c(1, 2, NA, 4, 5)
For many operations, the default behavior is to return
NA if any of the input values are NA,
no matter how many non-missing values there are. This is the case for
many functions in base R that take a vector of values as input and
return a single value:
mean(x)
## [1] NA
sum(x)
## [1] NA
median(x)
## [1] NA
Many such functions have an argument na.rm which is
FALSE by default. We can change that default behavior by
explicitly setting na.rm = TRUE to remove missing values
and compute the summary function on the non-missing values:
mean(x, na.rm = TRUE)
## [1] 3
sum(x, na.rm = TRUE)
## [1] 12
median(x, na.rm = TRUE)
## [1] 3
Many new R users are frustrated at this choice of defaults. Why doesn’t R remove the missing values and calculate the mean or sum, without me having to waste keystrokes explicitly telling it to get rid of those values? I believe that the default behavior is actually sensible. Interpretation changes when missing values are deleted, so this should not be done silently without our express consent.
However, many R model fitting functions do indeed delete missing values without explicit consent. The deletion of missing values can cause unexpected problems. Let’s say we want to plot the observed versus predicted values for a linear mixed model, to see whether the model is a reasonably good fit for the data. This works fine for an example dataset with no missing values:
data(cochran.beets)
fit <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)
plot(x = cochran.beets$yield, y = predict(fit))
But what happens if we introduce a single missing value?
cochran.beets$yield[3] <- NA
fit2 <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)
plot(x = cochran.beets$yield, y = predict(fit2))
## Error in xy.coords(x, y, xlabel, ylabel, log): 'x' and 'y' lengths differ
We get an error because the length of x is 42, which
includes a NA somewhere in there, but the length of
y is 41 — the missing value was already stripped out before
the model fitting, so the predicted values don’t include it! You would
need to do some gymnastics to line them up.
By the way, you can verify how many observations were used to fit a model and compare it to the number of rows in the input data, to see if any were removed.
nrow(cochran.beets)
## [1] 42
nobs(fit2)
## [1] 41
Before we learn what to do about missing data, let’s learn a few tools for exploring a dataset and getting a better idea of the patterns of missing data in it. Two useful R packages for this are VIM and naniar (originally known as ggmissing). VIM uses syntax and plotting style closer to base R, while naniar depends on ggplot2 and the other “tidyverse” packages.
We will be using a simple example dataset for some of the simpler examples here. It is a small dataset with a few dozen observations of jassid (leafhopper, an insect pest) abundance, with seven weather covariates. The data were collected in the Maharashtra region of India and published on Kaggle by Madhura Bhagat. I have artificially deleted some values from the dataset, completely at random and independently for each covariate, with a different probability of missingness for each variable. The response variable, jassid abundance, has no missing values.
Import the data from a CSV file. You may import from a web URL where
I have hosted the file on GitHub (first option shown below) or, if you
are running this on the SCINet server, from the data folder
(second option shown below).
jassids <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/jassids.csv')
jassids <- read.csv('data/jassids.csv')
Look at the first few rows of the data.
head(jassids)
## Jassids RF temp_max temp_min RH1 RH2 BSS wind_velocity
## 1 0.20 18.87 34.60 24.50 68.82 45.54 6.55 6.48
## 2 0.23 34.96 33.92 24.19 71.23 48.12 5.25 6.17
## 3 0.87 29.51 NA 23.66 73.88 50.89 4.81 NA
## 4 1.63 NA 32.10 23.69 77.48 57.47 4.10 5.73
## 5 2.38 17.78 31.44 NA 77.87 59.18 3.58 5.41
## 6 5.12 29.38 31.06 23.55 77.08 61.15 2.91 5.68
One thing to notice is that the plain old base R function
summary(), which has a method for data frames, includes
information on missing values. If a column contains any NA
values, summary() will tell you how many:
summary(jassids)
## Jassids RF temp_max temp_min
## Min. :0.000 Min. : 0.000 Min. :28.00 Min. :10.18
## 1st Qu.:2.237 1st Qu.: 2.663 1st Qu.:29.25 1st Qu.:18.77
## Median :3.995 Median :21.435 Median :30.00 Median :22.30
## Mean :3.816 Mean :23.418 Mean :30.25 Mean :20.91
## 3rd Qu.:5.287 3rd Qu.:32.800 3rd Qu.:30.89 3rd Qu.:23.67
## Max. :8.000 Max. :90.300 Max. :36.60 Max. :26.20
## NA's :4 NA's :12 NA's :2
## RH1 RH2 BSS wind_velocity
## Min. :63.65 Min. :31.25 Min. : 1.000 Min. :0.550
## 1st Qu.:74.23 1st Qu.:41.71 1st Qu.: 3.800 1st Qu.:1.070
## Median :79.38 Median :54.77 Median : 5.970 Median :2.400
## Mean :78.45 Mean :52.58 Mean : 6.027 Mean :2.741
## 3rd Qu.:84.71 3rd Qu.:63.35 3rd Qu.: 7.522 3rd Qu.:4.300
## Max. :92.80 Max. :71.05 Max. :20.700 Max. :6.550
## NA's :5 NA's :9
At least in my RStudio environment, glimpse(jassids)
from the dplyr package will highlight NA
values in red (not shown here); note that this only shows the first few
rows of each column, so it won’t necessarily flag the presence of
missing values in a column.
Another handy function is complete.cases(). It returns a
logical vector with TRUEs for each row that is “complete,”
and FALSEs for each row that contains at least one missing
value.
complete.cases(jassids)
## [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## [13] TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [25] TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
## [37] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [49] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
Pass the output of complete.cases() to
table() to get the number of rows with and without missing
values. Here, 25 of the 58 rows have at least one missing value.
table(complete.cases(jassids))
##
## FALSE TRUE
## 25 33
The function is.na() tests whether each value of a
vector is missing. We can negate it by writing !is.na() to
return TRUE if a value isn’t missing. We can use this to
make tables of missingness for individual data frame columns. For
example, 12 of the 58 rows are missing a value for
temp_max.
table(!is.na(jassids$temp_max))
##
## FALSE TRUE
## 12 46
We can make a two-way table to see how many rows are missing values
from either or both of two columns. This will also work for three or
more columns but the table becomes hard to interpret. Here, we see that
no rows are missing both temp_max and
temp_min, but some are missing one of the two.
with(jassids, table(!is.na(temp_max), !is.na(temp_min)))
##
## FALSE TRUE
## FALSE 0 12
## TRUE 2 44
Here’s a visualization of the missing data from the VIM package. First we use
aggr() to compute the number of missing values in each
variable and in each combination of two variables. Then the resulting
object has a plot() method. The argument
numbers = TRUE is optional and shows the proportion of
missing values in each combination.
plot(aggr(jassids), numbers = TRUE)
The naniar package has
the function vis_miss() which is originally from the
visdat package. It shows missing data row-by-row.
Currently (December 2025) there is a bug in this package that causes
the axis labels to display incorrectly unless an adjustment is made with
theme_sub_axis_x(). This should be fixed in the next
release of the package.
vis_miss(jassids) + theme_sub_axis_x(text = element_text(vjust = 0))
Try adding cluster = TRUE and/or
sort_miss = TRUE to vis_miss() to arrange
missing values row-wise and column-wise (plots not shown here).
vis_miss(jassids, cluster = TRUE)
vis_miss(jassids, sort_miss = TRUE)
From now on, whenever I show a figure, we will use the same style as Van Buuren 2019, which is to plot the observed data with solid blue circles and the imputed values with hollow red circles. I’ve created my own custom functions for visualizing the missing data patterns and will not go into detail about the code. This is a workshop about learning what to do with missing data and we don’t have time to cover the ins and outs of plotting code.
It’s inevitable that you will have missing data at some point. But you still need to analyze the data and answer the questions you care about! How to proceed? Here is a rundown of some different solutions, starting from the quick-and-dirty and proceeding to the most robust (but also the most labor-intensive to implement).
The strategies for dealing with missing data can be grouped into three main categories. We will go through some examples of each.
But before we describe how each strategy works in detail and do hands-on examples, I want to go through potential rationale for choosing a strategy.
The biggest factor determining your missing data strategy is the proportion of data that are missing. If it is relatively small, differences in imputed values will have a negligible effect on your bottom-line conclusions. I would consider something on the order of 5% missing values to be a small proportion. In that case, it may not be worth the effort to develop a complex missing-data model; simpler strategies will work just as well. But be aware that there is no globally valid “rule of thumb” of what rate of missingness you can safely ignore. You should evaluate the consequences of missing data on a case-by-case basis.
The way in which the data are missing is another factor. If the data are missing completely at random, then if you delete observations that contain missing data, you’re just randomly selecting observations from the dataset and removing them. The worst-case scenario there is that you end up with a smaller dataset with reduced statistical power to detect effects. As mentioned above, if the proportion of observations with missing data is small, the reduction in power will be minimal. On the other hand, what if there is a relationship between the probability of an observation having missing data and one of the predictors or outcomes that you care about? In that case you could bias your inference if you ignore the missing data pattern, even if the number of missing values is relatively small.
When confronted with missing data, it can be tempting to simply ignore it. By “ignore,” I mean pretend the missing data don’t exist by deleting observations that contain missing data until you end up with a complete dataset with no missing values. There are a few different ways to do this.
Listwise deletion, a.k.a. complete-case analysis: This is by far the commonest way of dealing with missing data. Just delete any observations, or “cases,” with any missing values. In the typical data format needed for statistical analysis, each row of the data frame is a case.
I usually use the dplyr function
filter() together with complete.cases() to do
listwise deletion on a dataset. Note we can use . to mean
“the dataset that has been passed to this function from a previous
step.” In our jassids example this leaves us with 33
complete rows.
jassids_complete_only <- jassids %>%
filter(complete.cases(.))
Note that listwise deletion is the method used by default in many R and SAS statistical modeling packages. So even if you don’t explicitly delete incomplete cases before passing the data frame to the model, it will be done internally (taking into account only the columns that are needed for that model). As we’ve shown above, do this at your own risk!
This code shows that the number of observations is greater for a linear model that models jassid abundance as a linear function only of maximum temperature, compared to one that uses both maximum temperature and wind velocity. When we add another predictor, the number of complete cases goes down.
model1 <- lm(Jassids ~ temp_max, data = jassids)
model2 <- lm(Jassids ~ temp_max + wind_velocity, data = jassids)
nobs(model1)
## [1] 46
nobs(model2)
## [1] 41
Fortunately, as we’ve already mentioned, many R modeling functions
return an error if missing values are present in the data (such as
clmm()), or at least a warning message if deletion was done
(such as brm()). Be wary of the ones like
lm(), lme4(), and glmmTMB() that
do the deletion by default and bury the warning somewhere in the summary
output!
In multivariate analyses, you may need to remove both columns (variables) and rows (cases) from the original dataset to get a complete dataset of acceptable size. For example, let’s say you are doing a multivariate analysis where you are trying to use the relative abundance of 100 microbiome taxa to predict a trait value. A few taxa have missing data for 90% of the rows. If you just try to do complete-case analysis on the full dataset, you will have few or no rows left to work with. In this case, a valid strategy could be to delete columns that are above some unacceptable threshold of proportion missing values, iterating until you get a reasonable number of complete cases.
Pairwise deletion, a.k.a. available-case analysis: This is a method where instead of using only the complete cases for every calculation, you do separate calculations for each variable or subset of variables, using the cases that are complete for those variables. This is often the method chosen for generating correlation matrices. If you want the correlation between two particular variables, you don’t necessarily need to worry about values being missing for other variables in that dataset.
The function cor() calculates a correlation matrix
containing the correlations between each pair of columns in a matrix. It
has an argument use that allows you to choose whether
you’re doing complete-case analysis/listwise deletion,
use = 'complete.obs' or pairwise deletion,
use = 'pairwise.complete.obs'. The default is
use = 'everything' which will return NA if a
pair of variables contains any missing values.
Below you can see how the output of cor() differs
depending on which deletion strategy we choose.
cor(jassids)
## Jassids RF temp_max temp_min RH1 RH2 BSS wind_velocity
## Jassids 1.0000 NA NA NA 0.193 NA 0.0262 NA
## RF NA 1 NA NA NA NA NA NA
## temp_max NA NA 1 NA NA NA NA NA
## temp_min NA NA NA 1 NA NA NA NA
## RH1 0.1934 NA NA NA 1.000 NA -0.2764 NA
## RH2 NA NA NA NA NA 1 NA NA
## BSS 0.0262 NA NA NA -0.276 NA 1.0000 NA
## wind_velocity NA NA NA NA NA NA NA 1
cor(jassids, use = 'complete.obs')
## Jassids RF temp_max temp_min RH1 RH2 BSS wind_velocity
## Jassids 1.0000 0.0326 -0.318 0.0274 0.360 0.230 -0.105 -0.116
## RF 0.0326 1.0000 0.150 0.5810 0.431 0.658 -0.306 0.130
## temp_max -0.3179 0.1495 1.000 0.5034 -0.302 -0.109 0.160 0.423
## temp_min 0.0274 0.5810 0.503 1.0000 0.309 0.726 -0.345 0.679
## RH1 0.3599 0.4313 -0.302 0.3092 1.000 0.499 -0.161 -0.183
## RH2 0.2300 0.6577 -0.109 0.7257 0.499 1.000 -0.536 0.459
## BSS -0.1050 -0.3063 0.160 -0.3454 -0.161 -0.536 1.000 -0.460
## wind_velocity -0.1157 0.1296 0.423 0.6791 -0.183 0.459 -0.460 1.000
cor(jassids, use = 'pairwise.complete.obs')
## Jassids RF temp_max temp_min RH1 RH2 BSS wind_velocity
## Jassids 1.0000 -0.0409 -0.1544 0.0217 0.1934 0.0617 0.0262 -0.2515
## RF -0.0409 1.0000 0.1636 0.6278 0.4782 0.6936 -0.4050 0.2437
## temp_max -0.1544 0.1636 1.0000 0.4832 -0.1822 -0.0964 0.1326 0.4200
## temp_min 0.0217 0.6278 0.4832 1.0000 0.4795 0.7447 -0.4427 0.6969
## RH1 0.1934 0.4782 -0.1822 0.4795 1.0000 0.5488 -0.2764 0.0219
## RH2 0.0617 0.6936 -0.0964 0.7447 0.5488 1.0000 -0.6564 0.5505
## BSS 0.0262 -0.4050 0.1326 -0.4427 -0.2764 -0.6564 1.0000 -0.5337
## wind_velocity -0.2515 0.2437 0.4200 0.6969 0.0219 0.5505 -0.5337 1.0000
Of course the pairwise deletion method isn’t helpful for many statistical modeling approaches that require all variables to be non-missing, such as multiple regression.
We’ve shown that under certain circumstances, deletion methods like listwise deletion can waste a surprisingly high proportion of your data. Especially if your sample size was small to begin with, it’s often unacceptable to throw away so much data that you invested so much time, money, and resources to generate. More importantly, if your data aren’t missing completely at random, listwise deletion will bias your results!
Let’s start with the simplest class of strategies for imputing missing values. By “impute,” I mean replace a missing data point with some kind of substituted value. When we impute, we harness some other source of available information to make a more or less educated guess about what that missing value truly is (or what it would be, if we could measure it).
A simple form of imputation is to replace all missing values with the
mean or median of that variable, or if it is a categorical variable, the
mode (most common category). There are a lot of ways to do this in R.
The base R function replace() is a pretty straightforward
way to replace missing values with imputed values. It takes three
arguments, the first is a vector in which we want to replace values, the
second is a vector of indexes where we want to do the replacing, and the
third is the value or values we want to replace with. Here, I’ve defined
a function to take the vector x and, at all the locations
where is.na(x) is true, replace with the mean of the
non-missing values of x.
impute_mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
We can define an almost identical function to impute with the median:
impute_median <- function(x) replace(x, is.na(x), median(x, na.rm = TRUE))
Here, we apply the impute_mean() function to all numeric
columns of the jassids dataframe. Note that I am assigning
the result to a new dataframe, jassids_mean_imputed, rather
than assigning the result back to the original jassids
object. This is good practice because the imputed dataframe does not
have any indication of which values are the actual observations and
which are the imputed values. If we keep the original dataframe in our
working environment, we can refer to it to get that information.
jassids_mean_imputed <- jassids %>%
mutate(across(where(is.numeric), impute_mean))
Also note that impute_mean() is applied to all numeric
columns even if they don’t have any missing values. If a column with no
missing values is passed to the function, it will return that column
unchanged.
How does this look for the temp_max variable? Here are
plots for the distribution of observed (blue) and imputed (red) values
and a scatter plot with temp_max plotted on the y-axis with
temp_min on the x-axis. We see, unsurprisingly, that all
the missing values have been imputed with the same value: the mean.
You’ll see more figures like this later.
A decent first assumption if you have data with space and/or time structure is that two observations that are close together in space or time are more likely to be similar than two observations that are far apart. We can put that to use if we have missing data.
The simplest way to do this is just to copy adjacent observations to
fill in missing values. The tidyverse has a function called
tidyr::fill() but it is more intended for filling in
missing categorical values rather than dealing with time series data.
The zoo package, on the other hand, is specialized for
time series data. zoo has the function
na.locf(), where LOCF stands for “last observation carried
forward.”
For the next couple of examples we will be using a version of the
diggle.cow dataset that you may be familiar with from other
lessons I’ve taught. It’s a time series dataset of body weight
measurements taken on 23 different days at semi-regular intervals from a
total of 26 cows in a two-way factorial experiment. The two treatments
were tuberculosis infection and iron supplementation. I’ve created a
version of the dataset with some values artificially missing completely
at random.
Import the data. As before you may import from the web URL where it’s
hosted or from the data folder if you are running this code
on the SCINet server.
cows <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/diggle.csv')
cows <- read.csv('data/diggle.csv')
Convert animal ID and treatment columns to factor variables.
cows <- cows %>%
mutate(across(c(animal, iron, infect), factor))
Here, if an observation of weight is missing, we take
the observation from the previous time point for that same animal and
fill it in. To make sure that only observations from the same animal are
taken, we first group_by(animal) then
arrange() to sort the data first by animal and
then in increasing order by day. The argument
na.rm = FALSE within na.locf() means that if
there are any remaining missing values at the beginning of any animal’s
time series, they should be retained so that the imputed column has the
same length as the original column.
cows_imputed <- cows %>%
group_by(animal) %>%
arrange(animal, day) %>%
mutate(weight_LOCF_imputed = na.locf(weight, na.rm = FALSE))
I’ve created a new dataframe called cows_imputed which
is a copy of cows with a new column called
weight_LOCF_imputed appended to it so that the missing and
imputed versions of weight can be compared
side-by-side.
There are still missing values because the first day’s value was
missing for some animals. We can go back and apply
na.locf() again to the already-imputed column, but this
time with the argument fromLast = TRUE. This will replace
that missing value with the next day’s value.
cows_imputed <- cows_imputed %>%
mutate(weight_LOCF_imputed = na.locf(weight_LOCF_imputed, na.rm = FALSE, fromLast = TRUE))
Simply copying the previous observation might not be what you are looking for. Another way to impute would be to use a simple linear interpolation, drawing an imaginary line between the non-missing points on either side of the missing one. For example, cow A04 weighed 290 kg on day 380, and 310 kg on day 478. Its weight on day 445 is missing. It might be reasonable to assume that it gained weight at a constant rate between the two dates. We can draw a line between the two points \((x_0, y_0)\) and \((x_1, y_1)\) and find the y-coordinate for a given value of x using this formula:
\[y = \frac{y_0(x_1-x) + y_1(x-x_0)}{x_1-x_0}\]
Here’s a demonstration that this works. The imputed value of ~303 kg is roughly halfway between 290 and 310 because day 445 is roughly halfway between 380 and 478.
interp <- function(x, x0, y0, x1, y1) (y0 * (x1 - x) + y1 * (x - x0)) / (x1 - x0)
interp(x = 445, x0 = 380, y0 = 290, x1 = 478, y1 = 310)
## [1] 303.2653
Luckily, we don’t have to define a linear interpolation function from
scratch every time we want to use it. Linear interpolation is
implemented in base R using the approx() function. We can
fill in the missing values in the cows dataset using this
method. Although the dataframe is already grouped by animal
and sorted by animal and day, I’ve repeated
those lines in the code because you would need to include them if you
are starting with a dataframe that is not grouped and sorted.
The approx() function needs x and
y columns, and the argument xout tells the
function which x values need to get interpolated
y values. In this case xout is the same as
x because we want interpolations for any missing
y values corresponding to x values. The
rule = 2 argument means that if there is an NA
value at the beginning or end, it will replace it with the closest
non-missing value.
cows_imputed <- cows_imputed %>%
group_by(animal) %>%
arrange(animal, day) %>%
mutate(weight_interpolation_imputed = approx(x = day, y = weight, xout = day, rule = 2)$y)
More sophisticated interpolation methods can be used for time series data; we will not get into that here.
Another old-school type of imputation is to impute using a randomly selected value from another observation. Back in the day this was called “hot deck” imputation because data used to be stored on decks of punch cards. If a replacement value was taken from the dataset that was currently being read in the card reader, it was taken from the “hot deck.” In contrast, “cold deck” imputation means taking a value from a different dataset. The value that is randomly chosen could be taken from any other observation in the dataset, or it could be stratified. For example, if a dataset contains individuals that are either juvenile or adult and either male or female, a missing value for body mass of a juvenile female could be chosen at random from the body mass values of the other juvenile females in the dataset.
Here’s an example of how to use R tidyverse functions to do hot-deck
imputation. First stratify a dataset using group_by(). Then
within a call to mutate() use some logic with the
if_else() function to replace missing values with randomly
chosen values from within the same group.
Group the data by treatment combination and time point using the
columns iron, infect, and day.
Within each of these groups, use the mutate() function to
create an imputed version of the column weight. We use
if_else() for this, which takes three arguments. The first
argument is the condition we are testing; in this case it is
is.na(weight), which will return TRUE if
weight is missing and FALSE if not. The second
argument is what value to assign if the condition is TRUE.
Here, it will be a randomly sampled value from the non-missing values of
weight. The third value is what value to assign if the
condition is FALSE, which in this case is the unchanged
value of weight.
set.seed(1)
cows_imputed <- cows_imputed %>%
group_by(iron, infect, day) %>%
mutate(weight_hotdeck_imputed = if_else(is.na(weight), sample(na.omit(weight), size = length(weight), replace = TRUE), weight))
IMPORTANT: You might have noticed that hot-deck imputation differs from the other forms of imputation we’ve covered so far because it involves randomness. Implementing hot-deck imputation more than once on the same dataset will give you a different result each time. Keep this in mind because it will be important when we start to talk about multiple imputation in a few minutes!
Here’s what the imputations look like. As usual, each time we show the distributions of the observed and imputed values on the left, and a scatterplot on the right. Does anything stand out to you about these plots?
The next stop on our journey from simple quick-and-dirty imputation to sophisticated cutting-edge imputation is imputing using regression. Why not take advantage of the relationships that exist between variables in our dataset to impute missing values? To do regression imputation, we just fit a regression model with the variable we want to impute as the dependent variable and one or more of the other variables in the dataset as predictors. The input data points are those where we have a complete set of values for every variable. Then we simply use the regression model to predict the value of the dependent variable where it’s missing.
We could do this with a simple bivariate regression. Going back to
the jassids dataset, let’s impute the temp_max
variable using the temp_min variable. Our linear model
tells us we can predict temp_max using the equation
25.95 + 0.21 * temp_min. (How could you check how good of a
prediction that is?)
temp_linear_fit <- lm(temp_max ~ temp_min, data = jassids)
coefficients(temp_linear_fit)
## (Intercept) temp_min
## 25.9537226 0.2058685
Use the ifelse() function2 to create a vector of
imputed temperature values, using the predicted temp_max
values for the rows in the dataframe where they were missing originally,
and the observed values otherwise.
pred_temp_max <- predict(temp_linear_fit, newdata = jassids)
temp_linear_imputed <- ifelse(is.na(jassids$temp_max), pred_temp_max, jassids$temp_max)
How does this look? Below, the right-hand plot shows the
temp_max variable plotted against the temp_min
variable.
You can see an issue from the right-hand plot above: it looks like the relationship between maximum and minimum temperature is not really linear, especially as minimum temperature gets above 20 degrees. But the imputed values all fall right on that possibly inappropriate linear trendline. Well, linear models are so 20th century, anyway. Get with the times and do some machine learning instead! Instead of predicting maximum temperature with a linear model using one other predictor, let’s fit a random forest model that predicts maximum temperature — and any other variables that have missing values — using all the other predictors in the dataset. Random forest can capture nonlinear relationships quite well.
Because this isn’t a random forest lesson, I will not go into much
detail on exactly how random forest works. See my lesson on machine
learning basics for details on how random forest works. What you need to
know for now is that the R package missForest has
a function called missForest(). If you pass
missForest() a dataframe with missing values, you get a
complete dataframe back where the missing values were imputed using a
random forest technique. There are a lot of additional arguments to
tweak the tuning parameters of the random forest model, but I won’t get
into any of those here. The only additional argument I have included is
variablewise = TRUE which will give us a report of the mean
squared error (MSE) of the random forest model column-by-column. These
values are the errors for the non-missing values in the column.
According to the package developer, if these errors are low, we can be
confident that the errors in the imputations of the missing values (the
imputations we actually care about) are also low.
set.seed(2)
jassids_RF_imputed <- missForest(jassids, variablewise = TRUE)
This function returns a list of two objects, ximp and
OOBerror. The first is the complete dataset and the second
is the MSE for each column.
head(jassids_RF_imputed$ximp)
## Jassids RF temp_max temp_min RH1 RH2 BSS wind_velocity
## 1 0.20 18.8700 34.6000 24.5000 68.82 45.54 6.55 6.480
## 2 0.23 34.9600 33.9200 24.1900 71.23 48.12 5.25 6.170
## 3 0.87 29.5100 31.1573 23.6600 73.88 50.89 4.81 4.357
## 4 1.63 24.1195 32.1000 23.6900 77.48 57.47 4.10 5.730
## 5 2.38 17.7800 31.4400 23.2172 77.87 59.18 3.58 5.410
## 6 5.12 29.3800 31.0600 23.5500 77.08 61.15 2.91 5.680
Take the square root of the MSE values and rename them after the
columns of jassids so we can get an idea of the quality of
the imputation. We see there are zeros for the columns that didn’t have
any missing values. The random forest model predicted the non-missing
values of temperature with an “average” error of 1.16 degrees.
Remember, this is the error for the predictions for the points where we do know the values. For the points where we don’t know the values (the missing values), we will never know for sure if the imputation was correct. That’s always the issue with imputation of missing data!
sqrt(jassids_RF_imputed$OOBerror) |> setNames(names(jassids))
## Jassids RF temp_max temp_min RH1
## 0.000000 15.186841 1.155743 1.684493 0.000000
## RH2 BSS wind_velocity
## 4.458125 0.000000 1.071593
By comparison, here is the root mean squared error for the simple linear model using only minimum temperature to predict maximum temperature. At 1.44, it’s a bit higher than the random forest, but still reasonable. The machine learning model only modestly outperforms the naive linear model approach in terms of predicting values we already know.
sqrt(mean(residuals(temp_linear_fit)^2))
## [1] 1.438877
Here are plots similar to those we’ve made for other imputation strategies. Compared to what we’ve seen before, the imputed temperature values are starting to look more reasonable in terms of having plausible-looking noise and being distributed similar to the non-missing values.
Many datasets are “cleaned” by using ad hoc imputation
methods like substituting the mean or doing linear interpolation on a
time series. Some people even get fancy and do regression-based
imputation, including machine learning models like the random forest
imputation we just saw. You end up with a nice tidy-looking dataset with
no unsightly empty cells or NAs, and you proceed to fit a
statistical model, test hypotheses, and make inferences. So far, so
good, right? Right?
Well, it’s important to remember that any value that you did not directly observe and measure is not data. It is the output of a model. Most ad hoc imputation methods generate a single imputed dataset. The dataset contains modeled values that are masquerading as data. As a result it is hiding any uncertainty associated with the imputation process. This can have serious consequences for your inference later on: it might lead to inappropriate rejections of the null hypothesis (if using Frequentist statistics) or inappropriately precise estimates of effect size (in both Bayesian and Frequentist frameworks). Actually, this is true for even the most sophisticated machine learning prediction model used to impute missing data: pretending that modeled values are actual observations inevitably leads to bias.
I will just quote the words of the great Stef van Buuren here, talking about regression imputation.
“[Regression imputation] artificially strengthens the relations in the data. Correlations are biased upwards. Variability is underestimated. Imputations are too good to be true. Regression imputation is a recipe for false positive and spurious relations.”
Of course, as I’ve said a few times already in this lesson, it is a matter of degree. If the proportion of missing values is small, a quick fix approach is okay. Ignoring uncertainty in the modeled values generated by single imputations, if there are only a few, will not have serious consequences. But if you have lots of missing values and you really need to impute them, the only statistically and scientifically defensible thing to do is to impute using an approach that explicitly accounts for the uncertainty. This is an ideal segue into our next section . . .
For the rest of our lesson, we’ll be talking about the most technically demanding, but also the most interesting, rewarding, and useful, flavor of imputation.
All the imputation methods we’ve discussed so far, with the possible exception of hot-deck imputation, turn a missing value into a single imputed data point without any measure of uncertainty associated with the imputing process for each of the individual estimates. This is a problem no matter how fancy of a statistical model is doing the imputing. How do we make sure our final estimates of uncertainty of the effect sizes we care about include the uncertainty resulting from some of the underlying data being missing and imputed?
Multiple imputation is the best way of imputing missing values and correctly accounting for the associated uncertainty. Below, we will discuss and give some examples of different multiple imputation techniques, showing how you would present the results of an analysis based on multiply imputed data.
Multiple imputation traces back to a brilliant insight by statistician Donald Rubin in 1977. Rubin observed that you could capture the uncertainty in the missing data by creating multiple different imputed versions of a dataset, drawing the imputed values from an estimated distribution each time. Then, you fit the desired statistical model to each one of these imputed datasets and somehow combine the estimates from the multiple models, in such a way that the uncertainty in the missing values is reflected in the uncertainty estimates (e.g., confidence intervals) of the final estimates. Adjustments are done to the degrees of freedom because when you impute values, you are “using up” degrees of freedom, with consequences for your test statistics and p-values.
A mathematical derivation of how this all works is beyond me, and anyway it’s out of the scope of this short workshop. If you want to learn more, I highly recommend Stef van Buuren’s book (see Further Reading section below).
The mice package does all of
this to a tee. The developers have written tons of methods to allow mice
to be incorporated into your R stats workflow. Multiple imputations
generated with mice can be used not only with base R
model fitting functions like lm() and glm()
but also with mixed model functions like lmer() and
glmmTMB(), and even Bayesian GLMMs with brm().
You can also use emmeans() on a set of imputed datasets
generated with mice, and the increased uncertainty and
adjustments to the degrees of freedom associated with the imputation
will be correctly reflected in the estimates you get there. If you have
more customized analyses, you may have to write some of the code to
perform analyses on each imputed dataset and combine the results
“manually.” But the out-of-the-box code should work for most common
analyses you might do.
The MICE (multiple imputation using chained equations) workflow goes like this:
m imputed datasets using the
function mice(). This produces an object of class
mids (multiply imputed dataset).with(), which fits the statistical model to each of the
m datasets within the mids object. This
produces an object of class mira (multiply imputed
regression analyses).pool() on the
mira object to pool (combine) the estimates from each model
into a combined estimate, with adjusted degrees of freedom. This creates
an object of class mipo (multiple imputation pooled
object). Alternatively, we can pass the mira object to
functions like emmeans().Let’s go through each step with some examples, and talk about some of the different arguments we can use to modify the imputation process to suit our needs.
mice() for multiple imputationLet’s do multiple imputation on the jassids dataset. For
a first try, let’s not supply any arguments other than the defaults,
except to specify m = 10 to create ten imputed datasets,
and seed to ensure the result is reproducible (remember
that multiple imputation involves random draws from probability
distributions). Simulation studies have shown that m = 10
is adequate for most practical applications, but feel free to try out
different values to see when your final estimates converge on a stable
value.
jassids_mice_imputed <- mice(jassids, m = 10, seed = 27606)
What do we get? Take a look at some components of the
mids object. nmis shows us how many missing
values there were in each column. imp shows us the imputed
values for each of the ten datasets, column by column. Columns that had
no missing values are associated with empty dataframes within
imp, but columns with missing values are associated with
\(n \times m\) dataframes, where \(n\) is the number of missing values.
Lastly, method shows us the imputation method that was used
to impute each column.
jassids_mice_imputed$nmis
## Jassids RF temp_max temp_min RH1
## 0 4 12 2 0
## RH2 BSS wind_velocity
## 5 0 9
jassids_mice_imputed$imp
## $Jassids
## [1] 1 2 3 4 5 6 7 8 9 10
## <0 rows> (or 0-length row.names)
##
## $RF
## 1 2 3 4 5 6 7 8 9 10
## 4 22.06 21.87 23.10 32.5 2.00 42.11 17.78 19.69 21.00 21.00
## 7 21.00 10.60 24.44 47.2 23.70 2.66 10.60 23.70 0.00 22.06
## 18 5.65 5.65 4.80 0.0 0.00 10.60 2.00 24.44 19.69 28.90
## 21 2.67 2.67 2.60 0.0 8.54 0.00 28.90 2.60 2.60 24.44
##
## $temp_max
## 1 2 3 4 5 6 7 8 9 10
## 3 32.10 32.10 31.44 30.99 34.60 32.10 30.99 34.60 30.99 33.92
## 12 29.95 29.99 29.45 30.60 30.27 28.95 30.03 29.75 31.34 31.34
## 14 30.03 29.15 29.35 31.20 28.95 29.22 29.15 29.22 31.34 29.99
## 18 30.70 29.35 28.85 31.20 29.34 30.99 30.54 30.01 31.20 31.34
## 19 30.60 30.95 31.06 30.65 31.06 29.99 31.20 30.01 31.06 30.60
## 22 29.78 31.44 30.03 29.78 29.45 29.98 29.35 30.99 31.82 30.01
## 31 30.99 31.20 32.10 30.05 31.20 31.20 30.99 30.65 32.10 34.60
## 36 29.10 28.25 28.00 28.00 28.00 28.07 29.34 29.39 29.22 29.65
## 41 32.10 31.06 30.95 30.95 32.10 30.03 31.34 31.20 30.05 30.54
## 46 28.85 28.85 30.60 29.95 31.82 30.01 28.85 29.99 31.44 28.85
## 51 30.10 30.10 29.65 30.10 30.60 29.10 28.85 30.99 30.50 29.49
## 53 28.85 31.06 31.20 31.44 28.85 31.34 30.95 30.60 30.60 28.85
##
## $temp_min
## 1 2 3 4 5 6 7 8 9 10
## 5 22.73 23.31 23.69 23.31 25.0 23.31 23.55 23.5 23.55 25.0
## 23 15.89 13.88 13.88 15.50 18.9 16.55 19.60 19.6 15.50 15.2
##
## $RH1
## [1] 1 2 3 4 5 6 7 8 9 10
## <0 rows> (or 0-length row.names)
##
## $RH2
## 1 2 3 4 5 6 7 8 9 10
## 11 51.50 62.00 59.08 66.65 65.85 60.66 59.86 62.00 60.05 60.05
## 18 45.54 53.95 51.95 43.96 36.43 51.85 53.95 51.85 45.54 36.43
## 20 34.80 33.35 34.80 42.00 36.88 34.80 36.45 36.43 37.92 33.35
## 28 36.88 38.81 38.77 36.88 38.81 36.43 33.65 41.71 36.88 36.88
## 51 34.80 38.81 45.54 34.80 43.96 48.12 53.95 33.35 36.45 35.00
##
## $BSS
## [1] 1 2 3 4 5 6 7 8 9 10
## <0 rows> (or 0-length row.names)
##
## $wind_velocity
## 1 2 3 4 5 6 7 8 9 10
## 3 5.19 3.65 5.00 6.48 3.20 3.65 5.19 6.13 3.65 4.35
## 10 3.65 4.85 5.68 5.46 5.19 3.65 5.68 3.65 3.65 5.68
## 16 2.70 2.99 2.40 1.36 2.99 5.46 4.30 2.90 2.70 2.40
## 29 0.95 1.20 0.95 1.20 1.20 1.00 1.07 1.20 1.20 1.00
## 34 5.41 5.00 3.95 6.48 6.55 6.17 6.55 3.20 6.55 5.00
## 36 5.46 3.65 5.46 3.65 5.68 5.19 6.13 5.46 3.65 4.85
## 41 2.01 2.01 2.90 2.50 1.80 2.01 2.99 2.99 2.01 2.90
## 50 0.64 1.15 1.20 1.20 1.00 1.20 1.15 1.20 0.95 1.15
## 51 1.07 1.00 0.85 0.55 1.07 1.35 1.36 1.07 2.54 1.00
jassids_mice_imputed$method
## Jassids RF temp_max temp_min RH1
## "" "pmm" "pmm" "pmm" ""
## RH2 BSS wind_velocity
## "pmm" "" "pmm"
You can see that "pmm", which stands for predictive mean
matching, was chosen as the default method for each of the columns with
missing data. Other defaults are used for categorical variables. The PMM
algorithm works by first fitting a regression model with a ridge
regularization parameter that uses the other columns of the dataset to
predict values for all the data points for a given column, including
both the missing and non-missing ones. Then, for each missing entry, we
find a small set (default 5) of other observations in the dataset that
have similar predicted values to the predicted value of the missing
entry. We take the actual observed values for those five non-missing
entries (not their predicted values) and randomly draw a value from
them. That’s our imputed value!
If you were able to follow all that, you will have noticed that PMM
combines aspects of regression imputation (in particular it’s a bit
machine-learning in flavor because of the regularization parameter) and
hot-deck imputation (because it draws values from the dataset currently
being analyzed). You will also have noticed that there are various
parameters that could be tweaked to get slightly different results. Of
course, just like with any analysis you do, you should be aware of any
sensitivity your final results have to design decisions you make along
the way, such as the choice of methods and parameters for imputation.
The defaults of mice() were selected because they are
sensible and have been shown to work well in many cases, but you
definitely should not accept them as gospel without exploring
sensitivity.
Here’s a plot similar to the ones we made earlier showing the
relationship between the temp_max and temp_min
values, both observed and imputed. Now the imputed values have error
bars on them to show the distance between the highest and lowest values
from the ten imputations (showing only the temp_max
imputations in this case). The hollow red points are the means of the
ten imputed values. They are not “the” imputed value because there is no
single imputed value; they are only plotted to show a rough idea of what
MICE is doing.
If you look at the help documentation for ?mice you will
see that it supports many other methods, including random forest
imputation, and all kinds of other regression-based methods. But the key
is that most of them incorporate randomness to some extent, so that you
will get multiple realizations of the imputation that can be passed
along to the next steps in the process, propagating the uncertainty from
the imputation forward to the final result.
Actually, all the simpler methods we demonstrated earlier, like single imputation using the mean, or pure hot-deck imputation, are implemented within
mice(). Thus, you could usemice()as a one-stop shop for all your imputation use cases, settingm = 1for the methods that don’t incorporate randomness. I demonstrated how to do those forms of imputation with base R and tidyverse functions so that you could understand better how they work, before introducing you to the increased power and flexibility ofmice().
Let’s try a different imputation method. If we provide
method = 'rf' we will get random forest imputation; however
this isn’t just a single imputation like you get from
missForest(). It uses a special tweak on the random forest
algorithm to generate multiple imputed datasets.
jassids_mice_RF_imputed <- mice(jassids, m = 10, seed = 27606, method = 'rf')
Again, you should explore the arguments of mice() to see
what default tuning parameters are used for this method and how you
might modify them.
Here’s a plot of the imputation results for the random forest algorithm. It looks pretty similar to the PMM results.
Now we can fit a model, for example let’s say we wanted to predict jassid abundance from all other variables in the dataset (ignoring that it’s not a great model). First we’ll use the datasets imputed with the PMM method.
jassids_multiple_lms <- with(jassids_mice_imputed, lm(Jassids ~ RF + temp_max + temp_min + RH1 + RH2 + BSS + wind_velocity))
jassids_multiple_lm_coeffs <- pool(jassids_multiple_lms)
summary(jassids_multiple_lm_coeffs)
## term estimate std.error statistic df p.value
## 1 (Intercept) 3.22001541 13.94901920 0.2308417 28.91597 0.819063322
## 2 RF -0.04343613 0.02306303 -1.8833660 35.22209 0.067929015
## 3 temp_max -0.15770185 0.42503637 -0.3710314 28.13832 0.713393053
## 4 temp_min 0.39032166 0.24511860 1.5923788 37.11455 0.119782024
## 5 RH1 -0.01700536 0.06538380 -0.2600852 44.47061 0.795998535
## 6 RH2 0.05202114 0.07733337 0.6726868 31.14328 0.506107501
## 7 BSS -0.09162332 0.13436862 -0.6818803 44.60785 0.498839966
## 8 wind_velocity -0.92724852 0.30788293 -3.0116918 41.23133 0.004423532
Let’s look at whether we get different results if we use the datasets imputed with the random forest algorithm instead of the predictive mean matching algorithm.
jassids_multiple_lms_RF <- with(jassids_mice_RF_imputed, lm(Jassids ~ RF + temp_max + temp_min + RH1 + RH2 + BSS + wind_velocity))
jassids_multiple_lm_RF_coeffs <- pool(jassids_multiple_lms_RF)
summary(jassids_multiple_lm_RF_coeffs)
## term estimate std.error statistic df p.value
## 1 (Intercept) -0.52099982 13.19998453 -0.03946973 24.60334 0.96883443
## 2 RF -0.03898685 0.02300277 -1.69487618 34.83599 0.09902162
## 3 temp_max -0.05665819 0.40569700 -0.13965641 23.33124 0.89013006
## 4 temp_min 0.25100063 0.25320340 0.99130041 24.57854 0.33119833
## 5 RH1 0.01044568 0.06824466 0.15306217 34.72891 0.87923514
## 6 RH2 0.05725752 0.07813806 0.73277373 22.96654 0.47110981
## 7 BSS -0.03662229 0.14068826 -0.26030804 39.34083 0.79598382
## 8 wind_velocity -0.68459260 0.34112942 -2.00684129 21.35661 0.05759101
Here’s a plot of the coefficients and their approximate 95%
confidence intervals for the two methods. (Most of the code here is just
for putting together the estimates from the different methods and
drawing the plot.) For comparison, I’ve included the results from
complete-case analysis, where we are forced to use only the rows in the
jassids dataframe that have no missing values. Looks like
our estimates are not too sensitive to the method of imputation used.
However, we might draw somewhat different conclusions about the effects
of some of the variables on jassid abundance from our imputed analysis
compared to the complete-case analysis. Why do you think that might
be?
See the exercises below for more comparing of different multiple imputation methods, or feel free to try different ones out for yourself.
Let’s work through another example of the MICE workflow using the cow weight dataset we saw earlier today. Note that I am using a slightly inappropriate statistical model just for demonstration purposes.
For this imputation, we will be using the "2l.pan"
method implemented in the mice package. This method is
suitable for imputing data with a multilevel structure. This would be
anything you’d analyze with a mixed model, such as data that come from
an experiment with a randomized block design, or data with repeated
measures on the same individuals over time. The cow dataset has repeated
measures of weight taken on the same cows over multiple days.
This method takes some bookkeeping code to set up. We have to specify
that we only want to impute the weight column by creating a
vector of imputation method names that has an empty string
'' for all dataframe columns other than
weight, and the value '2l.pan' for
weight. (weight is the fourth of five columns
in the input dataframe.) Next we have to create a predictor matrix to
specify which effects are fixed and which are random (grouping
variables) when predicting weight. The weight
row of this matrix needs to have -2 for the grouping
variable and 1 for the fixed effects, with 0
in all other rows because we are not imputing the other variables in the
dataframe.
imp_methods <- c('', '', '', '2l.pan', '')
pred_mat <- rbind(animal = 0, iron = 0, infect = 0, weight = c(animal = -2, iron = 1, infect = 1, weight = 0, day = 1), day = 0)
Next, do the multiple imputation on the cows dataframe.
Note that we need to convert the grouping variable, animal,
to an integer. This is required by the imputation method.
cows_multilevel_imputed <- cows %>%
mutate(animal = as.integer(animal)) %>%
mice(m = 10, method = imp_methods, predictorMatrix = pred_mat, seed = 142)
Here’s a plot showing how the imputation did, showing a few selected cows’ data on separate panels. Again there is an error bar on each imputed value showing the distance between the highest and lowest imputed values, and the hollow red points being the means of the imputations of each missing value.
Use with() to fit a mixed model to each of the imputed
datasets, with a random intercept for animal and fixed effects for the
two treatments, time in days, and their interaction.
cows_fit <- with(cows_multilevel_imputed, glmmTMB(weight ~ iron * infect * day + (1 | animal)))
Use pool() to get pooled estimates of the model
coefficients.
pool(cows_fit) |> summary()
## term estimate std.error statistic df
## 1 (Intercept) 95.32099880 6.727945471 14.1679208 571.3392
## 2 ironNoIron -2.28425203 9.771052539 -0.2337775 572.7028
## 3 infectNonInfected 3.83621449 13.998305763 0.2740485 572.8479
## 4 day 0.31094655 0.005315728 58.4955782 251.0726
## 5 ironNoIron:infectNonInfected 8.62215014 18.878210860 0.4567250 582.4862
## 6 ironNoIron:day 0.04946753 0.007564675 6.5392801 338.5060
## 7 infectNonInfected:day 0.06299665 0.010817034 5.8238378 348.1347
## 8 ironNoIron:infectNonInfected:day -0.01053632 0.014238521 -0.7399870 506.6398
## p.value
## 1 2.876813e-39
## 2 8.152413e-01
## 3 7.841462e-01
## 4 2.767789e-148
## 5 6.480390e-01
## 6 2.280503e-10
## 7 1.308385e-08
## 8 4.596507e-01
Usually in a designed experiment like this one, the model
coefficients by themselves are hard to interpret. We are often more
interested in estimated marginal means or trends. In this case we can
use emmeans() to get the pooled estimates for each
treatment’s mean weight on a specific day (day 750 in this example), or
emtrends() to get the pooled estimates for each treatment’s
marginal time trend (weight gain rate in kg/day). You can proceed to do
hypothesis tests such as pairwise mean comparisons as you normally
would; the uncertainty associated with the imputations is
incorporated.
emmeans(cows_fit, ~ iron + infect, at = list(day = 750))
## iron infect emmean SE df lower.CL upper.CL
## Iron Infected 329 6.56 33260 316 341
## NoIron Infected 363 6.89 137860 350 377
## Iron NonInfected 380 12.00 41978 356 403
## NoIron NonInfected 415 10.40 57381 395 435
##
## Confidence level used: 0.95
emtrends(cows_fit, ~ iron + infect, var = 'day')
## iron infect day.trend SE df lower.CL upper.CL
## Iron Infected 0.311 0.00532 497 0.301 0.321
## NoIron Infected 0.360 0.00542 1576 0.350 0.371
## Iron NonInfected 0.374 0.00930 2749 0.356 0.392
## NoIron NonInfected 0.413 0.00857 320 0.396 0.430
##
## Confidence level used: 0.95
The procedure to fit a Bayesian GLMM with the brms
package using multiply imputed data is very similar to the above. We use
the function brm_multiple() which is designed to work with
objects of class mids. Here, if we specify
chains = 1, a single Markov chain will be sampled for each
of the ten datasets (in practice this should be increased). The results
are pooled simply by concatenating the samples from the different
chains, just the same way it is done if you run multiple chains with a
single dataset. Pooling and propagating uncertainty from imputation is
much more straightforward in a Bayesian context. Because we’re using
uninformative prior distributions on the treatment and time effects, and
because uncertainty due to the imputation is being propagated in exactly
the same way for the Bayesian and non-Bayesian model fits, we get
virtually identical results to the frequentist statistical approach with
glmmTMB().
I’m glossing over a lot of details about Bayesian statistics here, and skipping the usual convergence and model fit diagnostics I would do. This is just a quick demonstration that multiple imputation works well if you’re fitting Bayesian mixed models. Check out my Bayesian lessons for more details on the Bayes approach.
cows_bayesfit <- brm_multiple(
weight ~ iron * infect * day + (1 | animal),
data = cows_multilevel_imputed,
prior = c(
prior(normal(0, 10), class = b)
),
cores = 4, chains = 1, iter = 2000, warmup = 1000, seed = 142,
file = 'data/cows_bayesfit', file_refit = 'never'
)
emmeans(cows_bayesfit, ~ iron + infect, at = list(day = 750))
## iron infect emmean lower.HPD upper.HPD
## Iron Infected 329 316 341
## NoIron Infected 365 352 378
## Iron NonInfected 379 359 397
## NoIron NonInfected 412 392 430
##
## Point estimate displayed: median
## HPD interval probability: 0.95
emtrends(cows_bayesfit, ~ iron + infect, var = 'day')
## iron infect day.trend lower.HPD upper.HPD
## Iron Infected 0.311 0.301 0.321
## NoIron Infected 0.360 0.350 0.370
## Iron NonInfected 0.374 0.356 0.391
## NoIron NonInfected 0.414 0.397 0.430
##
## Point estimate displayed: median
## HPD interval probability: 0.95
I just told you that multiple imputation is the best kind of
imputation, but I neglected to mention there is also a fully Bayesian
approach to imputation. The MICE approach is already basically Bayesian
in spirit, because of the way it propagates uncertainty from the
imputation step through to your final estimates, and the way it samples
repeatedly from a “posterior” distribution to impute each missing value.
We’ve shown above how we can use brm_multiple() to
incorporate MICE imputation into our Bayesian modeling workflow. It’s a
two-step process: first do the imputation, then fit the same Bayesian
model to each imputed dataset. Well, why not just go “full Bayesian” and
explicitly include the imputation process within a single statistical
model? This is an elegant and statistically sound way of imputing
missing data. The Bayesian approach to imputation treats missing values
in your dataset as more parameters for the model to estimate, in
addition to the slopes, intercepts, variances, etc. that you might be
more familiar with. In the Bayesian world, all parameters are treated as
being uncertain, and are modeled with probability distributions that
represent the state of our (imperfect and uncertain) knowledge about
them.
As described in the brms missing data vignette, this kind of Bayesian imputation has the great advantage that the model only needs to be fit once instead of m times, once for each imputed dataset. Bayesian models can be computationally intensive to fit, so this may be a big deal. But we lose a little of the “automatic” flavor of mice imputation. By default, mice uses all other variables to predict the missing ones. But when doing imputation and model fitting in the same step, we have to explicitly write out which variables are being used to impute the others.
It’s outside the scope of this workshop to include a worked example of this kind of Bayesian imputation. For now, I will refer you to the brms missing data vignette. I would be happy to work with anyone interested in Bayesian imputation of their missing data — feel free to contact me!
In addition to the great packages introduced in this lesson, you may also want to check these out:
Here are some handbooks and tutorials about missing data and imputation that you can use to expand on what you’ve learned here:
For the exercises, we will use another example dataset that contains
missing values. It is the mammalsleep dataset from the
mice package, originally published by Allison and
Cicchetti in 1976. The researchers wanted to explore whether total hours
of sleep per day (ts in the dataset) at the species level
could be predicted by other species traits such as body weight
(bw), brain weight (brw), and gestation time
(gt). Type ?mammalsleep in your console to get
more information about the dataset.
Here is how you load the dataset, that comes pre-loaded with the
mice package, into your R environment. Exclude the
species name column from the dataset to simplify how mice()
will handle the imputation later.
data(mammalsleep, package = 'mice')
mammalsleep <- mammalsleep %>% select(-species)
Use summary statistics and visualizations to explore the missing values in the dataset. Use the VIM or naniar package, or make your own visualizations.
Fit a linear model using only rows that have no missing values for
ts, bw, brw, or gt.
Model hours slept per day as a function of the logarithm of body weight,
the logarithm of brain weight, and the logarithm of gestation time. Your
model formula will be ts ~ log(bw) + log(brw) + log(gt).
Output the model coefficients and their standard errors.
Impute the missing values using the mean of each column and fit the same linear model as above to the imputed dataset. Again, output the model coefficients and their standard errors.
Impute the missing values using the PMM algorithm with
mice(), setting m = 10. Fit the same linear
model as above to each of the imputed datasets. Output pooled estimates
of the model coefficients and their standard errors.
Are the conclusions different depending on which of the three methods you used? If so, why?