Dealing with Missing Data

What is this course?

This is a lesson about missing data: what it is, what problems it can cause, how to recognize and describe it, and what to do about it.

At the end of this course, you will understand …

  • What the different kinds of missing data are
  • The consequences of ignoring missing data
  • How missing data increases uncertainty in your conclusions
  • How to decide the best way to address missing data

At the end of this course, you will be able to …

  • Make graphs and tables to visualize missing data
  • Impute missing data with a variety of methods
  • Use multiple imputation to incorporate uncertainty due to missing data in your statistical estimates

What is missing data and why is it a problem?

Where there is data, there is missing data.

Getting to know missing data

  • Data might be missing for many different reasons
  • Causes of missingness are different in different scientific fields
  • What are some reasons you can think of for why a data point might be missing?

Reasons for missingness

  • Instruments fail
  • Datum not measured
    • Example: daily time series data with missing values on weekends and holidays
  • Attrition
    • Examples: a patient drops out of a study; or a plant dies before its final biomass can be recorded
  • Loss of data records

Image (c) USDA-ARS

Reasons for missingness, continued

  • Datum is possible to record for some subjects but not others
    • Example: we are measuring length of longest petal, but some plants do not have any flowers
  • Data value above or below the limit of detection
    • Examples: concentration of an element is too low to be measured; or a thermometer cannot measure temperatures above an upper limit
  • Data value doesn’t fit a statistical distribution
    • Example: zero values that become undefined if we log-transform the data

Image (c) Jerry Mason

Categories of missing data

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Missing not at random (MNAR)
  • This categorization developed by Donald Rubin in 1976, based on the process that determines the probability of a data point being missing

Missing completely at random (MCAR)

  • The probability of being missing is equal for all data points in a dataset
  • If missing values are MCAR, we don’t have to account for any biases resulting from missing data

Missing at random (MAR)

  • The probability of being missing varies according to a group or variable that we have in our observed data
  • For example, our instrument is more likely to fail on a rainy day, but we have data on rainfall by day
  • Not as easy to deal with as MCAR, but we can use information we have to model the missing data process
  • Most standard methods for dealing with missing data assume MAR

Missing not at random (MNAR)

  • Probability of being missing varies, for unknown reasons
  • Most difficult situation to deal with
  • Example: in a survey with a follow-up, people who do not have a strong opinion are less likely to follow up
  • Example: plant genotypes that are less resistant to disease are more likely to die before the end of the study, so their final biomass value is more likely to be missing

Censoring

Image (c) Myresa Hurst
  • Censoring is a type of MNAR missing data pattern
  • A censored data point is missing if it’s above or below a threshold
  • Example: a concentration below the limit of detection of the instrument or assay
  • Censoring also occurs if the subject dies or is removed from the study
  • Often seen in survival studies where subjects are only tracked for a finite period of time
  • Any censoring will bias results if not properly accounted for

Side note: missing replicates in designed experiments

  • Sometimes measurements are missing in a designed experiment, for example if a plot in an RCBD is destroyed or skipped
  • Although losing data isn’t ideal, this is not a major concern from an analysis standpoint
  • Modern methods for ANOVA do not require perfectly balanced data
  • If data are unbalanced and you want to do F-tests, be aware of what sum of squares you’re using

How does R deal with missing data?

Load packages

library(agridat)
library(glmmTMB)
library(dplyr)
library(ggplot2)
library(tidyr)
library(emmeans)
library(broom.mixed)
library(brms)
library(zoo)
library(VIM)
library(naniar)
library(missForest)
library(mice)

Set brms backend for Bayesian analysis

  • Set cmdstanr as the Bayesian model fitting engine if you’re on the SCINet server or have CmdStanR installed on your local machine
  • Otherwise skip this line of code
options(brms.backend = 'cmdstanr')

R’s missing value indicator

  • Missing values in R are represented by a special code: NA for “not available”
  • NA is used across data types (numeric, character, factor, logical)
x <- c(1, 2, NA, 4, 5)

NA in summary statistics

  • Summary statistic functions like mean() take a vector of values as input and return a single value
  • Usually default behavior is to return NA if any of the input values are NA, no matter how many non-missing values there are
mean(x)
sum(x)
median(x)

na.rm argument

  • Many summary stat arguments functions have an argument na.rm
  • Usually na.rm = FALSE by default.
  • Explicitly setting na.rm = TRUE removes missing values and computes summary function on non-missing values
mean(x, na.rm = TRUE)
sum(x, na.rm = TRUE)
median(x, na.rm = TRUE)

Why is na.rm = FALSE the default?

  • Why doesn’t R remove the missing values and calculate the mean or sum without me having to tell it?
  • Interpretation changes when missing values are deleted, so this should not be done silently without our express consent
  • R model fitting functions have different defaults for dealing with missing values

Issues caused by deleting missing values

  • Unlike mean(), model fitting functions like glmmTMB() will not give you an error if values are missing
  • Compare results before and after introducing a missing value
data(cochran.beets)
fit <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)
plot(x = cochran.beets$yield, y = predict(fit))

cochran.beets$yield[3] <- NA
fit2 <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)
plot(x = cochran.beets$yield, y = predict(fit2))

Verifying how many observations were used to fit a model

nrow(cochran.beets)
nobs(fit2)

Visualizing missing data

Packages for missing data

  • VIM: package with syntax and plotting style similar to base R
  • naniar: (originally known as ggmissing, depends on ggplot2 and the other “tidyverse” packages

The data

Image (c) University of Georgia Cooperative Extension
  • A few dozen observations of jassid (leafhopper) abundance, with seven weather covariates
  • Collected in the Maharashtra region of India, published on Kaggle by Madhura Bhagat
  • I artificially deleted some values of the covariates from the dataset
  • Deletion was completely at random and independent for each covariate

Import the data

  • Import from CSV to dataframe, from web URL if on your own machine
jassids <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/jassids.csv')
  • Import from data folder if on SCINet
jassids <- read.csv('data/jassids.csv')
  • Look at the first few rows of the data
head(jassids)

R functions for summarizing missing data

  • Base R function summary()has a method for data frames
  • Tells you number of NA values per column, if there are any
summary(jassids)
  • glimpse() from dplyr highlights NA values in the first few rows

R functions for summarizing missing data

  • complete.cases() returns logical vector with TRUE for complete rows, FALSE for rows with any missing values
  • Pass output to table() to get number of rows with and without missing values
complete.cases(jassids)
table(complete.cases(jassids))

R functions for summarizing missing data

  • is.na() tests whether each value of a vector is missing
  • Negation !is.na() returns TRUE for non-missing values
  • Use this to make tables of missingness
table(!is.na(jassids$temp_max))
  • Two-way tables or more are possible too
with(jassids, table(!is.na(temp_max), !is.na(temp_min)))

VIM missing-data visualization

  • aggr() computes number of missing values in each variable and each combination of two variables
  • Resulting object has a plot() method
  • numbers = TRUE optionally shows proportion of missing values in each combination
plot(aggr(jassids), numbers = TRUE)

naniar missing data visualization

  • vis_miss(), imported from the visdat package, shows missing data row-by-row.
vis_miss(jassids) + theme_sub_axis_x(text = element_text(vjust = 0))
  • cluster = TRUE arranges missing values row-wise
  • sort_miss = TRUE arranges missing values column-wise
vis_miss(jassids, cluster = TRUE)
vis_miss(jassids, sort_miss = TRUE)

What to do about missing data

Three groups of strategies for dealing with missing data

  • Delete the observations that contain missing data and analyze only the complete observations
  • Impute missing values with a single imputation, treating the imputed value the same as the other data values
  • Impute missing values with multiple imputation, accounting for the uncertainty in the imputation in your final results

Choosing a strategy: Proportion of missing data

  • If small, imputation may have a negligible effect on your bottom-line conclusions
  • More complex techniques are more important for higher missing proportions
  • No one-size-fits-all rule for what proportion needs to be missing for it to “matter”
  • If missing % is too high, no imputation may save you

Choosing a strategy: Type of missingness

  • If MCAR, deleting observations with missing data = deleting observations at random
    • Power is lost by reduced sample size but otherwise no major issue
  • If not MCAR, deleting observations with missing data = deleting observations non-randomly
    • Bias is introduced unless imputation is done properly
    • Severity depends on number of missing values

Missing data strategy 1: Delete it

  • Delete observations that contain missing data until you end up with a complete dataset
    • Listwise deletion, or complete-case analysis
    • Pairwise deletion, or available-case analysis

Listwise deletion, a.k.a. complete-case analysis

  • By far the commonest way of dealing with missing data
  • Just delete any observations, or “cases,” with any missing values
  • In the typical stats-ready data format (so-called tidy data), each row of the data frame is a case
  • Many ways to do this, here is one using dplyr::filter() and base R complete.cases()
jassids_complete_only <- jassids %>%
  filter(complete.cases(.))

Listwise deletion in statistical modeling packages

  • Listwise deletion is default in many R and SAS statistical modeling packages
  • Even if you don’t explicitly delete incomplete cases, it will be done by the model fitting function
  • Only columns used in the model will be considered when determining complete cases
model1 <- lm(Jassids ~ temp_max, data = jassids)
model2 <- lm(Jassids ~ temp_max + wind_velocity, data = jassids)

nobs(model1)
nobs(model2)

Warnings and errors for incomplete cases

  • Many R modeling functions return an error if missing values are present: clmm()
  • Some return a warning: brm()
  • Some bury the warning deep in the summary output: lm(), lme4(), and glmmTMB() — beware!

Listwise deletion in multivariate analysis

  • You may need to remove both columns (variables) and rows (cases) to get an acceptable complete dataset
    • Example: you have 50 microbiome samples and a total of 100 microbiome taxa in the dataset
    • But a few taxa have missing data for 90% of the rows
    • Complete-case analysis on the full dataset will leave you with few or no rows to work with
    • You could delete columns above some unacceptable threshold of proportion missing values, iterating until you get a reasonable number of complete cases

Pairwise deletion, a.k.a. available-case analysis

  • Instead of using only the complete cases for every calculation, do separate calculations for each variable or subset of variables
  • In each case use the cases that are complete for those variables
  • Often used for generating correlation matrices
  • Can’t be used for things like multiple regression which require all variables to be non-missing

Pairwise deletion in cor()

  • default use = 'everything' returns NA if a pair of variables has any missing values
  • use = 'complete.obs' does listwise deletion
  • use = 'pairwise.complete.obs' does pairwise deletion
cor(jassids)
cor(jassids, use = 'complete.obs')
cor(jassids, use = 'pairwise.complete.obs')

Missing data strategy 2: Single imputation

Imputing missing data

  • Listwise deletion can waste a large proportion of your data
  • Listwise deletion biases your results if data are not MCAR
  • Instead impute, or replace missing data points with substitute values
  • Use information we have to make educated guesses about what the missing value truly is, or would be if we could measure it

Imputation strategies

  • Impute using mean or mode
  • Impute using adjacent observations in space and/or time
  • Impute using a randomly selected value from another observation
  • Impute using a regression model

Imputing using the mean or mode

  • Replace all missing values with the mean or median (or mode for categorical data) of that variable
  • Can be done in base R with replace()
    • First argument is the vector with some missing values (x)
    • Second argument is a vector of indexes where we want to do the replacing (locations where is.na(x) is TRUE)
    • Third argument is the values we want to replace at those indexes (mean of non-NA elements of x)
impute_mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
impute_median <- function(x) replace(x, is.na(x), median(x, na.rm = TRUE))

Do mean imputation on jassids dataset

  • mutate(across(where(is.numeric), f)) applies function f to all numeric columns of a dataframe, even those without missing values
  • Assign the result to a new dataframe (better than overwriting original dataframe)
jassids_mean_imputed <- jassids %>%
  mutate(across(where(is.numeric), impute_mean))

Visualization of mean imputation

Imputing using adjacent observations in space and/or time

  • Two observations close together in space or time are more likely to be similar than two distant ones
  • Simplest of these kinds of imputation methods: copy adjacent observations to fill missing values
  • tidyr::fill()
  • zoo::na.locf (Last Observation Carried Forward) specialized for time series data

Example dataset: cow weights over time

Image (c) Te Ara, the Encyclopedia of New Zealand
  • Time series of body weights taken at 23 time points from 26 cows
  • Two-way factorial experiment: tuberculosis infection × iron supplementation
  • I made some values missing completely at random

Import cow data

  • Import from web URL or from the data folder if you are running this code on the SCINet server
cows <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/diggle.csv')
cows <- read.csv('data/diggle.csv')
  • Convert animal ID and treatment columns to factor variables
cows <- cows %>%
  mutate(across(c(animal, iron, infect), factor))

LOCF imputation on cow weight data

  • If an observation of weight is missing, take the observation from the previous time point for that same animal
  • First group_by(animal) then arrange() to sort data by animal and then in increasing order by day
  • na.rm = FALSE argument: any remaining missing values at the beginning of any animal’s time series kept as NA
  • Create new dataframe with side-by-side missing and imputed weight columns
cows_imputed <- cows %>%
  group_by(animal) %>%
  arrange(animal, day) %>%
  mutate(weight_LOCF_imputed = na.locf(weight, na.rm = FALSE))

LOCF imputation continued

  • Some values still missing: first time point missing for some animals
  • Apply na.locf() again to the already-imputed column, with fromLast = TRUE
cows_imputed <- cows_imputed %>%
  mutate(weight_LOCF_imputed = na.locf(weight_LOCF_imputed, na.rm = FALSE, fromLast = TRUE))

Linear interpolation

Linear interpolation, image (c) Wikipedia
  • Draw an imaginary line between the non-missing points on either side of the missing one
  • Find the y-coordinate at the time point where the value is missing
  • Example: cow A04 weighed 290 kg on day 380, and 310 kg on day 478
  • Weight on day 445 is missing; fill in assuming the cow gained weight at a constant rate between the two dates

Linear interpolation formula

  • Draw a line between the two points \((x_0, y_0)\) and \((x_1, y_1)\)
  • Find the y-coordinate for a given value of x using this formula

\[y = \frac{y_0(x_1-x) + y_1(x-x_0)}{x_1-x_0}\]

interp <- function(x, x0, y0, x1, y1) (y0 * (x1 - x) + y1 * (x - x0)) / (x1 - x0)

interp(x = 445, x0 = 380, y0 = 290, x1 = 478, y1 = 310)

Linear interpolation using approx()

  • approx() function in base R does linear interpolation
  • Make sure dataframe is grouped and sorted
  • approx() arguments:
    • x and y
    • xout: which x values need interpolated y values? We get a value for every day here regardless of whether it was missing
    • rule = 2: replace NAs at beginning or end of time series with closest non-missing value
cows_imputed <- cows_imputed %>%
  group_by(animal) %>%
  arrange(animal, day) %>%
  mutate(weight_interpolation_imputed = approx(x = day, y = weight, xout = day, rule = 2)$y)

Imputing using randomly selected values from other observations

  • Known as “hot deck” imputation, referring to old-school punch card readers
  • The randomly chosen value could be taken from any other observation in the dataset
  • Or random sampling could be stratified, for example by age class and/or sex

Image (c) IBM

Hot-deck imputation with tidyverse functions

  • First stratify dataset by treatment combination using group_by()
  • Within mutate() use if_else() logic to replace missing values with randomly chosen values from within the same treatment combination
  • if_else() has three arguments:
    • First argument is the condition being tested, is.na(weight) is TRUE if weight missing
    • Second argument is what to assign if TRUE: sample() to select from non-missing weight values
    • Third argument is what to assign if FALSE: unchanged value of weight
set.seed(1)

cows_imputed <- cows_imputed %>%
  group_by(iron, infect, day) %>%
  mutate(weight_hotdeck_imputed = if_else(is.na(weight), sample(na.omit(weight), size = length(weight), replace = TRUE), weight))

Hot-deck imputation is random!

  • Hot-deck gives you a different result every time, unlike the other methods we’ve seen so far
  • Keep this in mind for when we talk about multiple imputation in a few minutes!

Visualization: Last observation carried forward

Visualization: Linear interpolation

Visualization: Hot-deck imputation

Imputing using a regression model

  • Use relationships between variables in our dataset to impute missing values
  • Fit a regression model with the variable to impute as dependent variable, one or more other variables as predictors
  • Input data points are the ones with a complete set of values for every variable
  • Then predict values of the dependent variable where it’s missing

Imputation with simple bivariate regression

  • In jassids dataset, impute temp_max using temp_min
  • How could you check how good of a prediction this is?
temp_linear_fit <- lm(temp_max ~ temp_min, data = jassids)

coefficients(temp_linear_fit)

Imputation with simple bivariate regression

  • Use ifelse() (base R, not tidyverse if_else()) to create a vector of imputed temperature values
  • Use predicted temp_max values if missing, observed if not missing
pred_temp_max <- predict(temp_linear_fit, newdata = jassids)

temp_linear_imputed <- ifelse(is.na(jassids$temp_max), pred_temp_max, jassids$temp_max)

Performance of bivariate regression imputation

Random forest imputation

  • Random forest allows many predictors and nonlinear relationships with dependent variable
  • package missForest has a function called missForest()
  • Pass missForest() a dataframe with missing values, get back an imputed dataframe
  • For now don’t worry about optional arguments to modify tuning parameters
  • variablewise = TRUE gives us mean squared error (MSE) of the random forest model column-by-column: errors for the non-missing values
set.seed(2)
jassids_RF_imputed <- missForest(jassids, variablewise = TRUE)

Output of missForest()

  • Returns a list of two objects
    • ximp (complete dataset)
    • OOBerror (MSE for each column)
head(jassids_RF_imputed$ximp)

Output of missForest(), continued

  • Square root of MSE for each column, with names
  • Note this only tells us error for non-missing values, not what we really want to know
sqrt(jassids_RF_imputed$OOBerror) |> setNames(names(jassids))
  • Compare random forest’s root MSE with simple bivariate regression model
sqrt(mean(residuals(temp_linear_fit)^2))

Performance of random forest imputation

Consequences of single imputation

  • Single imputation might use ad hoc methods like substituting mean or linear interpolation
  • It might use fancy methods like machine learning
  • Either way, you get a nice clean-looking dataset with no ugly missing values
  • Proceed to fit a statistical model, test hypotheses, and make inferences
  • So far, so good, right???

Imputed values are not data

  • Anything you did not directly observe and measure is not data
  • It is the output of a model masquerading as data
  • Single imputation hides uncertainty associated with the imputation process
  • May lead to inappropriate rejections of the null hypothesis or inappropriately precise estimates of effect size

Typical approach to imputation

In the words of the great Stef van Buuren

“[Regression imputation] artificially strengthens the relations in the data. Correlations are biased upwards. Variability is underestimated. Imputations are too good to be true. Regression imputation is a recipe for false positive and spurious relations.”

Missing data strategy 3: Multiple imputation

Why do multiple imputation?

  • All methods we’ve discussed so far (other than possibly hot-deck) turn a missing value into a single imputed value
  • Uncertainty associated with the imputation process is not carried forward to our final estimates of effect size
  • Multiple imputation attempts to solve this problem

Multiple imputation: the basic idea

  • Statistician Donald Rubin had the insight leading to multiple imputation in 1977
  • Create multiple different imputed versions of a dataset, drawing the imputed values from an estimated distribution each time
  • Fit your desired statistical model to each one of the imputed datasets
  • Combine the estimates from those models so that the uncertainty of the missing values is incorporated in the uncertainty estimates from the model
  • Adjustments are made to degrees of freedom, affecting widths of confidence intervals and p-values

Donald Rubin

mice package: Multiple imputation using chained equations

  • Has methods for many common stats packages so you can incorporate multiple imputation into your analysis
    • lm(), glm() (base R)
    • lmer(), glmmTMB() (functions from mixed model packages)
    • brm() (Bayesian mixed models)
    • emmeans() to estimate means from any of the above
  • In most cases, combining of estimates and adjustments of degrees of freedom are done “automatically”

MICE workflow

  1. Create a set of m imputed datasets using the function mice()
  • Output: mids object (multiply imputed dataset)
  1. Wrap statistical modeling function inside with() to fit the model to each of the m datasets
  • Output: mira object (multiply imputed regression analyses)
  1. Use pool() on the mira object to pool (combine) estimates from each model
  • Output: mipo object (multiple imputation pooled object)
  • Alternatively, pass mira object to functions like emmeans()

mice() for multiple imputation

  • First use all default arguments except:
    • m = 10 to create ten imputed datasets (often adequate)
    • seed to ensure reproducibility of the result
jassids_mice_imputed <- mice(jassids, m = 10, seed = 27606)

Output of MICE imputation

  • Object of class mids
  • nmis: Number of missing values in each column
  • imp: Imputed values for each of the ten datasets, by column
  • method: Imputation method used for each column
jassids_mice_imputed$nmis

jassids_mice_imputed$imp

jassids_mice_imputed$method

Predictive mean matching

  • "pmm": default method for columns of continuous data
  • PMM algorithm fits ridge regression using other columns of the dataset to predict values for all data points in a column, including missing and non-missing
  • For each missing value, find a small set (default 5) other observations in the dataset with most similar predicted values
  • Take the actual observed values for those five non-missing entries and randomly draw a value from them

More on PMM

  • PMM combines aspects of regression/machine learning imputation and hot-deck imputation
  • You can tweak different parameters of the algorithm to get slightly different results
  • Be aware of potential sensitivity of your final results to choice of method and parameters
  • Defaults of mice() work well in many cases, but don’t blindly accept them!

Performance of PMM imputation

Other mice() methods

  • Look at ?mice documentation, mice vignettes, or the mice book
  • Most have a random component so you can get multiple different realizations of the imputation
  • Simpler non-random methods are implemented, like mean imputation (set m = 1)

Random forest imputation with mice()

  • method = 'rf' gives us random forest imputation
  • Not a single imputation like missForest(); uses special algorithm to generate multiple imputed datasets
  • Explore mice() documentation if you want to modify default tuning parameters for this method
jassids_mice_RF_imputed <- mice(jassids, m = 10, seed = 27606, method = 'rf')

Performance of random forest imputation

Fit linear model to data multiply imputed with PMM method

jassids_multiple_lms <- with(
  jassids_mice_imputed,
  lm(Jassids ~ RF + temp_max + temp_min + RH1 + RH2 + BSS + wind_velocity)
)

Pool coefficient estimates and standard errors across multiple model fits

jassids_multiple_lm_coeffs <- pool(jassids_multiple_lms)

summary(jassids_multiple_lm_coeffs)

Fit linear model to data multiply imputed with random forest

jassids_multiple_lms_RF <- with(jassids_mice_RF_imputed, lm(Jassids ~ RF + temp_max + temp_min + RH1 + RH2 + BSS + wind_velocity))
jassids_multiple_lm_RF_coeffs <- pool(jassids_multiple_lms_RF)

summary(jassids_multiple_lm_RF_coeffs)

Comparison of coefficient estimates by imputation method

  • How and why do these estimates differ by method?

Example multiple imputation workflow: glmmTMB and emmeans

  • Revisiting cow weight dataset
  • "2l.pan" method suitable for imputing hierarchical/multilevel data with a multilevel structure
    • “Mixed model” data such as RCBD experiment or repeated measures on same individuals over time

Multilevel imputation setup

  • Set up vector of imputation method names with empty strings everywhere except the weight column
  • Set up predictor matrix to specify which effects are fixed (1), random grouping variable (-2), or not used (0)
pred_mat <- rbind(
  animal = 0, iron = 0, infect = 0, 
  weight = c(animal = -2, iron = 1, infect = 1, weight = 0, day = 1), 
  day = 0
)

imp_methods <- c('', '', '', '2l.pan', '')

Multiply impute the cow time series data

  • Notice we need to convert the grouping variable, animal, to an integer
cows_multilevel_imputed <- cows %>%
  mutate(animal = as.integer(animal)) %>%
  mice(m = 10, method = imp_methods, predictorMatrix = pred_mat, seed = 142)

Performance of imputation for selected individuals

Fit mixed model to imputed dataset and pool estimates

  • Fit a (slightly oversimplified) mixed model to each of the imputed datasets
  • Random intercept for animal
  • Fixed effects for two treatments, time in days, and their interactions
cows_fit <- with(
  cows_multilevel_imputed, 
  glmmTMB(weight ~ iron * infect * day + (1 | animal))
)

pool(cows_fit) |> summary()

Example multiple imputation workflow: Bayesian GLMM using brms

  • Very similar procedure for Bayesian GLMMs with brms
  • brm_multiple() works with mids objects
  • Here we sample one Markov chain for each of the ten datasets (chains = 1)
  • Estimates are pooled by concatenating all the chains together
  • Because of uninformative priors and similar uncertainty propagation, results are nearly identical to glmmTMB() in this example

brm_multiple() example code

cows_bayesfit <- brm_multiple(
  weight ~ iron * infect * day + (1 | animal), 
  data = cows_multilevel_imputed,
  prior = c(
    prior(normal(0, 10), class = b)
  ),
  cores = 4, chains = 1, iter = 2000, warmup = 1000, seed = 142,
  file = 'data/cows_bayesfit', file_refit = 'never'
)

emmeans(cows_bayesfit, ~ iron + infect, at = list(day = 750))

emtrends(cows_bayesfit, ~ iron + infect, var = 'day')

A final note on Bayesian imputation

  • MICE is semi-Bayesian but there is also a fully Bayesian approach to imputation
  • MICE is a two-step process (first impute then fit model to each imputed dataset)
  • Alternatively, we can explicitly include imputation within a single Bayesian model fit
  • Missing values are treated as parameters for the model to estimate, with probability distributions
  • Only one model to fit, but we need to explicitly write out the formula to impute each parameter
  • See brms missing data vignette
  • See text version of lesson for further reading and exercises for additional practice
  • Fill out MS Forms survey
  • Contact me at quentin.read@usda.gov!