Dealing with Missing Data

What is this course?

This is a lesson about missing data: what it is, what problems it can cause, how to recognize and describe it, and what to do about it.

At the end of this course, you will understand …

What the different kinds of missing data are
The consequences of ignoring missing data
How missing data increases uncertainty in your conclusions
How to decide the best way to address missing data

At the end of this course, you will be able to …

Make graphs and tables to visualize missing data
Impute missing data with a variety of methods
Use multiple imputation to incorporate uncertainty due to missing data in your statistical estimates

What is missing data and why is it a problem?

Where there is data, there is missing data.

Getting to know missing data

Data might be missing for many different reasons
Causes of missingness are different in different scientific fields
What are some reasons you can think of for why a data point might be missing?

Reasons for missingness

Instruments fail
Datum not measured
- Example: daily time series data with missing values on weekends and holidays
Attrition
- Examples: a patient drops out of a study; or a plant dies before its final biomass can be recorded
Loss of data records

Reasons for missingness, continued

Datum is possible to record for some subjects but not others
- Example: we are measuring length of longest petal, but some plants do not have any flowers
Data value above or below the limit of detection
- Examples: concentration of an element is too low to be measured; or a thermometer cannot measure temperatures above an upper limit
Data value doesn’t fit a statistical distribution
- Example: zero values that become undefined if we log-transform the data

Categories of missing data

Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
This categorization developed by Donald Rubin in 1976, based on the process that determines the probability of a data point being missing

Missing completely at random (MCAR)

The probability of being missing is equal for all data points in a dataset
If missing values are MCAR, we don’t have to account for any biases resulting from missing data

Missing at random (MAR)

The probability of being missing varies according to a group or variable that we have in our observed data
For example, our instrument is more likely to fail on a rainy day, but we have data on rainfall by day
Not as easy to deal with as MCAR, but we can use information we have to model the missing data process
Most standard methods for dealing with missing data assume MAR

Missing not at random (MNAR)

Probability of being missing varies, for unknown reasons
Most difficult situation to deal with
Example: in a survey of depression, people who are depressed are less likely to complete the survey
Example: plant genotypes that are less resistant to disease are more likely to die before the end of the study, so their final biomass value is more likely to be missing

Censoring

Censoring is a type of MNAR missing data pattern
A censored data point is missing if it’s above or below a threshold
Example: a concentration below the limit of detection of the instrument or assay
Censoring also occurs if the subject dies or is removed from the study
Often seen in survival studies where subjects are only tracked for a finite period of time
Any censoring will bias results if not properly accounted for

Side note: missing replicates in designed experiments

Sometimes measurements are missing in a designed experiment, for example if a plot in an RCBD is destroyed or skipped
Although losing data isn’t ideal, this is not a major concern from an analysis standpoint
Modern methods for ANOVA do not require perfectly balanced data
If data are unbalanced and you want to do F-tests, be aware of what sum of squares you’re using

How does R deal with missing data?

Load packages

library(agridat)
library(glmmTMB)
library(dplyr)
library(ggplot2)
library(tidyr)
library(emmeans)
library(broom.mixed)
library(brms)
library(zoo)
library(VIM)
library(naniar)
library(missForest)
library(mice)

R’s missing value indicator

Missing values in R are represented by a special code: NA for “not available”
NA is used across data types (numeric, character, factor, logical)

x <- c(1, 2, NA, 4, 5)

`NA` in summary statistics

Summary statistic functions like mean() take a vector of values as input and return a single value
Usually default behavior is to return NA if any of the input values are NA, no matter how many non-missing values there are

mean(x)
sum(x)
median(x)

`na.rm` argument

Many summary stat arguments functions have an argument na.rm
Usually na.rm = FALSE by default.
Explicitly setting na.rm = TRUE removes missing values and computes summary function on non-missing values

mean(x, na.rm = TRUE)
sum(x, na.rm = TRUE)
median(x, na.rm = TRUE)

Why is `na.rm = FALSE` the default?

Why doesn’t R remove the missing values and calculate the mean or sum without me having to tell it?
Interpretation changes when missing values are deleted, so this should not be done silently without our express consent
R model fitting functions have different defaults for dealing with missing values

Issues caused by deleting missing values

Unlike mean(), model fitting functions like glmmTMB() will not give you an error if values are missing
Compare results before and after introducing a missing value

data(cochran.beets)
fit <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)
plot(x = cochran.beets$yield, y = predict(fit))

cochran.beets$yield[3] <- NA
fit2 <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)
plot(x = cochran.beets$yield, y = predict(fit2))

Verifying how many observations were used to fit a model

nrow(cochran.beets)
nobs(fit2)

Other types of non-finite value: `NaN` and `Inf`

NaN: undefined result of a calculation (0/0, log(-1))
Inf or -Inf
- calculation with result defined to be ±infinity (1/0, log(0))
- value exceeding R’s largest/smallest possible value (2^10000)
Like NA, they affect the behavior of mean(), sum(), etc.

Visualizing missing data

Packages for missing data

VIM: package with syntax and plotting style similar to base R
naniar: originally known as ggmissing, depends on ggplot2 and the other “tidyverse” packages

The data

Image (c) University of Georgia Cooperative Extension

A few dozen observations of jassid (leafhopper) abundance, with seven weather covariates
Collected in the Maharashtra region of India, published on Kaggle by Madhura Bhagat
I artificially deleted some values of the covariates from the dataset
Deletion was completely at random and independent for each covariate

Import the data

Import from CSV to dataframe, from web URL if on your own machine

jassids <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/jassids.csv')

Import from data folder if on SCINet

jassids <- read.csv('data/jassids.csv')

Look at the first few rows of the data

head(jassids)

R functions for summarizing missing data

Base R function summary() has a method for data frames
Tells you number of NA values per column, if there are any

summary(jassids)

glimpse() from dplyr highlights NA values in the first few rows

R functions for summarizing missing data

complete.cases() returns logical vector with TRUE for complete rows, FALSE for rows with any missing values
Pass output to table() to get number of rows with and without missing values

complete.cases(jassids)
table(complete.cases(jassids))

R functions for summarizing missing data

is.na() tests whether each value of a vector is missing
Negation !is.na() returns TRUE for non-missing values
Use this to make tables of missingness

table(!is.na(jassids$temp_max))

Two-way tables or more are possible too

with(jassids, table(!is.na(temp_max), !is.na(temp_min)))

VIM missing-data visualization

aggr() computes number of missing values in each variable
Resulting object has a plot() method
numbers = TRUE optionally shows proportion of missing values in each combination

plot(aggr(jassids), numbers = TRUE)

naniar missing data visualization

vis_miss(), imported from the visdat package, shows missing data row-by-row.

vis_miss(jassids) + theme_sub_axis_x(text = element_text(vjust = 0))

cluster = TRUE arranges missing values row-wise
sort_miss = TRUE arranges missing values column-wise

vis_miss(jassids, cluster = TRUE)
vis_miss(jassids, sort_miss = TRUE)

What to do about missing data

Three groups of strategies for dealing with missing data

Delete the observations that contain missing data and analyze only the complete observations
Impute missing values with a single imputation, treating the imputed value the same as the other data values
Impute missing values with multiple imputation, accounting for the uncertainty in the imputation in your final results

Choosing a strategy: Proportion of missing data

If small, imputation may have a negligible effect on your bottom-line conclusions
More complex techniques are more important for higher missing proportions
No one-size-fits-all rule for what proportion needs to be missing for it to “matter”
If missing % is too high, no imputation may save you

Choosing a strategy: Type of missingness

If MCAR, deleting observations with missing data = deleting observations at random
- Power is lost by reduced sample size but otherwise no major issue
If not MCAR, deleting observations with missing data = deleting observations non-randomly
- Bias is introduced unless imputation is done properly
- Severity depends on number of missing values

Missing data strategy 1: Delete it

Delete observations that contain missing data until you end up with a complete dataset
- Listwise deletion, or complete-case analysis
- Pairwise deletion, or available-case analysis

Listwise deletion, a.k.a. complete-case analysis

By far the commonest way of dealing with missing data
Just delete any observations, or “cases,” with any missing values
In the typical stats-ready data format (so-called tidy data), each row of the data frame is a case
Many ways to do this, here is one using dplyr::filter() and base R complete.cases()

jassids_complete_only <- jassids %>%
  filter(complete.cases(.))

Listwise deletion in statistical modeling packages

Listwise deletion is default in many R and SAS statistical modeling packages
Even if you don’t explicitly delete incomplete cases, it will be done by the model fitting function
Only columns used in the model will be considered when determining complete cases

model1 <- lm(Jassids ~ temp_max, data = jassids)
model2 <- lm(Jassids ~ temp_max + wind_velocity, data = jassids)

nobs(model1)
nobs(model2)

Warnings and errors for incomplete cases

Many R modeling functions return an error if missing values are present: clmm()
Some return a warning: brm()
Some bury the warning deep in the summary output: lm(), lme4(), and glmmTMB() — beware!

Listwise deletion in multivariate analysis

You may need to remove both columns (variables) and rows (cases) to get an acceptable complete dataset
- Example: you have 50 microbiome samples and a total of 100 microbiome taxa in the dataset
- But a few taxa have missing data for 90% of the rows
- Complete-case analysis on the full dataset will leave you with few or no rows to work with
- You could delete columns above some unacceptable threshold of proportion missing values, iterating until you get a reasonable number of complete cases

Pairwise deletion, a.k.a. available-case analysis

Instead of using only the complete cases for every calculation, do separate calculations for each variable or subset of variables
In each case use the cases that are complete for those variables
Often used for generating correlation matrices
Can’t be used for things like multiple regression which require all variables to be non-missing

Pairwise deletion in `cor()`

default use = 'everything' returns NA if a pair of variables has any missing values
use = 'complete.obs' does listwise deletion
use = 'pairwise.complete.obs' does pairwise deletion

cor(jassids)
cor(jassids, use = 'complete.obs')
cor(jassids, use = 'pairwise.complete.obs')

Missing data strategy 2: Single imputation

Imputing missing data

Listwise deletion can waste a large proportion of your data
Listwise deletion biases your results if data are not MCAR
Instead impute, or replace missing data points with substitute values
Use information we have to make educated guesses about what the missing value truly is, or would be if we could measure it

Imputation strategies

Impute using mean or mode
Impute using adjacent observations in space and/or time
Impute using a randomly selected value from another observation
Impute using a regression model

Imputing using the mean or mode

Replace all missing values with the mean or median (or mode for categorical data) of that variable
Can be done in base R with replace()
- First argument is the vector with some missing values (x)
- Second argument is a vector of indexes where we want to do the replacing (locations where is.na(x) is TRUE)
- Third argument is the values we want to replace at those indexes (mean of non-NA elements of x)

impute_mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
impute_median <- function(x) replace(x, is.na(x), median(x, na.rm = TRUE))

Do mean imputation on jassids dataset

mutate(across(where(is.numeric), f)) applies function f to all numeric columns of a dataframe, even those without missing values
Assign the result to a new dataframe (better than overwriting original dataframe)

jassids_mean_imputed <- jassids %>%
  mutate(across(where(is.numeric), impute_mean))

Visualization of mean imputation

Imputing using adjacent observations in space and/or time

Two observations close together in space or time are more likely to be similar than two distant ones
Simplest of these kinds of imputation methods: copy adjacent observations to fill missing values
tidyr::fill()
zoo::na.locf (Last Observation Carried Forward) specialized for time series data

Example dataset: cow weights over time

Image (c) Te Ara, the Encyclopedia of New Zealand

Time series of body weights taken at 23 time points from 26 cows
Two-way factorial experiment: tuberculosis infection × iron supplementation
I made some values missing completely at random

Import cow data

Import from web URL or from the data folder if you are running this code on the SCINet server

cows <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/diggle.csv')

cows <- read.csv('data/diggle.csv')

Convert animal ID and treatment columns to factor variables

cows <- cows %>%
  mutate(across(c(animal, iron, infect), factor))

LOCF imputation on cow weight data

If an observation of weight is missing, take the observation from the previous time point for that same animal
First group_by(animal) then arrange() to sort data by animal and then in increasing order by day
na.rm = FALSE argument: any remaining missing values at the beginning of any animal’s time series kept as NA
Create new dataframe with side-by-side missing and imputed weight columns

cows_imputed <- cows %>%
  group_by(animal) %>%
  arrange(animal, day) %>%
  mutate(weight_LOCF_imputed = na.locf(weight, na.rm = FALSE))

LOCF imputation continued

Some values still missing: first time point missing for some animals
Apply na.locf() again to the already-imputed column, with fromLast = TRUE

cows_imputed <- cows_imputed %>%
  mutate(weight_LOCF_imputed = na.locf(weight_LOCF_imputed, na.rm = FALSE, fromLast = TRUE))

Linear interpolation

Draw an imaginary line between the non-missing points on either side of the missing one
Find the y-coordinate at the time point where the value is missing
Example: cow A04 weighed 290 kg on day 380, and 310 kg on day 478
Weight on day 445 is missing; fill in assuming the cow gained weight at a constant rate between the two dates

Linear interpolation formula

Draw a line between the two points \((x_0, y_0)\) and \((x_1, y_1)\)
Find the y-coordinate for a given value of x using this formula

\[y = \frac{y_0(x_1-x) + y_1(x-x_0)}{x_1-x_0}\]

interp <- function(x, x0, y0, x1, y1) (y0 * (x1 - x) + y1 * (x - x0)) / (x1 - x0)

interp(x = 445, x0 = 380, y0 = 290, x1 = 478, y1 = 310)

Linear interpolation using `approx()`

approx() function in base R does linear interpolation
Make sure dataframe is grouped and sorted
approx() arguments:
- x and y
- xout: which x values need interpolated y values? We get a value for every day here regardless of whether it was missing
- rule = 2: replace NAs at beginning or end of time series with closest non-missing value

cows_imputed <- cows_imputed %>%
  group_by(animal) %>%
  arrange(animal, day) %>%
  mutate(weight_interpolation_imputed = approx(x = day, y = weight, xout = day, rule = 2)$y)

Imputing using randomly selected values from other observations

Known as “hot deck” imputation, referring to old-school punch card readers
The randomly chosen value could be taken from any other observation in the dataset
Or random sampling could be stratified, for example by age class and/or sex

Hot-deck imputation with tidyverse functions

First stratify dataset by treatment combination using group_by()
Within mutate() use if_else() logic to replace missing values with randomly chosen values from within the same treatment combination
if_else() has three arguments:
- First argument is the condition being tested, is.na(weight) is TRUE if weight missing
- Second argument is what to assign if TRUE: sample() to select from non-missing weight values
- Third argument is what to assign if FALSE: unchanged value of weight

set.seed(1)

cows_imputed <- cows_imputed %>%
  group_by(iron, infect, day) %>%
  mutate(weight_hotdeck_imputed = if_else(is.na(weight), sample(na.omit(weight), size = length(weight), replace = TRUE), weight))

Hot-deck imputation is random!

Hot-deck gives you a different result every time, unlike the other methods we’ve seen so far
Keep this in mind for when we talk about multiple imputation in a few minutes!

Visualization: Last observation carried forward

Visualization: Linear interpolation

Visualization: Hot-deck imputation

Imputing using a regression model

Use relationships between variables in our dataset to impute missing values
Fit a regression model with the variable to impute as dependent variable, one or more other variables as predictors
Input data points are the ones with a complete set of values for every variable
Then predict values of the dependent variable where it’s missing

Imputation with simple bivariate regression

In jassids dataset, impute temp_max using temp_min
How could you check how good of a prediction this is?

temp_linear_fit <- lm(temp_max ~ temp_min, data = jassids)

coefficients(temp_linear_fit)

Imputation with simple bivariate regression

Use ifelse() (base R, not tidyverse if_else()) to create a vector of imputed temperature values
Use predicted temp_max values if missing, observed if not missing

pred_temp_max <- predict(temp_linear_fit, newdata = jassids)

temp_linear_imputed <- ifelse(is.na(jassids$temp_max), pred_temp_max, jassids$temp_max)

Performance of bivariate regression imputation

Random forest imputation

Random forest allows many predictors and nonlinear relationships with dependent variable
package missForest has a function called missForest()
Pass missForest() a dataframe with missing values, get back an imputed dataframe
For now don’t worry about optional arguments to modify tuning parameters
variablewise = TRUE gives us mean squared error (MSE) of the random forest model column-by-column: errors for the non-missing values

set.seed(2)
jassids_RF_imputed <- missForest(jassids, variablewise = TRUE)

Output of `missForest()`

Returns a list of two objects
- ximp (complete dataset)
- OOBerror (MSE for each column)

head(jassids_RF_imputed$ximp)

Output of `missForest()`, continued

Square root of MSE for each column, with names
Note this only tells us error for non-missing values, not what we really want to know

sqrt(jassids_RF_imputed$OOBerror) |> setNames(names(jassids))

Compare random forest’s root MSE with simple bivariate regression model

sqrt(mean(residuals(temp_linear_fit)^2))

Performance of random forest imputation

Consequences of single imputation

Single imputation might use ad hoc methods like substituting mean or linear interpolation
It might use fancy methods like machine learning
Either way, you get a nice clean-looking dataset with no ugly missing values
Proceed to fit a statistical model, test hypotheses, and make inferences
So far, so good, right???

Imputed values are not data

Anything you did not directly observe and measure is not data
It is the output of a model masquerading as data
Single imputation hides uncertainty associated with the imputation process
May lead to inappropriate rejections of the null hypothesis or inappropriately precise estimates of effect size

In the words of the great Stef van Buuren

“[Regression imputation] artificially strengthens the relations in the data. Correlations are biased upwards. Variability is underestimated. Imputations are too good to be true. Regression imputation is a recipe for false positive and spurious relations.”

Missing data strategy 3: Multiple imputation

Why do multiple imputation?

All methods we’ve discussed so far (other than possibly hot-deck) turn a missing value into a single imputed value
Uncertainty associated with the imputation process is not carried forward to our final estimates of effect size
Multiple imputation attempts to solve this problem

Multiple imputation: the basic idea

Statistician Donald Rubin had the insight leading to multiple imputation in 1977
Create multiple different imputed versions of a dataset, drawing the imputed values from an estimated distribution each time
Fit your desired statistical model to each one of the imputed datasets
Combine the estimates from those models so that the uncertainty of the missing values is incorporated in the uncertainty estimates from the model
Adjustments are made to degrees of freedom, affecting widths of confidence intervals and p-values

mice package: Multiple imputation using chained equations

Has methods for many common stats packages so you can incorporate multiple imputation into your analysis
- lm(), glm() (base R)
- lmer(), glmmTMB() (functions from mixed model packages)
- brm() (Bayesian mixed models)
- emmeans() to estimate means from any of the above
In most cases, combining of estimates and adjustments of degrees of freedom are done “automatically”

MICE workflow

Create a set of m imputed datasets using the function mice()

Output: mids object (multiply imputed dataset)

Wrap statistical modeling function inside with() to fit the model to each of the m datasets

Output: mira object (multiply imputed regression analyses)

Use pool() on the mira object to pool (combine) estimates from each model

Output: mipo object (multiple imputation pooled object)
Alternatively, pass mira object to functions like emmeans()

`mice()` for multiple imputation

First use all default arguments except:
- m = 10 to create ten imputed datasets (often adequate)
- seed to ensure reproducibility of the result

jassids_mice_imputed <- mice(jassids, m = 10, seed = 27606)

Output of MICE imputation

Object of class mids
nmis: Number of missing values in each column
imp: Imputed values for each of the ten datasets, by column
method: Imputation method used for each column

jassids_mice_imputed$nmis

jassids_mice_imputed$imp

jassids_mice_imputed$method

Predictive mean matching

"pmm": default method for columns of continuous data
PMM algorithm fits ridge regression using other columns of the dataset to predict values for all data points in a column, including missing and non-missing
For each missing value, find a small set (default 5) other observations in the dataset with most similar predicted values
Take the actual observed values for those five non-missing entries and randomly draw a value from them

Performance of PMM imputation

Other `mice()` methods

Look at ?mice documentation, mice vignettes, or the mice book
Most have a random component so you can get multiple different realizations of the imputation
Simpler non-random methods are implemented, like mean imputation (set m = 1)

Random forest imputation with `mice()`

method = 'rf' gives us random forest imputation
Not a single imputation like missForest(); uses special algorithm to generate multiple imputed datasets
Explore mice() documentation if you want to modify default tuning parameters for this method

jassids_mice_RF_imputed <- mice(jassids, m = 10, seed = 27606, method = 'rf')

Performance of random forest imputation

Fit linear model to data multiply imputed with PMM method

jassids_multiple_lms <- with(
  jassids_mice_imputed,
  lm(Jassids ~ RF + temp_max + temp_min + RH1 + RH2 + BSS + wind_velocity)
)

Pool coefficient estimates and standard errors across multiple model fits

jassids_multiple_lm_coeffs <- pool(jassids_multiple_lms)

summary(jassids_multiple_lm_coeffs)

Fit linear model to data multiply imputed with random forest

jassids_multiple_lms_RF <- with(jassids_mice_RF_imputed, lm(Jassids ~ RF + temp_max + temp_min + RH1 + RH2 + BSS + wind_velocity))
jassids_multiple_lm_RF_coeffs <- pool(jassids_multiple_lms_RF)

summary(jassids_multiple_lm_RF_coeffs)

Comparison of coefficient estimates by imputation method

How and why do these estimates differ by method?

Example multiple imputation workflow: glmmTMB and emmeans

Revisiting cow weight dataset
"2l.pan" method suitable for imputing hierarchical/multilevel data with a multilevel structure
- “Mixed model” data such as RCBD experiment or repeated measures on same individuals over time

Multilevel imputation setup

Set up vector of imputation method names with empty strings everywhere except the weight column
Set up predictor matrix to specify which effects are fixed (1), random grouping variable (-2), or not used (0)

pred_mat <- rbind(
  animal = 0, iron = 0, infect = 0, 
  weight = c(animal = -2, iron = 1, infect = 1, weight = 0, day = 1), 
  day = 0
)

imp_methods <- c('', '', '', '2l.pan', '')

Multiply impute the cow time series data

Notice we need to convert the grouping variable, animal, to an integer

cows_multilevel_imputed <- cows %>%
  mutate(animal = as.integer(animal)) %>%
  mice(m = 10, method = imp_methods, predictorMatrix = pred_mat, seed = 142)

Performance of imputation for selected individuals

Fit mixed model to imputed dataset and pool estimates

Fit a (slightly oversimplified) mixed model to each of the imputed datasets
Random intercept for animal
Fixed effects for two treatments, time in days, and their interactions

cows_fit <- with(
  cows_multilevel_imputed, 
  glmmTMB(weight ~ iron * infect * day + (1 | animal))
)

pool(cows_fit) |> summary()

Pooled estimates of marginal means and trends

Instead of raw model coefficients we often want estimates for each group
Here emmeans() pools estimates for each treatment’s marginal mean weight on a specific day
emtrends() pools estimates for each treatment’s marginal mean time trend (kg/day gain rate)
Uncertainty of imputations incorporated in these estimates and any subsequent hypothesis tests/comparisons

emmeans(cows_fit, ~ iron + infect, at = list(day = 750))

emtrends(cows_fit, ~ iron + infect, var = 'day')

Example multiple imputation workflow: Bayesian GLMM using brms

Very similar procedure for Bayesian GLMMs with brms
brm_multiple() works with mids objects
Here we sample one Markov chain for each of the ten datasets (chains = 1)
Estimates are pooled by concatenating all the chains together
Because of uninformative priors and similar uncertainty propagation, results are nearly identical to glmmTMB() in this example

`brm_multiple()` example code

cows_bayesfit <- brm_multiple(
  weight ~ iron * infect * day + (1 | animal), 
  data = cows_multilevel_imputed,
  prior = c(
    prior(normal(0, 10), class = b)
  ),
  cores = 4, chains = 1, iter = 2000, warmup = 1000, seed = 142,
  file = 'data/cows_bayesfit', file_refit = 'never'
)

emmeans(cows_bayesfit, ~ iron + infect, at = list(day = 750))

emtrends(cows_bayesfit, ~ iron + infect, var = 'day')

A final note on Bayesian imputation

MICE is semi-Bayesian but there is also a fully Bayesian approach to imputation
MICE is a two-step process (first impute then fit model to each imputed dataset)
Alternatively, we can explicitly include imputation within a single Bayesian model fit
Missing values are treated as parameters for the model to estimate, with probability distributions
Only one model to fit, but we need to explicitly write out the formula to impute each parameter
See brms missing data vignette

See text version of lesson for further reading and exercises for additional practice
Fill out MS Forms survey
Contact me at quentin.read@usda.gov!

Dealing with Missing Data

What is this course?

At the end of this course, you will understand …

At the end of this course, you will be able to …

What is missing data and why is it a problem?

Getting to know missing data

Reasons for missingness

Reasons for missingness, continued

Categories of missing data

Missing completely at random (MCAR)

Missing at random (MAR)

Missing not at random (MNAR)

Censoring

Side note: missing replicates in designed experiments

How does R deal with missing data?

Load packages

R’s missing value indicator

NA in summary statistics

na.rm argument

Why is na.rm = FALSE the default?

Issues caused by deleting missing values

Verifying how many observations were used to fit a model

Other types of non-finite value: NaN and Inf

Visualizing missing data

Packages for missing data

The data

Import the data

R functions for summarizing missing data

R functions for summarizing missing data

R functions for summarizing missing data

VIM missing-data visualization

naniar missing data visualization

What to do about missing data

Three groups of strategies for dealing with missing data

Choosing a strategy: Proportion of missing data

Choosing a strategy: Type of missingness

Missing data strategy 1: Delete it

Listwise deletion, a.k.a. complete-case analysis

Listwise deletion in statistical modeling packages

Warnings and errors for incomplete cases

Listwise deletion in multivariate analysis

Pairwise deletion, a.k.a. available-case analysis

Pairwise deletion in cor()

Missing data strategy 2: Single imputation

Imputing missing data

Imputation strategies

Imputing using the mean or mode

Do mean imputation on jassids dataset

Visualization of mean imputation

Imputing using adjacent observations in space and/or time

Example dataset: cow weights over time

Import cow data

LOCF imputation on cow weight data

LOCF imputation continued

Linear interpolation

Linear interpolation formula

Linear interpolation using approx()

Imputing using randomly selected values from other observations

Hot-deck imputation with tidyverse functions

Hot-deck imputation is random!

Visualization: Last observation carried forward

Visualization: Linear interpolation

Visualization: Hot-deck imputation

Imputing using a regression model

Imputation with simple bivariate regression

Imputation with simple bivariate regression

Performance of bivariate regression imputation

Random forest imputation

Output of missForest()

Output of missForest(), continued

Performance of random forest imputation

Consequences of single imputation

Imputed values are not data

In the words of the great Stef van Buuren

Missing data strategy 3: Multiple imputation

Why do multiple imputation?

Multiple imputation: the basic idea

mice package: Multiple imputation using chained equations

MICE workflow

mice() for multiple imputation

`NA` in summary statistics

`na.rm` argument

Why is `na.rm = FALSE` the default?

Other types of non-finite value: `NaN` and `Inf`

Pairwise deletion in `cor()`

Linear interpolation using `approx()`

Output of `missForest()`

Output of `missForest()`, continued

`mice()` for multiple imputation

Other `mice()` methods

Random forest imputation with `mice()`

`brm_multiple()` example code