Dealing with Missing Data

What is this course?

This is a lesson about missing data: what it is, what problems it can cause, how to recognize and describe it, and what to do about it.

At the end of this course, you will understand …

  • What the different kinds of missing data are
  • The consequences of ignoring missing data
  • How missing data increases uncertainty in your conclusions
  • How to decide the best way to address missing data

At the end of this course, you will be able to …

  • Make graphs and tables to visualize missing data
  • Impute missing data with a variety of methods
  • Use multiple imputation to incorporate uncertainty due to missing data in your statistical estimates

What is missing data and why is it a problem?

Where there is data, there is missing data.

Getting to know missing data

  • Data might be missing for many different reasons
  • Causes of missingness are different in different scientific fields
  • What are some reasons you can think of for why a data point might be missing?

Reasons for missingness

  • Instruments fail
  • Datum not measured
    • Example: daily time series data with missing values on weekends and holidays
  • Attrition
    • Examples: a patient drops out of a study; or a plant dies before its final biomass can be recorded
  • Loss of data records

Image (c) USDA-ARS

Reasons for missingness, continued

  • Datum is possible to record for some subjects but not others
    • Example: we are measuring length of longest petal, but some plants do not have any flowers
  • Data value above or below the limit of detection
    • Examples: concentration of an element is too low to be measured; or a thermometer cannot measure temperatures above an upper limit
  • Data value doesn’t fit a statistical distribution
    • Example: zero values that become undefined if we log-transform the data

Image (c) Jerry Mason

Categories of missing data

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Missing not at random (MNAR)
  • This categorization developed by Donald Rubin in 1976, based on the process that determines the probability of a data point being missing

Missing completely at random (MCAR)

  • The probability of being missing is equal for all data points in a dataset
  • If missing values are MCAR, we don’t have to account for any biases resulting from missing data

Missing at random (MAR)

  • The probability of being missing varies according to a group or variable that we have in our observed data
  • For example, our instrument is more likely to fail on a rainy day, but we have data on rainfall by day
  • Not as easy to deal with as MCAR, but we can use information we have to model the missing data process
  • Most standard methods for dealing with missing data assume MAR

Missing not at random (MNAR)

  • Probability of being missing varies, for unknown reasons
  • Most difficult situation to deal with
  • Example: in a survey with a follow-up, people who do not have a strong opinion are less likely to follow up
  • Example: plant genotypes that are less resistant to disease are more likely to die before the end of the study, so their final biomass value is more likely to be missing

Censoring

Image (c) Myresa Hurst
  • Censoring is a type of MNAR missing data pattern
  • A censored data point is missing if it’s above or below a threshold
  • Example: a concentration below the limit of detection of the instrument or assay
  • Censoring also occurs if the subject dies or is removed from the study
  • Often seen in survival studies where subjects are only tracked for a finite period of time
  • Any censoring will bias results if not properly accounted for

Side note: missing replicates in designed experiments

  • Sometimes measurements are missing in a designed experiment, for example if a plot in an RCBD is destroyed or skipped
  • Although losing data isn’t ideal, this is not a major concern from an analysis standpoint
  • Modern methods for ANOVA do not require perfectly balanced data
  • If data are unbalanced and you want to do F-tests, be aware of what sum of squares you’re using

How does R deal with missing data?

Load packages

library(agridat)
library(glmmTMB)
library(dplyr)
library(ggplot2)
library(tidyr)
library(emmeans)
library(broom.mixed)
library(brms)
library(zoo)
library(VIM)
library(naniar)
library(missForest)
library(mice)

R’s missing value indicator

  • Missing values in R are represented by a special code: NA for “not available”
  • NA is used across data types (numeric, character, factor, logical)
x <- c(1, 2, NA, 4, 5)

NA in summary statistics

  • Summary statistic functions like mean() take a vector of values as input and return a single value
  • Usually default behavior is to return NA if any of the input values are NA, no matter how many non-missing values there are
mean(x)
sum(x)
median(x)

na.rm argument

  • Many summary stat arguments functions have an argument na.rm
  • Usually na.rm = FALSE by default.
  • Explicitly setting na.rm = TRUE removes missing values and computes summary function on non-missing values
mean(x, na.rm = TRUE)
sum(x, na.rm = TRUE)
median(x, na.rm = TRUE)

Why is na.rm = FALSE the default?

  • Why doesn’t R remove the missing values and calculate the mean or sum without me having to tell it?
  • Interpretation changes when missing values are deleted, so this should not be done silently without our express consent
  • R model fitting functions have different defaults for dealing with missing values

Issues caused by deleting missing values

  • Unlike mean(), model fitting functions like glmmTMB() will not give you an error if values are missing
  • Compare results before and after introducing a missing value
data(cochran.beets)
fit <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)
plot(x = cochran.beets$yield, y = predict(fit))

cochran.beets$yield[3] <- NA
fit2 <- glmmTMB(yield ~ fert + (1|block), data = cochran.beets)
plot(x = cochran.beets$yield, y = predict(fit2))

Verifying how many observations were used to fit a model

nrow(cochran.beets)
nobs(fit2)

Visualizing missing data

Packages for missing data

  • VIM: package with syntax and plotting style similar to base R
  • naniar: (originally known as ggmissing, depends on ggplot2 and the other “tidyverse” packages

The data

Image (c) University of Georgia Cooperative Extension
  • A few dozen observations of jassid (leafhopper) abundance, with seven weather covariates
  • Collected in the Maharashtra region of India, published on Kaggle by Madhura Bhagat
  • I artificially deleted some values of the covariates from the dataset
  • Deletion was completely at random and independent for each covariate

Import the data

  • Import from CSV to dataframe, from web URL if on your own machine
jassids <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/jassids.csv')
  • Import from data folder if on SCINet
jassids <- read.csv('data/jassids.csv')
  • Look at the first few rows of the data
head(jassids)

R functions for summarizing missing data

  • Base R function summary()has a method for data frames
  • Tells you number of NA values per column, if there are any
summary(jassids)
  • glimpse() from dplyr highlights NA values in the first few rows

R functions for summarizing missing data

  • complete.cases() returns logical vector with TRUE for complete rows, FALSE for rows with any missing values
  • Pass output to table() to get number of rows with and without missing values
complete.cases(jassids)
table(complete.cases(jassids))

R functions for summarizing missing data

  • is.na() tests whether each value of a vector is missing
  • Negation !is.na() returns TRUE for non-missing values
  • Use this to make tables of missingness
table(!is.na(jassids$temp_max))
  • Two-way tables or more are possible too
with(jassids, table(!is.na(temp_max), !is.na(temp_min)))

VIM missing-data visualization

  • aggr() computes number of missing values in each variable and each combination of two variables
  • Resulting object has a plot() method
  • numbers = TRUE optionally shows proportion of missing values in each combination
plot(aggr(jassids), numbers = TRUE)

naniar missing data visualization

  • vis_miss(), imported from the visdat package, shows missing data row-by-row.
vis_miss(jassids) + theme_sub_axis_x(text = element_text(vjust = 0))
  • cluster = TRUE arranges missing values row-wise
  • sort_miss = TRUE arranges missing values column-wise
vis_miss(jassids, cluster = TRUE)
vis_miss(jassids, sort_miss = TRUE)

What to do about missing data

Three groups of strategies for dealing with missing data

  • Delete the observations that contain missing data and analyze only the complete observations
  • Impute missing values with a single imputation, treating the imputed value the same as the other data values
  • Impute missing values with multiple imputation, accounting for the uncertainty in the imputation in your final results

Choosing a strategy: Proportion of missing data

  • If small, imputation may have a negligible effect on your bottom-line conclusions
  • More complex techniques are more important for higher missing proportions
  • No one-size-fits-all rule for what proportion needs to be missing for it to “matter”
  • If missing % is too high, no imputation may save you

Choosing a strategy: Type of missingness

  • If MCAR, deleting observations with missing data = deleting observations at random
    • Power is lost by reduced sample size but otherwise no major issue
  • If not MCAR, deleting observations with missing data = deleting observations non-randomly
    • Bias is introduced unless imputation is done properly
    • Severity depends on number of missing values

Missing data strategy 1: Delete it

  • Delete observations that contain missing data until you end up with a complete dataset
    • Listwise deletion, or complete-case analysis
    • Pairwise deletion, or available-case analysis

Listwise deletion, a.k.a. complete-case analysis

  • By far the commonest way of dealing with missing data
  • Just delete any observations, or “cases,” with any missing values
  • In the typical stats-ready data format (so-called tidy data), each row of the data frame is a case
  • Many ways to do this, here is one using dplyr::filter() and base R complete.cases()
jassids_complete_only <- jassids %>%
  filter(complete.cases(.))

Listwise deletion in statistical modeling packages

  • Listwise deletion is default in many R and SAS statistical modeling packages
  • Even if you don’t explicitly delete incomplete cases, it will be done by the model fitting function
  • Only columns used in the model will be considered when determining complete cases
model1 <- lm(Jassids ~ temp_max, data = jassids)
model2 <- lm(Jassids ~ temp_max + wind_velocity, data = jassids)

nobs(model1)
nobs(model2)

Warnings and errors for incomplete cases

  • Many R modeling functions return an error if missing values are present: clmm()
  • Some return a warning: brm()
  • Some bury the warning deep in the summary output: lm(), lme4(), and glmmTMB() — beware!

Listwise deletion in multivariate analysis

  • You may need to remove both columns (variables) and rows (cases) to get an acceptable complete dataset
    • Example: you have 50 microbiome samples and a total of 100 microbiome taxa in the dataset
    • But a few taxa have missing data for 90% of the rows
    • Complete-case analysis on the full dataset will leave you with few or no rows to work with
    • You could delete columns above some unacceptable threshold of proportion missing values, iterating until you get a reasonable number of complete cases

Pairwise deletion, a.k.a. available-case analysis

  • Instead of using only the complete cases for every calculation, do separate calculations for each variable or subset of variables
  • In each case use the cases that are complete for those variables
  • Often used for generating correlation matrices
  • Can’t be used for things like multiple regression which require all variables to be non-missing

Pairwise deletion in cor()

  • default use = 'everything' returns NA if a pair of variables has any missing values
  • use = 'complete.obs' does listwise deletion
  • use = 'pairwise.complete.obs' does pairwise deletion
cor(jassids)
cor(jassids, use = 'complete.obs')
cor(jassids, use = 'pairwise.complete.obs')

Missing data strategy 2: Single imputation

Imputing missing data

  • Listwise deletion can waste a large proportion of your data
  • Listwise deletion biases your results if data are not MCAR
  • Instead impute, or replace missing data points with substitute values
  • Use information we have to make educated guesses about what the missing value truly is, or would be if we could measure it

Imputation strategies

  • Impute using mean or mode
  • Impute using adjacent observations in space and/or time
  • Impute using a randomly selected value from another observation
  • Impute using a regression model

Imputing using the mean or mode

  • Replace all missing values with the mean or median (or mode for categorical data) of that variable
  • Can be done in base R with replace()
    • First argument is the vector with some missing values (x)
    • Second argument is a vector of indexes where we want to do the replacing (locations where is.na(x) is TRUE)
    • Third argument is the values we want to replace at those indexes (mean of non-NA elements of x)
impute_mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
impute_median <- function(x) replace(x, is.na(x), median(x, na.rm = TRUE))

Do mean imputation on jassids dataset

  • mutate(across(where(is.numeric), f)) applies function f to all numeric columns of a dataframe, even those without missing values
  • Assign the result to a new dataframe (better than overwriting original dataframe)
jassids_mean_imputed <- jassids %>%
  mutate(across(where(is.numeric), impute_mean))

Visualization of mean imputation

Imputing using adjacent observations in space and/or time

  • Two observations close together in space or time are more likely to be similar than two distant ones
  • Simplest of these kinds of imputation methods: copy adjacent observations to fill missing values
  • tidyr::fill()
  • zoo::na.locf (Last Observation Carried Forward) specialized for time series data

Example dataset: cow weights over time

Image (c) Te Ara, the Encyclopedia of New Zealand
  • Time series of body weights taken at 23 time points from 26 cows
  • Two-way factorial experiment: tuberculosis infection × iron supplementation
  • I made some values missing completely at random

Import cow data

  • Import from web URL or from the data folder if you are running this code on the SCINet server
cows <- read.csv('https://usda-ree-ars.github.io/SEAStats/dealing_with_missing_data/diggle.csv')
cows <- read.csv('data/diggle.csv')
  • Convert animal ID and treatment columns to factor variables
cows <- cows %>%
  mutate(across(c(animal, iron, infect), factor))

LOCF imputation on cow weight data

  • If an observation of weight is missing, take the observation from the previous time point for that same animal
  • First group_by(animal) then arrange() to sort data by animal and then in increasing order by day
  • na.rm = FALSE argument: any remaining missing values at the beginning of any animal’s time series kept as NA
  • Create new dataframe with side-by-side missing and imputed weight columns
cows_imputed <- cows %>%
  group_by(animal) %>%
  arrange(animal, day) %>%
  mutate(weight_LOCF_imputed = na.locf(weight, na.rm = FALSE))

LOCF imputation continued

  • Some values still missing: first time point missing for some animals
  • Apply na.locf() again to the already-imputed column, with fromLast = TRUE
cows_imputed <- cows_imputed %>%
  mutate(weight_LOCF_imputed = na.locf(weight_LOCF_imputed, na.rm = FALSE, fromLast = TRUE))

Linear interpolation

Linear interpolation, image (c) Wikipedia
  • Draw an imaginary line between the non-missing points on either side of the missing one
  • Find the y-coordinate at the time point where the value is missing
  • Example: cow A04 weighed 290 kg on day 380, and 310 kg on day 478
  • Weight on day 445 is missing; fill in assuming the cow gained weight at a constant rate between the two dates

Linear interpolation formula

  • Draw a line between the two points \((x_0, y_0)\) and \((x_1, y_1)\)
  • Find the y-coordinate for a given value of x using this formula

\[y = \frac{y_0(x_1-x) + y_1(x-x_0)}{x_1-x_0}\]

interp <- function(x, x0, y0, x1, y1) (y0 * (x1 - x) + y1 * (x - x0)) / (x1 - x0)

interp(x = 445, x0 = 380, y0 = 290, x1 = 478, y1 = 310)

Linear interpolation using approx()

  • approx() function in base R does linear interpolation
  • Make sure dataframe is grouped and sorted
  • approx() arguments:
    • x and y
    • xout: which x values need interpolated y values? We get a value for every day here regardless of whether it was missing
    • rule = 2: replace NAs at beginning or end of time series with closest non-missing value
cows_imputed <- cows_imputed %>%
  group_by(animal) %>%
  arrange(animal, day) %>%
  mutate(weight_interpolation_imputed = approx(x = day, y = weight, xout = day, rule = 2)$y)

Imputing using randomly selected values from other observations

  • Known as “hot deck” imputation, referring to old-school punch card readers
  • The randomly chosen value could be taken from any other observation in the dataset
  • Or random sampling could be stratified, for example by age class and/or sex