This is a lesson about missing data: what it is, what problems it can cause, how to recognize and describe it, and what to do about it.
Where there is data, there is missing data.
cmdstanr as the Bayesian model fitting engine if you’re on the SCINet server or have CmdStanR installed on your local machineNA for “not available”NA is used across data types (numeric, character, factor, logical)NA in summary statisticsmean() take a vector of values as input and return a single valueNA if any of the input values are NA, no matter how many non-missing values there arena.rm argumentna.rmna.rm = FALSE by default.na.rm = TRUE removes missing values and computes summary function on non-missing valuesna.rm = FALSE the default?mean(), model fitting functions like glmmTMB() will not give you an error if values are missingdata folder if on SCINetsummary()has a method for data framesNA values per column, if there are anyglimpse() from dplyr highlights NA values in the first few rowscomplete.cases() returns logical vector with TRUE for complete rows, FALSE for rows with any missing valuestable() to get number of rows with and without missing valuesis.na() tests whether each value of a vector is missing!is.na() returns TRUE for non-missing valuesaggr() computes number of missing values in each variable and each combination of two variablesplot() methodnumbers = TRUE optionally shows proportion of missing values in each combinationvis_miss(), imported from the visdat package, shows missing data row-by-row.cluster = TRUE arranges missing values row-wisesort_miss = TRUE arranges missing values column-wisedplyr::filter() and base R complete.cases()clmm()brm()lm(), lme4(), and glmmTMB() — beware!cor()use = 'everything' returns NA if a pair of variables has any missing valuesuse = 'complete.obs' does listwise deletionuse = 'pairwise.complete.obs' does pairwise deletionreplace()
x)is.na(x) is TRUE)NA elements of x)mutate(across(where(is.numeric), f)) applies function f to all numeric columns of a dataframe, even those without missing valuestidyr::fill()zoo::na.locf (Last Observation Carried Forward) specialized for time series datadata folder if you are running this code on the SCINet serverweight is missing, take the observation from the previous time point for that same animalgroup_by(animal) then arrange() to sort data by animal and then in increasing order by dayna.rm = FALSE argument: any remaining missing values at the beginning of any animal’s time series kept as NAna.locf() again to the already-imputed column, with fromLast = TRUE\[y = \frac{y_0(x_1-x) + y_1(x-x_0)}{x_1-x_0}\]
approx()approx() function in base R does linear interpolationapprox() arguments:
x and yxout: which x values need interpolated y values? We get a value for every day here regardless of whether it was missingrule = 2: replace NAs at beginning or end of time series with closest non-missing valuegroup_by()mutate() use if_else() logic to replace missing values with randomly chosen values from within the same treatment combinationif_else() has three arguments:
is.na(weight) is TRUE if weight missingTRUE: sample() to select from non-missing weight valuesFALSE: unchanged value of weightjassids dataset, impute temp_max using temp_minifelse() (base R, not tidyverse if_else()) to create a vector of imputed temperature valuestemp_max values if missing, observed if not missingmissForest()missForest() a dataframe with missing values, get back an imputed dataframevariablewise = TRUE gives us mean squared error (MSE) of the random forest model column-by-column: errors for the non-missing valuesmissForest()ximp (complete dataset)OOBerror (MSE for each column)missForest(), continued“[Regression imputation] artificially strengthens the relations in the data. Correlations are biased upwards. Variability is underestimated. Imputations are too good to be true. Regression imputation is a recipe for false positive and spurious relations.”
lm(), glm() (base R)lmer(), glmmTMB() (functions from mixed model packages)brm() (Bayesian mixed models)emmeans() to estimate means from any of the abovem imputed datasets using the function mice()mids object (multiply imputed dataset)with() to fit the model to each of the m datasetsmira object (multiply imputed regression analyses)pool() on the mira object to pool (combine) estimates from each modelmipo object (multiple imputation pooled object)mira object to functions like emmeans()mice() for multiple imputationm = 10 to create ten imputed datasets (often adequate)seed to ensure reproducibility of the resultmidsnmis: Number of missing values in each columnimp: Imputed values for each of the ten datasets, by columnmethod: Imputation method used for each column"pmm": default method for columns of continuous datamice() work well in many cases, but don’t blindly accept them!mice() methods?mice documentation, mice vignettes, or the mice bookm = 1)mice()method = 'rf' gives us random forest imputationmissForest(); uses special algorithm to generate multiple imputed datasetsmice() documentation if you want to modify default tuning parameters for this method"2l.pan" method suitable for imputing hierarchical/multilevel data with a multilevel structure
weight column1), random grouping variable (-2), or not used (0)animal, to an integeremmeans() pools estimates for each treatment’s marginal mean weight on a specific dayemtrends() pools estimates for each treatment’s marginal mean time trend (kg/day gain rate)brm_multiple() works with mids objectschains = 1)glmmTMB() in this examplebrm_multiple() example codecows_bayesfit <- brm_multiple(
weight ~ iron * infect * day + (1 | animal),
data = cows_multilevel_imputed,
prior = c(
prior(normal(0, 10), class = b)
),
cores = 4, chains = 1, iter = 2000, warmup = 1000, seed = 142,
file = 'data/cows_bayesfit', file_refit = 'never'
)
emmeans(cows_bayesfit, ~ iron + infect, at = list(day = 750))
emtrends(cows_bayesfit, ~ iron + infect, var = 'day')quentin.read@usda.gov!