This is a lesson about missing data: what it is, what problems it can cause, how to recognize and describe it, and what to do about it.
Where there is data, there is missing data.
NA for “not available”NA is used across data types (numeric, character, factor, logical)NA in summary statisticsmean() take a vector of values as input and return a single valueNA if any of the input values are NA, no matter how many non-missing values there arena.rm argumentna.rmna.rm = FALSE by default.na.rm = TRUE removes missing values and computes summary function on non-missing valuesna.rm = FALSE the default?mean(), model fitting functions like glmmTMB() will not give you an error if values are missingdata folder if on SCINetsummary()has a method for data framesNA values per column, if there are anyglimpse() from dplyr highlights NA values in the first few rowscomplete.cases() returns logical vector with TRUE for complete rows, FALSE for rows with any missing valuestable() to get number of rows with and without missing valuesis.na() tests whether each value of a vector is missing!is.na() returns TRUE for non-missing valuesaggr() computes number of missing values in each variable and each combination of two variablesplot() methodnumbers = TRUE optionally shows proportion of missing values in each combinationvis_miss(), imported from the visdat package, shows missing data row-by-row.cluster = TRUE arranges missing values row-wisesort_miss = TRUE arranges missing values column-wisedplyr::filter() and base R complete.cases()clmm()brm()lm(), lme4(), and glmmTMB() — beware!cor()use = 'everything' returns NA if a pair of variables has any missing valuesuse = 'complete.obs' does listwise deletionuse = 'pairwise.complete.obs' does pairwise deletionreplace()
x)is.na(x) is TRUE)NA elements of x)mutate(across(where(is.numeric), f)) applies function f to all numeric columns of a dataframe, even those without missing valuestidyr::fill()zoo::na.locf (Last Observation Carried Forward) specialized for time series datadata folder if you are running this code on the SCINet serverweight is missing, take the observation from the previous time point for that same animalgroup_by(animal) then arrange() to sort data by animal and then in increasing order by dayna.rm = FALSE argument: any remaining missing values at the beginning of any animal’s time series kept as NAna.locf() again to the already-imputed column, with fromLast = TRUE\[y = \frac{y_0(x_1-x) + y_1(x-x_0)}{x_1-x_0}\]
approx()approx() function in base R does linear interpolationapprox() arguments:
x and yxout: which x values need interpolated y values? We get a value for every day here regardless of whether it was missingrule = 2: replace NAs at beginning or end of time series with closest non-missing value