Dealing With Missing Data Exercise Answers

Exercise 1

Here are a few different things you can do to explore the missing data in the mammalsleep dataset.

summary() will show how many NA values are in each column.
plot(aggr(mammalsleep)) from VIM package
vis_miss(mammalsleep) from naniar package

I added a few optional arguments; you can try others.

summary(mammalsleep)

##        bw                brw               sws               ps       
##  Min.   :   0.005   Min.   :   0.14   Min.   : 2.100   Min.   :0.000  
##  1st Qu.:   0.600   1st Qu.:   4.25   1st Qu.: 6.250   1st Qu.:0.900  
##  Median :   3.342   Median :  17.25   Median : 8.350   Median :1.800  
##  Mean   : 198.790   Mean   : 283.13   Mean   : 8.673   Mean   :1.972  
##  3rd Qu.:  48.203   3rd Qu.: 166.00   3rd Qu.:11.000   3rd Qu.:2.550  
##  Max.   :6654.000   Max.   :5712.00   Max.   :17.900   Max.   :6.600  
##                                       NA's   :14       NA's   :12     
##        ts             mls                gt               pi       
##  Min.   : 2.60   Min.   :  2.000   Min.   : 12.00   Min.   :1.000  
##  1st Qu.: 8.05   1st Qu.:  6.625   1st Qu.: 35.75   1st Qu.:2.000  
##  Median :10.45   Median : 15.100   Median : 79.00   Median :3.000  
##  Mean   :10.53   Mean   : 19.878   Mean   :142.35   Mean   :2.871  
##  3rd Qu.:13.20   3rd Qu.: 27.750   3rd Qu.:207.50   3rd Qu.:4.000  
##  Max.   :19.90   Max.   :100.000   Max.   :645.00   Max.   :5.000  
##  NA's   :4       NA's   :4         NA's   :4                       
##       sei             odi       
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :2.000  
##  Mean   :2.419   Mean   :2.613  
##  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000  
##

plot(aggr(mammalsleep), numbers = TRUE)

vis_miss(mammalsleep, cluster = TRUE) + theme_sub_axis_x(text = element_text(vjust = 0))

Exercise 2

By default, lm() will fit a model to only rows that have no missing values for any of the variables included in the model. You don’t need to create a NA-free version of the dataframe separately before model fitting. summary() is used to output the model coefficients, their standard errors, and associated test statistics and p-values (among other things).

lm_completecase <- lm(ts ~ log(bw) + log(brw) + log(gt), data = mammalsleep)

summary(lm_completecase)

## 
## Call:
## lm(formula = ts ~ log(bw) + log(brw) + log(gt), data = mammalsleep)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.604 -2.606 -0.016  1.957  7.802 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.39233    2.86052   7.129 3.77e-09 ***
## log(bw)     -0.41125    0.54283  -0.758  0.45224    
## log(brw)     0.08021    0.75972   0.106  0.91634    
## log(gt)     -2.18689    0.76447  -2.861  0.00615 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.474 on 50 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.4673, Adjusted R-squared:  0.4353 
## F-statistic: 14.62 on 3 and 50 DF,  p-value: 5.811e-07

Exercise 3

There are many ways to impute missing values with the mean. Here I will use the impute_mean() function we defined earlier in the lesson. I’ll also show how it can be done with the mice package. Again we use summary() on the linear model object.

impute_mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

mammalsleep_mean_imputed <- mammalsleep %>%
  mutate(across(where(is.numeric), impute_mean))

lm_mean_imputed <- lm(ts ~ log(bw) + log(brw) + log(gt), data = mammalsleep_mean_imputed)

summary(lm_mean_imputed)

## 
## Call:
## lm(formula = ts ~ log(bw) + log(brw) + log(gt), data = mammalsleep_mean_imputed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.349 -2.639 -0.573  2.105  8.920 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.21676    2.68181   7.538 3.65e-10 ***
## log(bw)      0.03474    0.52938   0.066  0.94791    
## log(brw)    -0.43990    0.73619  -0.598  0.55247    
## log(gt)     -1.86473    0.68762  -2.712  0.00879 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.602 on 58 degrees of freedom
## Multiple R-squared:  0.3779, Adjusted R-squared:  0.3457 
## F-statistic: 11.74 on 3 and 58 DF,  p-value: 4.089e-06

The mice package yields identical results (output not shown). The complete() function is used to extract the imputed dataset from the mice object.

mammalsleep_mean_imputed_mice <- mice(mammalsleep, m = 1, method = 'mean')

lm_mean_imputed_mice <- lm(ts ~ log(bw) + log(brw) + log(gt), data = complete(mammalsleep_mean_imputed_mice))

summary(lm_mean_imputed_mice)

Exercise 4

We use the same steps as in the lesson: impute with mice() to create a set of imputed datasets, pass that to with() to fit the linear model to each dataset, then use pool() to combine the coefficient estimates across models and output summary information with summary().

Note I used seed = 1 to ensure reproducibility of the results, and print = FALSE so that the progress indicators are not displayed in the output.

mammalsleep_pmm_imputed <- mice(mammalsleep, m = 10, method = 'pmm', seed = 1, print = FALSE)

## Warning: Number of logged events: 20

lm_pmm_imputed <- with(mammalsleep_pmm_imputed, lm(ts ~ log(bw) + log(brw) + log(gt)))

pool(lm_pmm_imputed) |> summary()

##          term   estimate std.error  statistic       df      p.value
## 1 (Intercept) 20.7216980 2.6859832  7.7147533 48.25995 5.756430e-10
## 2     log(bw) -0.1311432 0.5433055 -0.2413802 54.16271 8.101719e-01
## 3    log(brw) -0.1280311 0.7696150 -0.1663573 54.83086 8.684877e-01
## 4     log(gt) -2.1948165 0.7232911 -3.0344859 50.24129 3.809462e-03

Exercise 5

It looks like the interpretation is not drastically different depending on imputation method. The coefficient estimate for log(gt) is large and negative in all models, and has p < 0.1. However, the magnitude of the coefficient estimate is somewhat smaller with the mean imputation method. It also looks like the log(bw) and log(brw) coefficients vary in size depending on which imputation method is used. This is true even though those variables don’t have any missing values! However, gt does have missing values, so when we impute the gt variable, we are able to include additional body weight data points in the model as well, affecting the coefficient estimate.