Here are a few different things you can do to explore the missing
data in the mammalsleep dataset.
summary() will show how many NA values are
in each column.plot(aggr(mammalsleep)) from VIM packagevis_miss(mammalsleep) from naniar packageI added a few optional arguments; you can try others.
summary(mammalsleep)
## bw brw sws ps
## Min. : 0.005 Min. : 0.14 Min. : 2.100 Min. :0.000
## 1st Qu.: 0.600 1st Qu.: 4.25 1st Qu.: 6.250 1st Qu.:0.900
## Median : 3.342 Median : 17.25 Median : 8.350 Median :1.800
## Mean : 198.790 Mean : 283.13 Mean : 8.673 Mean :1.972
## 3rd Qu.: 48.203 3rd Qu.: 166.00 3rd Qu.:11.000 3rd Qu.:2.550
## Max. :6654.000 Max. :5712.00 Max. :17.900 Max. :6.600
## NA's :14 NA's :12
## ts mls gt pi
## Min. : 2.60 Min. : 2.000 Min. : 12.00 Min. :1.000
## 1st Qu.: 8.05 1st Qu.: 6.625 1st Qu.: 35.75 1st Qu.:2.000
## Median :10.45 Median : 15.100 Median : 79.00 Median :3.000
## Mean :10.53 Mean : 19.878 Mean :142.35 Mean :2.871
## 3rd Qu.:13.20 3rd Qu.: 27.750 3rd Qu.:207.50 3rd Qu.:4.000
## Max. :19.90 Max. :100.000 Max. :645.00 Max. :5.000
## NA's :4 NA's :4 NA's :4
## sei odi
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000
## Mean :2.419 Mean :2.613
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000
##
plot(aggr(mammalsleep), numbers = TRUE)
vis_miss(mammalsleep, cluster = TRUE)
By default, lm() will fit a model to only rows that have
no missing values for any of the variables included in the model. You
don’t need to create a NA-free version of the dataframe
separately before model fitting. summary() is used to
output the model coefficients, their standard errors, and associated
test statistics and p-values (among other things).
lm_completecase <- lm(ts ~ log(bw) + log(brw) + log(gt), data = mammalsleep)
summary(lm_completecase)
##
## Call:
## lm(formula = ts ~ log(bw) + log(brw) + log(gt), data = mammalsleep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.604 -2.606 -0.016 1.957 7.802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.39233 2.86052 7.129 3.77e-09 ***
## log(bw) -0.41125 0.54283 -0.758 0.45224
## log(brw) 0.08021 0.75972 0.106 0.91634
## log(gt) -2.18689 0.76447 -2.861 0.00615 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.474 on 50 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.4673, Adjusted R-squared: 0.4353
## F-statistic: 14.62 on 3 and 50 DF, p-value: 5.811e-07
There are many ways to impute missing values with the mean. Here I
will use the impute_mean() function we defined earlier in
the lesson. I’ll also show how it can be done with the mice package.
Again we use summary() on the linear model object.
impute_mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
mammalsleep_mean_imputed <- mammalsleep %>%
mutate(across(where(is.numeric), impute_mean))
lm_mean_imputed <- lm(ts ~ log(bw) + log(brw) + log(gt), data = mammalsleep_mean_imputed)
summary(lm_mean_imputed)
##
## Call:
## lm(formula = ts ~ log(bw) + log(brw) + log(gt), data = mammalsleep_mean_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.349 -2.639 -0.573 2.105 8.920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.21676 2.68181 7.538 3.65e-10 ***
## log(bw) 0.03474 0.52938 0.066 0.94791
## log(brw) -0.43990 0.73619 -0.598 0.55247
## log(gt) -1.86473 0.68762 -2.712 0.00879 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.602 on 58 degrees of freedom
## Multiple R-squared: 0.3779, Adjusted R-squared: 0.3457
## F-statistic: 11.74 on 3 and 58 DF, p-value: 4.089e-06
The mice package yields identical results (output not shown). The
complete() function is used to extract the imputed dataset
from the mice object.
mammalsleep_mean_imputed_mice <- mice(mammalsleep, m = 1, method = 'mean')
lm_mean_imputed_mice <- lm(ts ~ log(bw) + log(brw) + log(gt), data = complete(mammalsleep_mean_imputed_mice))
summary(lm_mean_imputed_mice)
We use the same steps as in the lesson: impute with
mice() to create a set of imputed datasets, pass that to
with() to fit the linear model to each dataset, then use
pool() to combine the coefficient estimates across models
and output summary information with summary().
Note I used seed = 1 to ensure reproducibility of the
results, and print = FALSE so that the progress indicators
are not displayed in the output.
mammalsleep_pmm_imputed <- mice(mammalsleep, m = 10, method = 'pmm', seed = 1, print = FALSE)
## Warning: Number of logged events: 20
lm_pmm_imputed <- with(mammalsleep_pmm_imputed, lm(ts ~ log(bw) + log(brw) + log(gt)))
pool(lm_pmm_imputed) |> summary()
## term estimate std.error statistic df p.value
## 1 (Intercept) 20.7216980 2.6859832 7.7147533 48.25995 5.756430e-10
## 2 log(bw) -0.1311432 0.5433055 -0.2413802 54.16271 8.101719e-01
## 3 log(brw) -0.1280311 0.7696150 -0.1663573 54.83086 8.684877e-01
## 4 log(gt) -2.1948165 0.7232911 -3.0344859 50.24129 3.809462e-03
It looks like the interpretation is not drastically different
depending on imputation method. The coefficient estimate for
log(gt) is large and negative in all models, and has p <
0.1. However, the magnitude of the coefficient estimate is somewhat
smaller with the mean imputation method. It also looks like the
log(bw) and log(brw) coefficients vary in size
depending on which imputation method is used. This is true even though
those variables don’t have any missing values! However, gt
does have missing values, so when we impute the gt
variable, we are able to include additional body weight data points in
the model as well, affecting the coefficient estimate.