At the end of this lesson, students will …
Taking the log transformation of the response variable before fitting the model is a good solution for many types of data. However it won’t always work. For many classes of data, we can’t just do a transformation to “shoehorn” the data into a form that will work with a general linear mixed model. So that means we need to take the next step and enter the world of generalized linear mixed models, or GLMM as they are affectionately known. (It’s a terrible name because general linear mixed model and generalized linear mixed model begin with the same letters, so it’s a little bit confusing, but we are stuck with that.)
Let’s say we have response data that is binary. For example, what if we are testing the ability of a vaccine to prevent some disease in animals? We would have a control group that was not inoculated and a treatment group that was inoculated. The response variable is whether or not the disease is present. The animals are raised in different pens of 10 animals each in a complete block design so we need to account for the effect of pen on disease as well, using a random intercept.
Here is an example of what data from that study might look like.
library(tidyverse)
library(lme4)
library(easystats)
disease <- read_csv('https://usda-ree-ars.github.io/SEAStats/mixed_models_in_R/datasets/disease.csv')
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): pen, inoc
## dbl (2): animal, disease
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here I will use the group_by()
and
summarize()
functions to make a table of how many animals
in each pen and each treatment got the disease.
disease %>%
group_by(inoc, pen) %>%
summarize(no_disease = sum(disease == 0), disease = sum(disease == 1))
## `summarise()` has grouped output by 'inoc'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups: inoc [2]
## inoc pen no_disease disease
## <chr> <chr> <int> <int>
## 1 no A 0 10
## 2 no B 3 7
## 3 no C 4 6
## 4 no D 3 7
## 5 no E 1 9
## 6 yes A 6 4
## 7 yes B 9 1
## 8 yes C 9 1
## 9 yes D 9 1
## 10 yes E 10 0
What if we tried to just use a linear mixed model as we have done before? We could code disease absent as 0 and disease present as 1.
lmm_disease <- lmer(disease ~ inoc + (1|pen), data = disease)
That seems reasonable at first glance. The intercept is 0.78, so the baseline rate is 78% disease, and the coefficient on inoculation is -0.64, so the inoculation treatment makes disease go down by 64%. What is wrong with that?
Well, what about if we wanted to make predictions? Let’s say we had a population where the baseline rate of disease was only 40%. Our model would predict that it would have -20% disease if we gave that population the inoculation treatment. That is logically impossible. It makes more sense to think about the effect of inoculation in terms of relative changes to the odds of getting the disease. So if the baseline rate is very low, there isn’t that much that inoculation can do to reduce it further. But if the baseline rate is high, we can easily lower it with even a moderately effective treatment.
Furthermore, the residuals of this model will probably not conform to the assumptions of normal distribution and homogeneous variance.
check_model(lmm_disease)
## Not enough model terms in the conditional part of the model to check for
## multicollinearity.
Also take a look at the posterior predictive check diagnostic plot at top left. The predicted values of the model go well outside the 0 to 1 range, leading to a very poor match between the data (green line) and predictions (blue lines).
So what do we do about this? Modeling a binary response can be done with a generalized linear model. We use a link function to convert the predicted response value into a scale that can be normally distributed. What we are predicting is the probability that an individual from each pen with or without the inoculation treatment will get the disease. Probability ranges from 0 to 1, so it cannot really have normally distributed error because normal distributions (bell curves) can take any value, positive or negative. So a bounded range like 0 to 1 doesn’t really play well with normal distributions.
The formula for predicting the probability of disease for individual \(i\) in pen \(j\) is as follows:
\[\text{logit}~\hat{y}_{i,j} = \beta_0 + \beta_1 inoc_i + u_j\]
In this formula, \(\hat{y}_{i,j}\) is the predicted probability of disease for individual \(i\) in pen \(j\). \(\beta_0\) is the intercept, \(\beta_1\) is the treatment coefficient, \(inoc_i\) is a binary variable that is 0 for uninoculated control and 1 for inoculated treatment, and \(u_j\) is the random effect, a separate intercept for each pen.
But what is that “logit” thing? That is the link function. It is used to transform the probability, which ranges from 0 to 1 and therefore can’t really have normally distributed error, to a scale that can take any value, positive or negative. Logit is another word for the log-odds function, defined as
\[\text{logit}~p = \log \frac {p}{1-p}\]
Here is a graph of that function:
You can see that it maps the values ranging from 0 to 1 to values ranging from negative to positive infinity. It is more or less a straight line when the probability is between about 0.25 and 0.75, but starts to get really steep as you get close to the boundaries. This makes sense because if you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a relatively bigger difference.
So to use statistical jargon, what we are going to do here is fit a binomial GLMM (binomial meaning our response variable is a binary outcome) with a logit link function to transform the predicted probabilities onto the log odds scale.
We use the glmer()
function instead of
lmer()
in this case. (If this was SAS, we would be using
proc glimmix.) We specify a formula in the same way as
we have been doing with lmer()
, and the data frame from
which the variables come, but we now have a new argument,
family
. This refers to the “family” of response
distributions that we are going to be using. Here we specify
binomial(link = 'logit')
. Other link functions are possible
to use with binomial GLMMs, but we will not get into that today.
glmm_disease <- glmer(disease ~ inoc + (1|pen), data = disease, family = binomial(link = 'logit'))
What do the diagnostic plots look like now?
check_model(glmm_disease)
## Not enough model terms in the conditional part of the model to check for
## multicollinearity.
This looks much better especially the posterior predictive check. All model predictions are between 0 and 1 which makes logical sense because they are probabilities, and also is a very good match to the observations. The residuals look better as well.
What does this model tell us?
summary(glmm_disease)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: disease ~ inoc + (1 | pen)
## Data: disease
##
## AIC BIC logLik deviance df.resid
## 97.2 105.0 -45.6 91.2 97
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9600 -0.3627 -0.2951 0.5659 3.3881
##
## Random effects:
## Groups Name Variance Std.Dev.
## pen (Intercept) 0.4676 0.6838
## Number of obs: 100, groups: pen, 5
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3914 0.4826 2.883 0.00394 **
## inocyes -3.3741 0.6244 -5.404 6.53e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## inocyes -0.534
Now we have an intercept of 1.39 and a coefficient on the inoculation treatment of -3.37. Great, but what does that exactly mean? When we are dealing with models that have link functions, the coefficients are hard to interpret at first glance because they are on the log odds scale. We have to back-transform them to the probability scale to make any sense of them.
The intercept of 1.39 is the fitted log odds of getting the disease
in the uninoculated control, averaged across the random effect of pen.
To transform this log odds back to the probability scale ranging from 0
to 1, we use the inverse logit function, plogis()
.
plogis(1.39)
## [1] 0.8005922
So the model population-level prediction is that a random individual who isn’t inoculated has 80% probability of getting the disease.
What about the slope of -3.37? That is the change in the log odds between uninoculated and inoculated. To find out the model prediction for inoculated individuals, we need to subtract 3.37 from the intercept, then take the inverse logit of that value.
plogis(1.39 - 3.37)
## [1] 0.1213188
The probability of getting the disease, says the model, plummets from 80% down to 12% if the individual is inoculated.
In the next lesson, we will look at easier ways to extract model predictions from GLMM models – you will not have to do it “by hand” as we just did.
We can set up a generalized linear mixed model with all the same kinds of fixed and random effects as in plain vanilla linear mixed models. For example, we can have continuous and categorical predictors, interactions, crossed or nested random effects, random intercepts and/or random slopes. I will leave that for the exercises.
If your response variable is a discrete count, you have a similar issue as with binary response data. Your response variable can only be a non-negative integer, 0 or greater. It’s easier to imagine a “bell curve” looking distribution for count data. But we still have the issue that there are only discrete values, not continuous as we would have to have for a true normal distribution. Also, we can easily end up with a model that predicts negative counts, which is basically nonsense.
GLMMs come to the rescue again! In this case, we will use a Poisson distribution with a log link function. The Poisson distribution is bounded over non-negative integers, so it is ideal for count data (as long as there are not too many 0 values).
Here we are going to use an example dataset from the
agridat package, like we did for the exercises in the
previous lesson. This dataset comes from an experiment conducted by
George Stirret and colleagues in Canada in 1935 to determine whether
fungal spores could be applied to corn plants to control the European
corn borer. There were four levels of the fungal spore treatment
(trt
column in the dataset): an untreated control
("None"
), early fungal treatment ("Early"
),
late fungal treatment ("Late"
), and a treatment in which
fungal spores were applied both early and late ("Both"
).
There were 15 experimental blocks (block
column), each
containing four plots. The number of borers per plot was counted on two
different dates. For simplicity we will only consider the second count
(count2
column).
Load the dataset:
data('stirret.borers', package = 'agridat')
If you are running R locally and for some reason you do not have access to the agridat package, I have provided the data in a CSV file:
stirret.borers <- read_csv('https://usda-ree-ars.github.io/SEAStats/mixed_models_in_R/datasets/stirret.borers.csv')
Here is a histogram of the count2
data for each
treatment separately, set to have 10 bins using the
geom_histogram()
function from ggplot2,
and the facet_wrap()
function to make a separate panel for
each treatment. I’ve also included a vertical line at the mean for each
treatment.
borer_means <- stirret.borers %>%
group_by(trt) %>%
summarize(count2_mean = mean(count2))
ggplot(stirret.borers, aes(x = count2)) +
geom_histogram(bins = 10) +
geom_vline(aes(xintercept = count2_mean), data = borer_means, color = 'red') +
facet_wrap(~ trt) +
theme_bw()
You can see that we have quite a few counts of zero or close to it, and that the “Both” and “Late” treatments have the lowest mean counts, with “Early” higher and “None” the highest.
As we’ve been doing, let’s rearrange the factor levels of the
trt
column so that the control level `’None`` is first in
the ordering and will be treated as the reference level by the model.
This will assist our interpretation.
stirret.borers <- stirret.borers %>%
mutate(trt = factor(trt, levels = c('None', 'Early', 'Late', 'Both')))
For a proof of concept, let’s once again fit the model as a general linear model (with random intercepts), assuming normally distributed residuals.
lmm_borers <- lmer(count2 ~ trt + (1|block), data = stirret.borers)
Then fit the generalized linear model. Here we use
family = poisson(link = 'log')
as the link function that
works best with the Poisson distribution is just log()
.
glmm_borers <- glmer(count2 ~ trt + (1|block), data = stirret.borers, family = poisson(link = 'log'))
Compare the diagnostic plots from each model, first from the linear mixed model:
check_model(lmm_borers)
## Not enough model terms in the conditional part of the model to check for
## multicollinearity.
Then from the Poisson GLMM:
check_model(glmm_borers)
## Not enough model terms in the conditional part of the model to check for
## multicollinearity.
The residual diagnostics actually look acceptable for both models, because the count data are really not that far off from normal. But compare the predictive check plots. You can see that the first model produces a lot of negative predictions. So the GLMM is ideal to use here, not just to “make the residuals behave” but also because it allows us to make meaningful predictions.
Let’s look at the coefficients of the model.
summary(glmm_borers)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: count2 ~ trt + (1 | block)
## Data: stirret.borers
##
## AIC BIC logLik deviance df.resid
## 446.1 456.6 -218.0 436.1 55
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.6020 -0.7342 -0.1642 0.7398 2.6456
##
## Random effects:
## Groups Name Variance Std.Dev.
## block (Intercept) 0.1076 0.328
## Number of obs: 60, groups: block, 15
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.42372 0.09658 35.451 < 2e-16 ***
## trtEarly -0.24192 0.06837 -3.538 0.000403 ***
## trtLate -1.00030 0.08745 -11.439 < 2e-16 ***
## trtBoth -0.98359 0.08692 -11.317 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) trtErl trtLat
## trtEarly -0.311
## trtLate -0.243 0.344
## trtBoth -0.245 0.346 0.271
Again, we have to back-transform the coefficient estimates from the
link function scale back to the original scale of the data. In this
case, because the logarithm log()
was the link function, we
use exponentiation exp()
as the inverse. The intercept,
3.42, represents the estimated mean for the “None” treatment (control
group) on a log scale.
exp(3.42)
## [1] 30.56942
This is a population-level prediction for the control group of about 30.6 borers per plot on the original data scale. We can get similar predictions for the other treatment groups by adding their coefficients to the intercept and exponentiating. For example, the “Early” treatment’s coefficient was -0.24:
exp(3.42 - 0.24)
## [1] 24.04675
This corresponds to a prediction of 24.0 borers per plot.
Unfortunately we are only barely scratching the surface of
generalized linear mixed models today. These are only a little bit of
what you can do. I don’t have time to cover all GLMM types in this
workshop. For one thing, glmer()
can only fit a small
subset of GLMMs. For example, your response variable might be a
categorical outcome with more than two possibilities. This would be the
case in a choice trial. In that case, you would want to fit a
multinomial GLMM. Some categorical response variables are ordered, such
as disease ratings on a 1-5 scale. There are also GLMMs for that.
Other topics that you might be wondering about include
These topics are very near and dear to my heart and I am planning to develop workshops that cover them.
Finally, you may have noticed there was no mention of Bayesian statistics. In many real world cases of data analysis, the “frequentist” methods we have covered today simply will not work. The best alternative is to use a Bayesian approach. I highly recommend the Bayesian R modeling packages brms and rstanarm to fit Bayesian GLMMs.
We have one more lesson coming up today, where we are going to look at comparing means and testing hypotheses.
These are more open-ended exercises to give you more practice with GLMMs.
The first example (Exercises 1 and 2) will use the Jansen apple
dataset from the agridat package. Load it with
data('jansen.apple', package = 'agridat')
or if you are
running R locally and don’t have access to the agridat
package, use
jansen.apple <- read_csv('https://usda-ree-ars.github.io/SEAStats/mixed_models_in_R/datasets/jansen.apple.csv')
.
The dataset has five columns: inoculum
, the amount of
fungus apple tree shoots were infected with, gen
, the
variety of apple, block
, the experimental block ID,
y
, the number of trees that developed apple canker, and
n
, the total number of shoots (there were 5 in all
blocks).
Before proceeding, convert inoculum
to a factor.
jansen.apple <- jansen.apple %>%
mutate(inoculum = factor(inoculum))
What is the average number of trees that got apple canker by inoculum level? By genotype? By combination of inoculum level and genotype?
group_by()
,
summarize()
, and mean()
.NOTE: This dataset has a different format than what we saw previously for binary data. Instead of one row per individual, the numbers of infected trees in each block are in each row.
glmer()
can take data in this format if you usecbind(y, n - y)
as the response variable.
family = binomial(link = 'logit')
!For the next two exercises we will use another
agridat dataset, the Beall webworms dataset from 1940.
This dataset has counts of webworms in a beet field arranged into
experimental blocks (block
) with two different insecticide
treatments, spray insecticide (spray
) and lead arsenate
insecticide (lead
). The y
column is the number
of webworms counted in the experimental plot.
Load the data with
data('beall.webworms', package = 'agridat')
or if you are
running R locally and don’t have access to the agridat
package, use
beall.webworms <- read_csv('https://usda-ree-ars.github.io/SEAStats/mixed_models_in_R/datasets/beall.webworms.csv')
.
This is a bonus exercise because we did not address count data with
excessive zeros yet. One way to deal with count data with too many zeros
to get a good fit with the Poisson distribution is to fit a negative
binomial GLMM instead. This can be done with the glmer.nb()
function. Try it out and see if you can get a better fit.
?glmer.nb
.