Lesson 5 learning objectives

At the end of this lesson, students will …

GLMM with binary response variable

Taking the log transformation of the response variable before fitting the model is a good solution for many types of data. However it won’t always work. For many classes of data, we can’t just do a transformation to “shoehorn” the data into a form that will work with a general linear mixed model. So that means we need to take the next step and enter the world of generalized linear mixed models, or GLMM as they are affectionately known. (It’s a terrible name because general linear mixed model and generalized linear mixed model begin with the same letters, so it’s a little bit confusing, but we are stuck with that.)

Let’s say we have response data that is binary. For example, what if we are testing the ability of a vaccine to prevent some disease in animals? We would have a control group that was not inoculated and a treatment group that was inoculated. The response variable is whether or not the disease is present. The animals are raised in different pens of 10 animals each in a complete block design so we need to account for the effect of pen on disease as well, using a random intercept.

Here is an example of what data from that study might look like.

library(tidyverse)
library(lme4)
library(easystats)
disease <- read_csv('https://usda-ree-ars.github.io/glmm-workshop-dec2022/datasets/disease.csv')
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): pen, inoc
## dbl (2): animal, disease
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here I will use the group_by() and summarize() functions to make a table of how many animals in each pen and each treatment got the disease.

disease %>%
  group_by(inoc, pen) %>%
  summarize(no_disease = sum(disease == 0), disease = sum(disease == 1))
## `summarise()` has grouped output by 'inoc'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups:   inoc [2]
##    inoc  pen   no_disease disease
##    <chr> <chr>      <int>   <int>
##  1 no    A              0      10
##  2 no    B              3       7
##  3 no    C              4       6
##  4 no    D              3       7
##  5 no    E              1       9
##  6 yes   A              6       4
##  7 yes   B              9       1
##  8 yes   C              9       1
##  9 yes   D              9       1
## 10 yes   E             10       0

What if we tried to just use a linear mixed model as we have done before? We could code disease absent as 0 and disease present as 1.

lmm_disease <- lmer(disease ~ inoc + (1|pen), data = disease)

That seems reasonable at first glance. The intercept is 0.78, so the baseline rate is 78% disease, and the coefficient on inoculation is -0.64, so the inoculation treatment makes disease go down by 64%. What is wrong with that?

Well, what about if we wanted to make predictions? Let’s say we had a population where the baseline rate of disease was only 40%. Our model would predict that it would have -20% disease if we gave that population the inoculation treatment. That is logically impossible. It makes more sense to think about the effect of inoculation in terms of relative changes to the odds of getting the disease. So if the baseline rate is very low, there isn’t that much that inoculation can do to reduce it further. But if the baseline rate is high, we can easily lower it with even a moderately effective treatment.

Furthermore, the residuals of this model will probably not conform to the assumptions of normal distribution and homogeneous variance.

check_model(lmm_disease)
## Not enough model terms in the conditional part of the model to check for
##   multicollinearity.

diagnostic plots for disease LMM

Also take a look at the posterior predictive check diagnostic plot at top left. The predicted values of the model go well outside the 0 to 1 range, leading to a very poor match between the data (green line) and predictions (blue lines).

So what do we do about this? Modeling a binary response can be done with a generalized linear model. We use a link function to convert the predicted response value into a scale that can be normally distributed. What we are predicting is the probability that an individual from each pen with or without the inoculation treatment will get the disease. Probability ranges from 0 to 1, so it cannot really have normally distributed error because normal distributions (bell curves) can take any value, positive or negative. So a bounded range like 0 to 1 doesn’t really play well with normal distributions.

The formula for predicting the probability of disease for individual \(i\) in pen \(j\) is as follows:

\[\text{logit}~\hat{y}_{i,j} = \beta_0 + \beta_1 inoc_i + u_j\]

In this formula, \(\hat{y}_{i,j}\) is the predicted probability of disease for individual \(i\) in pen \(j\). \(\beta_0\) is the intercept, \(\beta_1\) is the treatment coefficient, \(inoc_i\) is a binary variable that is 0 for uninoculated control and 1 for inoculated treatment, and \(u_j\) is the random effect, a separate intercept for each pen.

But what is that “logit” thing? That is the link function. It is used to transform the probability, which ranges from 0 to 1 and therefore can’t really have normally distributed error, to a scale that can take any value, positive or negative. Logit is another word for the log-odds function, defined as

\[\text{logit}~p = \log \frac {p}{1-p}\]

Here is a graph of that function:

graph of logit function

You can see that it maps the values ranging from 0 to 1 to values ranging from negative to positive infinity. It is more or less a straight line when the probability is between about 0.25 and 0.75, but starts to get really steep as you get close to the boundaries. This makes sense because if you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a relatively bigger difference.

So to use statistical jargon, what we are going to do here is fit a binomial GLMM (binomial meaning our response variable is a binary outcome) with a logit link function to transform the predicted probabilities onto the log odds scale.

We use the glmer() function instead of lmer() in this case. (If this was SAS, we would be using proc glimmix.) We specify a formula in the same way as we have been doing with lmer(), and the data frame from which the variables come, but we now have a new argument, family. This refers to the “family” of response distributions that we are going to be using. Here we specify binomial(link = 'logit'). Other link functions are possible to use with binomial GLMMs, but we will not get into that today.

glmm_disease <- glmer(disease ~ inoc + (1|pen), data = disease, family = binomial(link = 'logit'))

What do the diagnostic plots look like now?

check_model(glmm_disease)
## Not enough model terms in the conditional part of the model to check for
##   multicollinearity.

diagnostic plots of disease GLMM

This looks much better especially the posterior predictive check. All model predictions are between 0 and 1 which makes logical sense because they are probabilities, and also is a very good match to the observations. The residuals look better as well.

What does this model tell us?

summary(glmm_disease)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: disease ~ inoc + (1 | pen)
##    Data: disease
## 
##      AIC      BIC   logLik deviance df.resid 
##     97.2    105.0    -45.6     91.2       97 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.9600 -0.3627 -0.2951  0.5659  3.3881 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  pen    (Intercept) 0.4676   0.6838  
## Number of obs: 100, groups:  pen, 5
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.3914     0.4826   2.883  0.00394 ** 
## inocyes      -3.3741     0.6244  -5.404 6.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##         (Intr)
## inocyes -0.534

Now we have an intercept of 1.39 and a coefficient on the inoculation treatment of -3.37. Great, but what does that exactly mean? When we are dealing with models that have link functions, the coefficients are hard to interpret at first glance because they are on the log odds scale. We have to back-transform them to the probability scale to make any sense of them.

The intercept of 1.39 is the fitted log odds of getting the disease in the uninoculated control, averaged across the random effect of pen. To transform this log odds back to the probability scale ranging from 0 to 1, we use the inverse logit function, plogis().

plogis(1.39)
## [1] 0.8005922

So the model population-level prediction is that a random individual who isn’t inoculated has 80% probability of getting the disease.

What about the slope of -3.37? That is the change in the log odds between uninoculated and inoculated. To find out the model prediction for inoculated individuals, we need to subtract 3.37 from the intercept, then take the inverse logit of that value.

plogis(1.39 - 3.37)
## [1] 0.1213188

The probability of getting the disease, says the model, plummets from 80% down to 12% if the individual is inoculated.

In the next lesson, we will look at easier ways to extract model predictions from GLMM models – you will not have to do it “by hand” as we just did.

We can set up a generalized linear mixed model with all the same kinds of fixed and random effects as in plain vanilla linear mixed models. For example, we can have continuous and categorical predictors, interactions, crossed or nested random effects, random intercepts and/or random slopes. I will leave that for the exercises.

GLMM with count response variable

If your response variable is a discrete count, you have a similar issue as with binary response data. Your response variable can only be a non-negative integer, 0 or greater. It’s easier to imagine a “bell curve” looking distribution for count data. But we still have the issue that there are only discrete values, not continuous as we would have to have for a true normal distribution. Also, we can easily end up with a model that predicts negative counts, which is basically nonsense.

GLMMs come to the rescue again! In this case, we will use a Poisson distribution with a log link function. The Poisson distribution is bounded over non-negative integers, so it is ideal for count data (as long as there are not too many 0 values).

Here we are going to use an example dataset from the agridat package, like we did for the exercises in the previous lesson. This dataset comes from an experiment conducted by George Stirret and colleagues in Canada in 1935 to determine whether fungal spores could be applied to corn plants to control the European corn borer. There were four levels of the fungal spore treatment (trt column in the dataset): an untreated control ("None"), early fungal treatment ("Early"), late fungal treatment ("Late"), and a treatment in which fungal spores were applied both early and late ("Both"). There were 15 experimental blocks (block column), each containing four plots. The number of borers per plot was counted on two different dates. For simplicity we will only consider the second count (count2 column).

Load the dataset:

data('stirret.borers', package = 'agridat')

If you are running R locally and for some reason you do not have access to the agridat package, I have provided the data in a CSV file:

stirret.borers <- read_csv('https://usda-ree-ars.github.io/glmm-workshop-dec2022/datasets/stirret.borers.csv')

Here is a histogram of the count2 data for each treatment separately, set to have 10 bins using the geom_histogram() function from ggplot2, and the facet_wrap() function to make a separate panel for each treatment. I’ve also included a vertical line at the mean for each treatment.

borer_means <- stirret.borers %>%
  group_by(trt) %>%
  summarize(count2_mean = mean(count2))

ggplot(stirret.borers, aes(x = count2)) +
  geom_histogram(bins = 10) +
  geom_vline(aes(xintercept = count2_mean), data = borer_means, color = 'red') +
  facet_wrap(~ trt) +
  theme_bw()

histograms of borer counts by treatment

You can see that we have quite a few counts of zero or close to it, and that the “Both” and “Late” treatments have the lowest mean counts, with “Early” higher and “None” the highest.

As we’ve been doing, let’s rearrange the factor levels of the trt column so that the control level `’None`` is first in the ordering and will be treated as the reference level by the model. This will assist our interpretation.

stirret.borers <- stirret.borers %>%
  mutate(trt = factor(trt, levels = c('None', 'Early', 'Late', 'Both')))

For a proof of concept, let’s once again fit the model as a general linear model (with random intercepts), assuming normally distributed residuals.

lmm_borers <- lmer(count2 ~ trt + (1|block), data = stirret.borers)

Then fit the generalized linear model. Here we use family = poisson(link = 'log') as the link function that works best with the Poisson distribution is just log().

glmm_borers <- glmer(count2 ~ trt + (1|block), data = stirret.borers, family = poisson(link = 'log'))

Compare the diagnostic plots from each model, first from the linear mixed model:

check_model(lmm_borers)
## Not enough model terms in the conditional part of the model to check for
##   multicollinearity.

diagnostic plots for borer LMM

Then from the Poisson GLMM:

check_model(glmm_borers)
## Not enough model terms in the conditional part of the model to check for
##   multicollinearity.

diagnostic plots for borer GLMM

The residual diagnostics actually look acceptable for both models, because the count data are really not that far off from normal. But compare the predictive check plots. You can see that the first model produces a lot of negative predictions. So the GLMM is ideal to use here, not just to “make the residuals behave” but also because it allows us to make meaningful predictions.

Let’s look at the coefficients of the model.

summary(glmm_borers)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: count2 ~ trt + (1 | block)
##    Data: stirret.borers
## 
##      AIC      BIC   logLik deviance df.resid 
##    446.1    456.6   -218.0    436.1       55 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6020 -0.7342 -0.1642  0.7398  2.6456 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  block  (Intercept) 0.1076   0.328   
## Number of obs: 60, groups:  block, 15
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.42372    0.09658  35.451  < 2e-16 ***
## trtEarly    -0.24192    0.06837  -3.538 0.000403 ***
## trtLate     -1.00030    0.08745 -11.439  < 2e-16 ***
## trtBoth     -0.98359    0.08692 -11.317  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) trtErl trtLat
## trtEarly -0.311              
## trtLate  -0.243  0.344       
## trtBoth  -0.245  0.346  0.271

Again, we have to back-transform the coefficient estimates from the link function scale back to the original scale of the data. In this case, because the logarithm log() was the link function, we use exponentiation exp() as the inverse. The intercept, 3.42, represents the estimated mean for the “None” treatment (control group) on a log scale.

exp(3.42)
## [1] 30.56942

This is a population-level prediction for the control group of about 30.6 borers per plot on the original data scale. We can get similar predictions for the other treatment groups by adding their coefficients to the intercept and exponentiating. For example, the “Early” treatment’s coefficient was -0.24:

exp(3.42 - 0.24)
## [1] 24.04675

This corresponds to a prediction of 24.0 borers per plot.

Hey! What about …

Unfortunately we are only barely scratching the surface of generalized linear mixed models today. These are only a little bit of what you can do. I don’t have time to cover all GLMM types in this workshop. For one thing, glmer() can only fit a small subset of GLMMs. For example, your response variable might be a categorical outcome with more than two possibilities. This would be the case in a choice trial. In that case, you would want to fit a multinomial GLMM. Some categorical response variables are ordered, such as disease ratings on a 1-5 scale. There are also GLMMs for that.

Other topics that you might be wondering about include

These topics are very near and dear to my heart and I am planning to develop workshops that cover them.

Finally, you may have noticed there was no mention of Bayesian statistics. In many real world cases of data analysis, the “frequentist” methods we have covered today simply will not work. The best alternative is to use a Bayesian approach. I highly recommend the Bayesian R modeling packages brms and rstanarm to fit Bayesian GLMMs. I’m going to do a workshop on brms soon.

We have one more lesson coming up today, where we are going to look at comparing means and testing hypotheses.

Exercises

These are more open-ended exercises to give you more practice with GLMMs.

The first example (Exercises 1 and 2) will use the Jansen apple dataset from the agridat package. Load it with data('jansen.apple', package = 'agridat') or if you are running R locally and don’t have access to the agridat package, use jansen.apple <- read_csv('https://usda-ree-ars.github.io/glmm-workshop-dec2022/datasets/jansen.apple.csv').

The dataset has five columns: inoculum, the amount of fungus apple tree shoots were infected with, gen, the variety of apple, block, the experimental block ID, y, the number of trees that developed apple canker, and n, the total number of shoots (there were 5 in all blocks).

Before proceeding, convert inoculum to a factor.

jansen.apple <- jansen.apple %>%
  mutate(inoculum = factor(inoculum))

Exercise 1

What is the average number of trees that got apple canker by inoculum level? By genotype? By combination of inoculum level and genotype?

  • Hint: Use group_by(), summarize(), and mean().

Exercise 2

  1. Fit a generalized linear mixed model with infection status as the response variable, genotype, inoculum level, and their interaction as fixed effects, and block as a random intercept.

NOTE: This dataset has a different format than what we saw previously for binary data. Instead of one row per individual, the numbers of infected trees in each block are in each row. glmer() can take data in this format if you use cbind(y, n - y) as the response variable.

  • Hint: Don’t forget family = binomial(link = 'logit')!
  1. What do the diagnostic plots tell us?
  2. What does the model summary tell us?

Exercise 3

For the next two exercises we will use another agridat dataset, the Beall webworms dataset from 1940. This dataset has counts of webworms in a beet field arranged into experimental blocks (block) with two different insecticide treatments, spray insecticide (spray) and lead arsenate insecticide (lead). The y column is the number of webworms counted in the experimental plot.

Load the data with data('beall.webworms', package = 'agridat') or if you are running R locally and don’t have access to the agridat package, use beall.webworms <- read_csv('https://usda-ree-ars.github.io/glmm-workshop-dec2022/datasets/beall.webworms.csv').

  1. Fit a generalized linear mixed model with webworm count as the response variable, a random intercept for block, and fixed effects for spray insecticide, lead insecticide, and their interaction. Use Poisson distribution for the response family.
  2. What do the diagnostic plots tell us?
  3. What does the model summary tell us?

Exercise 4

This is a bonus exercise because we did not address count data with excessive zeros yet. One way to deal with count data with too many zeros to get a good fit with the Poisson distribution is to fit a negative binomial GLMM instead. This can be done with the glmer.nb() function. Try it out and see if you can get a better fit.

  • Hint: Get help by calling ?glmer.nb.

Click here for answers