Introduction and setup

This is the fourth in a series of lessons that are a practical introduction to fitting Bayesian multilevel models in R and Stan using the brms package (Bayesian Regression Models using Stan).

Download the worksheet for this lesson here.

What did we learn in Lessons 1 through 3?

So far, you’ve learned some of the big-picture concepts behind Bayesian inference. You’ve been exposed to the idea that prior distributions of parameters reflect our belief about some phenomenon. Then we collect data and use it to calculate the likelihood, or how likely different parameter values are to have produced the data we observed. We combine the prior and likelihood to get the posterior distribution of the parameters: our new belief about the phenomenon, updated with the information from the data. You’ve learned how modifying the prior and adding more data to the likelihood can change the posterior. You’ve learned how we use Monte Carlo sampling to sample from posterior distributions until we have enough individual samples to describe the shape of the distribution. You’ve learned how to use medians and credible intervals to describe those distributions.

As for models, you’ve learned about generalized linear mixed models, building up from linear mixed models that assume normal error, then adding other distributions like binomial and Gamma. You’ve also learned about random effects: random intercepts and slopes to account for different study designs, and covariance structures to account for non-independence among data points. You’ve even learned some basic generalized additive mixed models.

What will we learn in Lesson 4?

With all the above under your belt, you can start exploring models for specific situations that you often encounter in ag science that require us to make new and different assumptions about the data. In this lesson, we’ll keep adding skills to your Bayesian toolkit.

  • Beta regression for proportion data
  • Zero-inflated Poisson regression for count data that include zeros
  • Hurdle models for continuous and proportion data that include zeros

Along the way, I will introduce you to some bigger-picture concepts. We’ll learn what mixture distributions are and how they can help us make our statistical models flexible enough to handle all kinds of data. We’ll also learn how to allow each separate parameter of a statistical distribution to vary as a function of fixed and random effects. This will open up a whole new exciting world of stats to you!

By the way, all of these techniques don’t have to be Bayesian. But the Bayesian approach works really well to estimate these models in a lot of cases — often, it’s the only way that works. So the concepts and models we are going to go over today, may be useful to you even if the Bayesian approach is not going to become your main go-to in your statistical toolkit.

Load packages

library(brms)
library(agridat)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(ggdist)
library(bayestestR)
library(gt)
library(fitdistrplus)

theme_set(theme_bw())
colorblind_friendly <- scale_color_manual(values = unname(palette.colors(5)[-1]), aesthetics = c('colour', 'fill'))

options(mc.cores = 4, brms.backend = 'cmdstanr', brms.file_refit = 'on_change')

Beta regression

In Lesson 2, if you remember back that far, we demonstrated that if you have binary data, it is better to fit a logistic GLMM to the binary outcome data, instead of pre-calculating the proportion in each experimental unit and doing a regression with the proportions. But what if you do not have discrete counts of the outcomes? What if your response variables are, indeed, proportions?

We’ve shown how a model that assumes normally distributed error doesn’t do well with proportion response data because proportions are bounded between 0 and 1. A normal distribution has no lower and upper bound on its x-axis range. So, the residuals of proportion data can’t truly have normally distributed error. That said, if most of the proportions are between around 0.25 and 0.75, the normal distribution does a decent job of approximating the error. But often, proportion datasets in ag science have many values near 0 and 1. The normal distribution doesn’t do well in those cases.

So what distribution should we use instead? So far we’ve seen the binomial distribution, which can take two discrete values, 0 and 1. That won’t work because we have a continuous response variable, not discrete. We’ve seen the gamma distribution, which can take any positive value. That’s getting better but it also won’t work because proportions can’t be greater than 1 and the gamma has no upper bound. In fact, there are really very few statistical distributions that are bounded on the [0, 1] interval. Luckily, there is one distribution that perfectly meets our needs. It’s very flexible in terms of what shape it can have, and it is continuous on the interval from 0 to 1. I’m talking about the beta distribution!

Meet the beta distribution

I won’t get into the mathematical formula for the probability density function of the beta distribution, but I will introduce you to its parameters, and how modifying the parameters changes the shape of the beta distribution. Many sources, such as Wikipedia, give you the beta distribution in terms of two parameters named \(\alpha\) and \(\beta\). That’s also how the distribution is parameterized in the R function dbeta(), where the parameters are called shape1 and shape2. But in contrast, the formula for the beta distribution in brms is rewritten in terms of two parameters \(\mu\) (mu) and \(\phi\) (phi). These are the mean parameter and the precision parameter. Actually, most if not all distributions in brms are parameterized so that the first parameter is the mean of the distribution. This is convenient because it is easier to think about setting up a model so that you have fixed and/or random effects on the mean, and fixed and/or random effects on other parameters.

What does the beta distribution look like, with different parameters? By setting different parameter values, we can make it look all kinds of ways. We can make a more or less flat beta distribution with roughly equal probability of any proportion between 0 and 1. If the mean is far from 0 or 1, we can make a hump-shaped beta distribution with high probability for one proportion and low everywhere else (it looks roughly normal). If the mean is close to 0 or 1, the distribution slopes down in one direction. We can even make a horseshoe-shaped beta distribution with high probability for extreme proportions near 0 and 1 and low probability in the middle.

Fitting a beta regression model

Let’s look at a dataset we might want to model with a beta distribution. As usual it comes from the agridat package. We are dealing with Fusarium, a fungus that infects wheat. Here, researchers measured the percentage of winter wheat leaf area that was infected by Fusarium over three years. For each year, we have a percentage infected data point for each combination of 17 wheat genotypes and four Fusarium strains.

Fusarium head blight on winter wheat: image (c) Janet Lewis/CIMMYT
Fusarium head blight on winter wheat: image (c) Janet Lewis/CIMMYT


Because the beta distribution requires data between 0 and 1, and the data are presented as percentages, divide them by 100. Convert genotype, strain, and year to factor columns.

data(snijders.fusarium)

snijders.fusarium <- snijders.fusarium %>% 
  mutate(
    across(c(gen, strain, year), factor),
    p = y/100
  )

Take a look at the distribution of the data for each strain. These data show a typical shape for proportion data. We can easily come up with parameters for beta distributions that are shaped like this.

ggplot(snijders.fusarium, aes(x = p, fill = strain)) + 
  geom_histogram(bins = 15) +
  facet_wrap(~ strain) +
  colorblind_friendly +
  theme(legend.position = 'none')

Another way to look at the data is with a strip plot. Raw data are plotted as semi-transparent points, slightly jittered in the x direction so we can see multiple points with similar values. The median of each strain within each year is plotted as a horizontal line. I prefer plots like this, that show all the data points, to things like boxplots and violin plots if we have a reasonably small number of data points. It’s nice to see all the data.

ggplot(snijders.fusarium, aes(x = year, y = p, color = strain)) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = 0.5), alpha = 0.5) +
  stat_summary(fun = 'median', geom = 'point', size = 15, shape = 95, position = position_dodge(width = 0.5)) +
  colorblind_friendly

We see that there is a suggestion of a strain by year interaction. F39 has higher infection rates in 1986 and ’87, but none of the strains caused much infection in ’88.

Let’s fit a model to these data, assuming beta distributed error. This is also a good time to review the GLMM model fitting syntax used by the brm() function. As usual we have a model formula with response on the left. On the right there are fixed effects, which are the Fusarium strain, the year treated as a discrete variable, and their interaction. Then the (1 + year | gen) term indicates that a random intercept will be fit for each genotype, and the year effects will be allowed to vary by genotype as well. Note we could also have treated genotype as a fixed effect if we wanted to make inferences about the differences between the means of specific genotypes. But in this case, we are considering the levels of genotype in our dataset to be a ‘randomly’ chosen subset from a broader population of genotypes that we could have used. Whether a particular effect in a model is fixed or random can change depending on the scientific question you’re interested in.

The response family is Beta. Notice that there are two link functions, a logit link for the mean response and a log link for phi, the precision parameter. A reasonably uninformative prior is set on the fixed effects, and standard arguments for number of chains and iterations are used.

fusarium_beta_fit <- brm(
  p ~ strain * year + (1 + year | gen),
  family = Beta(link = 'logit', link_phi = 'log'),
  data = snijders.fusarium,
  prior = c(
    prior(normal(0, 5), class = b)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 555,
  file = 'fits/fusarium_beta_fit'
)

Let’s do a posterior predictive check. It looks reasonably good.

pp_check(fusarium_beta_fit)

Beta regression model with varying precision parameter

The model we just fit allows the mean of the beta distribution to vary as a function of the fixed and random effects. But what about the precision parameter phi? The model above fits one single value for phi. However, there is no reason why we have to stick with that. We could also look at whether allowing phi to vary as a function of fixed and/or random effects gives us a better fit.

Within the brm() function, if we have multiple parameters of the response distribution that we want to model as a function of fixed and random effects, we need to have a separate model formula for each of those parameters. We code the multiple formula lines by wrapping them in the function bf() and separating them with commas.

Let’s fit a model to the same dataset, this time allowing phi to vary as a function of the fixed effects: strain, year, and their interaction. Below you can see the model formula now has two lines, the first the same as before and the second being phi ~ strain * year. We could even put random effects on phi as well, but I think in most cases that is an unnecessary complication. The varying intercept and slope effects on the mean response with respect to genotype are more than enough to account for those effects.

Also note that you need to include prior distributions on the fixed effects on phi, separately from the priors on the fixed effects on the mean. We add the argument dpar = phi to one of our prior()s to specify which distributional parameter is getting those priors.

fusarium_beta_fit_varyphi <- brm(
  bf(
    p ~ strain * year + (1 + year | gen),
    phi ~ strain * year
  ),
  family = Beta(link = 'logit', link_phi = 'log'),
  data = snijders.fusarium,
  prior = c(
    prior(normal(0, 5), class = b),
    prior(normal(0, 2), class = b, dpar = phi)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 777,
  file = 'fits/fusarium_beta_fit_varyphi'
)

We will revisit the bf() syntax shortly when we look at zero-inflated models. By the way, you could even use the bf() syntax in a general linear mixed model with normal error to account for different variance between groups. You can let the standard deviation sigma to vary as a function of fixed and/or random effects. See the brms vignette on estimating distributional models.

Here’s the posterior predictive check plot for the model that allows phi to vary as a function of fixed effects. I guess it looks a tiny bit better than the model with a single global value estimated for phi.

pp_check(fusarium_beta_fit_varyphi)

Let’s compare the two models with LOO cross-validation. Despite having about a dozen more parameters, the model that allows phi to vary by strain and year is possibly better, and certainly no worse, than the simpler model. In this case I would say that it is up to your personal judgment which model to present. If the more complex model isn’t much better, I would probably stick with the simpler model. As an exercise, you may compare the inference from the two models to see if there is an appreciable difference.

fusarium_beta_fit <- add_criterion(fusarium_beta_fit, 'loo')
fusarium_beta_fit_varyphi <- add_criterion(fusarium_beta_fit_varyphi, 'loo')

loo_compare(fusarium_beta_fit, fusarium_beta_fit_varyphi)
##                           elpd_diff se_diff
## fusarium_beta_fit_varyphi  0.0       0.0   
## fusarium_beta_fit         -8.7      10.0

Making predictions with the beta model

Let’s present results from the simpler model with a single value of phi. Here, I estimate marginal means for each combination of strain and year, using type = 'response' to present them on the inverse link scale (the same proportion scale as the original data). Then, I take contrasts between each pair of strains in each year to see if we have statistical evidence that the expected proportion of Fusarium infection differs between strains within each year. We have to use regrid(fusarium_emms, transform = 'log') to get the contrasts in the form of ratios. Otherwise, we will get them in the form of odds ratios, which does not make sense for the proportion data we have.

(fusarium_emms <- emmeans(fusarium_beta_fit, ~ strain | year, type = 'response'))
## year = 1986:
##  strain response lower.HPD upper.HPD
##  F329     0.0540   0.03023    0.0886
##  F348     0.0608   0.03175    0.0970
##  F39      0.2440   0.15630    0.3506
##  F436     0.0833   0.04582    0.1303
## 
## year = 1987:
##  strain response lower.HPD upper.HPD
##  F329     0.0201   0.00996    0.0348
##  F348     0.0147   0.00733    0.0264
##  F39      0.2839   0.18633    0.4003
##  F436     0.0649   0.03491    0.1037
## 
## year = 1988:
##  strain response lower.HPD upper.HPD
##  F329     0.0750   0.04228    0.1171
##  F348     0.0629   0.03438    0.0973
##  F39      0.1164   0.07285    0.1773
##  F436     0.0919   0.05125    0.1395
## 
## Point estimate displayed: median 
## Results are back-transformed from the logit scale 
## HPD interval probability: 0.95
(fusarium_contrasts <- contrast(regrid(fusarium_emms, transform = 'log'), 'pairwise'))
## year = 1986:
##  contrast     ratio lower.HPD upper.HPD
##  F329 / F348 0.8887    0.6251     1.208
##  F329 / F39  0.2220    0.1626     0.290
##  F329 / F436 0.6482    0.4597     0.854
##  F348 / F39  0.2492    0.1835     0.320
##  F348 / F436 0.7298    0.5277     0.950
##  F39 / F436  2.9281    2.2417     3.659
## 
## year = 1987:
##  contrast     ratio lower.HPD upper.HPD
##  F329 / F348 1.3721    0.7608     2.184
##  F329 / F39  0.0713    0.0454     0.100
##  F329 / F436 0.3108    0.2005     0.453
##  F348 / F39  0.0519    0.0314     0.076
##  F348 / F436 0.2275    0.1335     0.330
##  F39 / F436  4.3460    3.3186     5.686
## 
## year = 1988:
##  contrast     ratio lower.HPD upper.HPD
##  F329 / F348 1.1934    0.8687     1.600
##  F329 / F39  0.6467    0.4853     0.831
##  F329 / F436 0.8208    0.6084     1.071
##  F348 / F39  0.5419    0.3969     0.700
##  F348 / F436 0.6866    0.5156     0.929
##  F39 / F436  1.2674    0.9685     1.617
## 
## Point estimate displayed: median 
## Results are back-transformed from the log scale 
## HPD interval probability: 0.95

We use describe_posterior() from the bayestestR package to get our desired credible intervals for the means and pairwise differences. In addition, if you are keen on testing a null hypothesis, you can get Bayesian analogues to p-values for the pairwise differences.

(fusarium_emms_CI <- describe_posterior(fusarium_emms, ci = c(0.66, 0.95), ci_method = 'eti', test = NULL))
## Summary of Posterior Distribution
## 
## strain | year | Median |       66% CI |       95% CI
## ----------------------------------------------------
## F329   | 1986 |   0.05 | [0.04, 0.07] | [0.03, 0.09]
## F348   | 1986 |   0.06 | [0.05, 0.08] | [0.04, 0.10]
## F39    | 1986 |   0.24 | [0.20, 0.29] | [0.16, 0.36]
## F436   | 1986 |   0.08 | [0.07, 0.11] | [0.05, 0.14]
## F329   | 1987 |   0.02 | [0.01, 0.03] | [0.01, 0.04]
## F348   | 1987 |   0.01 | [0.01, 0.02] | [0.01, 0.03]
## F39    | 1987 |   0.28 | [0.24, 0.34] | [0.19, 0.41]
## F436   | 1987 |   0.06 | [0.05, 0.08] | [0.04, 0.11]
## F329   | 1988 |   0.08 | [0.06, 0.09] | [0.05, 0.12]
## F348   | 1988 |   0.06 | [0.05, 0.08] | [0.04, 0.10]
## F39    | 1988 |   0.12 | [0.09, 0.14] | [0.07, 0.18]
## F436   | 1988 |   0.09 | [0.07, 0.12] | [0.06, 0.15]
(fusarium_contrasts_CI <- describe_posterior(fusarium_contrasts, ci = c(0.66, 0.95), ci_method = 'eti', test = c('pd', 'p_map'), null = 1))
## Summary of Posterior Distribution
## 
## contrast    | year | Median |       66% CI |       95% CI | p (MAP) |     pd
## ----------------------------------------------------------------------------
## F329 / F348 | 1986 |   0.89 | [0.76, 1.04] | [0.64, 1.24] |  0.735  | 75.62%
## F329 / F39  | 1986 |   0.22 | [0.19, 0.25] | [0.17, 0.29] |  < .001 |   100%
## F329 / F436 | 1986 |   0.65 | [0.56, 0.75] | [0.48, 0.88] |  0.011  | 99.65%
## F348 / F39  | 1986 |   0.25 | [0.22, 0.28] | [0.19, 0.33] |  < .001 |   100%
## F348 / F436 | 1986 |   0.73 | [0.63, 0.84] | [0.54, 0.97] |  0.071  | 98.42%
## F39 / F436  | 1986 |   2.93 | [2.59, 3.31] | [2.30, 3.76] |  < .001 |   100%
## F329 / F348 | 1987 |   1.37 | [1.08, 1.75] | [0.82, 2.28] |  0.602  | 89.53%
## F329 / F39  | 1987 |   0.07 | [0.06, 0.09] | [0.05, 0.10] |  < .001 |   100%
## F329 / F436 | 1987 |   0.31 | [0.26, 0.38] | [0.20, 0.46] |  < .001 |   100%
## F348 / F39  | 1987 |   0.05 | [0.04, 0.06] | [0.03, 0.08] |  < .001 |   100%
## F348 / F436 | 1987 |   0.23 | [0.18, 0.28] | [0.14, 0.35] |  < .001 |   100%
## F39 / F436  | 1987 |   4.35 | [3.84, 4.99] | [3.37, 5.78] |  < .001 |   100%
## F329 / F348 | 1988 |   1.19 | [1.03, 1.39] | [0.88, 1.62] |  0.604  | 87.60%
## F329 / F39  | 1988 |   0.65 | [0.57, 0.74] | [0.49, 0.84] |  0.007  | 99.88%
## F329 / F436 | 1988 |   0.82 | [0.71, 0.94] | [0.62, 1.09] |  0.320  | 92.00%
## F348 / F39  | 1988 |   0.54 | [0.47, 0.62] | [0.41, 0.71] |  < .001 |   100%
## F348 / F436 | 1988 |   0.69 | [0.60, 0.79] | [0.51, 0.92] |  0.033  | 99.42%
## F39 / F436  | 1988 |   1.27 | [1.12, 1.44] | [0.98, 1.63] |  0.241  | 96.62%

Presentation ideas

Figure

This figure could be put in a manuscript to show the modeled posterior means and their credible intervals alongside the data. As usual, this simultaneously shows the model predictions and the raw data points. This emphasizes that the data are really the important thing, not just the “clean” model predictions. It is a graphical way to both show that the model is a good fit to the data and show how different the predicted means are from each other.

gather_emmeans_draws(fusarium_emms) %>%
  mutate(.value = plogis(.value)) %>%
  ggplot(aes(x = year, y = .value, group = interaction(strain, year))) +
    stat_interval(aes(color = strain, color_ramp = after_stat(level)), .width = c(.66, .95), position = ggpp::position_dodgenudge(width = .5, x = .03)) +
    stat_summary(fun = 'median', geom = 'point', size = 2, position = ggpp::position_dodgenudge(width = .5, x = .03)) +
    geom_point(aes(y = p, color = strain), data = snijders.fusarium, position = ggpp::position_dodgenudge(width = .5, x = -.03), alpha = .5) +
    colorblind_friendly +
    labs(y = 'proportion infected', color_ramp = 'CI level')

Table

This is a table of the pairwise ratios of proportion infection along with their associated \(p_{MAP}\) values. (For example in 1986 our posterior estimate is that we are 95% sure that F329’s average area infected is somewhere between 16.5% and 29.3% as much as F39, with a point estimate of 22%.)

fusarium_contrasts_CI %>%
  filter(CI == 0.95) %>%
  dplyr::select(-c(pd, CI)) %>%
  gt(groupname_col = 'year') %>%
  fmt_number(n_sigfig = 3) %>%
  cols_label(Median = 'ratio', CI_low = '95% CI lower bound', CI_high = '95% CI upper bound', p_MAP = md('p~MAP~')) %>%
  sub_zero(columns = p_MAP, zero_text = '< 0.001') %>%
  tab_options(row_group.background.color = 'slateblue3',
              column_labels.font.weight = 'bold')
contrast ratio 95% CI lower bound 95% CI upper bound pMAP
1986
F329 / F348 0.889 0.645 1.24 0.735
F329 / F39 0.222 0.165 0.293 < 0.001
F329 / F436 0.648 0.477 0.879 0.0115
F348 / F39 0.249 0.188 0.327 < 0.001
F348 / F436 0.730 0.543 0.969 0.0708
F39 / F436 2.93 2.30 3.76 < 0.001
1987
F329 / F348 1.37 0.824 2.28 0.602
F329 / F39 0.0713 0.0473 0.103 < 0.001
F329 / F436 0.311 0.203 0.459 < 0.001
F348 / F39 0.0519 0.0331 0.0783 < 0.001
F348 / F436 0.227 0.145 0.347 < 0.001
F39 / F436 4.35 3.37 5.78 < 0.001
1988
F329 / F348 1.19 0.882 1.62 0.604
F329 / F39 0.647 0.491 0.838 0.00680
F329 / F436 0.821 0.616 1.09 0.320
F348 / F39 0.542 0.407 0.715 < 0.001
F348 / F436 0.687 0.506 0.923 0.0328
F39 / F436 1.27 0.982 1.63 0.241

Methods and results

Here is a basic way I came up with to describe the statistical model and its predictions, that you could include in the methods and results sections of a manuscript. This is by no means the only way you could describe it but I think this covers all of the important parts.

Methods

We fit a GLMM with beta response distribution to the proportion Fusarium infection data. A logit link function was used for the mean parameter, and a log link for the precision parameter \(\phi\). Strain and year were treated as fixed effects, and a random intercept and random slope with respect to year were fit for each genotype. We assigned Normal(0, 5) prior distributions to the fixed effects. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We compared this model with a model that had additional fixed effects on the precision parameter \(\phi\). LOO cross-validation showed this model was no better than the model without fixed effects on \(\phi\) (\(\Delta_{ELPD}\) = 8.7 with standard error = 10.0), therefore we present the simpler model.

We obtained posterior estimates of the marginal means for each combination of strain and year, back-transformed from the logit scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results

Incidence of Fusarium infection was overall low except for strain F39 in 1986 and 1987, with estimated mean infection proportions 0.24 with 95% CI [0.16, 0.36] and 0.28 [0.19, 0.41] respectively. The evidence is consistent with a greater proportion infection for F39 relative to other strains, with approximately a tenfold difference in 1987 and smaller differences in the twofold to fivefold range in 1986 and 1988 (see Table).

Image (c) Philippa Willitts Image (c) Kirby Lee/USA Today Sports
Images (c) Philippa Willitts and Kirby Lee/USA Today Sports


Zero inflated and hurdle models

So far, when we’ve worked with distributions like gamma and beta, we’ve been using datasets that don’t contain any zeros. Gamma and beta also only support positive (greater than zero) values. As a result the statistical distributions have modeled the data really well. But many non-negative right-skewed datasets contain some zeros, and many proportion datasets also contain some zeros. What do we do about that, if we’re assuming the data follow a distribution without zeros?

“Old school” way: get rid of the zeros and transform

As usual, there is a tried and true way to force this kind of dataset to fit in the “normal” box. We transform the data with a log or logit transformation, but first we add a little fudge factor. This turns the zeros into a small positive number. The fudge factor can be based on reality. For example often people use half of the lower limit of detection, if you are dealing with concentration data that contains zeros. In many cases, though, the value added is completely arbitrary. Do you do a \(\text{log}(x + 1)\) transformation? What about \(\text{log}(x + 0.001)\)? Or \(\text{log}(x + 0.000001)\)? And as we have learned in previous lessons, even without the fudge factor, a bias is introduced when transforming and back-transforming.

Another way of dealing with it is simply to remove the zeros from the dataset before fitting a model. This is not a great practice for obvious reasons. But if you only have a very small percentage of zeros in the dataset, it can still give more or less unbiased results.

Better way: mixture distributions

As we’ve seen over and over again in these lessons, the old-fashioned way of doing things distorts the data until it conforms to a model with very specific and restrictive assumptions. That often works well enough, but it’s better to use an appropriate statistical model that actually fits the data as it is.

How do we do that in this case? We use what’s called a mixture distribution. A mixture distribution is a statistical distribution made up of two or more distributions stacked on top of each other. We take a weighted sum of their density functions so that the area under the curve is still 1. Here are some examples of mixture distributions. Notice we can mix distributions of different families, and at different proportions.

You can see how the PDF curves are just added to each other at different proportions to create the mixtures. Here, we will consider an important application of mixture distributions: modeling excess zeros (and sometimes ones) that can’t be fit by a standard statistical distribution. In his book Statistical Rethinking, Richard McElreath gives an example of medieval monks that are working on copying manuscripts. It takes a monk a long time to copy a manuscript. They work on copying most days. On some of those days, they finish one manuscript, some days two or three, some days none at all. But on other days, they take a break and enjoy a beer. On those break days, they always finish zero manuscripts. Therefore, the distribution of number of manuscripts finished per day will have a lot of zeros. Some of the zeros will come from the days where they were working but didn’t happen to finish any manuscripts. Other zeros will come from the beer-drinking days.

Image from Li livres dou santé (13th c.) Image from Miracles de Notre Dame (15th c.)
Images from Li livres dou santé (13th c.) and Miracles de Notre Dame (15th c.)


We can model the distribution of manuscript counts as a mixture of two different distributions: one part is a count distribution, and the other is a “point mass” distribution which adds some extra probability at zero. It would look something like this:

The difference between zero-inflated and hurdle models

Note that to model this kind of data effectively, we don’t need to know which zeros are which. We just need to assume that there is a greater proportion of zeros in the data than you would expect for a Poisson count distribution with a given mean.

A slightly different type of mixture model is used discrete and continuous distributions. We will cover both situations here. The two types of model are similar in spirit but differ slightly in their technical details.

Zero-inflated model: This refers to the case where we have a discrete distribution like a Poisson distribution. It already has some finite probability of being zero. But we ‘inflate’ the probability of zero with an extra parameter.

Hurdle model: Continuous distributions like gamma and beta do not have a probability of being exactly zero, or any other number for that matter. Probability for a continuous distribution is only defined on an interval between two values. So, to add extra probability of zero to a continuous distribution, we need a hurdle model. This is a mixture of two distributions: a so-called “point mass” distribution that determines the probability of zero, and a continuous distribution that gives us the shape of the nonzero values.

Count data that contains zeros: Zero Inflated Poisson (ZIP)

A common type of data where we want to use a zero-inflated model is count data. This occurs a lot in natural and ecological systems where we are counting the abundance of some species. (It isn’t just for beer-drinking monks!) We often observe zero individuals at a given plot/time point/sample. The Poisson and negative binomial distributions are both discrete distributions that work well with count data. However, they have fairly restricted shapes and don’t have enough flexibility to closely fit datasets that have a lot of zeros. They already do have some probability of zero. We just need to add an extra parameter to inflate that number. In the model, not all zeros are created equal. Some of them are zeros from the main discrete distribution and others are zeros from the extra inflation component.

Beet webworm caterpillar and feeding damage: image (c) USU Extension
Beet webworm caterpillar and feeding damage: image (c) USU Extension


For this part of the lesson, we will work with an example dataset from the agridat package called beall.webworms. It is a classic dataset originally published in 1940 where two pesticide treatments, spray and lead, were applied in a 2 by 2 factorial design. The researchers counted the number of beet webworm egg masses in 25 subdivided areas in each plot. It is not completely clear from the original manuscript how the experiment was laid out but for these purposes we’ll consider it to be a randomized complete block design, with each treatment combination present in each block. The egg mass count response variable is called y.

Let’s take a look at the distribution of egg mass counts for each treatment combination. Somewhere between 0 and 9 egg masses were counted in each subplot. In the untreated control (top left), relatively fewer subplots had zero egg masses. In the treated plots, there were more subplots with no webworm egg masses. Spray seems to strongly reduce webworm egg mass counts, regardless of lead treatment.

data(beall.webworms)

ggplot(beall.webworms, aes(x = y)) +
  geom_histogram() + 
  facet_grid(spray ~ lead, labeller = labeller(spray = c(N = 'no spray', Y = 'spray'),
                                               lead = c(N = 'no lead', Y = 'lead'))) +
  scale_x_continuous(breaks = 1:9, name = 'number of egg masses')

Fitting a Poisson model to the count data

First, let’s try to fit a regular old Poisson model without a zero-inflation component to this dataset, and see how it does. We have a single-line model formula with response on the left and predictors on the right. First there are fixed effects, which are the two treatments and their interactions, a term indicating that a random intercept will be fit for each block, and a term for a random intercept for each combination of treatments and block (this is the plot; each plot is divided into 25 non-independent observations). The response family is poisson, which has a log link function. A reasonably uninformative prior is set on the fixed effects, and standard arguments for number of chains and iterations are used.

webworm_pois_fit <- brm(y ~ spray * lead + (1 | block) + (1 | spray:lead:block),
                        data = beall.webworms,
                        family = poisson(link = 'log'),
                        prior = c(
                          prior(normal(0, 5), class = b)
                        ),
                        chains = 4, iter = 2000, warmup = 1000, seed = 111,
                        file = 'fits/webworm_pois_fit')

Let’s look at a posterior predictive check plot to see how well the posterior predicted values match the observations. The argument type = 'bars' is suitable for discrete data. We observe that the model predictions (dark blue dots with error bars) substantially underpredict the zeros and overpredict the counts of 1 and 2.

pp_check(webworm_pois_fit, type = 'bars')

Fitting a zero-inflated Poisson model to the count data

How to improve the model fit? Let’s include a zero inflation component.

We have already seen how model formulas in brms can have multiple lines, one for each parameter that we’re modeling. Mixture models will always have at least two parameters you can include because of the multiple distributions. We can put fixed effects, random effects, or both on the mean of the Poisson component and on the mass of the zero-inflation component, which is called zi. As before, wrap all parts of the model formula in the bf() function.

Note that I have chosen to put only fixed effects on the zi component, and no random effects, because I think the non-independence of the measurements from the same plots and blocks is already well accounted for by the random effects on the mean component. Adding an additional set of random effects on zi would needlessly complicate the model. But nothing is stopping you from including those random effects! Also note that I have included an additional prior distribution for those fixed effects. If you do not supply a separate prior distribution for fixed effects on each component of the response, brm() will assign default flat priors to the ones you left out.

webworm_zip_fit <- brm(
  bf(
    y ~ spray * lead + (1 | block) + (1 | spray:lead:block),
    zi ~ spray * lead
  ),
  data = beall.webworms,
  family = zero_inflated_poisson(link = 'log', link_zi = 'logit'),
  prior = c(
    prior(normal(0, 5), class = b),
    prior(normal(0, 5), class = b, dpar = zi)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 222,
  file = 'fits/webworm_zip_fit'
)

Now let’s take a look at the posterior predictive check plot. It’s much better! You may formally compare the models with LOO cross-validation if you like (left as an exercise).

pp_check(webworm_zip_fit, type = 'bars')

Making predictions with the ZIP model

As usual, let’s explore the model predictions a little bit. This is a little bit trickier than for models we’ve fit in the past. That’s because we can make several different kinds of predictions from the model:

  • The conditional mean \(\mu\), or the model’s predicted value of the nonzero egg mass count for each treatment (i.e., “conditional” on there being a nonzero number of egg masses)
  • \(p_{zi}\), The model’s predicted probability of observing zero egg masses for each treatment
  • The overall mean response \(\mu(1 - p_{zi})\), or the conditional mean multiplied by the probability of a nonzero response.

By default, emmeans() will give us the conditional mean. Supplying the argument dpar = 'zi' will give us the probability of zero. But we often are interested in how both the expected nonzero value and the probability of zero combine to result in an overall expected value. We get this by supplying the argument epred = TRUE, where “epred” stands for “expected value of the posterior predictive distribution.”

(webworm_emms_condmean <- emmeans(webworm_zip_fit, ~ spray + lead, type = 'response'))
##  spray lead  rate lower.HPD upper.HPD
##  N     N    1.587     1.177     2.022
##  Y     N    0.583     0.392     0.810
##  N     Y    0.922     0.662     1.223
##  Y     Y    0.497     0.308     0.718
## 
## Point estimate displayed: median 
## Results are back-transformed from the log scale 
## HPD interval probability: 0.95
(webworm_emms_probzero <- emmeans(webworm_zip_fit, ~ spray + lead, type = 'response', dpar = 'zi'))
##  spray lead response lower.HPD upper.HPD
##  N     N       0.177  1.04e-01     0.252
##  Y     N       0.181  2.90e-05     0.339
##  N     Y       0.153  1.24e-05     0.258
##  Y     Y       0.238  3.67e-06     0.404
## 
## Point estimate displayed: median 
## Results are back-transformed from the logit scale 
## HPD interval probability: 0.95
(webworm_emms_overall <- emmeans(webworm_zip_fit, ~ spray + lead, epred = TRUE))
##  spray lead emmean lower.HPD upper.HPD
##  N     N     1.303     0.988     1.655
##  Y     N     0.478     0.351     0.627
##  N     Y     0.782     0.579     1.004
##  Y     Y     0.379     0.273     0.502
## 
## Point estimate displayed: median 
## HPD interval probability: 0.95

We see that the overall expected value is about 20% lower than the conditional mean, as we would expect because the expected probability of zero is around 20% in each treatment.

To do the appropriate pairwise comparisons between the overall expected values, we have to “remind” emmeans to treat the expected values as if they were fit on the log scale. This information is not preserved when epred = TRUE. We do this to get two sets of contrasts: contrasts of no-spray versus spray within each level of lead, and contrasts of no-lead versus lead within each level of spray.

(spray_contrasts_by_lead <- regrid(webworm_emms_overall, transform = 'log') %>% contrast(method = 'pairwise', type = 'response', by = 'lead'))
## lead = N:
##  contrast ratio lower.HPD upper.HPD
##  N / Y     2.72      2.08      3.54
## 
## lead = Y:
##  contrast ratio lower.HPD upper.HPD
##  N / Y     2.07      1.52      2.71
## 
## Point estimate displayed: median 
## Results are back-transformed from the log scale 
## HPD interval probability: 0.95
(lead_contrasts_by_spray <- regrid(webworm_emms_overall, transform = 'log') %>% contrast(method = 'pairwise', type = 'response', by = 'spray'))
## spray = N:
##  contrast ratio lower.HPD upper.HPD
##  N / Y     1.66     1.266      2.10
## 
## spray = Y:
##  contrast ratio lower.HPD upper.HPD
##  N / Y     1.26     0.907      1.68
## 
## Point estimate displayed: median 
## Results are back-transformed from the log scale 
## HPD interval probability: 0.95

We see that the spray treatment reduces egg counts by a factor of 2 or 3 regardless of lead treatment, and the lead treatment reduces egg counts by a lesser amount (by a factor of around 1.5). There is a modest interaction with the spray treatment: the spray is quite effective, so adding lead to already-sprayed plots doesn’t do that much more to reduce webworm egg counts.

If you aren’t happy with just estimating the effect size and its uncertainty (null hypothesis significance testing brainwashing takes a long time to wear off!!!) we can generate some Bayesian analogues to p-values for our summary tables.

describe_posterior(spray_contrasts_by_lead, ci_method = 'eti', ci = 0.95, test = c('p_map', 'pd'), null = 1)
## Summary of Posterior Distribution
## 
## contrast | lead | Median |       95% CI | p (MAP) |   pd
## --------------------------------------------------------
## N / Y    |    N |   2.72 | [2.10, 3.58] |  < .001 | 100%
## N / Y    |    Y |   2.07 | [1.54, 2.75] |  < .001 | 100%
describe_posterior(lead_contrasts_by_spray, ci_method = 'eti', ci = 0.95, test = c('p_map', 'pd'), null = 1)
## Summary of Posterior Distribution
## 
## contrast | spray | Median |       95% CI | p (MAP) |     pd
## -----------------------------------------------------------
## N / Y    |     N |   1.66 | [1.30, 2.14] |  < .001 |   100%
## N / Y    |     Y |   1.26 | [0.92, 1.71] |  0.417  | 93.15%

Presentation ideas

Figure

Here’s a figure of the overall posterior predictions for each treatment combination, showing the medians of the posterior distributions and some quantile-based credible intervals. As usual, it is nice to put the raw data on the figure as well, so that you can show everyone that your nice clean model predictions have some connection to reality. Because there are too many overlapping raw data points for the counts to plot them all as individual points, I have chosen to plot the distribution of counts in each treatment combination as a histogram. This might not be up your alley in terms of aesthetics; feel free to try out different ways of plotting the model + data!

gather_emmeans_draws(webworm_emms_overall) %>%
  ggplot(aes(x = interaction(spray, lead, sep = ' + '), y = .value)) +
    stat_histinterval(aes(y = y), data = beall.webworms, adjust = 5, point_interval = NULL) +
    stat_interval(aes(color = after_stat(level)), .width = c(.66, .95), position = position_nudge(x = -0.1)) +
    stat_summary(fun = 'median', geom = 'point', size = 2, position = position_nudge(x = -0.1)) +
    scale_color_brewer(name = 'CI level', palette = 'Greens') +
    scale_x_discrete(name = 'treatment', labels = c('-spray -lead', '+spray -lead', '-spray +lead', '+spray +lead')) +
    scale_y_continuous(name = 'webworm count', breaks = 0:6, limits = c(0, 6)) +
    coord_flip() +
    theme(panel.grid = element_blank(), legend.position = 'inside', legend.position.inside = c(0.9, 0.85))

Methods and results

As before, here are my suggestions for how the methods and results could be written up in a manuscript.

Methods

We fit a zero-inflated Poisson GLMM to the webworm count data, with a log link function for the \(\lambda\) parameter and a logit link function for the zero-inflation parameter. The model included fixed effects for spray treatment, lead treatment, and their interactions, and random intercepts for block nested within treatment combination. We assigned Normal(0, 5) prior distributions to the fixed effects on both \(\lambda\) and the zero-inflation parameter. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots.

We obtained posterior estimates of the marginal means for each treatment combination of strain and year, combining the conditional mean and zero-inflation probability. We back-transformed these means from the log scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results

There was statistical evidence that webworms were more abundant in the unsprayed condition versus sprayed. This was true both in the presence of lead where spraying reduced mean webworm abundance by a factor of 2.07 (95% CI [1.54, 2.75]), and in the absence of lead where spraying reduced abundance 2.72 times (95% CI [2.10, 3.58]). There was more modest evidence that lead treatment reduced webworm abundance, with a greater difference in the unsprayed condition (lead reduced abundance by a factor of 1.66, 95% CI [1.30, 2.14]) than sprayed (1.26, 95% CI [0.92, 1.71]).

Final note: Another way to deal with the poor fit of the Poisson regression is to use a negative binomial distribution instead. The negative binomial distribution also works with count data, but it has an additional parameter that the Poisson distribution doesn’t. This allows it to more flexibly fit the data. In many cases it can work as well as or better than a zero-inflated Poisson model. I have included that in the exercises at the end of this lesson.

Models for continuous positive data including zeros

In ag science, we often have data that are strictly non-negative. For example, the concentration of an element in a sample, in mass per unit volume, cannot be negative. It must be zero or greater. For that matter, it is probably never truly zero because any sample has jillions of atoms in it and at least a few of them must be the element in question. But the methods we use to determine the concentration aren’t sensitive enough to detect extremely low concentrations that are very close to zero. So for all practical purposes those values are zero.

There are statistical distributions that are very good for fitting positive continuous data, such as the Gamma distribution that we’ve seen previously, or the lognormal. These distributions are great, but there’s a problem. They do not accommodate values of exactly zero. That’s why people often resort to adding a “fudge factor” as we’ve discussed previously, before fitting a Gamma or lognormal model. Of course, there is also the even more common, but even more bias-introducing, method of doing a log(x + 0.001) transformation and fitting a normal model.

Hurdle model for continuous data with zeros

Let’s look at a better way of fitting a model to continuous positive data including zeros. In this case, we are fitting a hurdle model. We aren’t “inflating” the probability of zero because the Gamma or lognormal distribution don’t have any probability of zero. Continuous distributions don’t even have a probability of any one specific discrete value; the probability is only defined over an interval. Instead, we are fitting a hurdle component with a certain probability, meaning that the distribution has to overcome a “hurdle” in order to attain a non-zero value. The interpretation of this is very similar to a zero-inflated model.

Image (c) NABAS
Image (c) NABAS


For once, I couldn’t find appropriate example data in the agridat package, so I am using some edited and anonymized data from an ARS researcher. This dataset comes from the field of immunology. The researcher did an assay to determine how effective a particular vaccine was at stimulating immune response in chickens. My very limited understanding of how this works is this: a disease-causing microorganism has some proteins on its cell surface (antigens). Having immunity to a particular microorganism means that you will have antibodies circulating in your bloodstream that recognize and bind to the microorganism’s antigens. This either disables the pathogen or marks it for destruction later. In this study, chickens that were vaccinated presumably had more of the appropriate antibodies in their bloodstream. The immunoassay involved mixing chicken blood with chemically labeled antigen molecules that release photons of light when they bind with an antibody. If the sample lights up more, as measured by a value called chemiluminescence, the individual has a stronger immune response. (One of) the questions in the study was: for which of four different antigens does the vaccine stimulate the strongest response? We can look at this by estimating the difference in mean chemiluminescence between blood samples from vaccinated and unvaccinated chickens exposed to each antigen. A subsample of each chicken’s blood was exposed to each of the four antigens, but each chicken is only either vaccinated or unvaccinated so it isn’t quite a complete block design.

The relevance of this dataset here is that many samples, especially from unvaccinated chickens, produced no detectable chemiluminescence. These are 0 values. Otherwise, chemiluminescence varies by several orders of magnitude. It is an ideal case for a hurdle model.

Take a look at the distribution of the data for each antigen. I’ve plotted a solid horizontal bar for each median, and all the raw data points slightly jittered so you can see there are quite a few zeros, especially in the unvaccinated group, but even among a few of the vaccinated birds that apparently didn’t get anything out of the vaccine. There is some “natural” immune response to two of the antigens, but none at all to the other two. The vaccine seems to be increasing the response to all four antigens on average, but by different amounts in each case.

immunoassay <- read_csv('data/immunoassay.csv')

ggplot(immunoassay, aes(x = antigen, y = chemiluminescence, color = vaccination, group = vaccination)) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = 0.5), size = 2, alpha = 0.5) +
  stat_summary(geom = 'point', fun = 'median', shape = 95, size = 10, position = position_dodge(width = 0.5)) +
  colorblind_friendly

This time I will go straight to the hurdle model because it is clear that a model that doesn’t accommodate zeros will be a poor fit to a dataset that is about half zeros. But brms supports two different hurdle models for continuous positive data with zeros: the hurdle Gamma and the hurdle lognormal. Let’s just get a rough idea of which of the two might be a better fit by quickly fitting a lognormal and a Gamma distribution to the nonzero values of chemiluminescence. This uses fitdist() from the package fitdistrplus. (We’re using maximum likelihood and not Bayesian methods … gasp!) Plot the density functions of each of the distributions over a scaled histogram of the observations. Orange is gamma and blue is lognormal.

nonzero_values <- immunoassay$chemiluminescence[immunoassay$chemiluminescence > 0]
gamma_fit <- fitdist(nonzero_values, 'gamma')
lnorm_fit <- fitdist(nonzero_values, 'lnorm')

ggplot(immunoassay %>% filter(chemiluminescence > 0), aes(x = chemiluminescence)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20) +
  stat_function(fun = dgamma, args = list(shape = gamma_fit$estimate[1], rate = gamma_fit$estimate[2]), linewidth = 2, color = 'orange') +
  stat_function(fun = dlnorm, args = list(meanlog = lnorm_fit$estimate[1], sdlog = lnorm_fit$estimate[2]), linewidth = 2, color = 'slateblue2') 

The lognormal distribution might be a slightly better fit. We will use the hurdle lognormal model, but I will leave it as an exercise for you to try out the hurdle Gamma as well and compare them.

Fitting a hurdle lognormal model

Here is the hurdle lognormal model fit. There are fixed effects for vaccination status, antigen, and their interaction. Those fixed effects are applied to both the mean response and the hurdle component hu (proportion of zeros). As above, I only put random effects on the mean component, although you could also put random effects on the hurdle if you wanted. Note that the hurdle lognormal uses an identity link for the mean response and a logit link for the probability. There is also a sigma parameter (standard deviation of the lognormal) with a log link function. I’ve chosen not to allow the sigma parameter to vary as a function of fixed or random effects, to avoid needlessly complicating the model. Reasonably conservative normal priors are on all the fixed effects.

immuno_hurdlelognorm_fit <- brm(
  bf(chemiluminescence ~ vaccination * antigen + (1 | chickID), 
     hu ~ vaccination * antigen
  ),
  data = immunoassay, 
  family = hurdle_lognormal(link = 'identity', link_sigma = 'log', link_hu = 'logit'),
  prior = c(
    prior(normal(0, 3), class = 'b'),
    prior(normal(0, 3), class = 'b', dpar = 'hu')
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 333,
  file = 'fits/immuno_hurdlelognorm_fit'
)

How does the posterior predictive check plot look? I applied a “pseudo-log” transformation to the axis scale which allows zeros to be plotted on a modified log scale. We can see that the model captures the shape of the data extremely well, including the peak of zeros.

pp_check(immuno_hurdlelognorm_fit) + scale_x_continuous(trans = 'pseudo_log', breaks = c(0, 10, 1000, 100000))

Making predictions with the hurdle lognormal model

Now, we can estimate the marginal means using the epred (expected value of the posterior predictive) which will be the product of the conditional mean of the nonzero observations and the probability of a nonzero observation.

(immuno_emmeans <- emmeans(immuno_hurdlelognorm_fit, ~ vaccination | antigen, epred = TRUE))
## antigen = FimA:
##  vaccination   emmean lower.HPD upper.HPD
##  nonvaccinated   7896   2008.25     16319
##  vaccinated     32743  15835.92     55959
## 
## antigen = FimW:
##  vaccination   emmean lower.HPD upper.HPD
##  nonvaccinated  55952  33496.05     88496
##  vaccinated     98389  62045.02    148687
## 
## antigen = FlgK:
##  vaccination   emmean lower.HPD upper.HPD
##  nonvaccinated   1214      0.09     55471
##  vaccinated     98387  48828.00    158130
## 
## antigen = FliD:
##  vaccination   emmean lower.HPD upper.HPD
##  nonvaccinated    998      0.03     50980
##  vaccinated     72847  30109.74    124767
## 
## Point estimate displayed: median 
## HPD interval probability: 0.95

Take the contrast between nonvaccinated and vaccinated birds within each antigen, which is our estimate of the immune response promoted by the vaccine, above the baseline natural immunity. Use 'revpairwise' to subtract vaccinated - nonvaccinated. There’s decent evidence that the vaccine promotes an immune response to all four antigens (most if not all of the posterior probability density is > 0), but there’s a lot of uncertainty in exactly how strong that vaccine effect is for each antigen. This is because of the variability among the chickens: the chickens’ immune systems seem to vary both in how much of an immune response they can naturally mount against these antigens, and in how much they are benefited by the vaccine. The uncertainty is highest for the FimW antigen.

(immuno_contrasts <- contrast(immuno_emmeans, 'revpairwise'))
## antigen = FimA:
##  contrast                   estimate lower.HPD upper.HPD
##  vaccinated - nonvaccinated    24217      5603     47334
## 
## antigen = FimW:
##  contrast                   estimate lower.HPD upper.HPD
##  vaccinated - nonvaccinated    42223     -1905    104370
## 
## antigen = FlgK:
##  contrast                   estimate lower.HPD upper.HPD
##  vaccinated - nonvaccinated    92203     26106    179090
## 
## antigen = FliD:
##  contrast                   estimate lower.HPD upper.HPD
##  vaccinated - nonvaccinated    67317      7297    148199
## 
## Point estimate displayed: median 
## HPD interval probability: 0.95

Again, we can do hypothesis tests using our by-now familiar Bayesian p-value analogues.

describe_posterior(immuno_contrasts, ci_method = 'eti', ci = 0.95, test = c('p_map', 'pd'), null = 0)
## Summary of Posterior Distribution
## 
## contrast                   | antigen |   Median |                95% CI
## -----------------------------------------------------------------------
## vaccinated - nonvaccinated |    FimA | 24216.96 | [  7550.79, 50082.38]
## vaccinated - nonvaccinated |    FimW | 42223.19 | [ -5277.64, 1.02e+05]
## vaccinated - nonvaccinated |    FlgK | 92203.25 | [ -8769.94, 1.62e+05]
## vaccinated - nonvaccinated |    FliD | 67316.71 | [-49143.10, 1.27e+05]
## 
## contrast                   | p (MAP) |     pd
## ---------------------------------------------
## vaccinated - nonvaccinated |   0.020 | 99.83%
## vaccinated - nonvaccinated |   0.194 | 96.30%
## vaccinated - nonvaccinated |   0.024 | 97.22%
## vaccinated - nonvaccinated |   0.028 | 96.15%

Something important to note here is that I haven’t explored how sensitive our results are to our choice of prior distribution. Because some of the groups have only 0 values, we don’t really have a great idea of what the true natural variability in immune response is in those groups. We can start to think there can’t be that much … after all, for two of the antigens we observed 0 response in all ten unvaccinated individuals. But we are still leaning on our prior distributions to fill in the gaps and help us estimate the means and hurdle probabilities for those groups. We used normal(0, 3) for the fixed-effect priors but what if we used normal(0, 1)? normal(0, 10)? What if we put different priors on the mean component and hurdle component? Especially in cases like this where the data are limited and the priors are doing a lot of work, it is important to explore the sensitivity of your results to choice of prior. I have included this as an exercise at the end of this lesson, which I highly recommend you try out.

Presentation ideas

Figure

As usual, this figure shows the data and the model predictions to simultaneously show that the model is a reasonably good fit to the data and show what inference you can get from the model.

gather_emmeans_draws(immuno_emmeans, value = 'chemiluminescence') %>%
  ggplot(aes(x = vaccination, y = chemiluminescence, color = vaccination)) +
    facet_wrap(~ antigen) +
    stat_interval(aes(color_ramp = after_stat(level)), .width = c(.66, .95), position = position_nudge(x = 0.15)) +
    stat_summary(fun = 'median', geom = 'point', color = 'black', size = 2, position = position_nudge(x = 0.15)) +
    geom_point(data = immunoassay, position = position_jitter(width = 0.05, height = 0), size = 1, alpha = 0.5) +
    colorblind_friendly +
    labs(color_ramp = 'CI level') +
    guides(color = 'none') +
    theme(strip.background = element_blank(),
          legend.position = c(0.09, 0.87),
          legend.background = element_rect(color = 'black'))

Methods and results

As usual, these are some basic guidelines for how to describe the statistical methods and results in a manuscript.

Methods

We fit a Bayesian lognormal GLMM to the chemiluminescence data from the immunoassay. The model included a hurdle component to accommodate the zeros in the response. There was an identity link function for the mean parameter, a log link function for the standard deviation, and a logit link function for the hurdle parameter. The model included fixed effects for vaccination status, antigen type, and their interaction, and random intercepts for each individual bird. The hurdle component was fit with fixed effects as well. The lognormal distribution was chosen by visually comparing the fit of the lognormal and Gamma distributions to the nonzero values in the data. We assigned Normal(0, 3) prior distributions to the fixed effects on both the mean and the zero-inflation parameter. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots.

We obtained posterior estimates of the marginal means for each combination of vaccination status and antigen, combining the conditional mean and hurdle components. We took pairwise differences between the unvaccinated and vaccinated means within each antigen. We characterized the posterior distributions of means and mean differences using the median and equal-tailed credible intervals. We tested the null hypothesis that each difference was equal to 0, indicating no response to vaccination, using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results

There was evidence that each antigen provoked a stronger immune response in vaccinated birds, but the uncertainty in the strength of this effect varied by antigen. The largest difference in response between vaccinated and unvaccinated birds was observed in response to the antigen FlgK (92200, 95% credible interval [26100, 179000]) and the smallest in response to the antigen FimA (24200, [5600, 47000]).

Models for proportion data including zeros, ones, or both

Another situation where you might want to use a hurdle model is for proportion data. Plenty of proportion datasets contain values that are exactly zero or exactly one, in addition to everything in between. Just a little while ago, I introduced you to a new friend, the beta distribution. But you might have noticed that the beta distribution, no matter what parameter values it takes, doesn’t ever have exactly 0 or exactly 1 in its support. This isn’t a big problem if there are only a handful of zeros and ones in the data. But if you have meaningful numbers of zeros and ones, say 5% or more of the data, the plain-vanilla beta distribution fits the data less well.

Zero-and-one-inflated beta (ZOIB) model

You can guess where this is heading. We can modify the beta distribution by including a probability of observing exactly zero, exactly one, or both. Now this is also a hurdle model because the beta distribution has 0 probability of being exactly 0. (Like the Gamma distribution, it’s continuous so it can’t have a probability of “exactly” any specific value.) The same goes for being exactly 1.

You’ve already seen how to fit a hurdle model to continuous positive data. The process is similar for the beta distribution. brms has two response families: zero_inflated_beta and zero_one_inflated_beta. (They’re called inflated though technically speaking they are hurdle models).

Because you are already pretty much experts at hurdle models at this point, we will go straight to the more complex case: the zero-one-inflated beta.

For the zero-one-inflated beta case, we have no less than four distributional parameters that we can model as varying with any combination of fixed and random effects: the mean, the precision parameter phi, the zero-inflation parameter zoi, and the one-inflation parameter coi. That can get complicated quickly. We will keep things relatively simple but you are welcome to try to add complexity to the model, keeping in mind that the usefulness of adding more and more parameters declines, the more complicated the model gets.

Saline soil where alfalfa is cultivated: image (c) Alforex Seeds
Saline soil where alfalfa is cultivated: image (c) Alforex Seeds


The example dataset is called carlson.germination from the agridat package. We are interested in the effect of salt concentration on the germination of alfalfa seeds from various different genotypes. For whatever reason, we don’t have binary germination outcomes for each seed individually, so we can’t use a logistic regression. We only have the percentage of seeds that germinated for each combination of genotype and salt concentration.

Let’s take a look at the dataset, first converting germination from a percent to a proportion ranging between 0 and 1. Draw a plot with a separate line for each genotype.

data(carlson.germination)

carlson.germination <- carlson.germination %>%
  mutate(germ = germ / 100)

ggplot(carlson.germination, aes(x = nacl, y = germ, group = gen)) + geom_line()

Germination decreases as salt concentration increases, surprising absolutely no one. No genotype seems to be that much more salt-tolerant than any other. But you can also see from the plot that there are quite a few exactly 0 and exactly 1 values in the dataset, because no seeds germinated for some genotypes at the highest salt concentration, and there was perfect germination success for some genotypes at low concentrations.

Fitting the ZOIB model

Let’s imagine, for this example, that we don’t really care about the difference between genotypes. We are considering the genotypes we happened to be testing in this study to be randomly sampled from a bigger population of genotypes we have to choose from. So we’ll treat genotype as a random effect. For now, we will also assume we don’t want our model to allow the effect of salt to vary by genotype (no random slopes), though we do want to allow the intercept to vary by genotype.

Here’s a model we could use. We are putting a fixed effect for the continuous variable nacl on the response, as well as a random intercept for genotype gen. We are also allowing the zero-inflation parameter zoi and the one-inflation parameter coi to vary by salt concentration. Of course, we could make this model way more complicated by adding random effects to zoi and coi as well as fixed and/or random effects on the precision parameter phi. As usual, complexity in a model is only beneficial when it helps us understand the data better. With a small dataset, it is unlikely to be fruitful to overcomplicate things.

Note we are using family = zero_one_inflated_beta() with all the default link functions, although I’ve spelled them out in the code anyway. We are using fairly conservative priors on the fixed effects.

carlson_zoibeta <- brm(
  bf(
    germ ~ nacl + (1 | gen),
    zoi ~ nacl,
    coi ~ nacl
  ),
  family = zero_one_inflated_beta(link = 'logit', link_phi = 'log', link_zoi = 'logit', link_coi = 'logit'),
  prior = c(
    prior(normal(0, 2), class = b),
    prior(normal(0, 2), class = b, dpar = zoi),
    prior(normal(0, 2), class = b, dpar = coi)
  ),
  data = carlson.germination,
  file = 'fits/carlson_zoibeta'
)

We could simplify the model further by getting rid of the fixed effects on the zero-inflation and one-inflation parameters.

carlson_zoibeta_simpler <- brm(
  bf(
    germ ~ nacl + (1 | gen)
  ),
  family = zero_one_inflated_beta(link = 'logit', link_phi = 'log', link_zoi = 'logit', link_coi = 'logit'),
  prior = c(
    prior(normal(0, 2), class = b)
  ),
  data = carlson.germination,
  file = 'fits/carlson_zoibeta_simpler'
)

Compare the two models.

carlson_zoibeta <- add_criterion(carlson_zoibeta, 'loo')
carlson_zoibeta_simpler <- add_criterion(carlson_zoibeta_simpler, 'loo')

loo_compare(carlson_zoibeta, carlson_zoibeta_simpler)
##                         elpd_diff se_diff
## carlson_zoibeta           0.0       0.0  
## carlson_zoibeta_simpler -21.1       4.7

We see that the model with fixed effects on the zero-inflation and one-inflation parameters is superior. Its graphical posterior predictive check is reasonable but not perfect (this is a small dataset after all).

pp_check(carlson_zoibeta)

Making predictions with the ZOIB model

We can use the summary() method for the model to see the different slope parameter estimates.

summary(carlson_zoibeta)
##  Family: zero_one_inflated_beta 
##   Links: mu = logit; phi = identity; zoi = logit; coi = logit 
## Formula: germ ~ nacl + (1 | gen) 
##          zoi ~ nacl
##          coi ~ nacl
##    Data: carlson.germination (Number of observations: 120) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Multilevel Hyperparameters:
## ~gen (Number of levels: 15) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.32      0.13     0.06     0.62 1.00      839      591
## 
## Regression Coefficients:
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept         6.30      0.37     5.57     7.02 1.00     2306     2379
## zoi_Intercept    -0.02      0.41    -0.83     0.79 1.00     6066     2944
## coi_Intercept     4.52      1.49     2.11     7.87 1.00     4850     2483
## nacl             -4.75      0.26    -5.24    -4.22 1.00     2481     2607
## zoi_nacl         -1.39      0.41    -2.22    -0.63 1.00     4636     2705
## coi_nacl         -3.94      1.09    -6.32    -2.09 1.00     4732     2870
## 
## Further Distributional Parameters:
##     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## phi    16.10      3.04    10.65    22.39 1.00     1918     2058
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

We can do hypothesis testing on the parameters of the model directly with describe_posterior().

describe_posterior(carlson_zoibeta, ci_method = 'eti', test = c('pd', 'p_map'))
## Summary of Posterior Distribution
## 
## Parameter     | Median |         95% CI | p (MAP) |     pd |  Rhat |     ESS
## ----------------------------------------------------------------------------
## (Intercept)   |   6.30 | [ 5.57,  7.02] |  < .001 |   100% | 1.000 | 2268.00
## zoi_Intercept |  -0.02 | [-0.83,  0.79] |  0.994  | 51.60% | 1.000 | 5960.00
## coi_Intercept |   4.35 | [ 2.11,  7.87] |  < .001 |   100% | 1.000 | 3998.00
## nacl          |  -4.75 | [-5.24, -4.22] |  < .001 |   100% | 1.000 | 2386.00
## zoi_nacl      |  -1.38 | [-2.22, -0.63] |  < .001 |   100% | 1.000 | 4492.00
## coi_nacl      |  -3.85 | [-6.32, -2.09] |  < .001 |   100% | 1.000 | 4362.00

Ignoring the zero-inflation and one-inflation parameters momentarily, we can interpret the slope coefficient on nacl as the amount that the log odds of germination declines by, for every one unit increase in nacl. Salt concentration was measured in percent, so we can say that for every 1% increase in salt concentration, odds of germination decreases by a factor of \(e^{4.75}\), or ~116-fold. The 95% CI can be transformed in a similar way.

Presentation ideas

Figure

Because it’s sort of confusing what the different slopes mean in the context of the hurdle model, I don’t think it is that illuminating to present these parameter estimates. More useful is to “holistically” present the model predictions as a fitted trend. Let’s make a plot of the germination trend, with credible interval, superimposed on the raw data for each genotype. We do this by generating the expected value of the posterior predictive for a set of evenly spaced concentration values. This accounts for the zero-inflation and one-inflation components.

germ_pred <- data.frame(nacl = seq(0, 2, by = 0.1)) %>%
  add_epred_draws(carlson_zoibeta, re_formula = ~ 0)

ggplot(carlson.germination, aes(x = nacl, y = germ)) + 
  stat_lineribbon(aes(y = .epred), data = germ_pred) +
  geom_line(aes(group = gen), alpha = 0.3) +
  scale_fill_brewer(palette = 'Greens', name = 'CI level') +
  scale_x_continuous(name = 'NaCl concentration') +
  scale_y_continuous(name = 'Germination', labels = scales::percent) +
  theme(legend.position = 'bottom') 

Methods and results

For the last time in this lesson, these are some basic guidelines for how to describe the statistical methods and results in a manuscript.

Methods

We fit a zero-one-inflated beta hurdle model to the percent germination data. NaCl concentration was treated as a continuous fixed effect, and genotype as a categorical random effect. We used a logit link for the response, a log link for the dispersion parameter, and logit links for the zero-inflation and one-inflation hurdle parameters. We fit a single dispersion parameter, but the zero-inflation parameter and one-inflation parameter were modeled as varying by NaCl concentration. Weakly informative Normal(0, 2) priors were put on the slope parameters. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We generated posterior distributions of fitted values from the model at a grid of evenly spaced concentration values, accounting for the conditional mean, zero-inflation probability, and one-inflation probability.

Results

Note: The percentages given below are extracted from the germ_pred object created for the figure.

The percent germination decreased strongly with increasing NaCl concentration; the posterior estimate for 0% NaCl was 99.3% (95% CI [94.5%, 99.9%]), for 1% NaCl 78.7% [69.6%, 85.4%], and for 2% NaCl 4.1% [2.8%, 6.1%].

Conclusion

Congratulations on finishing Lesson 4 of the Bayesian crash course! As we always do, let’s look back over the learning objectives to see what we’ve learned.

In this lesson we have:

  • Fit a beta regression model to proportion data with and without zeros and ones
  • Fit a zero-inflated Poisson GLMM to count data that includes zeros
  • Fit a hurdle lognormal model to continuous non-negative data that includes zeros

By now, you have a wide variety of Bayesian models in your “toolkit” that can handle many different kinds of data. I hope it’s an empowering feeling!

Going further

Here are some additional resources you can use to learn more about the kinds of Bayesian GLMMs we’ve worked on in this lesson.

Exercises

Here are some exercises to help you practice with the concepts we learned in this lesson. As always, the “homework” is optional but highly recommended! These exercises are all focused on making small modifications to the models we already fit during the lesson and exploring how sensitive our inference is to those tweaks. As you fit more and more complicated statistical models, there are more and more subjective design decisions you have to make when building the model. It is critical to examine those decisions and the assumptions behind them. In that way, both you and your readers are fully informed about how much your findings depend on exactly what model you fit to the data.

Exercise 1

Another distribution that is often used for count data is the negative binomial distribution. It can have a similar shape to the Poisson distribution but it has an additional shape parameter, making it more flexible. It can fit data with more or less variance than expected by the Poisson. It is supported in brms; its family function is negbinomial(). It even has a zero-inflated version, zero_inflated_negbinomial().

Fit a negative binomial model and a zero-inflated negative binomial model to the Beall webworm data. Put fixed effects on the shape parameter. Use appropriate prior distributions. Using leave-one-out cross-validation, compare these two models to the basic Poisson model and the zero-inflated Poisson model we already fit. What can you conclude?

If you would like some fill-in-the-blank code to help you get started, expand the box below.

Click to expand
webworm_nb_fit <- brm(
  bf(
    y ~ spray * lead + (1 | block) + (1 | spray:lead:block),
    ... ~ ... * ...
  ),
  data = beall.webworms,
  family = ...(link = 'log', link_shape = 'log'),
  prior = c(
    prior(normal(0, 5), ... = b),
    prior(normal(0, 5), ... = b, dpar = ...)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 222,
  file = 'fits/webworm_nb_fit'
)

webworm_zinb_fit <- brm(
  bf(
    y ~ spray * lead + (1 | block) + (1 | spray:lead:block),
    ... ~ ... * ...,
    ... ~ ... * ...
  ),
  data = beall.webworms,
  family = zero_inflated_negbinomial(link = 'log', link_shape = 'log', link_zi = 'logit'),
  prior = c(
    prior(normal(0, 5), ... = b),
    prior(normal(0, 5), ... = b, dpar = ...),
    prior(normal(0, 5), ... = b, dpar = ...)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 222,
  file = 'fits/webworm_zinb_fit'
)

webworm_pois_fit <- ...(webworm_pois_fit, 'loo')
webworm_zip_fit <- add_criterion(..., ...)
webworm_nb_fit <- add_criterion(..., ...)
webworm_zinb_fit <- add_criterion(..., ...)

...(webworm_pois_fit, webworm_zip_fit, webworm_nb_fit, webworm_zinb_fit)

Exercise 2

We chose to fit a hurdle lognormal model to the immunoassay chemiluminescence data based on a quick visual assessment that it would fit the data better than the hurdle Gamma model. Fit a hurdle Gamma model to the data using the family hurdle_gamma(). Use a log link function for the response and a logit link function for the hurdle component hu. Do not put any fixed or random effects on the shape parameter. Use leave-one-out cross-validation to compare the two models. What can you conclude?

Exercise 3

Consider once again the hurdle lognormal model fit to the immunoassay data. I mentioned during the lesson that because of the lack of nonzero values in some of the response groups, our results may be especially sensitive to our choice of prior distributions on the fixed effects. Refit the model with normal(0, 1) priors on the fixed effects on the mean response. (Don’t bother to change the priors on the hu fixed effects.) Fit it once more with normal(0, 10) priors. Compare the inference you get from the original model fit and the two new model fits, using the emmeans(), contrast(), and describe_posterior() functions as we did for the original model. How does the inference differ? Bonus question: How might you report this in a manuscript?

Exercise 4

Add random intercepts by genotype to the zero-inflation and one-inflation parameters in the model we fit to the alfalfa germination by salt concentration dataset. Compare this model to the other models we fit to this dataset using leave-one-out cross-validation. What can you conclude?

Click here for answers