A crash course in Bayesian mixed models with brms (Lesson 4)

Introduction and review

What is this class?

  • Part 4 of a practical introduction to fitting Bayesian multilevel models in R and Stan
  • Uses the brms package (Bayesian Regression Models using Stan)

How to follow the course

  • Slides and text version of lessons are online
  • Fill in code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

What we’ve learned in Lessons 1-3

  • Relationship between prior, likelihood, and posterior
  • How to sample from the posterior with Hamiltonian Monte Carlo
  • How to use posterior distributions from our models to make predictions and test hypotheses
  • GLMMs with non-normal response distributions and random intercepts and slopes
  • Models with residuals correlated in space and time
  • Spatial generalized additive mixed models

What will we learn in this lesson?

  • Beta regression for proportion data
  • Zero-inflated Poisson regression for count data that include zeros
  • Hurdle models for continuous and proportion data that include zeros

Big-picture concepts

  • What are mixture distributions, and how they make stat models more flexible
  • How to allow each separate parameter of a statistical distribution to vary as a function of fixed and random effects
  • Reminder: not all these things have to be done Bayesian, but it’s often the only way that works in practice!

Part 1. Beta-regression for proportion data

Load packages and set options

library(brms)
library(agridat)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(ggdist)
library(bayestestR)
library(gt)
library(fitdistrplus)

theme_set(theme_bw())
colorblind_friendly <- scale_color_manual(values = unname(palette.colors(5)[-1]), aesthetics = c('colour', 'fill'))

# IF YOU ARE RUNNING THIS CODE ON YOUR OWN MACHINE
options(mc.cores = 4, brms.file_refit = 'on_change')

# IF YOU ARE RUNNING THIS CODE ON THE POSIT CLOUD SERVER (or you have configured CmdStan on your own machine)
options(brms.backend = 'cmdstanr', mc.cores = 4, brms.file_refit = 'on_change')

Regression with proportional data

  • In Lesson 2, we modeled binary data with a logistic GLMM instead of fitting a model to the proportions
  • But sometimes you just have the proportions and not the counts, or you truly have a continuous proportion not based on counts
  • Normally distributed error won’t work because proportions are bounded between 0 and 1
    • It can work OK if most of your data is between ~0.25 and ~0.75, but that often isn’t the case in ag science

What distribution to use?

  • Binomial won’t work because it has only two discrete values (0 and 1), not continuous
  • Gamma won’t work because it doesn’t have an upper bound
  • We need a distribution that is bounded on the [0, 1] interval and is very flexible in its shape

Beta distribution to the rescue!

  • Can be a bell-like curve, a skewed hump, or even a horseshoe, but it always goes from 0 to 1

Parameters of the beta distribution

  • Many sources use parameters \(\alpha\) and \(\beta\) (Wikipedia)
  • R function dbeta() uses those parameters but calls them shape1 and shape2
  • In brms, it is rewritten in terms of different parameters, \(\mu\) (mu) and \(\phi\) (phi)
    • mu is the mean parameter
    • phi is the precision parameter

Mean parameterization

  • Most distributions in brms are parameterized so that the first parameter is the mean or expected value of the distribution
  • This is convenient because we can mentally keep track of effects on the mean, versuseffects on other components of the distribution
  • You can put any combination of fixed and/or random effects on any parameter of any distribution in brms

Example dataset of proportions

Fusarium head blight on winter wheat: image (c) Janet Lewis/CIMMYT
  • snijders.fusarium from agridat package
  • Percentage of winter wheat leaf area infected by Fusarium fungus
  • 17 wheat genotypes, 4 Fusarium strains, 3 years

Pre-process data

  • Beta distribution requires proportions, not percentages
  • Convert gen, strain, and year to factor data type
data(snijders.fusarium)

snijders.fusarium <- snijders.fusarium %>% 
  mutate(
    across(c(gen, strain, year), factor),
    p = y/100
  )

Histograms: distribution of data for each strain

  • These shapes are typical for proportion data, and resemble beta distributions
ggplot(snijders.fusarium, aes(x = p, fill = strain)) + 
  geom_histogram(bins = 15) +
  facet_wrap(~ strain) +
  colorblind_friendly +
  theme(legend.position = 'none')

Strip plot: raw data and medians for each strain

  • Showing all the data is often preferable to a boxplot, if the dataset is small
ggplot(snijders.fusarium, aes(x = year, y = p, color = strain)) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = 0.5), alpha = 0.5) +
  stat_summary(fun = 'median', geom = 'point', size = 15, shape = 95, position = position_dodge(width = 0.5)) +
  colorblind_friendly

Model formula and response family

  • p ~ strain * year + (1 + year | gen)
    • Response p on the left
    • Fixed effects for strain (categorical), year (categorical), and their interaction
    • Random intercept for each genotype, and random slope (year effects may vary by genotype)
  • family = Beta(link = 'logit', link_phi = 'log')
    • Logit link for the mean response
    • Log link for the precision parameter phi

Fit the beta regression model

fusarium_beta_fit <- brm(
  p ~ strain * year + (1 + year | gen),
  family = Beta(link = 'logit', link_phi = 'log'),
  data = snijders.fusarium,
  prior = c(
    prior(normal(0, 5), class = b)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 555,
  file = 'fits/fusarium_beta_fit'
)

Posterior predictive check: is it a good fit?

pp_check(fusarium_beta_fit)

Beta regression model with varying precision parameter

  • The model we just fit has fixed and random effects on the mean mu, but only fits one global precision parameter phi
  • Does allowing phi to vary with fixed and/or random effects give us a better fit?
  • Within brm() we specify a separate formula for each model parameter, wrapped in bf()
  • phi ~ strain * year means the precision parameter has fixed effects on it too
  • Putting random effects on the precision parameter is usually overkill
  • We need to put separate prior distributions on the fixed effects of phi, using the argument dpar = phi
    • dpar stands for “distributional parameter”
fusarium_beta_fit_varyphi <- brm(
  bf(
    p ~ strain * year + (1 + year | gen),
    phi ~ strain * year
  ),
  family = Beta(link = 'logit', link_phi = 'log'),
  data = snijders.fusarium,
  prior = c(
    prior(normal(0, 5), class = b),
    prior(normal(0, 2), class = b, dpar = phi)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 777,
  file = 'fits/fusarium_beta_fit_varyphi'
)

Note on bf() syntax

  • We will revisit shortly when we learn about zero-inflated and hurdle models
  • You could even use the distributional regression bf() syntax in a typical general linear model with normally distributed error
    • Standard deviation sigma can vary with fixed or random effects, if you want to model unequal variance among groups
  • See the brms vignette on estimating distributional models

pp_check() and model comparison

pp_check(fusarium_beta_fit_varyphi)

fusarium_beta_fit <- add_criterion(fusarium_beta_fit, 'loo')
fusarium_beta_fit_varyphi <- add_criterion(fusarium_beta_fit_varyphi, 'loo')

loo_compare(fusarium_beta_fit, fusarium_beta_fit_varyphi)

What can we conclude?

Making predictions with the beta model

  • Use the simpler model with a fixed phi
  • Estimate marginal means for each combination of strain and year
    • type = 'response' means use inverse link scale (the same scale as the original data)
  • Take contrasts between each pair of strains in each year
    • Use regrid(fusarium_emms, transform = 'log') to get the contrasts in the form of ratios, not odds ratios
(fusarium_emms <- emmeans(fusarium_beta_fit, ~ strain | year, type = 'response'))
(fusarium_contrasts <- contrast(regrid(fusarium_emms, transform = 'log'), 'pairwise'))

Credible intervals and hypothesis tests

  • Use describe_posterior() from the bayestestR package to get credible intervals for the means and pairwise differences
    • These are quantiles of the 4000 posterior samples we generated for each of those quantities
  • \(p_{MAP}\) and \(p_{d}\) are Bayesian analogues to p-values, if you are keen on testing a null hypothesis
(fusarium_emms_CI <- describe_posterior(fusarium_emms, ci = c(0.66, 0.95), ci_method = 'eti', test = NULL))
(fusarium_contrasts_CI <- describe_posterior(fusarium_contrasts, ci = c(0.66, 0.95), ci_method = 'eti', test = c('pd', 'p_map'), null = 1))

Presentation ideas: How to report this analysis in a manuscript?

  • Description of statistical model for methods section (including model comparison results)
  • Verbal description of results
  • Figure
  • Table

Methods section

We fit a GLMM with beta response distribution to the proportion Fusarium infection data. A logit link function was used for the mean parameter, and a log link for the precision parameter \(\phi\). Strain and year were treated as fixed effects, and a random intercept and random slope with respect to year were fit for each genotype. We assigned Normal(0, 5) prior distributions to the fixed effects. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We compared this model with a model that had additional fixed effects on the precision parameter \(\phi\). LOO cross-validation showed this model was no better than the model without fixed effects on \(\phi\) (\(\Delta_{ELPD}\) = 8.7 with standard error = 10.0), therefore we present the simpler model.

Methods section, continued

We obtained posterior estimates of the marginal means for each combination of strain and year, back-transformed from the logit scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results section

Incidence of Fusarium infection was overall low except for strain F39 in 1986 and 1987, with estimated mean infection proportions 0.24 with 95% CI [0.16, 0.36] and 0.28 [0.19, 0.41] respectively. The evidence is consistent with a greater proportion infection for F39 relative to other strains, with approximately a tenfold difference in 1987 and smaller differences in the twofold to fivefold range in 1986 and 1988 (see Table).

Manuscript-quality figure

  • Modeled posterior means, credible intervals, and the raw data
  • Simultaneously shows that the model is a good fit to the data, and shows the differences between the predicted means
  • 66% and 95% credible intervals because that is about ± 1 and 2 SD, if the posterior distribution of the parameter is roughly normal
gather_emmeans_draws(fusarium_emms) %>%
  mutate(.value = plogis(.value)) %>%
  ggplot(aes(x = year, y = .value, group = interaction(strain, year))) +
    stat_interval(aes(color = strain, color_ramp = after_stat(level)), .width = c(.66, .95), position = ggpp::position_dodgenudge(width = .5, x = .03)) +
    stat_summary(fun = 'median', geom = 'point', size = 2, position = ggpp::position_dodgenudge(width = .5, x = .03)) +
    geom_point(aes(y = p, color = strain), data = snijders.fusarium, position = ggpp::position_dodgenudge(width = .5, x = -.03), alpha = .5) +
    colorblind_friendly +
    labs(y = 'proportion infected', color_ramp = 'CI level')

Manuscript-quality table

  • Pairwise ratios of proportion infection (posterior estimates and 95% CIs), with \(p_{MAP}\) values
fusarium_contrasts_CI %>%
  filter(CI == 0.95) %>%
  dplyr::select(-c(pd, CI)) %>%
  gt(groupname_col = 'year') %>%
  fmt_number(n_sigfig = 3) %>%
  cols_label(Median = 'ratio', CI_low = '95% CI lower bound', CI_high = '95% CI upper bound', p_MAP = md('p~MAP~')) %>%
  sub_zero(columns = p_MAP, zero_text = '< 0.001') %>%
  tab_options(row_group.background.color = 'slateblue3',
              column_labels.font.weight = 'bold')

Part 2. Zero-inflated Poisson model for count data with zeros

Zero-inflated and hurdle models

Image (c) Philippa Willitts Image (c) Kirby Lee/USA Today Sports
Images (c) Philippa Willitts and Kirby Lee/USA Today Sports

  • Gamma and Beta distributions only support positive values (>0)
  • So far we’ve used datasets that don’t contain any zeros, so these distributions have worked very well
  • But many proportion, count, and non-negative continuous datasets contain zeros, sometimes lots of them!

“Old school” way: get rid of the zeros and transform

  • Force the dataset to fit in the “normal” box
  • Add a little fudge factor (constant), then log transform
    • Sometimes constant is reality-based: LOD/2
    • Sometimes pulled out of thin air: log(x + 1)? log(x + 0.0000001)?
  • Even without the fudge factor, a bias is introduced when transforming and back-transforming
  • Or you can simply remove the zeros from the dataset before fitting a model (acceptable if very few zeros)

Better way: mixture distributions

  • Mixture distribution: a statistical distribution made up of two or more distributions stacked on top of each other
  • Weighted sum of their density functions so that the area under the PDF curve = 1
  • You can mix distributions from different families, at different proportions

Application of mixture distributions to datasets with zeros

  • We can use mixture distributions to model excess zeros (and sometimes ones) that can’t be fit by a standard statistical distrbution
  • Let’s start with an example from the great book Statistical Rethinking by Richard McElreath

Image from Li livres dou santé (13th c.) Image from Miracles de Notre Dame (15th c.)
Images from Li livres dou santé (13th c.) and Miracles de Notre Dame (15th c.)

  • Medieval monks take a long time to copy manuscripts
  • Most days they work on copying manuscripts
  • On some of those days, they finish one manuscript, some days two or three, some days none at all
  • But on other days, they take a break and enjoy a beer, always finishing zero manuscripts

  • Distribution of number of manuscripts finished per day includes zeros
  • Some come from workdays when no manuscripts finished
  • The rest come from beer-drinking days
  • To model this data, we don’t need to know which zeros are which
  • Just assume that there is a greater proportion of zeros in the data than you would expect for a (Poisson) count distribution with a given mean

Zero-inflated models

  • For discrete distributions like Poisson that already have some finite probability of being zero.
  • ‘Inflate’ the probability of zero with an extra parameter.

Hurdle models

  • For continuous distributions like Gamma and Beta
  • Continuous distributions do not have a probability of being exactly zero, or any other number
  • Mixture of two distributions
    • “Point mass” distribution that determines the probability of zero
    • Continuous distribution that gives us the shape of the nonzero values

Count data that contains zeros: Zero Inflated Poisson (ZIP)

  • Many datasets that count the abundance of organisms have lots of zero observations
  • Poisson and negative binomial distributions are both discrete distributions that work well with count data
  • But they may not fit datasets well that have a lot of zeros
  • They already have some probability of zero, we just need to add an extra parameter to inflate that number
  • Two kinds of zeros in the model: some from the main discrete distribution and others from the extra inflation component

Example count dataset with zeros