A crash course in Bayesian mixed models with brms (Lesson 4)

Introduction and review

What is this class?

Part 4 of a practical introduction to fitting Bayesian multilevel models in R and Stan
Uses the brms package (Bayesian Regression Models using Stan)

How to follow the course

Slides and text version of lessons are online
Fill in code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind

What we’ve learned in Lessons 1-3

Relationship between prior, likelihood, and posterior
How to sample from the posterior with Hamiltonian Monte Carlo
How to use posterior distributions from our models to make predictions and test hypotheses
GLMMs with non-normal response distributions and random intercepts and slopes
Models with residuals correlated in space and time
Spatial generalized additive mixed models

What will we learn in this lesson?

Beta regression for proportion data
Zero-inflated Poisson regression for count data that include zeros
Hurdle models for continuous and proportion data that include zeros

Big-picture concepts

What are mixture distributions, and how they make stat models more flexible
How to allow each separate parameter of a statistical distribution to vary as a function of fixed and random effects
Reminder: not all these things have to be done Bayesian, but it’s often the only way that works in practice!

Part 1. Beta-regression for proportion data

Regression with proportional data

In Lesson 2, we modeled binary data with a logistic GLMM instead of fitting a model to the proportions
But sometimes you just have the proportions and not the counts, or you truly have a continuous proportion not based on counts
Normally distributed error won’t work because proportions are bounded between 0 and 1
- It can work OK if most of your data is between ~0.25 and ~0.75, but that often isn’t the case in ag science

What distribution to use?

Binomial won’t work because it has only two discrete values (0 and 1), not continuous
Gamma won’t work because it doesn’t have an upper bound
We need a distribution that is bounded on the [0, 1] interval and is very flexible in its shape

Beta distribution to the rescue!

Can be a bell-like curve, a skewed hump, or even a horseshoe, but it always goes from 0 to 1

Parameters of the beta distribution

Many sources use parameters \(\alpha\) and \(\beta\) (Wikipedia)
R function dbeta() uses those parameters but calls them shape1 and shape2
In brms, it is rewritten in terms of different parameters, \(\mu\) (mu) and \(\phi\) (phi)
- mu is the mean parameter
- phi is the precision parameter

Mean parameterization

Most distributions in brms are parameterized so that the first parameter is the mean or expected value of the distribution
Easy to separate effects on the mean from effects on other components of the distribution
You can put any combination of fixed and/or random effects on any parameter of any distribution in brms

Example dataset of proportions

Fusarium head blight on winter wheat: image (c) Janet Lewis/CIMMYT

snijders.fusarium from agridat package
Percentage of winter wheat leaf area infected by Fusarium fungus
17 wheat genotypes, 4 Fusarium strains, 3 years

Split-plot design without replication

New randomization done each year at a different site

Load packages and set options

library(brms)
library(agridat)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(ggdist)
library(bayestestR)
library(gt)
library(fitdistrplus)

theme_set(theme_bw())
colorblind_friendly <- scale_color_manual(values = unname(palette.colors(5)[-1]), aesthetics = c('colour', 'fill'))

options(mc.cores = 4, brms.file_refit = 'on_change') # If you don't have CmdStan
options(mc.cores = 4, brms.file_refit = 'on_change', brms.backend = 'cmdstanr') # If you do have CmdStan

Pre-process data

Beta distribution requires proportions, not percentages
Convert gen, strain, and year to factor data type

data(snijders.fusarium)

snijders.fusarium <- snijders.fusarium %>% 
  mutate(
    across(c(gen, strain, year), factor),
    p = y/100
  )

Histograms: distribution of data for each strain

These shapes are typical for proportion data, and resemble beta distributions

ggplot(snijders.fusarium, aes(x = p, fill = strain)) + 
  geom_histogram(bins = 15) +
  facet_wrap(~ strain) +
  colorblind_friendly +
  theme(legend.position = 'none')

Strip plot: raw data and medians for each strain

Showing all the data is often preferable to a boxplot, if the dataset is small

ggplot(snijders.fusarium, aes(x = year, y = p, color = strain)) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = 0.5), alpha = 0.5) +
  stat_summary(fun = 'median', geom = 'point', size = 15, shape = 95, position = position_dodge(width = 0.5)) +
  colorblind_friendly

Model formula and response family

p ~ strain * year + (1 | gen) + (1 | gen:year)
- Response p on the left
- Fixed effects for strain (categorical), year (categorical), and their interaction
- Random intercept for each genotype and each genotype-year combination
family = Beta(link = 'logit', link_phi = 'log')
- Logit link for the mean response
- Log link for the precision parameter phi

Fit the beta regression model

fusarium_beta_fit <- brm(
  p ~ strain * year + (1 | gen) + (1 | gen:year),
  family = Beta(link = 'logit', link_phi = 'log'),
  data = snijders.fusarium,
  prior = c(
    prior(normal(0, 5), class = b)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 555,
  file = 'fits/fusarium_beta_fit'
)

Posterior predictive check: is it a good fit?

pp_check(fusarium_beta_fit)

Beta regression model with varying precision parameter

The model we just fit has fixed and random effects on the mean mu, but only fits one global precision parameter phi
Does allowing phi to vary with fixed and/or random effects give us a better fit?
Within brm() we specify a separate formula for each model parameter, wrapped in bf()
phi ~ strain * year means the precision parameter has fixed effects on it too
Putting random effects on the precision parameter is usually overkill
We need to put separate prior distributions on the fixed effects of phi, using the argument dpar = phi
- dpar stands for “distributional parameter”

fusarium_beta_fit_varyphi <- brm(
  bf(
    p ~ strain * year + (1 | gen) + (1 | gen:year),
    phi ~ strain * year
  ),
  family = Beta(link = 'logit', link_phi = 'log'),
  data = snijders.fusarium,
  prior = c(
    prior(normal(0, 5), class = b),
    prior(normal(0, 2), class = b, dpar = phi)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 777,
  file = 'fits/fusarium_beta_fit_varyphi'
)

Note on `bf()` syntax

We will revisit shortly when we learn about zero-inflated and hurdle models
You could even use the distributional regression bf() syntax in a typical general linear model with normally distributed error
- Standard deviation sigma can vary with fixed or random effects, if you want to model unequal variance among groups
See the brms vignette on estimating distributional models

`pp_check()` and model comparison

pp_check(fusarium_beta_fit_varyphi)

fusarium_beta_fit <- add_criterion(fusarium_beta_fit, 'loo')
fusarium_beta_fit_varyphi <- add_criterion(fusarium_beta_fit_varyphi, 'loo')

loo_compare(fusarium_beta_fit, fusarium_beta_fit_varyphi)

What can we conclude?

Making predictions with the beta model

Use the simpler model with a fixed phi
Estimate marginal means for each combination of strain and year
- type = 'response' means use inverse link scale (the same scale as the original data)
Take contrasts between each pair of strains in each year
- Use regrid(fusarium_emms, transform = 'log') to get the contrasts in the form of ratios, not odds ratios

(fusarium_emms <- emmeans(fusarium_beta_fit, ~ strain | year, type = 'response'))
(fusarium_contrasts <- contrast(regrid(fusarium_emms, transform = 'log'), 'pairwise'))

Credible intervals and hypothesis tests

Use describe_posterior() from the bayestestR package to get credible intervals for the means and pairwise differences
- These are quantiles of the 4000 posterior samples we generated for each of those quantities
\(p_{MAP}\) and \(p_{d}\) are Bayesian analogues to p-values, if you are keen on testing a null hypothesis

(fusarium_emms_CI <- describe_posterior(fusarium_emms, ci = c(0.66, 0.95), ci_method = 'eti', test = NULL))
(fusarium_contrasts_CI <- describe_posterior(fusarium_contrasts, ci = c(0.66, 0.95), ci_method = 'eti', test = c('pd', 'p_map'), null = 1, as_p = TRUE))

Presentation ideas: How to report this analysis in a manuscript?

Description of statistical model for methods section (including model comparison results)
Verbal description of results
Figure
Table

Methods section

We fit a GLMM with beta response distribution to the proportion Fusarium infection data. A logit link function was used for the mean parameter, and a log link for the precision parameter \(\phi\). Strain and year were treated as fixed effects, and a random intercept and random slope with respect to year were fit for each genotype. We assigned Normal(0, 5) prior distributions to the fixed effects. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We compared this model with a model that had additional fixed effects on the precision parameter \(\phi\). LOO cross-validation showed this model was no better than the model without fixed effects on \(\phi\) (\(\Delta_{ELPD}\) = 10.2 with standard error = 10.2), therefore we present the simpler model.

Methods section, continued

We obtained posterior estimates of the marginal means for each combination of strain and year, back-transformed from the logit scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results section

Incidence of Fusarium infection was overall low except for strain F39 in 1986 and 1987, with estimated mean infection proportions 0.24 with 95% CI [0.16, 0.34] and 0.28 [0.19, 0.40] respectively. The evidence is consistent with a greater proportion infection for F39 relative to other strains, with approximately a tenfold difference in 1987 and smaller differences in the twofold to fivefold range in 1986 and 1988 (see Table).

Manuscript-quality figure

Modeled posterior means, credible intervals, and the raw data
Simultaneously shows that the model is a good fit to the data, and shows the differences between the predicted means
66% and 95% credible intervals because that is about ± 1 and 2 SD, if the posterior distribution of the parameter is roughly normal

gather_emmeans_draws(fusarium_emms) %>%
  mutate(.value = plogis(.value)) %>%
  ggplot(aes(x = year, y = .value, group = interaction(strain, year))) +
    stat_interval(aes(color = strain, color_ramp = after_stat(level)), .width = c(.66, .95), position = ggpp::position_dodgenudge(width = .5, x = .03)) +
    stat_summary(fun = 'median', geom = 'point', size = 2, position = ggpp::position_dodgenudge(width = .5, x = .03)) +
    geom_point(aes(y = p, color = strain), data = snijders.fusarium, position = ggpp::position_dodgenudge(width = .5, x = -.03), alpha = .5) +
    colorblind_friendly +
    labs(y = 'proportion infected', color_ramp = 'CI level')

Manuscript-quality table

Pairwise ratios of proportion infection (posterior estimates and 95% CIs), with \(p_{MAP}\) values

fusarium_contrasts_CI %>%
  filter(CI == 0.95) %>%
  dplyr::select(-c(p, CI)) %>%
  gt(groupname_col = 'year') %>%
  fmt_number(n_sigfig = 3) %>%
  cols_label(Median = 'ratio', CI_low = '95% CI lower bound', CI_high = '95% CI upper bound', p_MAP = md('p~MAP~')) %>%
  sub_zero(columns = p_MAP, zero_text = '< 0.001') %>%
  tab_options(row_group.background.color = 'slateblue3',
              column_labels.font.weight = 'bold')

Part 2. Zero-inflated Poisson model for count data with zeros

Zero-inflated and hurdle models

Image (c) Philippa Willitts Image (c) Kirby Lee/USA Today Sports
Images (c) Philippa Willitts and Kirby Lee/USA Today Sports

Gamma and Beta distributions only support positive values (>0)
So far we’ve used datasets that don’t contain any zeros, so these distributions have worked very well
But many proportion, count, and non-negative continuous datasets contain zeros, sometimes lots of them!

“Old school” way: get rid of the zeros and transform

Force the dataset to fit in the “normal” box
Add a little fudge factor (constant), then log transform
- Sometimes constant is reality-based: LOD/2
- Sometimes pulled out of thin air: log(x + 1)? log(x + 0.0000001)?
Even without the fudge factor, a bias is introduced when transforming and back-transforming
Or you can simply remove the zeros from the dataset before fitting a model (acceptable if very few zeros)

Better way: mixture distributions

Mixture distribution: a statistical distribution made up of two or more distributions stacked on top of each other
Weighted sum of their density functions so that the area under the PDF curve = 1
You can mix distributions from different families, at different proportions

Application of mixture distributions to datasets with zeros

We can use mixture distributions to model excess zeros (and sometimes ones) that can’t be fit by a standard statistical distrbution
Let’s start with an example from the great book Statistical Rethinking by Richard McElreath

Image from Li livres dou santé (13th c.) Image from Miracles de Notre Dame (15th c.)
Images from Li livres dou santé (13th c.) and Miracles de Notre Dame (15th c.)

Medieval monks take a long time to copy manuscripts
Most days they work on copying manuscripts
On some of those days, they finish one manuscript, some days two or three, some days none at all
But on other days, they take a break and enjoy a beer, always finishing zero manuscripts

Distribution of number of manuscripts finished per day includes zeros
Some come from workdays when no manuscripts finished
The rest come from beer-drinking days

To model this data, we don’t need to know which zeros are which
Just assume that there is a greater proportion of zeros in the data than you would expect for a (Poisson) count distribution with a given mean

Zero-inflated models

For discrete distributions like Poisson that already have some finite probability of being zero.
‘Inflate’ the probability of zero with an extra parameter

Hurdle models

For continuous distributions like Gamma and Beta
Continuous distributions do not have a probability of being exactly zero, or any other number
Mixture of two distributions
- “Point mass” distribution that determines the probability of zero
- Continuous distribution that gives us the shape of the nonzero values

Count data that contains zeros: Zero Inflated Poisson (ZIP)

Many datasets that count the abundance of organisms have lots of zero observations
Poisson and negative binomial distributions are both discrete distributions that work well with count data
But they may not fit datasets well that have a lot of zeros
They already have some probability of zero, we just need to add an extra parameter to inflate that number
Two kinds of zeros in the model: some from the main discrete distribution and others from the extra inflation component

Example count dataset with zeros

Beet webworm caterpillar and feeding damage: image (c) USU Extension

beall.webworms from agridat: classic dataset from 1940
Two pesticide treatments, spray and lead, applied in a 2 × 2 factorial design
Counted beet webworm egg masses (y) at 25 locations in each plot
Assuming a randomized complete block design, with each combination in each of the 13 blocks

RCBD design with subdivided plots

Histograms: distribution of egg mass counts

Note the proportion of zeros in each treatment combination

data(beall.webworms)

ggplot(beall.webworms, aes(x = y)) +
  geom_histogram() + 
  facet_grid(spray ~ lead, labeller = labeller(spray = c(N = 'no spray', Y = 'spray'),
                                               lead = c(N = 'no lead', Y = 'lead'))) +
  scale_x_continuous(breaks = 0:9, name = 'number of egg masses')

Fitting a Poisson model to the count data

No zero-inflation component yet
Fixed effects for treatments and their interactions: spray * lead
Random intercepts for block and for plot (1 | block) + (1 | spray:lead:block)
Log link function is usually used with Poisson

webworm_pois_fit <- brm(y ~ spray * lead + (1 | block) + (1 | spray:lead:block),
                        data = beall.webworms,
                        family = poisson(link = 'log'),
                        prior = c(
                          prior(normal(0, 5), class = b)
                        ),
                        chains = 4, iter = 2000, warmup = 1000, seed = 111,
                        file = 'fits/webworm_pois_fit')

Posterior predictive check for Poisson model

Argument type = 'bars' is good for discrete data
What do we observe?

pp_check(webworm_pois_fit, type = 'bars')

Fitting a ZIP model to the count data

Again we will use bf() for the multi-line formula
Mixture models will always have at least two parameters you can set formulas for
zi is the zero-inflation component
- Here we only put fixed effects on zi because the random effects on the mean are adequate
- Must specify separate prior distributions for fixed effects with dpar = zi or default flat priors will be assigned

webworm_zip_fit <- brm(
  bf(
    y ~ spray * lead + (1 | block) + (1 | spray:lead:block),
    zi ~ spray * lead
  ),
  data = beall.webworms,
  family = zero_inflated_poisson(link = 'log', link_zi = 'logit'),
  prior = c(
    prior(normal(0, 5), class = b),
    prior(normal(0, 5), class = b, dpar = zi)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 222,
  file = 'fits/webworm_zip_fit'
)

`pp_check()` the ZIP model

What do we observe?
Bonus exercise: formally compare the Poisson and ZIP models with LOO cross-validation

pp_check(webworm_zip_fit, type = 'bars')

What predictions can we make with the ZIP model?

Conditional mean \(\mu\) (mu): predicted value of the egg mass count for each treatment conditional on there being a non-zero number of egg masses
Zero-inflation probability \(p_{zi}\) (zi): predicted probability of observing zero egg masses for each treatment
Overall mean response \(\mu(1 - p_{zi})\): conditional mean times the probability of a nonzero response

emmeans() reports conditional mean by default
- type = 'response' inverts from the link function scale (log counts) to the original data scale (counts)
Argument dpar = 'zi' gives us the probability of zero
- type = 'response' inverts from the link function scale (log odds) to the original data scale (probabilities)
Argument epred = TRUE gives us the overall expected value
- epred stands for “expected value of the posterior predictive distribution.”

(webworm_emms_condmean <- emmeans(webworm_zip_fit, ~ spray + lead, type = 'response'))
(webworm_emms_probzero <- emmeans(webworm_zip_fit, ~ spray + lead, type = 'response', dpar = 'zi'))
(webworm_emms_overall <- emmeans(webworm_zip_fit, ~ spray + lead, epred = TRUE))

Pairwise comparisons between overall expected values

Tell emmeans() to treat the expected values as if they were fit on the log scale
- regrid(..., transform = 'log')
type = 'response' inverts from the log count scale to the count scale
'pairwise' contrasts of no-spray versus spray within each level of lead, and no-lead versus lead within each level of spray

(spray_contrasts_by_lead <- regrid(webworm_emms_overall, transform = 'log') %>% contrast(method = 'pairwise', type = 'response', by = 'lead'))
(lead_contrasts_by_spray <- regrid(webworm_emms_overall, transform = 'log') %>% contrast(method = 'pairwise', type = 'response', by = 'spray'))

Null hypothesis testing, if that’s your thing

describe_posterior(spray_contrasts_by_lead, ci_method = 'eti', ci = 0.95, test = c('p_map', 'pd'), null = 1, as_p = TRUE)
describe_posterior(lead_contrasts_by_spray, ci_method = 'eti', ci = 0.95, test = c('p_map', 'pd'), null = 1, as_p = TRUE)

Presentation ideas: How to report this analysis in a manuscript?

Description of statistical model for methods section
Verbal description of results
Figure

Methods section

We fit a zero-inflated Poisson GLMM to the webworm count data, with a log link function for the \(\lambda\) parameter and a logit link function for the zero-inflation parameter. The model included fixed effects for spray treatment, lead treatment, and their interactions, and random intercepts for block nested within treatment combination. We assigned Normal(0, 5) prior distributions to the fixed effects on both \(\lambda\) and the zero-inflation parameter. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots.

We obtained posterior estimates of the marginal means for each treatment combination of strain and year, combining the conditional mean and zero-inflation probability. We back-transformed these means from the log scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results section

There was statistical evidence that webworms were more abundant in the unsprayed condition versus sprayed. This was true both in the presence of lead where spraying reduced mean webworm abundance by a factor of 2.07 (95% CI [1.56, 2.73]), and in the absence of lead where spraying reduced abundance 2.72 times (95% CI [2.10, 3.51]). There was more modest evidence that lead treatment reduced webworm abundance, with a greater difference in the unsprayed condition (lead reduced abundance by a factor of 1.66, 95% CI [1.31, 2.12]) than sprayed (1.26, 95% CI [0.93, 1.68]).

Manuscript-quality figure

As usual we want to plot the raw data in addition to the model predictions
Raw data points are a jittered, mostly transparent cloud

gather_emmeans_draws(webworm_emms_overall) %>%
  ggplot(aes(x = interaction(spray, lead, sep = ' + '), y = .value)) +
    geom_jitter(aes(y = y), data = beall.webworms, alpha = 0.08, width = 0.15) +
    stat_interval(aes(color = after_stat(level)), .width = c(.66, .95), linewidth = 5) +
    stat_summary(fun = 'median', geom = 'point', size = 2) +
    scale_color_manual(name = 'CI level', values = c("#C7E9C0", "#238B45")) +
    scale_x_discrete(name = 'treatment', labels = c('-spray -lead', '+spray -lead', '-spray +lead', '+spray +lead')) +
    scale_y_continuous(name = 'webworm count', breaks = 0:9, limits = c(0, 9)) +
    coord_flip() +
    theme(panel.grid = element_blank(), legend.position = 'inside', legend.position.inside = c(1, 1), legend.justification = c(1.1, 1.1))

A final note on modeling count data with zeros

Negative binomial is another discrete distribution that works with count data
It has an additional parameter which allows it to fit data more flexibly than Poisson
Often it can handle excess zeros as well as or better than a zero-inflated Poisson
See exercises at the end of the lesson

Part 3. Hurdle lognormal model for continuous positive data including zeros

Non-negative continuous data with zeros

Common in ag and other natural sciences
- For example the concentration of an element or compound in a sample can never be negative
- Often it is below the limit of detection, meaning it is 0 for practical purposes
Gamma and lognormal distributions are great for fitting positive continuous data
But these distributions do not accommodate values of exactly 0 (or any exact value for that matter)

Old-school way, revisited

Do a log(x + fudge_factor) transformation and fit a model with normal error
Slightly better: add a fudge factor and fit a Gamma or lognormal model
Both ways introduce bias

Better way: Hurdle model

Similar in interpretation to a zero-inflated model
But instead of inflating a pre-existing probability of zero, we are adding some probability of zero to a distribution that ordinarily cannot be exactly zero
The distribution has to “jump over a hurdle” to reach a non-zero value

Example continuous dataset with zeros

Edited and anonymized data from an ARS immunologist
Immunoassay: how effective is a vaccine at stimulating immune response in chickens?
- Vaccinated chickens should have more antibodies in their bloodstream that bind to microorganism’s antigens
- Mix chicken blood with chemically labeled antigen that releases light when bound
- Measure chemiluminescence as an indicator of immune response

Immunoassay schematic

Chicken immunoassay data

Blood from 10 vaccinated and 10 unvaccinated birds exposed to four different antigens
- “Split plot” design: aliquot of blood from the same bird was exposed to each antigen
Many samples, especially from unvaccinated chickens, produced no detectable chemiluminescence

Import data

Read from CSV in the data folder on cloud server

immunoassay <- read_csv('data/immunoassay.csv')

If not on cloud server/SCINet, read from GitHub URL

immunoassay <- read_csv('https://usda-ree-ars.github.io/SEAStats/brms_crash_course/immunoassay.csv')

Strip plot: distribution of chemiluminescence for each antigen

Data points are jittered to left and right so you can see the many zeros

ggplot(immunoassay, aes(x = antigen, y = chemiluminescence, color = vaccination, group = vaccination)) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = 0.5), size = 2, alpha = 0.5) +
  stat_summary(geom = 'point', fun = 'median', shape = 95, size = 10, position = position_dodge(width = 0.5)) +
  colorblind_friendly

Which statistical distribution to fit?

Use a hurdle model because the dataset is ~50% zeros
brms supports hurdle Gamma and hurdle lognormal for continuous positive data with zeros
Fit lognormal and Gamma distributions to the nonzero values of chemiluminescence
- fitdist() from fitdistrplus package (maximum likelihood)
- Plot density functions over a scaled histogram of the observations
- Orange is Gamma and blue is lognormal

nonzero_values <- immunoassay$chemiluminescence[immunoassay$chemiluminescence > 0]
gamma_fit <- fitdist(nonzero_values, 'gamma')
lnorm_fit <- fitdist(nonzero_values, 'lnorm')

ggplot(immunoassay %>% filter(chemiluminescence > 0), aes(x = chemiluminescence)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20) +
  stat_function(fun = dgamma, args = list(shape = gamma_fit$estimate[1], rate = gamma_fit$estimate[2]), linewidth = 2, color = 'orange') +
  stat_function(fun = dlnorm, args = list(meanlog = lnorm_fit$estimate[1], sdlog = lnorm_fit$estimate[2]), linewidth = 2, color = 'slateblue2')

Fitting a hurdle lognormal model

Fixed effects for vaccination status, antigen, and their interaction: vaccination * antigen
Fixed effects on both the mean response and the hurdle hu (proportion of zeros)
Random intercept (1 | chickID) only on the mean response
sigma (standard deviation of the lognormal) has no fixed or random effects

immuno_hurdlelognorm_fit <- brm(
  bf(chemiluminescence ~ vaccination * antigen + (1 | chickID), 
     hu ~ vaccination * antigen
  ),
  data = immunoassay, 
  family = hurdle_lognormal(link = 'identity', link_sigma = 'log', link_hu = 'logit'),
  prior = c(
    prior(normal(0, 3), class = b),
    prior(normal(0, 3), class = b, dpar = hu)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 333,
  file = 'fits/immuno_hurdlelognorm_fit'
)

Posterior predictive check for hurdle lognormal

trans = 'pseudo_log' allows us to show zeros on a log scale

pp_check(immuno_hurdlelognorm_fit) + scale_x_continuous(trans = 'pseudo_log', breaks = c(0, 10, 1000, 100000))

Making predictions with the hurdle lognormal model

epred (expected value of the posterior predictive): mean of non-zeros × probability of non-zero
Ratio between nonvaccinated and vaccinated within each antigen = estimate of how much the vaccine promotes immune response above natural baseline
- 'revpairwise' (reverse pairwise) means subtract vaccinated - nonvaccinated
- regrid(..., transform = 'log' and type = 'response' give us ratios instead of differences
Null hypothesis tests if you want them

(immuno_emmeans <- emmeans(immuno_hurdlelognorm_fit, ~ vaccination | antigen, epred = TRUE))
(immuno_contrasts <- contrast(regrid(immuno_emmeans, transform = 'log'), 'revpairwise', type = 'response'))
describe_posterior(immuno_contrasts, ci_method = 'eti', ci = 0.95, test = c('p_map', 'pd'), null = 1, as_p = TRUE)

Note on sensitivity of results to priors

Some groups in our data have only 0 values
We have very little information about the natural variability of immune response in those groups (though we can say it’s low, we don’t know how low)
The prior distributions help fill in that gap of information, but it makes the results sensitive to our choice of prior
We used normal(0, 3) but what if we used normal(0, 1)? normal(0, 10)? Etc.
See exercises at the end of this lesson

Presentation ideas: How to report this analysis in a manuscript?

Description of statistical model for methods section
Verbal description of results
Figure

Methods section

We fit a Bayesian lognormal GLMM to the chemiluminescence data from the immunoassay. The model included a hurdle component to accommodate the zeros in the response. There was an identity link function for the mean parameter, a log link function for the standard deviation, and a logit link function for the hurdle parameter. The model included fixed effects for vaccination status, antigen type, and their interaction, and random intercepts for each individual bird. The hurdle component was fit with fixed effects as well. The lognormal distribution was chosen by visually comparing the fit of the lognormal and Gamma distributions to the nonzero values in the data. We assigned Normal(0, 3) prior distributions to the fixed effects on both the mean and the zero-inflation parameter. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots.

Methods section, continued

We obtained posterior estimates of the marginal means for each combination of vaccination status and antigen, combining the conditional mean and hurdle components. We took pairwise ratios between the vaccinated and unvaccinated means within each antigen. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We tested the null hypothesis that each ratio was equal to 1, indicating no response to vaccination, using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results

There was evidence that each antigen provoked a stronger immune response in vaccinated birds, but the uncertainty in the strength of this effect varied by antigen. The largest ratio in response between vaccinated and unvaccinated birds was observed in response to the antigen FlgK (88.8, 95% credible interval [0.9, 9519.4]) and the smallest in response to the antigen FimW (1.8, [1.0, 3.4]).

Manuscript-quality figure

gather_emmeans_draws(immuno_emmeans, value = 'chemiluminescence') %>%
  ggplot(aes(x = vaccination, y = chemiluminescence, color = vaccination)) +
    facet_wrap(~ antigen) +
    stat_interval(aes(color_ramp = after_stat(level)), .width = c(.66, .95), position = position_nudge(x = 0.15)) +
    stat_summary(fun = 'median', geom = 'point', color = 'black', size = 2, position = position_nudge(x = 0.15)) +
    geom_point(data = immunoassay, position = position_jitter(width = 0.05, height = 0), size = 1, alpha = 0.5) +
    colorblind_friendly +
    labs(color_ramp = 'CI level') +
    guides(color = 'none') +
    theme(strip.background = element_blank(),
          legend.position = 'inside',
          legend.position.inside = c(0, 1),
          legend.justification = c(-0.1, 1.1),
          legend.background = element_rect(color = 'black'))

Part 4. Zero-one-inflated beta models for proportion data including zeros and ones

Proportion data with zeros and ones

Lots of proportion datasets contain values exactly 0 (0%) and/or 1 (100%)
Beta distribution is continuous so it cannot be exactly 0 or exactly 1
We can fudge it if there are only a few 0s and 1s in the dataset

Zero-and-one-inflated beta (ZOIB) model

Beta distribution with a probability of observing exactly 0 and/or exactly 1
Technically this is a hurdle model because it’s modifying a continuous distribution
brms has two response families: zero_inflated_beta and zero_one_inflated_beta
We will go straight to the more complex zero-one-inflation case

Parameters of the ZOIB model

Four distributional parameters, all of which can have fixed and/or random effects
- mean mu
- precision parameter phi
- zero-inflation parameter zoi
- one-inflation parameter coi

Example proportion dataset with zeros and ones

Saline soil where alfalfa is cultivated: image (c) Alforex Seeds

carlson.germination from agridat package
Effect of salt concentration on the germination of alfalfa seeds of different genotypes
8 levels of salt concentration (nacl) from 0% to 2%
15 alfalfa genotypes (gen)
Percentage of germinated seeds (germ) for each combination of nacl and gen, normalized against the 0% NaCl group for each genotype

Line plot: effect of salt on germination rate by genotype

Convert germ from a percentage to a proportion for use with the Beta distribution
Filter out the 0% NaCl values (responses are all 100%)
There are many 0 and 1 values in the dataset

data(carlson.germination)

carlson.germination <- carlson.germination %>%
  mutate(germ = germ / 100) %>%
  filter(nacl > 0)

ggplot(carlson.germination, aes(x = nacl, y = germ, group = gen)) + geom_line()

Fitting the ZOIB model

Formula for mean response: germ ~ nacl + (1 | gen)
- Consider genotype as a random effect for this example
- Don’t allow the effect of salt to vary by genotype, only the baseline germination rate (random intercept but no random slope)
Allow zoi and coi to vary by salt concentration
No fixed or random effects on precision parameter phi
All default link functions used for the response family zero_one_inflated_beta()

carlson_zoibeta <- brm(
  bf(
    germ ~ nacl + (1 | gen),
    zoi ~ nacl,
    coi ~ nacl
  ),
  family = zero_one_inflated_beta(link = 'logit', link_phi = 'log', link_zoi = 'logit', link_coi = 'logit'),
  prior = c(
    prior(normal(0, 2), class = b),
    prior(normal(0, 2), class = b, dpar = zoi),
    prior(normal(0, 2), class = b, dpar = coi)
  ),
  data = carlson.germination,
  file = 'fits/carlson_zoibeta'
)

Simpler ZOIB model

No fixed effects on zero-inflation zoi or one-inflation coi

carlson_zoibeta_simpler <- brm(
  bf(
    germ ~ nacl + (1 | gen)
  ),
  family = zero_one_inflated_beta(link = 'logit', link_phi = 'log', link_zoi = 'logit', link_coi = 'logit'),
  prior = c(
    prior(normal(0, 2), class = b)
  ),
  data = carlson.germination,
  file = 'fits/carlson_zoibeta_simpler'
)

Leave-one-out cross-validation to compare ZOIB models

carlson_zoibeta <- add_criterion(carlson_zoibeta, 'loo')
carlson_zoibeta_simpler <- add_criterion(carlson_zoibeta_simpler, 'loo')

loo_compare(carlson_zoibeta, carlson_zoibeta_simpler)

What can we conclude?

Posterior predictive check for ZOIB model

pp_check(carlson_zoibeta)

Making predictions with the ZOIB model

summary() method shows the slope parameter estimates
Do hypothesis testing directly on those parameters using describe_posterior() with component = 'all'
Coefficient on nacl: how much does log-odds of germination change for every 1-unit increase in salt?
- If salt concentration goes from 0% to 1%, odds of germination decreases by a factor of \(e^{4.75}\) or ~116-fold

summary(carlson_zoibeta)

describe_posterior(carlson_zoibeta, ci_method = 'eti', test = c('pd', 'p_map'), as_p = TRUE, component = 'all')

Presentation ideas: How to report this analysis in a manuscript?

Description of statistical model for methods section
Verbal description of results
Figure

Methods section

We fit a zero-one-inflated beta hurdle model to the percent germination data. NaCl concentration was treated as a continuous fixed effect, and genotype as a categorical random effect. We used a logit link for the response, a log link for the dispersion parameter, and logit links for the zero-inflation and one-inflation hurdle parameters. We fit a single dispersion parameter, but the zero-inflation parameter and one-inflation parameter were modeled as varying by NaCl concentration. Weakly informative Normal(0, 2) priors were put on the slope parameters. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We generated posterior distributions of fitted values from the model at a grid of evenly spaced concentration values, accounting for the conditional mean, zero-inflation probability, and one-inflation probability.

Results

The percentages given below are extracted from the prediction object I created for the figure

The percent germination relative to the no-salt control decreased strongly with increasing NaCl concentration; the posterior estimate for 0.5% NaCl was 97.7% (95% CI [95.6%, 98.7%]), for 1% NaCl 81.3% [76.1%, 85.7%], and for 2% NaCl 4.5% [2.5%, 11.4%].

Manuscript-quality figure

Instead of showing slope coefficients which may be confusing, generate predictions instead
Generate epred value at an evenly spaced set of concentration values to get overall trend
This accounts for the zero-inflation and one-inflation probabilities but not the random genotype effects

germ_pred <- data.frame(nacl = seq(0.5, 2, by = 0.1)) %>%
  add_epred_draws(carlson_zoibeta, re_formula = ~ 0)

ggplot(carlson.germination, aes(x = nacl, y = germ)) + 
  stat_lineribbon(aes(y = .epred), data = germ_pred) +
  geom_line(aes(group = gen), alpha = 0.3) +
  scale_fill_brewer(palette = 'Greens', name = 'CI level') +
  scale_x_continuous(name = 'NaCl concentration') +
  scale_y_continuous(name = 'Germination', labels = scales::percent) +
  theme(legend.position = 'bottom')

Conclusion

What did we learn to do today?

Fit a beta regression model to proportion data with and without zeros and ones
Fit a zero-inflated Poisson GLMM to count data that includes zeros
Fit a hurdle lognormal model to continuous non-negative data that includes zeros

Enjoy the feeling of empowerment!!!

See text version of lesson for further reading and useful resources
Please provide feedback to help me improve this course!
- See course homepage for link to feedback form

A crash course in Bayesian mixed models with brms (Lesson 4)

Introduction and review

What is this class?

How to follow the course

What we’ve learned in Lessons 1-3

What will we learn in this lesson?

Big-picture concepts

Part 1. Beta-regression for proportion data

Regression with proportional data

What distribution to use?

Beta distribution to the rescue!

Parameters of the beta distribution

Mean parameterization

Example dataset of proportions

Split-plot design without replication

Load packages and set options

Pre-process data

Histograms: distribution of data for each strain

Strip plot: raw data and medians for each strain

Model formula and response family

Fit the beta regression model

Posterior predictive check: is it a good fit?

Beta regression model with varying precision parameter

Note on bf() syntax

pp_check() and model comparison

Making predictions with the beta model

Credible intervals and hypothesis tests

Presentation ideas: How to report this analysis in a manuscript?

Methods section

Methods section, continued

Results section

Manuscript-quality figure

Manuscript-quality table

Part 2. Zero-inflated Poisson model for count data with zeros

Zero-inflated and hurdle models

“Old school” way: get rid of the zeros and transform

Better way: mixture distributions

Application of mixture distributions to datasets with zeros

Zero-inflated models

Hurdle models

Count data that contains zeros: Zero Inflated Poisson (ZIP)

Example count dataset with zeros

RCBD design with subdivided plots

Histograms: distribution of egg mass counts

Fitting a Poisson model to the count data

Posterior predictive check for Poisson model

Fitting a ZIP model to the count data

pp_check() the ZIP model

What predictions can we make with the ZIP model?

Pairwise comparisons between overall expected values

Null hypothesis testing, if that’s your thing

Presentation ideas: How to report this analysis in a manuscript?

Methods section

Results section

Manuscript-quality figure

A final note on modeling count data with zeros

Part 3. Hurdle lognormal model for continuous positive data including zeros

Non-negative continuous data with zeros

Old-school way, revisited

Better way: Hurdle model

Example continuous dataset with zeros

Chicken immunoassay data

Import data

Strip plot: distribution of chemiluminescence for each antigen

Which statistical distribution to fit?

Fitting a hurdle lognormal model

Posterior predictive check for hurdle lognormal

Making predictions with the hurdle lognormal model

Note on sensitivity of results to priors

Presentation ideas: How to report this analysis in a manuscript?

Methods section

Methods section, continued

Results

Manuscript-quality figure

Part 4. Zero-one-inflated beta models for proportion data including zeros and ones

Proportion data with zeros and ones

Zero-and-one-inflated beta (ZOIB) model

Parameters of the ZOIB model

Example proportion dataset with zeros and ones

Line plot: effect of salt on germination rate by genotype

Note on `bf()` syntax

`pp_check()` and model comparison

`pp_check()` the ZIP model