A crash course in Bayesian mixed models with brms (Lesson 4)

Introduction and review

What is this class?

  • Part 4 of a practical introduction to fitting Bayesian multilevel models in R and Stan
  • Uses the brms package (Bayesian Regression Models using Stan)

How to follow the course

  • Slides and text version of lessons are online
  • Fill in code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

What we’ve learned in Lessons 1-3

  • Relationship between prior, likelihood, and posterior
  • How to sample from the posterior with Hamiltonian Monte Carlo
  • How to use posterior distributions from our models to make predictions and test hypotheses
  • GLMMs with non-normal response distributions and random intercepts and slopes
  • Models with residuals correlated in space and time
  • Spatial generalized additive mixed models

What will we learn in this lesson?

  • Beta regression for proportion data
  • Zero-inflated Poisson regression for count data that include zeros
  • Hurdle models for continuous and proportion data that include zeros

Big-picture concepts

  • What are mixture distributions, and how they make stat models more flexible
  • How to allow each separate parameter of a statistical distribution to vary as a function of fixed and random effects
  • Reminder: not all these things have to be done Bayesian, but it’s often the only way that works in practice!

Part 1. Beta-regression for proportion data

Load packages and set options

library(brms)
library(agridat)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(ggdist)
library(bayestestR)
library(gt)
library(fitdistrplus)

theme_set(theme_bw())
colorblind_friendly <- scale_color_manual(values = unname(palette.colors(5)[-1]), aesthetics = c('colour', 'fill'))

# IF YOU ARE RUNNING THIS CODE ON YOUR OWN MACHINE
options(mc.cores = 4, brms.file_refit = 'on_change')

# IF YOU ARE RUNNING THIS CODE ON THE POSIT CLOUD SERVER (or you have configured CmdStan on your own machine)
options(brms.backend = 'cmdstanr', mc.cores = 4, brms.file_refit = 'on_change')

Regression with proportional data

  • In Lesson 2, we modeled binary data with a logistic GLMM instead of fitting a model to the proportions
  • But sometimes you just have the proportions and not the counts, or you truly have a continuous proportion not based on counts
  • Normally distributed error won’t work because proportions are bounded between 0 and 1
    • It can work OK if most of your data is between ~0.25 and ~0.75, but that often isn’t the case in ag science

What distribution to use?

  • Binomial won’t work because it has only two discrete values (0 and 1), not continuous
  • Gamma won’t work because it doesn’t have an upper bound
  • We need a distribution that is bounded on the [0, 1] interval and is very flexible in its shape

Beta distribution to the rescue!

  • Can be a bell-like curve, a skewed hump, or even a horseshoe, but it always goes from 0 to 1

Parameters of the beta distribution

  • Many sources use parameters \(\alpha\) and \(\beta\) (Wikipedia)
  • R function dbeta() uses those parameters but calls them shape1 and shape2
  • In brms, it is rewritten in terms of different parameters, \(\mu\) (mu) and \(\phi\) (phi)
    • mu is the mean parameter
    • phi is the precision parameter

Mean parameterization

  • Most distributions in brms are parameterized so that the first parameter is the mean or expected value of the distribution
  • This is convenient because we can mentally keep track of effects on the mean, versuseffects on other components of the distribution
  • You can put any combination of fixed and/or random effects on any parameter of any distribution in brms

Example dataset of proportions

Fusarium head blight on winter wheat: image (c) Janet Lewis/CIMMYT
  • snijders.fusarium from agridat package
  • Percentage of winter wheat leaf area infected by Fusarium fungus
  • 17 wheat genotypes, 4 Fusarium strains, 3 years

Pre-process data

  • Beta distribution requires proportions, not percentages
  • Convert gen, strain, and year to factor data type
data(snijders.fusarium)

snijders.fusarium <- snijders.fusarium %>% 
  mutate(
    across(c(gen, strain, year), factor),
    p = y/100
  )

Histograms: distribution of data for each strain

  • These shapes are typical for proportion data, and resemble beta distributions
ggplot(snijders.fusarium, aes(x = p, fill = strain)) + 
  geom_histogram(bins = 15) +
  facet_wrap(~ strain) +
  colorblind_friendly +
  theme(legend.position = 'none')

Strip plot: raw data and medians for each strain

  • Showing all the data is often preferable to a boxplot, if the dataset is small
ggplot(snijders.fusarium, aes(x = year, y = p, color = strain)) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = 0.5), alpha = 0.5) +
  stat_summary(fun = 'median', geom = 'point', size = 15, shape = 95, position = position_dodge(width = 0.5)) +
  colorblind_friendly

Model formula and response family

  • p ~ strain * year + (1 + year | gen)
    • Response p on the left
    • Fixed effects for strain (categorical), year (categorical), and their interaction
    • Random intercept for each genotype, and random slope (year effects may vary by genotype)
  • family = Beta(link = 'logit', link_phi = 'log')
    • Logit link for the mean response
    • Log link for the precision parameter phi

Fit the beta regression model

fusarium_beta_fit <- brm(
  p ~ strain * year + (1 + year | gen),
  family = Beta(link = 'logit', link_phi = 'log'),
  data = snijders.fusarium,
  prior = c(
    prior(normal(0, 5), class = b)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 555,
  file = 'fits/fusarium_beta_fit'
)

Posterior predictive check: is it a good fit?

pp_check(fusarium_beta_fit)

Beta regression model with varying precision parameter

  • The model we just fit has fixed and random effects on the mean mu, but only fits one global precision parameter phi
  • Does allowing phi to vary with fixed and/or random effects give us a better fit?
  • Within brm() we specify a separate formula for each model parameter, wrapped in bf()
  • phi ~ strain * year means the precision parameter has fixed effects on it too
  • Putting random effects on the precision parameter is usually overkill
  • We need to put separate prior distributions on the fixed effects of phi, using the argument dpar = phi
    • dpar stands for “distributional parameter”
fusarium_beta_fit_varyphi <- brm(
  bf(
    p ~ strain * year + (1 + year | gen),
    phi ~ strain * year
  ),
  family = Beta(link = 'logit', link_phi = 'log'),
  data = snijders.fusarium,
  prior = c(
    prior(normal(0, 5), class = b),
    prior(normal(0, 2), class = b, dpar = phi)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 777,
  file = 'fits/fusarium_beta_fit_varyphi'
)

Note on bf() syntax

  • We will revisit shortly when we learn about zero-inflated and hurdle models
  • You could even use the distributional regression bf() syntax in a typical general linear model with normally distributed error
    • Standard deviation sigma can vary with fixed or random effects, if you want to model unequal variance among groups
  • See the brms vignette on estimating distributional models

pp_check() and model comparison

pp_check(fusarium_beta_fit_varyphi)

fusarium_beta_fit <- add_criterion(fusarium_beta_fit, 'loo')
fusarium_beta_fit_varyphi <- add_criterion(fusarium_beta_fit_varyphi, 'loo')

loo_compare(fusarium_beta_fit, fusarium_beta_fit_varyphi)

What can we conclude?

Making predictions with the beta model

  • Use the simpler model with a fixed phi
  • Estimate marginal means for each combination of strain and year
    • type = 'response' means use inverse link scale (the same scale as the original data)
  • Take contrasts between each pair of strains in each year
    • Use regrid(fusarium_emms, transform = 'log') to get the contrasts in the form of ratios, not odds ratios
(fusarium_emms <- emmeans(fusarium_beta_fit, ~ strain | year, type = 'response'))
(fusarium_contrasts <- contrast(regrid(fusarium_emms, transform = 'log'), 'pairwise'))

Credible intervals and hypothesis tests

  • Use describe_posterior() from the bayestestR package to get credible intervals for the means and pairwise differences
    • These are quantiles of the 4000 posterior samples we generated for each of those quantities
  • \(p_{MAP}\) and \(p_{d}\) are Bayesian analogues to p-values, if you are keen on testing a null hypothesis
(fusarium_emms_CI <- describe_posterior(fusarium_emms, ci = c(0.66, 0.95), ci_method = 'eti', test = NULL))
(fusarium_contrasts_CI <- describe_posterior(fusarium_contrasts, ci = c(0.66, 0.95), ci_method = 'eti', test = c('pd', 'p_map'), null = 1))

Presentation ideas: How to report this analysis in a manuscript?

  • Description of statistical model for methods section (including model comparison results)
  • Verbal description of results
  • Figure
  • Table

Methods section

We fit a GLMM with beta response distribution to the proportion Fusarium infection data. A logit link function was used for the mean parameter, and a log link for the precision parameter \(\phi\). Strain and year were treated as fixed effects, and a random intercept and random slope with respect to year were fit for each genotype. We assigned Normal(0, 5) prior distributions to the fixed effects. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We compared this model with a model that had additional fixed effects on the precision parameter \(\phi\). LOO cross-validation showed this model was no better than the model without fixed effects on \(\phi\) (\(\Delta_{ELPD}\) = 8.7 with standard error = 10.0), therefore we present the simpler model.

Methods section, continued

We obtained posterior estimates of the marginal means for each combination of strain and year, back-transformed from the logit scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results section

Incidence of Fusarium infection was overall low except for strain F39 in 1986 and 1987, with estimated mean infection proportions 0.24 with 95% CI [0.16, 0.36] and 0.28 [0.19, 0.41] respectively. The evidence is consistent with a greater proportion infection for F39 relative to other strains, with approximately a tenfold difference in 1987 and smaller differences in the twofold to fivefold range in 1986 and 1988 (see Table).

Manuscript-quality figure

  • Modeled posterior means, credible intervals, and the raw data
  • Simultaneously shows that the model is a good fit to the data, and shows the differences between the predicted means
  • 66% and 95% credible intervals because that is about ± 1 and 2 SD, if the posterior distribution of the parameter is roughly normal
gather_emmeans_draws(fusarium_emms) %>%
  mutate(.value = plogis(.value)) %>%
  ggplot(aes(x = year, y = .value, group = interaction(strain, year))) +
    stat_interval(aes(color = strain, color_ramp = after_stat(level)), .width = c(.66, .95), position = ggpp::position_dodgenudge(width = .5, x = .03)) +
    stat_summary(fun = 'median', geom = 'point', size = 2, position = ggpp::position_dodgenudge(width = .5, x = .03)) +
    geom_point(aes(y = p, color = strain), data = snijders.fusarium, position = ggpp::position_dodgenudge(width = .5, x = -.03), alpha = .5) +
    colorblind_friendly +
    labs(y = 'proportion infected', color_ramp = 'CI level')

Manuscript-quality table

  • Pairwise ratios of proportion infection (posterior estimates and 95% CIs), with \(p_{MAP}\) values
fusarium_contrasts_CI %>%
  filter(CI == 0.95) %>%
  dplyr::select(-c(pd, CI)) %>%
  gt(groupname_col = 'year') %>%
  fmt_number(n_sigfig = 3) %>%
  cols_label(Median = 'ratio', CI_low = '95% CI lower bound', CI_high = '95% CI upper bound', p_MAP = md('p~MAP~')) %>%
  sub_zero(columns = p_MAP, zero_text = '< 0.001') %>%
  tab_options(row_group.background.color = 'slateblue3',
              column_labels.font.weight = 'bold')

Part 2. Zero-inflated Poisson model for count data with zeros

Zero-inflated and hurdle models

Image (c) Philippa Willitts Image (c) Kirby Lee/USA Today Sports
Images (c) Philippa Willitts and Kirby Lee/USA Today Sports

  • Gamma and Beta distributions only support positive values (>0)
  • So far we’ve used datasets that don’t contain any zeros, so these distributions have worked very well
  • But many proportion, count, and non-negative continuous datasets contain zeros, sometimes lots of them!

“Old school” way: get rid of the zeros and transform

  • Force the dataset to fit in the “normal” box
  • Add a little fudge factor (constant), then log transform
    • Sometimes constant is reality-based: LOD/2
    • Sometimes pulled out of thin air: log(x + 1)? log(x + 0.0000001)?
  • Even without the fudge factor, a bias is introduced when transforming and back-transforming
  • Or you can simply remove the zeros from the dataset before fitting a model (acceptable if very few zeros)

Better way: mixture distributions

  • Mixture distribution: a statistical distribution made up of two or more distributions stacked on top of each other
  • Weighted sum of their density functions so that the area under the PDF curve = 1
  • You can mix distributions from different families, at different proportions

Application of mixture distributions to datasets with zeros

  • We can use mixture distributions to model excess zeros (and sometimes ones) that can’t be fit by a standard statistical distrbution
  • Let’s start with an example from the great book Statistical Rethinking by Richard McElreath

Image from Li livres dou santé (13th c.) Image from Miracles de Notre Dame (15th c.)
Images from Li livres dou santé (13th c.) and Miracles de Notre Dame (15th c.)

  • Medieval monks take a long time to copy manuscripts
  • Most days they work on copying manuscripts
  • On some of those days, they finish one manuscript, some days two or three, some days none at all
  • But on other days, they take a break and enjoy a beer, always finishing zero manuscripts

  • Distribution of number of manuscripts finished per day includes zeros
  • Some come from workdays when no manuscripts finished
  • The rest come from beer-drinking days
  • To model this data, we don’t need to know which zeros are which
  • Just assume that there is a greater proportion of zeros in the data than you would expect for a (Poisson) count distribution with a given mean

Zero-inflated models

  • For discrete distributions like Poisson that already have some finite probability of being zero.
  • ‘Inflate’ the probability of zero with an extra parameter.

Hurdle models

  • For continuous distributions like Gamma and Beta
  • Continuous distributions do not have a probability of being exactly zero, or any other number
  • Mixture of two distributions
    • “Point mass” distribution that determines the probability of zero
    • Continuous distribution that gives us the shape of the nonzero values

Count data that contains zeros: Zero Inflated Poisson (ZIP)

  • Many datasets that count the abundance of organisms have lots of zero observations
  • Poisson and negative binomial distributions are both discrete distributions that work well with count data
  • But they may not fit datasets well that have a lot of zeros
  • They already have some probability of zero, we just need to add an extra parameter to inflate that number
  • Two kinds of zeros in the model: some from the main discrete distribution and others from the extra inflation component

Example count dataset with zeros

Beet webworm caterpillar and feeding damage: image (c) USU Extension
  • beall.webworms from agridat: classic dataset from 1940
  • Two pesticide treatments, spray and lead, applied in a 2 × 2 factorial design
  • Counted beet webworm egg masses (y) at 25 locations in each plot
  • Assuming a randomized complete block design, with each combination in each of the 4 blocks

Histograms: distribution of egg mass counts

  • Note the proportion of zeros in each treatment combination
data(beall.webworms)

ggplot(beall.webworms, aes(x = y)) +
  geom_histogram() + 
  facet_grid(spray ~ lead, labeller = labeller(spray = c(N = 'no spray', Y = 'spray'),
                                               lead = c(N = 'no lead', Y = 'lead'))) +
  scale_x_continuous(breaks = 1:9, name = 'number of egg masses')

Fitting a Poisson model to the count data

  • No zero-inflation component yet
  • Fixed effects for treatments and their interactions: spray * lead
  • Random intercepts for block and for plot (1 | block) + (1 | spray:lead:block)
  • Log link function is usually used with Poisson
webworm_pois_fit <- brm(y ~ spray * lead + (1 | block) + (1 | spray:lead:block),
                        data = beall.webworms,
                        family = poisson(link = 'log'),
                        prior = c(
                          prior(normal(0, 5), class = b)
                        ),
                        chains = 4, iter = 2000, warmup = 1000, seed = 111,
                        file = 'fits/webworm_pois_fit')

Posterior predictive check for Poisson model

  • Argument type = 'bars' is good for discrete data
  • What do we observe?
pp_check(webworm_pois_fit, type = 'bars')

Fitting a ZIP model to the count data

  • Again we will use bf() for the multi-line formula
  • Mixture models will always have at least two parameters you can set formulas for
  • zi is the zero-inflation component
    • Here we only put fixed effects on zi because the random effects on the mean are adequate
    • Must specify separate prior distributions for fixed effects with dpar = zi or default flat priors will be assigned
webworm_zip_fit <- brm(
  bf(
    y ~ spray * lead + (1 | block) + (1 | spray:lead:block),
    zi ~ spray * lead
  ),
  data = beall.webworms,
  family = zero_inflated_poisson(link = 'log', link_zi = 'logit'),
  prior = c(
    prior(normal(0, 5), class = b),
    prior(normal(0, 5), class = b, dpar = zi)
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 222,
  file = 'fits/webworm_zip_fit'
)

pp_check() the ZIP model

  • What do we observe?
  • Bonus exercise: formally compare the Poisson and ZIP models with LOO cross-validation
pp_check(webworm_zip_fit, type = 'bars')

What predictions can we make with the ZIP model?

  • Conditional mean \(\mu\) (mu): predicted value of the egg mass count for each treatment conditional on there being a non-zero number of egg masses
  • Zero-inflation probability \(p_{zi}\) (zi): predicted probability of observing zero egg masses for each treatment
  • Overall mean response \(\mu(1 - p_{zi})\): conditional mean times the probability of a nonzero response
  • emmeans() reports conditional mean by default
    • type = 'response' inverts from the link function scale (log counts) to the original data scale (counts)
  • Argument dpar = 'zi' gives us the probability of zero
    • type = 'response' inverts from the link function scale (log odds) to the original data scale (probabilities)
  • Argument epred = TRUE gives us the overall expected value
    • epred stands for “expected value of the posterior predictive distribution.”
(webworm_emms_condmean <- emmeans(webworm_zip_fit, ~ spray + lead, type = 'response'))
(webworm_emms_probzero <- emmeans(webworm_zip_fit, ~ spray + lead, type = 'response', dpar = 'zi'))
(webworm_emms_overall <- emmeans(webworm_zip_fit, ~ spray + lead, epred = TRUE))

Pairwise comparisons between overall expected values

  • Tell emmeans() to treat the expected values as if they were fit on the log scale
    • regrid(..., transform = 'log')
  • type = 'response' inverts from the log count scale to the count scale
  • 'pairwise' contrasts of no-spray versus spray within each level of lead, and no-lead versus lead within each level of spray
(spray_contrasts_by_lead <- regrid(webworm_emms_overall, transform = 'log') %>% contrast(method = 'pairwise', type = 'response', by = 'lead'))
(lead_contrasts_by_spray <- regrid(webworm_emms_overall, transform = 'log') %>% contrast(method = 'pairwise', type = 'response', by = 'spray'))

Null hypothesis testing, if that’s your thing

describe_posterior(spray_contrasts_by_lead, ci_method = 'eti', ci = 0.95, test = c('p_map', 'pd'), null = 1)
describe_posterior(lead_contrasts_by_spray, ci_method = 'eti', ci = 0.95, test = c('p_map', 'pd'), null = 1)

Presentation ideas: How to report this analysis in a manuscript?

  • Description of statistical model for methods section
  • Verbal description of results
  • Figure

Methods section

We fit a zero-inflated Poisson GLMM to the webworm count data, with a log link function for the \(\lambda\) parameter and a logit link function for the zero-inflation parameter. The model included fixed effects for spray treatment, lead treatment, and their interactions, and random intercepts for block nested within treatment combination. We assigned Normal(0, 5) prior distributions to the fixed effects on both \(\lambda\) and the zero-inflation parameter. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots.

We obtained posterior estimates of the marginal means for each treatment combination of strain and year, combining the conditional mean and zero-inflation probability. We back-transformed these means from the log scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results section

There was statistical evidence that webworms were more abundant in the unsprayed condition versus sprayed. This was true both in the presence of lead where spraying reduced mean webworm abundance by a factor of 2.07 (95% CI [1.54, 2.75]), and in the absence of lead where spraying reduced abundance 2.72 times (95% CI [2.10, 3.58]). There was more modest evidence that lead treatment reduced webworm abundance, with a greater difference in the unsprayed condition (lead reduced abundance by a factor of 1.66, 95% CI [1.30, 2.14]) than sprayed (1.26, 95% CI [0.92, 1.71]).

Manuscript-quality figure

  • As usual we want to plot the raw data in addition to the model predictions
  • This time there are too many individual raw data points to plot, so we use a histogram
gather_emmeans_draws(webworm_emms_overall) %>%
  ggplot(aes(x = interaction(spray, lead, sep = ' + '), y = .value)) +
    stat_histinterval(aes(y = y), data = beall.webworms, adjust = 5, point_interval = NULL) +
    stat_interval(aes(color = after_stat(level)), .width = c(.66, .95), position = position_nudge(x = -0.1)) +
    stat_summary(fun = 'median', geom = 'point', size = 2, position = position_nudge(x = -0.1)) +
    scale_color_brewer(name = 'CI level', palette = 'Greens') +
    scale_x_discrete(name = 'treatment', labels = c('-spray -lead', '+spray -lead', '-spray +lead', '+spray +lead')) +
    scale_y_continuous(name = 'webworm count', breaks = 0:6, limits = c(0, 6)) +
    coord_flip() +
    theme(panel.grid = element_blank(), legend.position = 'inside', legend.position.inside = c(0.9, 0.85))

A final note on modeling count data with zeros

  • Negative binomial is another discrete distribution that works with count data
  • It has an additional parameter which allows it to fit data more flexibly than Poisson
  • Often it can handle excess zeros as well as or better than a zero-inflated Poisson
  • See exercises at the end of the lesson

Part 3. Hurdle lognormal model for continuous positive data including zeros

Non-negative continuous data with zeros

  • Common in ag and other natural sciences
    • For example the concentration of an element or compound in a sample can never be negative
    • Often it is below the limit of detection, meaning it is 0 for practical purposes
  • Gamma and lognormal distributions are great for fitting positive continuous data
  • But these distributions do not accommodate values of exactly 0 (or any exact value for that matter)

Old-school way, revisited

  • Do a log(x + fudge_factor) transformation and fit a model with normal error
  • Slightly better: add a fudge factor and fit a Gamma or lognormal model
  • Both ways introduce bias

Better way: Hurdle model

  • Similar in interpretation to a zero-inflated model
  • But instead of inflating a pre-existing probability of zero, we are adding some probability of zero to a distribution that ordinarily cannot be exactly zero
  • The distribution has to “jump over a hurdle” to reach a non-zero value

Example continuous dataset with zeros

  • Edited and anonymized data from an ARS immunologist
  • Immunoassay: how effective is a vaccine at stimulating immune response in chickens?
    • Vaccinated chickens should have more antibodies in their bloodstream that bind to microorganism’s antigens
    • Mix chicken blood with chemically labeled antigen that releases light when bound
    • Measure chemiluminescence as an indicator of immune response

Image (c) NABAS

Chicken immunoassay data

  • Blood from 10 vaccinated and 10 unvaccinated birds exposed to four different antigens
    • “Split plot” design: aliquot of blood from the same bird was exposed to each antigen
  • Many samples, especially from unvaccinated chickens, produced no detectable chemiluminescence

Strip plot: distribution of chemiluminescence for each antigen

  • Data points are jittered to left and right so you can see the many zeros
immunoassay <- read_csv('data/immunoassay.csv')

ggplot(immunoassay, aes(x = antigen, y = chemiluminescence, color = vaccination, group = vaccination)) +
  geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0, dodge.width = 0.5), size = 2, alpha = 0.5) +
  stat_summary(geom = 'point', fun = 'median', shape = 95, size = 10, position = position_dodge(width = 0.5)) +
  colorblind_friendly

Which statistical distribution to fit?

  • Use a hurdle model because the dataset is ~50% zeros
  • brms supports hurdle Gamma and hurdle lognormal for continuous positive data with zeros
  • Fit lognormal and Gamma distributions to the nonzero values of chemiluminescence
    • fitdist() from fitdistrplus package (maximum likelihood)
    • Plot density functions over a scaled histogram of the observations
    • Orange is Gamma and blue is lognormal
nonzero_values <- immunoassay$chemiluminescence[immunoassay$chemiluminescence > 0]
gamma_fit <- fitdist(nonzero_values, 'gamma')
lnorm_fit <- fitdist(nonzero_values, 'lnorm')

ggplot(immunoassay %>% filter(chemiluminescence > 0), aes(x = chemiluminescence)) +
  geom_histogram(aes(y = after_stat(density)), bins = 20) +
  stat_function(fun = dgamma, args = list(shape = gamma_fit$estimate[1], rate = gamma_fit$estimate[2]), linewidth = 2, color = 'orange') +
  stat_function(fun = dlnorm, args = list(meanlog = lnorm_fit$estimate[1], sdlog = lnorm_fit$estimate[2]), linewidth = 2, color = 'slateblue2') 

Fitting a hurdle lognormal model

  • Fixed effects for vaccination status, antigen, and their interaction: vaccination * antigen
  • Fixed effects on both the mean response and the hurdle hu (proportion of zeros)
  • Random intercept (1 | chickID) only on the mean response
  • sigma (standard deviation of the lognormal) has no fixed or random effects
immuno_hurdlelognorm_fit <- brm(
  bf(chemiluminescence ~ vaccination * antigen + (1 | chickID), 
     hu ~ vaccination * antigen
  ),
  data = immunoassay, 
  family = hurdle_lognormal(link = 'identity', link_sigma = 'log', link_hu = 'logit'),
  prior = c(
    prior(normal(0, 3), class = 'b'),
    prior(normal(0, 3), class = 'b', dpar = 'hu')
  ),
  chains = 4, iter = 2000, warmup = 1000, seed = 333,
  file = 'fits/immuno_hurdlelognorm_fit'
)

Posterior predictive check for hurdle lognormal

  • trans = 'pseudo_log' allows us to show zeros on a log scale
pp_check(immuno_hurdlelognorm_fit) + scale_x_continuous(trans = 'pseudo_log', breaks = c(0, 10, 1000, 100000))

Making predictions with the hurdle lognormal model

  • epred (expected value of the posterior predictive): mean of non-zeros × probability of non-zero
  • Contrast between nonvaccinated and vaccinated within each antigen = estimate of how much the vaccine promotes immune response above natural baseline
    • 'revpairwise' (reverse pairwise) means subtract vaccinated - nonvaccinated
  • Null hypothesis tests if you want them
(immuno_emmeans <- emmeans(immuno_hurdlelognorm_fit, ~ vaccination | antigen, epred = TRUE))
(immuno_contrasts <- contrast(immuno_emmeans, 'revpairwise'))
describe_posterior(immuno_contrasts, ci_method = 'eti', ci = 0.95, test = c('p_map', 'pd'), null = 0)

Note on sensitivity of results to priors

  • Some groups in our data have only 0 values
  • We have very little information about the natural variability of immune response in those groups (though we can say it’s low, we don’t know how low)
  • The prior distributions help fill in that gap of information, but it makes the results sensitive to our choice of prior
  • We used normal(0, 3) but what if we used normal(0, 1)? normal(0, 10)? Etc.
  • See exercises at the end of this lesson

Presentation ideas: How to report this analysis in a manuscript?

  • Description of statistical model for methods section
  • Verbal description of results
  • Figure

Methods section

We fit a Bayesian lognormal GLMM to the chemiluminescence data from the immunoassay. The model included a hurdle component to accommodate the zeros in the response. There was an identity link function for the mean parameter, a log link function for the standard deviation, and a logit link function for the hurdle parameter. The model included fixed effects for vaccination status, antigen type, and their interaction, and random intercepts for each individual bird. The hurdle component was fit with fixed effects as well. The lognormal distribution was chosen by visually comparing the fit of the lognormal and Gamma distributions to the nonzero values in the data. We assigned Normal(0, 3) prior distributions to the fixed effects on both the mean and the zero-inflation parameter. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots.

Methods section, continued

We obtained posterior estimates of the marginal means for each combination of vaccination status and antigen, combining the conditional mean and hurdle components. We took pairwise differences between the unvaccinated and vaccinated means within each antigen. We characterized the posterior distributions of means and mean differences using the median and equal-tailed credible intervals. We tested the null hypothesis that each difference was equal to 0, indicating no response to vaccination, using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).

Results

There was evidence that each antigen provoked a stronger immune response in vaccinated birds, but the uncertainty in the strength of this effect varied by antigen. The largest difference in response between vaccinated and unvaccinated birds was observed in response to the antigen FlgK (92200, 95% credible interval [26100, 179000]) and the smallest in response to the antigen FimA (24200, [5600, 47000]).

Manuscript-quality figure

gather_emmeans_draws(immuno_emmeans, value = 'chemiluminescence') %>%
  ggplot(aes(x = vaccination, y = chemiluminescence, color = vaccination)) +
    facet_wrap(~ antigen) +
    stat_interval(aes(color_ramp = after_stat(level)), .width = c(.66, .95), position = position_nudge(x = 0.15)) +
    stat_summary(fun = 'median', geom = 'point', color = 'black', size = 2, position = position_nudge(x = 0.15)) +
    geom_point(data = immunoassay, position = position_jitter(width = 0.05, height = 0), size = 1, alpha = 0.5) +
    colorblind_friendly +
    labs(color_ramp = 'CI level') +
    guides(color = 'none') +
    theme(strip.background = element_blank(),
          legend.position = c(0.07, 0.87),
          legend.background = element_rect(color = 'black'))

Part 4. Zero-one-inflated beta models for proportion data including zeros and ones

Proportion data with zeros and ones

  • Lots of proportion datasets contain values exactly 0 (0%) and/or 1 (100%)
  • Beta distribution is continuous so it cannot be exactly 0 or exactly 1
  • We can fudge it if there are only a few 0s and 1s in the dataset

Zero-and-one-inflated beta (ZOIB) model

  • Beta distribution with a probability of observing exactly 0 and/or exactly 1
  • Technically this is a hurdle model because it’s modifying a continuous distribution
  • brms has two response families: zero_inflated_beta and zero_one_inflated_beta
  • We will go straight to the more complex zero-one-inflation case

Parameters of the ZOIB model

  • Four distributional parameters, all of which can have fixed and/or random effects
    • mean mu
    • precision parameter phi
    • zero-inflation parameter zoi
    • one-inflation parameter coi

Example proportion dataset with zeros and ones

Saline soil where alfalfa is cultivated: image (c) Alforex Seeds
  • carlson.germination from agridat package
  • Effect of salt concentration on the germination of alfalfa seeds of different genotypes
  • 8 levels of salt concentration (nacl) from 0% to 2%
  • 15 alfalfa genotypes (gen)
  • We know the percentage of germinated seeds (germ) for each combination of nacl and gen

Line plot: effect of salt on germination rate by genotype

  • Convert germ from a percentage to a proportion for use with the Beta distribution
  • There are many 0 and 1 values in the dataset
data(carlson.germination)

carlson.germination <- carlson.germination %>%
  mutate(germ = germ / 100)

ggplot(carlson.germination, aes(x = nacl, y = germ, group = gen)) + geom_line()

Fitting the ZOIB model

  • Formula for mean response: germ ~ nacl + (1 | gen)
    • Consider genotype as a random effect for this example
    • Don’t allow the effect of salt to vary by genotype, only the baseline germination rate (random intercept but no random slope)
  • Allow zoi and coi to vary by salt concentration
  • No fixed or random effects on precision parameter phi
  • All default link functions used for the response family zero_one_inflated_beta()
carlson_zoibeta <- brm(
  bf(
    germ ~ nacl + (1 | gen),
    zoi ~ nacl,
    coi ~ nacl
  ),
  family = zero_one_inflated_beta(link = 'logit', link_phi = 'log', link_zoi = 'logit', link_coi = 'logit'),
  prior = c(
    prior(normal(0, 2), class = b),
    prior(normal(0, 2), class = b, dpar = zoi),
    prior(normal(0, 2), class = b, dpar = coi)
  ),
  data = carlson.germination,
  file = 'fits/carlson_zoibeta'
)

Simpler ZOIB model

  • No fixed effects on zero-inflation zoi or one-inflation coi
carlson_zoibeta_simpler <- brm(
  bf(
    germ ~ nacl + (1 | gen)
  ),
  family = zero_one_inflated_beta(link = 'logit', link_phi = 'log', link_zoi = 'logit', link_coi = 'logit'),
  prior = c(
    prior(normal(0, 2), class = b)
  ),
  data = carlson.germination,
  file = 'fits/carlson_zoibeta_simpler'
)

Leave-one-out cross-validation to compare ZOIB models

carlson_zoibeta <- add_criterion(carlson_zoibeta, 'loo')
carlson_zoibeta_simpler <- add_criterion(carlson_zoibeta_simpler, 'loo')

loo_compare(carlson_zoibeta, carlson_zoibeta_simpler)

What can we conclude?

Posterior predictive check for ZOIB model

pp_check(carlson_zoibeta)

Making predictions with the ZOIB model

  • summary() method shows the slope parameter estimates
  • Do hypothesis testing directly on those parameters using describe_posterior()
  • Coefficient on nacl: how much does log-odds of germination change for every 1-unit increase in salt?
    • If salt concentration goes from 0% to 1%, odds of germination decreases by a factor of \(e^{4.75}\) or ~116-fold
summary(carlson_zoibeta)

describe_posterior(carlson_zoibeta, ci_method = 'eti', test = c('pd', 'p_map'))

Presentation ideas: How to report this analysis in a manuscript?

  • Description of statistical model for methods section
  • Verbal description of results
  • Figure

Methods section

We fit a zero-one-inflated beta hurdle model to the percent germination data. NaCl concentration was treated as a continuous fixed effect, and genotype as a categorical random effect. We used a logit link for the response, a log link for the dispersion parameter, and logit links for the zero-inflation and one-inflation hurdle parameters. We fit a single dispersion parameter, but the zero-inflation parameter and one-inflation parameter were modeled as varying by NaCl concentration. Weakly informative Normal(0, 2) priors were put on the slope parameters. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We generated posterior distributions of fitted values from the model at a grid of evenly spaced concentration values, accounting for the conditional mean, zero-inflation probability, and one-inflation probability.

Results

  • The percentages given below are extracted from the prediction object I created for the figure

The percent germination decreased strongly with increasing NaCl concentration; the posterior estimate for 0% NaCl was 99.3% (95% CI [94.5%, 99.9%]), for 1% NaCl 78.7% [69.6%, 85.4%], and for 2% NaCl 4.1% [2.8%, 6.1%].

Manuscript-quality figure

  • Instead of showing slope coefficients which may be confusing, generate predictions instead
  • Generate epred value at an evenly spaced set of concentration values to get overall trend
  • This accounts for the zero-inflation and one-inflation probabilities but not the random genotype effects
germ_pred <- data.frame(nacl = seq(0, 2, by = 0.1)) %>%
  add_epred_draws(carlson_zoibeta, re_formula = ~ 0)

ggplot(carlson.germination, aes(x = nacl, y = germ)) + 
  stat_lineribbon(aes(y = .epred), data = germ_pred) +
  geom_line(aes(group = gen), alpha = 0.3) +
  scale_fill_brewer(palette = 'Greens', name = 'CI level') +
  scale_x_continuous(name = 'NaCl concentration') +
  scale_y_continuous(name = 'Germination', labels = scales::percent) +
  theme(legend.position = 'bottom') 

Conclusion

What did we learn to do today?

  • Fit a beta regression model to proportion data with and without zeros and ones
  • Fit a zero-inflated Poisson GLMM to count data that includes zeros
  • Fit a hurdle lognormal model to continuous non-negative data that includes zeros

Enjoy the feeling of empowerment!!!

  • See text version of lesson for further reading and useful resources
  • Please provide feedback to help me improve this course!
    • See course homepage for link to feedback form