A crash course in Bayesian mixed models with brms (Lesson 4)
Introduction and review
What is this class?
Part 4 of a practical introduction to fitting Bayesian multilevel models in R and Stan
Uses the brms package (Bayesian Regression Models using Stan)
How to follow the course
Slides and text version of lessons are online
Fill in code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind
What we’ve learned in Lessons 1-3
Relationship between prior, likelihood, and posterior
How to sample from the posterior with Hamiltonian Monte Carlo
How to use posterior distributions from our models to make predictions and test hypotheses
GLMMs with non-normal response distributions and random intercepts and slopes
Models with residuals correlated in space and time
Spatial generalized additive mixed models
What will we learn in this lesson?
Beta regression for proportion data
Zero-inflated Poisson regression for count data that include zeros
Hurdle models for continuous and proportion data that include zeros
Big-picture concepts
What are mixture distributions, and how they make stat models more flexible
How to allow each separate parameter of a statistical distribution to vary as a function of fixed and random effects
Reminder: not all these things have to be done Bayesian, but it’s often the only way that works in practice!
Part 1. Beta-regression for proportion data
Load packages and set options
library(brms)library(agridat)library(tidyverse)library(emmeans)library(tidybayes)library(ggdist)library(bayestestR)library(gt)library(fitdistrplus)theme_set(theme_bw())colorblind_friendly <-scale_color_manual(values =unname(palette.colors(5)[-1]), aesthetics =c('colour', 'fill'))# IF YOU ARE RUNNING THIS CODE ON YOUR OWN MACHINEoptions(mc.cores =4, brms.file_refit ='on_change')# IF YOU ARE RUNNING THIS CODE ON THE POSIT CLOUD SERVER (or you have configured CmdStan on your own machine)options(brms.backend ='cmdstanr', mc.cores =4, brms.file_refit ='on_change')
Regression with proportional data
In Lesson 2, we modeled binary data with a logistic GLMM instead of fitting a model to the proportions
But sometimes you just have the proportions and not the counts, or you truly have a continuous proportion not based on counts
Normally distributed error won’t work because proportions are bounded between 0 and 1
It can work OK if most of your data is between ~0.25 and ~0.75, but that often isn’t the case in ag science
What distribution to use?
Binomial won’t work because it has only two discrete values (0 and 1), not continuous
Gamma won’t work because it doesn’t have an upper bound
We need a distribution that is bounded on the [0, 1] interval and is very flexible in its shape
Beta distribution to the rescue!
Can be a bell-like curve, a skewed hump, or even a horseshoe, but it always goes from 0 to 1
Parameters of the beta distribution
Many sources use parameters \(\alpha\) and \(\beta\) (Wikipedia)
R function dbeta() uses those parameters but calls them shape1 and shape2
In brms, it is rewritten in terms of different parameters, \(\mu\) (mu) and \(\phi\) (phi)
mu is the mean parameter
phi is the precision parameter
Mean parameterization
Most distributions in brms are parameterized so that the first parameter is the mean or expected value of the distribution
This is convenient because we can mentally keep track of effects on the mean, versuseffects on other components of the distribution
You can put any combination of fixed and/or random effects on any parameter of any distribution in brms
Example dataset of proportions
Fusarium head blight on winter wheat: image (c) Janet Lewis/CIMMYT
snijders.fusarium from agridat package
Percentage of winter wheat leaf area infected by Fusarium fungus
17 wheat genotypes, 4 Fusarium strains, 3 years
Pre-process data
Beta distribution requires proportions, not percentages
Use describe_posterior() from the bayestestR package to get credible intervals for the means and pairwise differences
These are quantiles of the 4000 posterior samples we generated for each of those quantities
\(p_{MAP}\) and \(p_{d}\) are Bayesian analogues to p-values, if you are keen on testing a null hypothesis
(fusarium_emms_CI <-describe_posterior(fusarium_emms, ci =c(0.66, 0.95), ci_method ='eti', test =NULL))(fusarium_contrasts_CI <-describe_posterior(fusarium_contrasts, ci =c(0.66, 0.95), ci_method ='eti', test =c('pd', 'p_map'), null =1))
Presentation ideas: How to report this analysis in a manuscript?
Description of statistical model for methods section (including model comparison results)
Verbal description of results
Figure
Table
Methods section
We fit a GLMM with beta response distribution to the proportion Fusarium infection data. A logit link function was used for the mean parameter, and a log link for the precision parameter \(\phi\). Strain and year were treated as fixed effects, and a random intercept and random slope with respect to year were fit for each genotype. We assigned Normal(0, 5) prior distributions to the fixed effects. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We compared this model with a model that had additional fixed effects on the precision parameter \(\phi\). LOO cross-validation showed this model was no better than the model without fixed effects on \(\phi\) (\(\Delta_{ELPD}\) = 8.7 with standard error = 10.0), therefore we present the simpler model.
Methods section, continued
We obtained posterior estimates of the marginal means for each combination of strain and year, back-transformed from the logit scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).
Results section
Incidence of Fusarium infection was overall low except for strain F39 in 1986 and 1987, with estimated mean infection proportions 0.24 with 95% CI [0.16, 0.36] and 0.28 [0.19, 0.41] respectively. The evidence is consistent with a greater proportion infection for F39 relative to other strains, with approximately a tenfold difference in 1987 and smaller differences in the twofold to fivefold range in 1986 and 1988 (see Table).
Manuscript-quality figure
Modeled posterior means, credible intervals, and the raw data
Simultaneously shows that the model is a good fit to the data, and shows the differences between the predicted means
66% and 95% credible intervals because that is about ± 1 and 2 SD, if the posterior distribution of the parameter is roughly normal
gather_emmeans_draws(fusarium_emms) %>%mutate(.value =plogis(.value)) %>%ggplot(aes(x = year, y = .value, group =interaction(strain, year))) +stat_interval(aes(color = strain, color_ramp =after_stat(level)), .width =c(.66, .95), position = ggpp::position_dodgenudge(width = .5, x = .03)) +stat_summary(fun ='median', geom ='point', size =2, position = ggpp::position_dodgenudge(width = .5, x = .03)) +geom_point(aes(y = p, color = strain), data = snijders.fusarium, position = ggpp::position_dodgenudge(width = .5, x =-.03), alpha = .5) + colorblind_friendly +labs(y ='proportion infected', color_ramp ='CI level')
Manuscript-quality table
Pairwise ratios of proportion infection (posterior estimates and 95% CIs), with \(p_{MAP}\) values
Pairwise comparisons between overall expected values
Tell emmeans() to treat the expected values as if they were fit on the log scale
regrid(..., transform = 'log')
type = 'response' inverts from the log count scale to the count scale
'pairwise' contrasts of no-spray versus spray within each level of lead, and no-lead versus lead within each level of spray
(spray_contrasts_by_lead <-regrid(webworm_emms_overall, transform ='log') %>%contrast(method ='pairwise', type ='response', by ='lead'))(lead_contrasts_by_spray <-regrid(webworm_emms_overall, transform ='log') %>%contrast(method ='pairwise', type ='response', by ='spray'))
Null hypothesis testing, if that’s your thing
describe_posterior(spray_contrasts_by_lead, ci_method ='eti', ci =0.95, test =c('p_map', 'pd'), null =1)describe_posterior(lead_contrasts_by_spray, ci_method ='eti', ci =0.95, test =c('p_map', 'pd'), null =1)
Presentation ideas: How to report this analysis in a manuscript?
Description of statistical model for methods section
Verbal description of results
Figure
Methods section
We fit a zero-inflated Poisson GLMM to the webworm count data, with a log link function for the \(\lambda\) parameter and a logit link function for the zero-inflation parameter. The model included fixed effects for spray treatment, lead treatment, and their interactions, and random intercepts for block nested within treatment combination. We assigned Normal(0, 5) prior distributions to the fixed effects on both \(\lambda\) and the zero-inflation parameter. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots.
We obtained posterior estimates of the marginal means for each treatment combination of strain and year, combining the conditional mean and zero-inflation probability. We back-transformed these means from the log scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).
Results section
There was statistical evidence that webworms were more abundant in the unsprayed condition versus sprayed. This was true both in the presence of lead where spraying reduced mean webworm abundance by a factor of 2.07 (95% CI [1.54, 2.75]), and in the absence of lead where spraying reduced abundance 2.72 times (95% CI [2.10, 3.58]). There was more modest evidence that lead treatment reduced webworm abundance, with a greater difference in the unsprayed condition (lead reduced abundance by a factor of 1.66, 95% CI [1.30, 2.14]) than sprayed (1.26, 95% CI [0.92, 1.71]).
Manuscript-quality figure
As usual we want to plot the raw data in addition to the model predictions
This time there are too many individual raw data points to plot, so we use a histogram
Negative binomial is another discrete distribution that works with count data
It has an additional parameter which allows it to fit data more flexibly than Poisson
Often it can handle excess zeros as well as or better than a zero-inflated Poisson
See exercises at the end of the lesson
Part 3. Hurdle lognormal model for continuous positive data including zeros
Non-negative continuous data with zeros
Common in ag and other natural sciences
For example the concentration of an element or compound in a sample can never be negative
Often it is below the limit of detection, meaning it is 0 for practical purposes
Gamma and lognormal distributions are great for fitting positive continuous data
But these distributions do not accommodate values of exactly 0 (or any exact value for that matter)
Old-school way, revisited
Do a log(x + fudge_factor) transformation and fit a model with normal error
Slightly better: add a fudge factor and fit a Gamma or lognormal model
Both ways introduce bias
Better way: Hurdle model
Similar in interpretation to a zero-inflated model
But instead of inflating a pre-existing probability of zero, we are adding some probability of zero to a distribution that ordinarily cannot be exactly zero
The distribution has to “jump over a hurdle” to reach a non-zero value
Example continuous dataset with zeros
Edited and anonymized data from an ARS immunologist
Immunoassay: how effective is a vaccine at stimulating immune response in chickens?
Vaccinated chickens should have more antibodies in their bloodstream that bind to microorganism’s antigens
Mix chicken blood with chemically labeled antigen that releases light when bound
Measure chemiluminescence as an indicator of immune response
Image (c) NABAS
Chicken immunoassay data
Blood from 10 vaccinated and 10 unvaccinated birds exposed to four different antigens
“Split plot” design: aliquot of blood from the same bird was exposed to each antigen
Many samples, especially from unvaccinated chickens, produced no detectable chemiluminescence
Strip plot: distribution of chemiluminescence for each antigen
Data points are jittered to left and right so you can see the many zeros
immunoassay <-read_csv('data/immunoassay.csv')ggplot(immunoassay, aes(x = antigen, y = chemiluminescence, color = vaccination, group = vaccination)) +geom_point(position =position_jitterdodge(jitter.width =0.1, jitter.height =0, dodge.width =0.5), size =2, alpha =0.5) +stat_summary(geom ='point', fun ='median', shape =95, size =10, position =position_dodge(width =0.5)) + colorblind_friendly
Which statistical distribution to fit?
Use a hurdle model because the dataset is ~50% zeros
brms supports hurdle Gamma and hurdle lognormal for continuous positive data with zeros
Fit lognormal and Gamma distributions to the nonzero values of chemiluminescence
fitdist() from fitdistrplus package (maximum likelihood)
Plot density functions over a scaled histogram of the observations
Making predictions with the hurdle lognormal model
epred (expected value of the posterior predictive): mean of non-zeros × probability of non-zero
Contrast between nonvaccinated and vaccinated within each antigen = estimate of how much the vaccine promotes immune response above natural baseline
'revpairwise' (reverse pairwise) means subtract vaccinated - nonvaccinated
Null hypothesis tests if you want them
(immuno_emmeans <-emmeans(immuno_hurdlelognorm_fit, ~ vaccination | antigen, epred =TRUE))(immuno_contrasts <-contrast(immuno_emmeans, 'revpairwise'))describe_posterior(immuno_contrasts, ci_method ='eti', ci =0.95, test =c('p_map', 'pd'), null =0)
Note on sensitivity of results to priors
Some groups in our data have only 0 values
We have very little information about the natural variability of immune response in those groups (though we can say it’s low, we don’t know how low)
The prior distributions help fill in that gap of information, but it makes the results sensitive to our choice of prior
We used normal(0, 3) but what if we used normal(0, 1)? normal(0, 10)? Etc.
See exercises at the end of this lesson
Presentation ideas: How to report this analysis in a manuscript?
Description of statistical model for methods section
Verbal description of results
Figure
Methods section
We fit a Bayesian lognormal GLMM to the chemiluminescence data from the immunoassay. The model included a hurdle component to accommodate the zeros in the response. There was an identity link function for the mean parameter, a log link function for the standard deviation, and a logit link function for the hurdle parameter. The model included fixed effects for vaccination status, antigen type, and their interaction, and random intercepts for each individual bird. The hurdle component was fit with fixed effects as well. The lognormal distribution was chosen by visually comparing the fit of the lognormal and Gamma distributions to the nonzero values in the data. We assigned Normal(0, 3) prior distributions to the fixed effects on both the mean and the zero-inflation parameter. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots.
Methods section, continued
We obtained posterior estimates of the marginal means for each combination of vaccination status and antigen, combining the conditional mean and hurdle components. We took pairwise differences between the unvaccinated and vaccinated means within each antigen. We characterized the posterior distributions of means and mean differences using the median and equal-tailed credible intervals. We tested the null hypothesis that each difference was equal to 0, indicating no response to vaccination, using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).
Results
There was evidence that each antigen provoked a stronger immune response in vaccinated birds, but the uncertainty in the strength of this effect varied by antigen. The largest difference in response between vaccinated and unvaccinated birds was observed in response to the antigen FlgK (92200, 95% credible interval [26100, 179000]) and the smallest in response to the antigen FimA (24200, [5600, 47000]).
Manuscript-quality figure
gather_emmeans_draws(immuno_emmeans, value ='chemiluminescence') %>%ggplot(aes(x = vaccination, y = chemiluminescence, color = vaccination)) +facet_wrap(~ antigen) +stat_interval(aes(color_ramp =after_stat(level)), .width =c(.66, .95), position =position_nudge(x =0.15)) +stat_summary(fun ='median', geom ='point', color ='black', size =2, position =position_nudge(x =0.15)) +geom_point(data = immunoassay, position =position_jitter(width =0.05, height =0), size =1, alpha =0.5) + colorblind_friendly +labs(color_ramp ='CI level') +guides(color ='none') +theme(strip.background =element_blank(),legend.position =c(0.07, 0.87),legend.background =element_rect(color ='black'))
Part 4. Zero-one-inflated beta models for proportion data including zeros and ones
summary() method shows the slope parameter estimates
Do hypothesis testing directly on those parameters using describe_posterior()
Coefficient on nacl: how much does log-odds of germination change for every 1-unit increase in salt?
If salt concentration goes from 0% to 1%, odds of germination decreases by a factor of \(e^{4.75}\) or ~116-fold
summary(carlson_zoibeta)describe_posterior(carlson_zoibeta, ci_method ='eti', test =c('pd', 'p_map'))
Presentation ideas: How to report this analysis in a manuscript?
Description of statistical model for methods section
Verbal description of results
Figure
Methods section
We fit a zero-one-inflated beta hurdle model to the percent germination data. NaCl concentration was treated as a continuous fixed effect, and genotype as a categorical random effect. We used a logit link for the response, a log link for the dispersion parameter, and logit links for the zero-inflation and one-inflation hurdle parameters. We fit a single dispersion parameter, but the zero-inflation parameter and one-inflation parameter were modeled as varying by NaCl concentration. Weakly informative Normal(0, 2) priors were put on the slope parameters. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We generated posterior distributions of fitted values from the model at a grid of evenly spaced concentration values, accounting for the conditional mean, zero-inflation probability, and one-inflation probability.
Results
The percentages given below are extracted from the prediction object I created for the figure
The percent germination decreased strongly with increasing NaCl concentration; the posterior estimate for 0% NaCl was 99.3% (95% CI [94.5%, 99.9%]), for 1% NaCl 78.7% [69.6%, 85.4%], and for 2% NaCl 4.1% [2.8%, 6.1%].
Manuscript-quality figure
Instead of showing slope coefficients which may be confusing, generate predictions instead
Generate epred value at an evenly spaced set of concentration values to get overall trend
This accounts for the zero-inflation and one-inflation probabilities but not the random genotype effects