A crash course in Bayesian mixed models with brms (Lesson 4)
Introduction and review
What is this class?
Part 4 of a practical introduction to fitting Bayesian multilevel models in R and Stan
Uses the brms package (Bayesian Regression Models using Stan)
How to follow the course
Slides and text version of lessons are online
Fill in code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind
What we’ve learned in Lessons 1-3
Relationship between prior, likelihood, and posterior
How to sample from the posterior with Hamiltonian Monte Carlo
How to use posterior distributions from our models to make predictions and test hypotheses
GLMMs with non-normal response distributions and random intercepts and slopes
Models with residuals correlated in space and time
Spatial generalized additive mixed models
What will we learn in this lesson?
Beta regression for proportion data
Zero-inflated Poisson regression for count data that include zeros
Hurdle models for continuous and proportion data that include zeros
Big-picture concepts
What are mixture distributions, and how they make stat models more flexible
How to allow each separate parameter of a statistical distribution to vary as a function of fixed and random effects
Reminder: not all these things have to be done Bayesian, but it’s often the only way that works in practice!
Part 1. Beta-regression for proportion data
Load packages and set options
library(brms)library(agridat)library(tidyverse)library(emmeans)library(tidybayes)library(ggdist)library(bayestestR)library(gt)library(fitdistrplus)theme_set(theme_bw())colorblind_friendly <-scale_color_manual(values =unname(palette.colors(5)[-1]), aesthetics =c('colour', 'fill'))# IF YOU ARE RUNNING THIS CODE ON YOUR OWN MACHINEoptions(mc.cores =4, brms.file_refit ='on_change')# IF YOU ARE RUNNING THIS CODE ON THE POSIT CLOUD SERVER (or you have configured CmdStan on your own machine)options(brms.backend ='cmdstanr', mc.cores =4, brms.file_refit ='on_change')
Regression with proportional data
In Lesson 2, we modeled binary data with a logistic GLMM instead of fitting a model to the proportions
But sometimes you just have the proportions and not the counts, or you truly have a continuous proportion not based on counts
Normally distributed error won’t work because proportions are bounded between 0 and 1
It can work OK if most of your data is between ~0.25 and ~0.75, but that often isn’t the case in ag science
What distribution to use?
Binomial won’t work because it has only two discrete values (0 and 1), not continuous
Gamma won’t work because it doesn’t have an upper bound
We need a distribution that is bounded on the [0, 1] interval and is very flexible in its shape
Beta distribution to the rescue!
Can be a bell-like curve, a skewed hump, or even a horseshoe, but it always goes from 0 to 1
Parameters of the beta distribution
Many sources use parameters \(\alpha\) and \(\beta\) (Wikipedia)
R function dbeta() uses those parameters but calls them shape1 and shape2
In brms, it is rewritten in terms of different parameters, \(\mu\) (mu) and \(\phi\) (phi)
mu is the mean parameter
phi is the precision parameter
Mean parameterization
Most distributions in brms are parameterized so that the first parameter is the mean or expected value of the distribution
This is convenient because we can mentally keep track of effects on the mean, versuseffects on other components of the distribution
You can put any combination of fixed and/or random effects on any parameter of any distribution in brms
Example dataset of proportions
Fusarium head blight on winter wheat: image (c) Janet Lewis/CIMMYT
snijders.fusarium from agridat package
Percentage of winter wheat leaf area infected by Fusarium fungus
17 wheat genotypes, 4 Fusarium strains, 3 years
Pre-process data
Beta distribution requires proportions, not percentages
Use describe_posterior() from the bayestestR package to get credible intervals for the means and pairwise differences
These are quantiles of the 4000 posterior samples we generated for each of those quantities
\(p_{MAP}\) and \(p_{d}\) are Bayesian analogues to p-values, if you are keen on testing a null hypothesis
(fusarium_emms_CI <-describe_posterior(fusarium_emms, ci =c(0.66, 0.95), ci_method ='eti', test =NULL))(fusarium_contrasts_CI <-describe_posterior(fusarium_contrasts, ci =c(0.66, 0.95), ci_method ='eti', test =c('pd', 'p_map'), null =1))
Presentation ideas: How to report this analysis in a manuscript?
Description of statistical model for methods section (including model comparison results)
Verbal description of results
Figure
Table
Methods section
We fit a GLMM with beta response distribution to the proportion Fusarium infection data. A logit link function was used for the mean parameter, and a log link for the precision parameter \(\phi\). Strain and year were treated as fixed effects, and a random intercept and random slope with respect to year were fit for each genotype. We assigned Normal(0, 5) prior distributions to the fixed effects. Hamiltonian Monte Carlo was used to sample from the posterior distribution, with four Markov chains each run for 1000 warmup iterations and 1000 sampling iterations. We verified model convergence with the R-hat statistic and graphically assessed model fit with posterior predictive check plots. We compared this model with a model that had additional fixed effects on the precision parameter \(\phi\). LOO cross-validation showed this model was no better than the model without fixed effects on \(\phi\) (\(\Delta_{ELPD}\) = 8.7 with standard error = 10.0), therefore we present the simpler model.
Methods section, continued
We obtained posterior estimates of the marginal means for each combination of strain and year, back-transformed from the logit scale, and took pairwise ratios between strain means within each year. We characterized the posterior distributions of means and ratios using the median and equal-tailed credible intervals. We teested the null hypothesis that each ratio was equal to 1 using two different Bayesian p-value analogues: probability of direction \(p_d\) and maximum a posteriori p-value \(p_{MAP}\).
Results section
Incidence of Fusarium infection was overall low except for strain F39 in 1986 and 1987, with estimated mean infection proportions 0.24 with 95% CI [0.16, 0.36] and 0.28 [0.19, 0.41] respectively. The evidence is consistent with a greater proportion infection for F39 relative to other strains, with approximately a tenfold difference in 1987 and smaller differences in the twofold to fivefold range in 1986 and 1988 (see Table).
Manuscript-quality figure
Modeled posterior means, credible intervals, and the raw data
Simultaneously shows that the model is a good fit to the data, and shows the differences between the predicted means
66% and 95% credible intervals because that is about ± 1 and 2 SD, if the posterior distribution of the parameter is roughly normal
gather_emmeans_draws(fusarium_emms) %>%mutate(.value =plogis(.value)) %>%ggplot(aes(x = year, y = .value, group =interaction(strain, year))) +stat_interval(aes(color = strain, color_ramp =after_stat(level)), .width =c(.66, .95), position = ggpp::position_dodgenudge(width = .5, x = .03)) +stat_summary(fun ='median', geom ='point', size =2, position = ggpp::position_dodgenudge(width = .5, x = .03)) +geom_point(aes(y = p, color = strain), data = snijders.fusarium, position = ggpp::position_dodgenudge(width = .5, x =-.03), alpha = .5) + colorblind_friendly +labs(y ='proportion infected', color_ramp ='CI level')
Manuscript-quality table
Pairwise ratios of proportion infection (posterior estimates and 95% CIs), with \(p_{MAP}\) values