A crash course in Bayesian mixed models with brms (Lesson 3)
Introduction and review
What is this class?
Part 3 of a practical introduction to fitting Bayesian multilevel models in R and Stan
Uses the brms package (Bayesian Regression Models using Stan)
How to follow the course
Slides and text version of lessons are online
Fill in code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind
What we’ve learned in Lessons 1 and 2
Relationship between prior, likelihood, and posterior
How to sample from the posterior with Hamiltonian Monte Carlo
How to use posterior distributions from our models to make predictions and test hypotheses
GLMMs with non-normal response distributions and random intercepts and slopes
What will we learn in this lesson?
How to fit a linear mixed model with residuals that are correlated in time (repeated measures in time)
How to fit a linear mixed model to a split plot design, and add a spatial additive component to the model
Space and time
Basic insight: the closer two things are in time or space, the more closely related they are
First Law of Geography
Space and time relationship: Example
We are studying the effect of farm-level management practices on soil carbon
The closer together two farms are, the more likely they are to share environmental conditions
Adjacent farms are not statistically independent even if they make independent management decisions
This is an issue with observational data and with designed experiments
Residuals are independent … right?
The models we’ve seen so far assume the residuals of every data point are independent of all the others
Even the mixed models assume this is true after the random effects are accounted for
You may know the term “G-side models” from SAS: random effects are in the \(\textbf{G}\) matrix (variances and covariances of the random effects)
Patterns of variation in space and time can cause this assumption to be wrong
One way to model data correlated in space or time
Within the framework of a GLMM, we can model a residual covariance structure
Data points are modeled as being more closely correlated, the closer they are to each other in space or time
SAS calls these “R-side models”: random effects are in the \(\textbf{R}\) matrix, which tells us the pattern of how the residuals (errors of each data point) are correlated with each other
Another way to model data correlated in space or time
We can include an additive term, otherwise known as a “smoother” or “spline”
Models the response variable as depending on a smooth function of either space or time
Our model is now a GAMM (generalized additive mixed model)
Example 1: a GLMM with residual covariance structure to model autocorrelation in time
Example 2: a GAMM with an additive term to model autocorrelation in space
Both ways work for space or time, and you can even combine them in the same model!
Models with residual covariance for repeated measures in time
Repeated measures within subjects
Many datasets have the same subjects (plots, animals) measured at different time points
We need to account for measurements from the same individual at different times not being independent
Load packages
Same as before except ape for spatial autocorrelation statistics
Weight gain of an animal over time is an autoregressive process (cow’s weight at time \(t\) depends on its weight at time \(t-1\))
Compound symmetry might be better for other processes that are subject to random environmental fluctuations at each time (e.g., annual yield trends)
Model comparison results
Unstructured covariance model does much better than it would in a frequentist context
Bayesian priors “regularize” models (shrink parameters toward zero) so even complex models are not too overfit
Assessing model predictions
Use emmeans package to explore the posterior
emtrends() (estimated marginal trends): posterior estimate of slope for each treatment combo, averaged over random effects
Like emmeans() with one extra argument, var: name of time variable
revpairwise ~ iron | infect: reverse pairwise contrast between each level of iron within each level of infect.
Reverse pairwise subtracts the control mean from the treatment mean
Slopes are still in standardized units (we will convert to kg/day later)
emt_diggle <-emtrends(diggle_ar1, revpairwise ~ iron | infect, var ='dayafterstart')
Bayesian p-values
No strong evidence of effect of iron supplementation (but only \(n=7\) per group)
We can also contrast the contrasts if we want
describe_posterior(emt_diggle$contrasts, ci =NULL, test =c('pd', 'p_map'))
Presentation ideas: How to report this analysis in a manuscript?
Description of statistical model for methods section (including model comparison results)
Verbal description of results
Figure
Table
Methods section
We fit a linear mixed model to the trend in standardized body weight over time with residual covariance structure to account for repeated measures within animals. The model had fixed effects for time, iron supplementation, tuberculosis infection, and their interactions. We fit models with unstructured, compound symmetry, and AR(1) residual covariance. A Normal(0, 5) prior distribution was assumed for the fixed effects. An LKJ prior with η = 3 was given to the unstructured residual covariance matrix. Models were fit with Hamiltonian Monte Carlo. The joint posterior distribution was sampled with four Markov chains, run for a variable number of warmup iterations and 1000 sampling iterations. We verified the R-hat statistic was < 1.01 for all parameters, indicating model convergence. In k-fold cross-validation, the first-order autoregressive model performed better than the unstructured (ΔELPD = 41) and compound symmetry models (ΔELPD = 295). Thus we present predictions from the autoregressive model.
Results section
There was little evidence for an effect of iron supplementation on the rate of weight gain. In uninfected animals the posterior estimate of the difference in mean weight gain rate between animals receiving supplemental iron and those that did not was -0.046 kg/day (95% CI [-0.135, 0.046]; pMAP = 0.64), while in infected animals it was -0.039 kg/day (95% CI [-0.094, 0.015]; pMAP = 0.40).
Manuscript-quality figure
Raw data and modeled trends superimposed, with 95% credible intervals
Need to calculate expected value of posterior predictive across a grid of time points by treatment combination
day and weight are rescaled to their original scales
incl.autocor = FALSE gives us overall prediction, not individual trends
Fit looks good, possibly a nonlinear trend would slightly improve fit?
Presentation ideas: How to report this analysis in a manuscript?
Description of model for methods section
Table
Results write-up and figure left as an exercise!
Methods section
We fit a general additive mixed model to the z-transformed yield variable. The fixed effects were genotype, fungicide, and their interaction, with random intercepts for block and for main plot nested within block. To account for spatial variation, we included a Gaussian process spatial smoothing term. We fit models with basis dimension (k) values 10, 25, and 50. We put a Normal(0, 3) prior on all fixed effects. Models were fit with Hamiltonian Monte Carlo, with 4 Markov chains, each run for 2000 warmup iterations before sampling 2000 draws for each chain. Model convergence was assessed by checking that R-hat < 1.01 for all parameters, and model fit was assessed by examining posterior predictive check plots. We compared models with leave-one-out cross-validation. The model with basis dimension k=50 performed best (ΔELPD 58.8 or greater); thus we present predictions from that model.
Manuscript-quality table
Predictions for fungicides averaged over genotypes
Show a few different credible intervals
Transform from standard deviation units to original units (tonnes/ha)